All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] slub: prevent validate_slab() error due to race condition
@ 2012-04-26 18:57 ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2012-04-26 18:57 UTC (permalink / raw)
  To: cl, penberg, mpm
  Cc: linux-mm, linux-kernel, don.morris, Waiman Long, Waiman Long

The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
replacing some of page locking codes for updating the free object list
of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
or a pseudo one using a page lock when debugging is turned on.  In the
normal case, that should be enough to make sure that the slab is in a
consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
Redzone debugging flag is set, the Redzone bytes are also used to mark
if an object is free or allocated. The extra state information in those
Redzone bytes is not protected by the cmpxchg_double_slab(). As a
result,
validate_slab() may report a Redzone error if the validation is
performed
while racing with a free to a debugged slab.

The problem was reported in

	https://bugzilla.kernel.org/show_bug.cgi?id=42312

It is fairly easy to reproduce by passing in the kernel parameter of
"slub_debug=FZPU".  After booting, run the command (as root):

	while true ; do ./slabinfo -v ; sleep 3 ; done

The slabinfo test code can be found in tools/vm/slabinfo.c.

At the same time, load the system with heavy I/O activities by, for
example, building the Linux kernel. The following kind of dmesg messages
will then be reported:

	BUG names_cache: Redzone overwritten
	SLUB: names_cache 3 slabs counted but counter=4

This patch fixes the BUG message by acquiring the node-level lock for
slabs flagged for debugging to avoid this possible racing condition.
The locking is done on the node-level lock instead of the more granular
page lock because the new code may speculatively acquire the node-level
lock later on. Acquiring the page lock and then the node lock may lead
to potential deadlock.

As the increment of slab node count and insertion of the new slab into
the partial or full slab list is not an atomic operation, there is a
small time window where the two may not match. This patch temporarily
works around this problem by allowing the node count to be one larger
than the number of slab presents in the lists. This workaround may not
work if more than one CPU is actively adding slab to the same node,
but it should be good enough to workaround the problem in most cases.

To really fix the issue, the overall synchronization between debug slub
operations and slub validation needs a revisit.

This patch also fixes a number of "code indent should use tabs where
possible" error reported by checkpatch.pl in the __slab_free() function
by replacing groups of 8-space tab by real tabs.

After applying the patch, the slub error and warnings are all gone in
the 4-CPU x86-64 test machine.

Signed-off-by: Waiman Long <waiman.long@hp.com>
Reviewed-by: Don Morris <don.morris@hp.com>
---
 mm/slub.c |   46 +++++++++++++++++++++++++++++++++-------------
 1 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ffe13fd..4ca3140 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2445,8 +2445,18 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 	stat(s, FREE_SLOWPATH);
 
-	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
-		return;
+	if (kmem_cache_debug(s)) {
+		/*
+		 * We need to acquire the node lock to prevent spurious error
+		 * with validate_slab().
+		 */
+		n = get_node(s, page_to_nid(page));
+		spin_lock_irqsave(&n->list_lock, flags);
+		if (!free_debug_processing(s, page, x, addr)) {
+			spin_unlock_irqrestore(&n->list_lock, flags);
+			return;
+		}
+	}
 
 	do {
 		prior = page->freelist;
@@ -2467,7 +2477,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 			else { /* Needs to be taken off a list */
 
-	                        n = get_node(s, page_to_nid(page));
+				n = get_node(s, page_to_nid(page));
 				/*
 				 * Speculatively acquire the list_lock.
 				 * If the cmpxchg does not succeed then we may
@@ -2501,10 +2511,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 * The list lock was not taken therefore no list
 		 * activity can be necessary.
 		 */
-                if (was_frozen)
-                        stat(s, FREE_FROZEN);
-                return;
-        }
+		if (was_frozen)
+			stat(s, FREE_FROZEN);
+		return;
+	}
 
 	/*
 	 * was_frozen may have been set after we acquired the list_lock in
@@ -2514,7 +2524,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		stat(s, FREE_FROZEN);
 	else {
 		if (unlikely(!inuse && n->nr_partial > s->min_partial))
-                        goto slab_empty;
+			goto slab_empty;
 
 		/*
 		 * Objects left in the slab. If it was not on the partial list before
@@ -4122,7 +4132,7 @@ static void validate_slab_slab(struct kmem_cache *s, struct page *page,
 static int validate_slab_node(struct kmem_cache *s,
 		struct kmem_cache_node *n, unsigned long *map)
 {
-	unsigned long count = 0;
+	unsigned long count = 0, n_count;
 	struct page *page;
 	unsigned long flags;
 
@@ -4143,10 +4153,20 @@ static int validate_slab_node(struct kmem_cache *s,
 		validate_slab_slab(s, page, map);
 		count++;
 	}
-	if (count != atomic_long_read(&n->nr_slabs))
-		printk(KERN_ERR "SLUB: %s %ld slabs counted but "
-			"counter=%ld\n", s->name, count,
-			atomic_long_read(&n->nr_slabs));
+	n_count = atomic_long_read(&n->nr_slabs);
+	/*
+	 * The following workaround is to greatly reduce the chance of counter
+	 * mismatch messages due to the fact that inc_slabs_node() and the
+	 * subsequent insertion into the partial or full slab list is not
+	 * atomic. Consequently, there is a small timing window when the two
+	 * are not in the same state. A possible fix is to take the node lock
+	 * while doing inc_slabs_node() and slab insertion, but that may
+	 * require substantial changes to existing slow path slab allocation
+	 * logic.
+	 */
+	if ((count != n_count) && (count + 1 != n_count))
+		printk(KERN_ERR "SLUB: %s %ld slabs counted but counter=%ld\n",
+			s->name, count, n_count);
 
 out:
 	spin_unlock_irqrestore(&n->list_lock, flags);
-- 
1.7.8.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] slub: prevent validate_slab() error due to race condition
@ 2012-04-26 18:57 ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2012-04-26 18:57 UTC (permalink / raw)
  To: cl, penberg, mpm
  Cc: linux-mm, linux-kernel, don.morris, Waiman Long, Waiman Long

The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
replacing some of page locking codes for updating the free object list
of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
or a pseudo one using a page lock when debugging is turned on.  In the
normal case, that should be enough to make sure that the slab is in a
consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
Redzone debugging flag is set, the Redzone bytes are also used to mark
if an object is free or allocated. The extra state information in those
Redzone bytes is not protected by the cmpxchg_double_slab(). As a
result,
validate_slab() may report a Redzone error if the validation is
performed
while racing with a free to a debugged slab.

The problem was reported in

	https://bugzilla.kernel.org/show_bug.cgi?id=42312

It is fairly easy to reproduce by passing in the kernel parameter of
"slub_debug=FZPU".  After booting, run the command (as root):

	while true ; do ./slabinfo -v ; sleep 3 ; done

The slabinfo test code can be found in tools/vm/slabinfo.c.

At the same time, load the system with heavy I/O activities by, for
example, building the Linux kernel. The following kind of dmesg messages
will then be reported:

	BUG names_cache: Redzone overwritten
	SLUB: names_cache 3 slabs counted but counter=4

This patch fixes the BUG message by acquiring the node-level lock for
slabs flagged for debugging to avoid this possible racing condition.
The locking is done on the node-level lock instead of the more granular
page lock because the new code may speculatively acquire the node-level
lock later on. Acquiring the page lock and then the node lock may lead
to potential deadlock.

As the increment of slab node count and insertion of the new slab into
the partial or full slab list is not an atomic operation, there is a
small time window where the two may not match. This patch temporarily
works around this problem by allowing the node count to be one larger
than the number of slab presents in the lists. This workaround may not
work if more than one CPU is actively adding slab to the same node,
but it should be good enough to workaround the problem in most cases.

To really fix the issue, the overall synchronization between debug slub
operations and slub validation needs a revisit.

This patch also fixes a number of "code indent should use tabs where
possible" error reported by checkpatch.pl in the __slab_free() function
by replacing groups of 8-space tab by real tabs.

After applying the patch, the slub error and warnings are all gone in
the 4-CPU x86-64 test machine.

Signed-off-by: Waiman Long <waiman.long@hp.com>
Reviewed-by: Don Morris <don.morris@hp.com>
---
 mm/slub.c |   46 +++++++++++++++++++++++++++++++++-------------
 1 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ffe13fd..4ca3140 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2445,8 +2445,18 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 	stat(s, FREE_SLOWPATH);
 
-	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
-		return;
+	if (kmem_cache_debug(s)) {
+		/*
+		 * We need to acquire the node lock to prevent spurious error
+		 * with validate_slab().
+		 */
+		n = get_node(s, page_to_nid(page));
+		spin_lock_irqsave(&n->list_lock, flags);
+		if (!free_debug_processing(s, page, x, addr)) {
+			spin_unlock_irqrestore(&n->list_lock, flags);
+			return;
+		}
+	}
 
 	do {
 		prior = page->freelist;
@@ -2467,7 +2477,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 			else { /* Needs to be taken off a list */
 
-	                        n = get_node(s, page_to_nid(page));
+				n = get_node(s, page_to_nid(page));
 				/*
 				 * Speculatively acquire the list_lock.
 				 * If the cmpxchg does not succeed then we may
@@ -2501,10 +2511,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 * The list lock was not taken therefore no list
 		 * activity can be necessary.
 		 */
-                if (was_frozen)
-                        stat(s, FREE_FROZEN);
-                return;
-        }
+		if (was_frozen)
+			stat(s, FREE_FROZEN);
+		return;
+	}
 
 	/*
 	 * was_frozen may have been set after we acquired the list_lock in
@@ -2514,7 +2524,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		stat(s, FREE_FROZEN);
 	else {
 		if (unlikely(!inuse && n->nr_partial > s->min_partial))
-                        goto slab_empty;
+			goto slab_empty;
 
 		/*
 		 * Objects left in the slab. If it was not on the partial list before
@@ -4122,7 +4132,7 @@ static void validate_slab_slab(struct kmem_cache *s, struct page *page,
 static int validate_slab_node(struct kmem_cache *s,
 		struct kmem_cache_node *n, unsigned long *map)
 {
-	unsigned long count = 0;
+	unsigned long count = 0, n_count;
 	struct page *page;
 	unsigned long flags;
 
@@ -4143,10 +4153,20 @@ static int validate_slab_node(struct kmem_cache *s,
 		validate_slab_slab(s, page, map);
 		count++;
 	}
-	if (count != atomic_long_read(&n->nr_slabs))
-		printk(KERN_ERR "SLUB: %s %ld slabs counted but "
-			"counter=%ld\n", s->name, count,
-			atomic_long_read(&n->nr_slabs));
+	n_count = atomic_long_read(&n->nr_slabs);
+	/*
+	 * The following workaround is to greatly reduce the chance of counter
+	 * mismatch messages due to the fact that inc_slabs_node() and the
+	 * subsequent insertion into the partial or full slab list is not
+	 * atomic. Consequently, there is a small timing window when the two
+	 * are not in the same state. A possible fix is to take the node lock
+	 * while doing inc_slabs_node() and slab insertion, but that may
+	 * require substantial changes to existing slow path slab allocation
+	 * logic.
+	 */
+	if ((count != n_count) && (count + 1 != n_count))
+		printk(KERN_ERR "SLUB: %s %ld slabs counted but counter=%ld\n",
+			s->name, count, n_count);
 
 out:
 	spin_unlock_irqrestore(&n->list_lock, flags);
-- 
1.7.8.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
  2012-04-26 18:57 ` Waiman Long
@ 2012-04-26 19:12   ` Eric Dumazet
  -1 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2012-04-26 19:12 UTC (permalink / raw)
  To: Waiman Long; +Cc: cl, penberg, mpm, linux-mm, linux-kernel, don.morris

On Thu, 2012-04-26 at 14:57 -0400, Waiman Long wrote:
> The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
> replacing some of page locking codes for updating the free object list
> of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
> or a pseudo one using a page lock when debugging is turned on.  In the
> normal case, that should be enough to make sure that the slab is in a
> consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
> Redzone debugging flag is set, the Redzone bytes are also used to mark
> if an object is free or allocated. The extra state information in those
> Redzone bytes is not protected by the cmpxchg_double_slab(). As a
> result,
> validate_slab() may report a Redzone error if the validation is
> performed
> while racing with a free to a debugged slab.
> 
> The problem was reported in
> 
> 	https://bugzilla.kernel.org/show_bug.cgi?id=42312
> 
> It is fairly easy to reproduce by passing in the kernel parameter of
> "slub_debug=FZPU".  After booting, run the command (as root):
> 
> 	while true ; do ./slabinfo -v ; sleep 3 ; done
> 
> The slabinfo test code can be found in tools/vm/slabinfo.c.
> 
> At the same time, load the system with heavy I/O activities by, for
> example, building the Linux kernel. The following kind of dmesg messages
> will then be reported:
> 
> 	BUG names_cache: Redzone overwritten
> 	SLUB: names_cache 3 slabs counted but counter=4
> 
> This patch fixes the BUG message by acquiring the node-level lock for
> slabs flagged for debugging to avoid this possible racing condition.
> The locking is done on the node-level lock instead of the more granular
> page lock because the new code may speculatively acquire the node-level
> lock later on. Acquiring the page lock and then the node lock may lead
> to potential deadlock.
> 
> As the increment of slab node count and insertion of the new slab into
> the partial or full slab list is not an atomic operation, there is a
> small time window where the two may not match. This patch temporarily
> works around this problem by allowing the node count to be one larger
> than the number of slab presents in the lists. This workaround may not
> work if more than one CPU is actively adding slab to the same node,
> but it should be good enough to workaround the problem in most cases.
> 
> To really fix the issue, the overall synchronization between debug slub
> operations and slub validation needs a revisit.
> 
> This patch also fixes a number of "code indent should use tabs where
> possible" error reported by checkpatch.pl in the __slab_free() function
> by replacing groups of 8-space tab by real tabs.
> 
> After applying the patch, the slub error and warnings are all gone in
> the 4-CPU x86-64 test machine.
> 
> Signed-off-by: Waiman Long <waiman.long@hp.com>
> Reviewed-by: Don Morris <don.morris@hp.com>
> ---
>  mm/slub.c |   46 +++++++++++++++++++++++++++++++++-------------
>  1 files changed, 33 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index ffe13fd..4ca3140 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2445,8 +2445,18 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  
>  	stat(s, FREE_SLOWPATH);
>  
> -	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
> -		return;
> +	if (kmem_cache_debug(s)) {
> +		/*
> +		 * We need to acquire the node lock to prevent spurious error
> +		 * with validate_slab().
> +		 */
> +		n = get_node(s, page_to_nid(page));
> +		spin_lock_irqsave(&n->list_lock, flags);
> +		if (!free_debug_processing(s, page, x, addr)) {
> +			spin_unlock_irqrestore(&n->list_lock, flags);
> +			return;
> +		}

		missing spin_unlock_irqrestore(&n->list_lock, flags); ?

> +	}
>  
>  	do {
>  		prior = page->freelist;
> @@ -2467,7 +2477,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  
>  			else { /* Needs to be taken off a list */
>  
> -	                        n = get_node(s, page_to_nid(page));
> +				n = get_node(s, page_to_nid(page));
>  				/*
>  				 * Speculatively acquire the list_lock.
>  				 * If the cmpxchg does not succeed then we may
> @@ -2501,10 +2511,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  		 * The list lock was not taken therefore no list
>  		 * activity can be necessary.
>  		 */
> -                if (was_frozen)
> -                        stat(s, FREE_FROZEN);
> -                return;
> -        }
> +		if (was_frozen)
> +			stat(s, FREE_FROZEN);
> +		return;
> +	}
>  
>  	/*
>  	 * was_frozen may have been set after we acquired the list_lock in
> @@ -2514,7 +2524,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  		stat(s, FREE_FROZEN);
>  	else {
>  		if (unlikely(!inuse && n->nr_partial > s->min_partial))
> -                        goto slab_empty;
> +			goto slab_empty;
>  
>  		/*
>  		 * Objects left in the slab. If it was not on the partial list before
> @@ -4122,7 +4132,7 @@ static void validate_slab_slab(struct kmem_cache *s, struct page *page,
>  static int validate_slab_node(struct kmem_cache *s,
>  		struct kmem_cache_node *n, unsigned long *map)
>  {
> -	unsigned long count = 0;
> +	unsigned long count = 0, n_count;
>  	struct page *page;
>  	unsigned long flags;
>  
> @@ -4143,10 +4153,20 @@ static int validate_slab_node(struct kmem_cache *s,
>  		validate_slab_slab(s, page, map);
>  		count++;
>  	}
> -	if (count != atomic_long_read(&n->nr_slabs))
> -		printk(KERN_ERR "SLUB: %s %ld slabs counted but "
> -			"counter=%ld\n", s->name, count,
> -			atomic_long_read(&n->nr_slabs));
> +	n_count = atomic_long_read(&n->nr_slabs);
> +	/*
> +	 * The following workaround is to greatly reduce the chance of counter
> +	 * mismatch messages due to the fact that inc_slabs_node() and the
> +	 * subsequent insertion into the partial or full slab list is not
> +	 * atomic. Consequently, there is a small timing window when the two
> +	 * are not in the same state. A possible fix is to take the node lock
> +	 * while doing inc_slabs_node() and slab insertion, but that may
> +	 * require substantial changes to existing slow path slab allocation
> +	 * logic.
> +	 */
> +	if ((count != n_count) && (count + 1 != n_count))
> +		printk(KERN_ERR "SLUB: %s %ld slabs counted but counter=%ld\n",
> +			s->name, count, n_count);
>  
>  out:
>  	spin_unlock_irqrestore(&n->list_lock, flags);



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
@ 2012-04-26 19:12   ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2012-04-26 19:12 UTC (permalink / raw)
  To: Waiman Long; +Cc: cl, penberg, mpm, linux-mm, linux-kernel, don.morris

On Thu, 2012-04-26 at 14:57 -0400, Waiman Long wrote:
> The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
> replacing some of page locking codes for updating the free object list
> of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
> or a pseudo one using a page lock when debugging is turned on.  In the
> normal case, that should be enough to make sure that the slab is in a
> consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
> Redzone debugging flag is set, the Redzone bytes are also used to mark
> if an object is free or allocated. The extra state information in those
> Redzone bytes is not protected by the cmpxchg_double_slab(). As a
> result,
> validate_slab() may report a Redzone error if the validation is
> performed
> while racing with a free to a debugged slab.
> 
> The problem was reported in
> 
> 	https://bugzilla.kernel.org/show_bug.cgi?id=42312
> 
> It is fairly easy to reproduce by passing in the kernel parameter of
> "slub_debug=FZPU".  After booting, run the command (as root):
> 
> 	while true ; do ./slabinfo -v ; sleep 3 ; done
> 
> The slabinfo test code can be found in tools/vm/slabinfo.c.
> 
> At the same time, load the system with heavy I/O activities by, for
> example, building the Linux kernel. The following kind of dmesg messages
> will then be reported:
> 
> 	BUG names_cache: Redzone overwritten
> 	SLUB: names_cache 3 slabs counted but counter=4
> 
> This patch fixes the BUG message by acquiring the node-level lock for
> slabs flagged for debugging to avoid this possible racing condition.
> The locking is done on the node-level lock instead of the more granular
> page lock because the new code may speculatively acquire the node-level
> lock later on. Acquiring the page lock and then the node lock may lead
> to potential deadlock.
> 
> As the increment of slab node count and insertion of the new slab into
> the partial or full slab list is not an atomic operation, there is a
> small time window where the two may not match. This patch temporarily
> works around this problem by allowing the node count to be one larger
> than the number of slab presents in the lists. This workaround may not
> work if more than one CPU is actively adding slab to the same node,
> but it should be good enough to workaround the problem in most cases.
> 
> To really fix the issue, the overall synchronization between debug slub
> operations and slub validation needs a revisit.
> 
> This patch also fixes a number of "code indent should use tabs where
> possible" error reported by checkpatch.pl in the __slab_free() function
> by replacing groups of 8-space tab by real tabs.
> 
> After applying the patch, the slub error and warnings are all gone in
> the 4-CPU x86-64 test machine.
> 
> Signed-off-by: Waiman Long <waiman.long@hp.com>
> Reviewed-by: Don Morris <don.morris@hp.com>
> ---
>  mm/slub.c |   46 +++++++++++++++++++++++++++++++++-------------
>  1 files changed, 33 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index ffe13fd..4ca3140 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2445,8 +2445,18 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  
>  	stat(s, FREE_SLOWPATH);
>  
> -	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
> -		return;
> +	if (kmem_cache_debug(s)) {
> +		/*
> +		 * We need to acquire the node lock to prevent spurious error
> +		 * with validate_slab().
> +		 */
> +		n = get_node(s, page_to_nid(page));
> +		spin_lock_irqsave(&n->list_lock, flags);
> +		if (!free_debug_processing(s, page, x, addr)) {
> +			spin_unlock_irqrestore(&n->list_lock, flags);
> +			return;
> +		}

		missing spin_unlock_irqrestore(&n->list_lock, flags); ?

> +	}
>  
>  	do {
>  		prior = page->freelist;
> @@ -2467,7 +2477,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  
>  			else { /* Needs to be taken off a list */
>  
> -	                        n = get_node(s, page_to_nid(page));
> +				n = get_node(s, page_to_nid(page));
>  				/*
>  				 * Speculatively acquire the list_lock.
>  				 * If the cmpxchg does not succeed then we may
> @@ -2501,10 +2511,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  		 * The list lock was not taken therefore no list
>  		 * activity can be necessary.
>  		 */
> -                if (was_frozen)
> -                        stat(s, FREE_FROZEN);
> -                return;
> -        }
> +		if (was_frozen)
> +			stat(s, FREE_FROZEN);
> +		return;
> +	}
>  
>  	/*
>  	 * was_frozen may have been set after we acquired the list_lock in
> @@ -2514,7 +2524,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  		stat(s, FREE_FROZEN);
>  	else {
>  		if (unlikely(!inuse && n->nr_partial > s->min_partial))
> -                        goto slab_empty;
> +			goto slab_empty;
>  
>  		/*
>  		 * Objects left in the slab. If it was not on the partial list before
> @@ -4122,7 +4132,7 @@ static void validate_slab_slab(struct kmem_cache *s, struct page *page,
>  static int validate_slab_node(struct kmem_cache *s,
>  		struct kmem_cache_node *n, unsigned long *map)
>  {
> -	unsigned long count = 0;
> +	unsigned long count = 0, n_count;
>  	struct page *page;
>  	unsigned long flags;
>  
> @@ -4143,10 +4153,20 @@ static int validate_slab_node(struct kmem_cache *s,
>  		validate_slab_slab(s, page, map);
>  		count++;
>  	}
> -	if (count != atomic_long_read(&n->nr_slabs))
> -		printk(KERN_ERR "SLUB: %s %ld slabs counted but "
> -			"counter=%ld\n", s->name, count,
> -			atomic_long_read(&n->nr_slabs));
> +	n_count = atomic_long_read(&n->nr_slabs);
> +	/*
> +	 * The following workaround is to greatly reduce the chance of counter
> +	 * mismatch messages due to the fact that inc_slabs_node() and the
> +	 * subsequent insertion into the partial or full slab list is not
> +	 * atomic. Consequently, there is a small timing window when the two
> +	 * are not in the same state. A possible fix is to take the node lock
> +	 * while doing inc_slabs_node() and slab insertion, but that may
> +	 * require substantial changes to existing slow path slab allocation
> +	 * logic.
> +	 */
> +	if ((count != n_count) && (count + 1 != n_count))
> +		printk(KERN_ERR "SLUB: %s %ld slabs counted but counter=%ld\n",
> +			s->name, count, n_count);
>  
>  out:
>  	spin_unlock_irqrestore(&n->list_lock, flags);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
  2012-04-26 19:12   ` Eric Dumazet
@ 2012-04-26 19:20     ` Don Morris
  -1 siblings, 0 replies; 18+ messages in thread
From: Don Morris @ 2012-04-26 19:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Waiman Long, cl, penberg, mpm, linux-mm, linux-kernel

On 04/26/2012 12:12 PM, Eric Dumazet wrote:
> On Thu, 2012-04-26 at 14:57 -0400, Waiman Long wrote:
>> The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
>> replacing some of page locking codes for updating the free object list
>> of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
>> or a pseudo one using a page lock when debugging is turned on.  In the
>> normal case, that should be enough to make sure that the slab is in a
>> consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
>> Redzone debugging flag is set, the Redzone bytes are also used to mark
>> if an object is free or allocated. The extra state information in those
>> Redzone bytes is not protected by the cmpxchg_double_slab(). As a
>> result,
>> validate_slab() may report a Redzone error if the validation is
>> performed
>> while racing with a free to a debugged slab.
>>
>> The problem was reported in
>>
>> 	https://bugzilla.kernel.org/show_bug.cgi?id=42312
>>
>> It is fairly easy to reproduce by passing in the kernel parameter of
>> "slub_debug=FZPU".  After booting, run the command (as root):
>>
>> 	while true ; do ./slabinfo -v ; sleep 3 ; done
>>
>> The slabinfo test code can be found in tools/vm/slabinfo.c.
>>
>> At the same time, load the system with heavy I/O activities by, for
>> example, building the Linux kernel. The following kind of dmesg messages
>> will then be reported:
>>
>> 	BUG names_cache: Redzone overwritten
>> 	SLUB: names_cache 3 slabs counted but counter=4
>>
>> This patch fixes the BUG message by acquiring the node-level lock for
>> slabs flagged for debugging to avoid this possible racing condition.
>> The locking is done on the node-level lock instead of the more granular
>> page lock because the new code may speculatively acquire the node-level
>> lock later on. Acquiring the page lock and then the node lock may lead
>> to potential deadlock.
>>
>> As the increment of slab node count and insertion of the new slab into
>> the partial or full slab list is not an atomic operation, there is a
>> small time window where the two may not match. This patch temporarily
>> works around this problem by allowing the node count to be one larger
>> than the number of slab presents in the lists. This workaround may not
>> work if more than one CPU is actively adding slab to the same node,
>> but it should be good enough to workaround the problem in most cases.
>>
>> To really fix the issue, the overall synchronization between debug slub
>> operations and slub validation needs a revisit.
>>
>> This patch also fixes a number of "code indent should use tabs where
>> possible" error reported by checkpatch.pl in the __slab_free() function
>> by replacing groups of 8-space tab by real tabs.
>>
>> After applying the patch, the slub error and warnings are all gone in
>> the 4-CPU x86-64 test machine.
>>
>> Signed-off-by: Waiman Long <waiman.long@hp.com>
>> Reviewed-by: Don Morris <don.morris@hp.com>
>> ---
>>  mm/slub.c |   46 +++++++++++++++++++++++++++++++++-------------
>>  1 files changed, 33 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index ffe13fd..4ca3140 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2445,8 +2445,18 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  
>>  	stat(s, FREE_SLOWPATH);
>>  
>> -	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
>> -		return;
>> +	if (kmem_cache_debug(s)) {
>> +		/*
>> +		 * We need to acquire the node lock to prevent spurious error
>> +		 * with validate_slab().
>> +		 */
>> +		n = get_node(s, page_to_nid(page));
>> +		spin_lock_irqsave(&n->list_lock, flags);
>> +		if (!free_debug_processing(s, page, x, addr)) {
>> +			spin_unlock_irqrestore(&n->list_lock, flags);
>> +			return;
>> +		}
> 
> 		missing spin_unlock_irqrestore(&n->list_lock, flags); ?

Note that he sets n here, hence the if() block on 2458 can not
be taken (!n fails) and the if(likely(!n)) is not taken for the
same reason. As such, the code falls through to the returns for
either the slab being empty (or not) where the node lock is
released (2529 / 2543).

Don Morris

> 
>> +	}
>>  
>>  	do {
>>  		prior = page->freelist;
>> @@ -2467,7 +2477,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  
>>  			else { /* Needs to be taken off a list */
>>  
>> -	                        n = get_node(s, page_to_nid(page));
>> +				n = get_node(s, page_to_nid(page));
>>  				/*
>>  				 * Speculatively acquire the list_lock.
>>  				 * If the cmpxchg does not succeed then we may
>> @@ -2501,10 +2511,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  		 * The list lock was not taken therefore no list
>>  		 * activity can be necessary.
>>  		 */
>> -                if (was_frozen)
>> -                        stat(s, FREE_FROZEN);
>> -                return;
>> -        }
>> +		if (was_frozen)
>> +			stat(s, FREE_FROZEN);
>> +		return;
>> +	}
>>  
>>  	/*
>>  	 * was_frozen may have been set after we acquired the list_lock in
>> @@ -2514,7 +2524,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  		stat(s, FREE_FROZEN);
>>  	else {
>>  		if (unlikely(!inuse && n->nr_partial > s->min_partial))
>> -                        goto slab_empty;
>> +			goto slab_empty;
>>  
>>  		/*
>>  		 * Objects left in the slab. If it was not on the partial list before
>> @@ -4122,7 +4132,7 @@ static void validate_slab_slab(struct kmem_cache *s, struct page *page,
>>  static int validate_slab_node(struct kmem_cache *s,
>>  		struct kmem_cache_node *n, unsigned long *map)
>>  {
>> -	unsigned long count = 0;
>> +	unsigned long count = 0, n_count;
>>  	struct page *page;
>>  	unsigned long flags;
>>  
>> @@ -4143,10 +4153,20 @@ static int validate_slab_node(struct kmem_cache *s,
>>  		validate_slab_slab(s, page, map);
>>  		count++;
>>  	}
>> -	if (count != atomic_long_read(&n->nr_slabs))
>> -		printk(KERN_ERR "SLUB: %s %ld slabs counted but "
>> -			"counter=%ld\n", s->name, count,
>> -			atomic_long_read(&n->nr_slabs));
>> +	n_count = atomic_long_read(&n->nr_slabs);
>> +	/*
>> +	 * The following workaround is to greatly reduce the chance of counter
>> +	 * mismatch messages due to the fact that inc_slabs_node() and the
>> +	 * subsequent insertion into the partial or full slab list is not
>> +	 * atomic. Consequently, there is a small timing window when the two
>> +	 * are not in the same state. A possible fix is to take the node lock
>> +	 * while doing inc_slabs_node() and slab insertion, but that may
>> +	 * require substantial changes to existing slow path slab allocation
>> +	 * logic.
>> +	 */
>> +	if ((count != n_count) && (count + 1 != n_count))
>> +		printk(KERN_ERR "SLUB: %s %ld slabs counted but counter=%ld\n",
>> +			s->name, count, n_count);
>>  
>>  out:
>>  	spin_unlock_irqrestore(&n->list_lock, flags);
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
@ 2012-04-26 19:20     ` Don Morris
  0 siblings, 0 replies; 18+ messages in thread
From: Don Morris @ 2012-04-26 19:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Waiman Long, cl, penberg, mpm, linux-mm, linux-kernel

On 04/26/2012 12:12 PM, Eric Dumazet wrote:
> On Thu, 2012-04-26 at 14:57 -0400, Waiman Long wrote:
>> The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
>> replacing some of page locking codes for updating the free object list
>> of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
>> or a pseudo one using a page lock when debugging is turned on.  In the
>> normal case, that should be enough to make sure that the slab is in a
>> consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
>> Redzone debugging flag is set, the Redzone bytes are also used to mark
>> if an object is free or allocated. The extra state information in those
>> Redzone bytes is not protected by the cmpxchg_double_slab(). As a
>> result,
>> validate_slab() may report a Redzone error if the validation is
>> performed
>> while racing with a free to a debugged slab.
>>
>> The problem was reported in
>>
>> 	https://bugzilla.kernel.org/show_bug.cgi?id=42312
>>
>> It is fairly easy to reproduce by passing in the kernel parameter of
>> "slub_debug=FZPU".  After booting, run the command (as root):
>>
>> 	while true ; do ./slabinfo -v ; sleep 3 ; done
>>
>> The slabinfo test code can be found in tools/vm/slabinfo.c.
>>
>> At the same time, load the system with heavy I/O activities by, for
>> example, building the Linux kernel. The following kind of dmesg messages
>> will then be reported:
>>
>> 	BUG names_cache: Redzone overwritten
>> 	SLUB: names_cache 3 slabs counted but counter=4
>>
>> This patch fixes the BUG message by acquiring the node-level lock for
>> slabs flagged for debugging to avoid this possible racing condition.
>> The locking is done on the node-level lock instead of the more granular
>> page lock because the new code may speculatively acquire the node-level
>> lock later on. Acquiring the page lock and then the node lock may lead
>> to potential deadlock.
>>
>> As the increment of slab node count and insertion of the new slab into
>> the partial or full slab list is not an atomic operation, there is a
>> small time window where the two may not match. This patch temporarily
>> works around this problem by allowing the node count to be one larger
>> than the number of slab presents in the lists. This workaround may not
>> work if more than one CPU is actively adding slab to the same node,
>> but it should be good enough to workaround the problem in most cases.
>>
>> To really fix the issue, the overall synchronization between debug slub
>> operations and slub validation needs a revisit.
>>
>> This patch also fixes a number of "code indent should use tabs where
>> possible" error reported by checkpatch.pl in the __slab_free() function
>> by replacing groups of 8-space tab by real tabs.
>>
>> After applying the patch, the slub error and warnings are all gone in
>> the 4-CPU x86-64 test machine.
>>
>> Signed-off-by: Waiman Long <waiman.long@hp.com>
>> Reviewed-by: Don Morris <don.morris@hp.com>
>> ---
>>  mm/slub.c |   46 +++++++++++++++++++++++++++++++++-------------
>>  1 files changed, 33 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index ffe13fd..4ca3140 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2445,8 +2445,18 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  
>>  	stat(s, FREE_SLOWPATH);
>>  
>> -	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
>> -		return;
>> +	if (kmem_cache_debug(s)) {
>> +		/*
>> +		 * We need to acquire the node lock to prevent spurious error
>> +		 * with validate_slab().
>> +		 */
>> +		n = get_node(s, page_to_nid(page));
>> +		spin_lock_irqsave(&n->list_lock, flags);
>> +		if (!free_debug_processing(s, page, x, addr)) {
>> +			spin_unlock_irqrestore(&n->list_lock, flags);
>> +			return;
>> +		}
> 
> 		missing spin_unlock_irqrestore(&n->list_lock, flags); ?

Note that he sets n here, hence the if() block on 2458 can not
be taken (!n fails) and the if(likely(!n)) is not taken for the
same reason. As such, the code falls through to the returns for
either the slab being empty (or not) where the node lock is
released (2529 / 2543).

Don Morris

> 
>> +	}
>>  
>>  	do {
>>  		prior = page->freelist;
>> @@ -2467,7 +2477,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  
>>  			else { /* Needs to be taken off a list */
>>  
>> -	                        n = get_node(s, page_to_nid(page));
>> +				n = get_node(s, page_to_nid(page));
>>  				/*
>>  				 * Speculatively acquire the list_lock.
>>  				 * If the cmpxchg does not succeed then we may
>> @@ -2501,10 +2511,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  		 * The list lock was not taken therefore no list
>>  		 * activity can be necessary.
>>  		 */
>> -                if (was_frozen)
>> -                        stat(s, FREE_FROZEN);
>> -                return;
>> -        }
>> +		if (was_frozen)
>> +			stat(s, FREE_FROZEN);
>> +		return;
>> +	}
>>  
>>  	/*
>>  	 * was_frozen may have been set after we acquired the list_lock in
>> @@ -2514,7 +2524,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>>  		stat(s, FREE_FROZEN);
>>  	else {
>>  		if (unlikely(!inuse && n->nr_partial > s->min_partial))
>> -                        goto slab_empty;
>> +			goto slab_empty;
>>  
>>  		/*
>>  		 * Objects left in the slab. If it was not on the partial list before
>> @@ -4122,7 +4132,7 @@ static void validate_slab_slab(struct kmem_cache *s, struct page *page,
>>  static int validate_slab_node(struct kmem_cache *s,
>>  		struct kmem_cache_node *n, unsigned long *map)
>>  {
>> -	unsigned long count = 0;
>> +	unsigned long count = 0, n_count;
>>  	struct page *page;
>>  	unsigned long flags;
>>  
>> @@ -4143,10 +4153,20 @@ static int validate_slab_node(struct kmem_cache *s,
>>  		validate_slab_slab(s, page, map);
>>  		count++;
>>  	}
>> -	if (count != atomic_long_read(&n->nr_slabs))
>> -		printk(KERN_ERR "SLUB: %s %ld slabs counted but "
>> -			"counter=%ld\n", s->name, count,
>> -			atomic_long_read(&n->nr_slabs));
>> +	n_count = atomic_long_read(&n->nr_slabs);
>> +	/*
>> +	 * The following workaround is to greatly reduce the chance of counter
>> +	 * mismatch messages due to the fact that inc_slabs_node() and the
>> +	 * subsequent insertion into the partial or full slab list is not
>> +	 * atomic. Consequently, there is a small timing window when the two
>> +	 * are not in the same state. A possible fix is to take the node lock
>> +	 * while doing inc_slabs_node() and slab insertion, but that may
>> +	 * require substantial changes to existing slow path slab allocation
>> +	 * logic.
>> +	 */
>> +	if ((count != n_count) && (count + 1 != n_count))
>> +		printk(KERN_ERR "SLUB: %s %ld slabs counted but counter=%ld\n",
>> +			s->name, count, n_count);
>>  
>>  out:
>>  	spin_unlock_irqrestore(&n->list_lock, flags);
> 
> 
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
  2012-04-26 19:20     ` Don Morris
@ 2012-04-26 19:44       ` Eric Dumazet
  -1 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2012-04-26 19:44 UTC (permalink / raw)
  To: Don Morris; +Cc: Waiman Long, cl, penberg, mpm, linux-mm, linux-kernel

On Thu, 2012-04-26 at 12:20 -0700, Don Morris wrote:

> Note that he sets n here, hence the if() block on 2458 can not
> be taken (!n fails) and the if(likely(!n)) is not taken for the
> same reason. As such, the code falls through to the returns for
> either the slab being empty (or not) where the node lock is
> released (2529 / 2543).

Ah yes, you're right, thanks for clarification.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
@ 2012-04-26 19:44       ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2012-04-26 19:44 UTC (permalink / raw)
  To: Don Morris; +Cc: Waiman Long, cl, penberg, mpm, linux-mm, linux-kernel

On Thu, 2012-04-26 at 12:20 -0700, Don Morris wrote:

> Note that he sets n here, hence the if() block on 2458 can not
> be taken (!n fails) and the if(likely(!n)) is not taken for the
> same reason. As such, the code falls through to the returns for
> either the slab being empty (or not) where the node lock is
> released (2529 / 2543).

Ah yes, you're right, thanks for clarification.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
  2012-04-26 18:57 ` Waiman Long
@ 2012-04-27 14:59   ` Christoph Lameter
  -1 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2012-04-27 14:59 UTC (permalink / raw)
  To: Waiman Long; +Cc: penberg, mpm, linux-mm, linux-kernel, don.morris


On Thu, 26 Apr 2012, Waiman Long wrote:

> The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
> replacing some of page locking codes for updating the free object list
> of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
> or a pseudo one using a page lock when debugging is turned on.  In the
> normal case, that should be enough to make sure that the slab is in a
> consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
> Redzone debugging flag is set, the Redzone bytes are also used to mark
> if an object is free or allocated. The extra state information in those
> Redzone bytes is not protected by the cmpxchg_double_slab(). As a
> result,
> validate_slab() may report a Redzone error if the validation is
> performed
> while racing with a free to a debugged slab.

Right. The problem is unique to validate_slab because no one else outside
the slab has access to that object that is to be freed.

> This patch fixes the BUG message by acquiring the node-level lock for
> slabs flagged for debugging to avoid this possible racing condition.
> The locking is done on the node-level lock instead of the more granular
> page lock because the new code may speculatively acquire the node-level
> lock later on. Acquiring the page lock and then the node lock may lead
> to potential deadlock.

Correct that would address the issue.

> As the increment of slab node count and insertion of the new slab into
> the partial or full slab list is not an atomic operation, there is a
> small time window where the two may not match. This patch temporarily
> works around this problem by allowing the node count to be one larger
> than the number of slab presents in the lists. This workaround may not
> work if more than one CPU is actively adding slab to the same node,
> but it should be good enough to workaround the problem in most cases.

Well yeah that is not a real fix. Its been racy for a long time.

> To really fix the issue, the overall synchronization between debug slub
> operations and slub validation needs a revisit.

True. The sync between validation and concurrent operations was not in
focus so far. Validation was used (at least by me) so far only to validate
that the slab structures are still okay after running some tests.

Lets just do this in steps. First patch here is a simple taking of the
node lock in free_debug_processing. This is similar to what you have done
but the changes are made to the debugging function instead. Then we can
look at how to address the slab counting issue in a separate patch.


Subject: slub: Take node lock during object free checks

This is needed for proper synchronization with validate_slab()
as pointed out by Waiman Long <Waiman.Long@hp.com>

Reported-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Christoph Lameter <cl@linux.com>


---
 mm/slub.c |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2012-04-27 09:40:00.000000000 -0500
+++ linux-2.6/mm/slub.c	2012-04-27 09:50:15.000000000 -0500
@@ -1082,13 +1082,13 @@ bad:
 	return 0;
 }

-static noinline int free_debug_processing(struct kmem_cache *s,
-		 struct page *page, void *object, unsigned long addr)
+static noinline struct kmem_cache_node *free_debug_processing(
+	struct kmem_cache *s, struct page *page, void *object,
+	unsigned long addr, unsigned long *flags)
 {
-	unsigned long flags;
-	int rc = 0;
+	struct kmem_cache_node *n = get_node(s, page_to_nid(page));

-	local_irq_save(flags);
+	spin_lock_irqsave(&n->list_lock, *flags);
 	slab_lock(page);

 	if (!check_slab(s, page))
@@ -1126,15 +1126,19 @@ static noinline int free_debug_processin
 		set_track(s, object, TRACK_FREE, addr);
 	trace(s, page, object, 0);
 	init_object(s, object, SLUB_RED_INACTIVE);
-	rc = 1;
 out:
 	slab_unlock(page);
-	local_irq_restore(flags);
-	return rc;
+	/*
+	 * Keep node_lock to preserve integrity
+	 * until the object is actually freed
+	 */
+	return n;

 fail:
+	slab_unlock(page);
+	spin_unlock_irqrestore(&n->list_lock, *flags);
 	slab_fix(s, "Object at 0x%p not freed", object);
-	goto out;
+	return NULL;
 }

 static int __init setup_slub_debug(char *str)
@@ -1227,8 +1231,9 @@ static inline void setup_object_debug(st
 static inline int alloc_debug_processing(struct kmem_cache *s,
 	struct page *page, void *object, unsigned long addr) { return 0; }

-static inline int free_debug_processing(struct kmem_cache *s,
-	struct page *page, void *object, unsigned long addr) { return 0; }
+static inline struct kmem_cache_node *free_debug_processing(
+	struct kmem_cache *s, struct page *page, void *object,
+	unsigned long addr, unsigned long *flags) { return NULL; }

 static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
 			{ return 1; }
@@ -2445,7 +2450,8 @@ static void __slab_free(struct kmem_cach

 	stat(s, FREE_SLOWPATH);

-	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
+	if (kmem_cache_debug(s) &&
+		!(n = free_debug_processing(s, page, x, addr, &flags)))
 		return;

 	do {

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
@ 2012-04-27 14:59   ` Christoph Lameter
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2012-04-27 14:59 UTC (permalink / raw)
  To: Waiman Long; +Cc: penberg, mpm, linux-mm, linux-kernel, don.morris


On Thu, 26 Apr 2012, Waiman Long wrote:

> The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
> replacing some of page locking codes for updating the free object list
> of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
> or a pseudo one using a page lock when debugging is turned on.  In the
> normal case, that should be enough to make sure that the slab is in a
> consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
> Redzone debugging flag is set, the Redzone bytes are also used to mark
> if an object is free or allocated. The extra state information in those
> Redzone bytes is not protected by the cmpxchg_double_slab(). As a
> result,
> validate_slab() may report a Redzone error if the validation is
> performed
> while racing with a free to a debugged slab.

Right. The problem is unique to validate_slab because no one else outside
the slab has access to that object that is to be freed.

> This patch fixes the BUG message by acquiring the node-level lock for
> slabs flagged for debugging to avoid this possible racing condition.
> The locking is done on the node-level lock instead of the more granular
> page lock because the new code may speculatively acquire the node-level
> lock later on. Acquiring the page lock and then the node lock may lead
> to potential deadlock.

Correct that would address the issue.

> As the increment of slab node count and insertion of the new slab into
> the partial or full slab list is not an atomic operation, there is a
> small time window where the two may not match. This patch temporarily
> works around this problem by allowing the node count to be one larger
> than the number of slab presents in the lists. This workaround may not
> work if more than one CPU is actively adding slab to the same node,
> but it should be good enough to workaround the problem in most cases.

Well yeah that is not a real fix. Its been racy for a long time.

> To really fix the issue, the overall synchronization between debug slub
> operations and slub validation needs a revisit.

True. The sync between validation and concurrent operations was not in
focus so far. Validation was used (at least by me) so far only to validate
that the slab structures are still okay after running some tests.

Lets just do this in steps. First patch here is a simple taking of the
node lock in free_debug_processing. This is similar to what you have done
but the changes are made to the debugging function instead. Then we can
look at how to address the slab counting issue in a separate patch.


Subject: slub: Take node lock during object free checks

This is needed for proper synchronization with validate_slab()
as pointed out by Waiman Long <Waiman.Long@hp.com>

Reported-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Christoph Lameter <cl@linux.com>


---
 mm/slub.c |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2012-04-27 09:40:00.000000000 -0500
+++ linux-2.6/mm/slub.c	2012-04-27 09:50:15.000000000 -0500
@@ -1082,13 +1082,13 @@ bad:
 	return 0;
 }

-static noinline int free_debug_processing(struct kmem_cache *s,
-		 struct page *page, void *object, unsigned long addr)
+static noinline struct kmem_cache_node *free_debug_processing(
+	struct kmem_cache *s, struct page *page, void *object,
+	unsigned long addr, unsigned long *flags)
 {
-	unsigned long flags;
-	int rc = 0;
+	struct kmem_cache_node *n = get_node(s, page_to_nid(page));

-	local_irq_save(flags);
+	spin_lock_irqsave(&n->list_lock, *flags);
 	slab_lock(page);

 	if (!check_slab(s, page))
@@ -1126,15 +1126,19 @@ static noinline int free_debug_processin
 		set_track(s, object, TRACK_FREE, addr);
 	trace(s, page, object, 0);
 	init_object(s, object, SLUB_RED_INACTIVE);
-	rc = 1;
 out:
 	slab_unlock(page);
-	local_irq_restore(flags);
-	return rc;
+	/*
+	 * Keep node_lock to preserve integrity
+	 * until the object is actually freed
+	 */
+	return n;

 fail:
+	slab_unlock(page);
+	spin_unlock_irqrestore(&n->list_lock, *flags);
 	slab_fix(s, "Object at 0x%p not freed", object);
-	goto out;
+	return NULL;
 }

 static int __init setup_slub_debug(char *str)
@@ -1227,8 +1231,9 @@ static inline void setup_object_debug(st
 static inline int alloc_debug_processing(struct kmem_cache *s,
 	struct page *page, void *object, unsigned long addr) { return 0; }

-static inline int free_debug_processing(struct kmem_cache *s,
-	struct page *page, void *object, unsigned long addr) { return 0; }
+static inline struct kmem_cache_node *free_debug_processing(
+	struct kmem_cache *s, struct page *page, void *object,
+	unsigned long addr, unsigned long *flags) { return NULL; }

 static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
 			{ return 1; }
@@ -2445,7 +2450,8 @@ static void __slab_free(struct kmem_cach

 	stat(s, FREE_SLOWPATH);

-	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
+	if (kmem_cache_debug(s) &&
+		!(n = free_debug_processing(s, page, x, addr, &flags)))
 		return;

 	do {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
  2012-04-27 14:59   ` Christoph Lameter
@ 2012-04-27 20:10     ` Waiman Long
  -1 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2012-04-27 20:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: penberg, mpm, linux-mm, linux-kernel, Morris,
	Donald George (HP-UX Cupertino)

On 4/27/2012 10:59 AM, Christoph Lameter wrote:
> On Thu, 26 Apr 2012, Waiman Long wrote:
>
>> The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
>> replacing some of page locking codes for updating the free object list
>> of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
>> or a pseudo one using a page lock when debugging is turned on.  In the
>> normal case, that should be enough to make sure that the slab is in a
>> consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
>> Redzone debugging flag is set, the Redzone bytes are also used to mark
>> if an object is free or allocated. The extra state information in those
>> Redzone bytes is not protected by the cmpxchg_double_slab(). As a
>> result,
>> validate_slab() may report a Redzone error if the validation is
>> performed
>> while racing with a free to a debugged slab.
> Right. The problem is unique to validate_slab because no one else outside
> the slab has access to that object that is to be freed.
>
>> This patch fixes the BUG message by acquiring the node-level lock for
>> slabs flagged for debugging to avoid this possible racing condition.
>> The locking is done on the node-level lock instead of the more granular
>> page lock because the new code may speculatively acquire the node-level
>> lock later on. Acquiring the page lock and then the node lock may lead
>> to potential deadlock.
> Correct that would address the issue.
>
>> As the increment of slab node count and insertion of the new slab into
>> the partial or full slab list is not an atomic operation, there is a
>> small time window where the two may not match. This patch temporarily
>> works around this problem by allowing the node count to be one larger
>> than the number of slab presents in the lists. This workaround may not
>> work if more than one CPU is actively adding slab to the same node,
>> but it should be good enough to workaround the problem in most cases.
> Well yeah that is not a real fix. Its been racy for a long time.
>
>> To really fix the issue, the overall synchronization between debug slub
>> operations and slub validation needs a revisit.
> True. The sync between validation and concurrent operations was not in
> focus so far. Validation was used (at least by me) so far only to validate
> that the slab structures are still okay after running some tests.
>
> Lets just do this in steps. First patch here is a simple taking of the
> node lock in free_debug_processing. This is similar to what you have done
> but the changes are made to the debugging function instead. Then we can
> look at how to address the slab counting issue in a separate patch.
>

Thank for the quick response. I have no problem for moving the node-lock 
taking into free_debug_processing. Of the 2 problems that are reported, 
this is a more serious one and so need to be fixed sooner rather than 
later. For the other one, we can take more time to find a better solution.

So are you going to integrate your change to the mainline?

> Subject: slub: Take node lock during object free checks
>
> This is needed for proper synchronization with validate_slab()
> as pointed out by Waiman Long<Waiman.Long@hp.com>
>
> Reported-by: Waiman Long<Waiman.Long@hp.com>
> Signed-off-by: Christoph Lameter<cl@linux.com>
>
>
> ---
>   mm/slub.c |   30 ++++++++++++++++++------------
>   1 file changed, 18 insertions(+), 12 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2012-04-27 09:40:00.000000000 -0500
> +++ linux-2.6/mm/slub.c	2012-04-27 09:50:15.000000000 -0500
> @@ -1082,13 +1082,13 @@ bad:
>   	return 0;
>   }
>
> -static noinline int free_debug_processing(struct kmem_cache *s,
> -		 struct page *page, void *object, unsigned long addr)
> +static noinline struct kmem_cache_node *free_debug_processing(
> +	struct kmem_cache *s, struct page *page, void *object,
> +	unsigned long addr, unsigned long *flags)
>   {
> -	unsigned long flags;
> -	int rc = 0;
> +	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
>
> -	local_irq_save(flags);
> +	spin_lock_irqsave(&n->list_lock, *flags);
>   	slab_lock(page);
>
>   	if (!check_slab(s, page))
> @@ -1126,15 +1126,19 @@ static noinline int free_debug_processin
>   		set_track(s, object, TRACK_FREE, addr);
>   	trace(s, page, object, 0);
>   	init_object(s, object, SLUB_RED_INACTIVE);
> -	rc = 1;
>   out:
>   	slab_unlock(page);
> -	local_irq_restore(flags);
> -	return rc;
> +	/*
> +	 * Keep node_lock to preserve integrity
> +	 * until the object is actually freed
> +	 */
> +	return n;
>
>   fail:
> +	slab_unlock(page);
> +	spin_unlock_irqrestore(&n->list_lock, *flags);
>   	slab_fix(s, "Object at 0x%p not freed", object);
> -	goto out;
> +	return NULL;
>   }
>
>   static int __init setup_slub_debug(char *str)
> @@ -1227,8 +1231,9 @@ static inline void setup_object_debug(st
>   static inline int alloc_debug_processing(struct kmem_cache *s,
>   	struct page *page, void *object, unsigned long addr) { return 0; }
>
> -static inline int free_debug_processing(struct kmem_cache *s,
> -	struct page *page, void *object, unsigned long addr) { return 0; }
> +static inline struct kmem_cache_node *free_debug_processing(
> +	struct kmem_cache *s, struct page *page, void *object,
> +	unsigned long addr, unsigned long *flags) { return NULL; }
>
>   static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
>   			{ return 1; }
> @@ -2445,7 +2450,8 @@ static void __slab_free(struct kmem_cach
>
>   	stat(s, FREE_SLOWPATH);
>
> -	if (kmem_cache_debug(s)&&  !free_debug_processing(s, page, x, addr))
> +	if (kmem_cache_debug(s)&&
> +		!(n = free_debug_processing(s, page, x, addr,&flags)))
>   		return;
>
>   	do {


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
@ 2012-04-27 20:10     ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2012-04-27 20:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: penberg, mpm, linux-mm, linux-kernel, Morris,
	Donald George (HP-UX Cupertino)

On 4/27/2012 10:59 AM, Christoph Lameter wrote:
> On Thu, 26 Apr 2012, Waiman Long wrote:
>
>> The SLUB memory allocator was changed substantially from 3.0 to 3.1 by
>> replacing some of page locking codes for updating the free object list
>> of the slab with double-quadword atomic exchange (cmpxchg_double_slab)
>> or a pseudo one using a page lock when debugging is turned on.  In the
>> normal case, that should be enough to make sure that the slab is in a
>> consistent state. However, when CONFIG_SLUB_DEBUG is turned on and the
>> Redzone debugging flag is set, the Redzone bytes are also used to mark
>> if an object is free or allocated. The extra state information in those
>> Redzone bytes is not protected by the cmpxchg_double_slab(). As a
>> result,
>> validate_slab() may report a Redzone error if the validation is
>> performed
>> while racing with a free to a debugged slab.
> Right. The problem is unique to validate_slab because no one else outside
> the slab has access to that object that is to be freed.
>
>> This patch fixes the BUG message by acquiring the node-level lock for
>> slabs flagged for debugging to avoid this possible racing condition.
>> The locking is done on the node-level lock instead of the more granular
>> page lock because the new code may speculatively acquire the node-level
>> lock later on. Acquiring the page lock and then the node lock may lead
>> to potential deadlock.
> Correct that would address the issue.
>
>> As the increment of slab node count and insertion of the new slab into
>> the partial or full slab list is not an atomic operation, there is a
>> small time window where the two may not match. This patch temporarily
>> works around this problem by allowing the node count to be one larger
>> than the number of slab presents in the lists. This workaround may not
>> work if more than one CPU is actively adding slab to the same node,
>> but it should be good enough to workaround the problem in most cases.
> Well yeah that is not a real fix. Its been racy for a long time.
>
>> To really fix the issue, the overall synchronization between debug slub
>> operations and slub validation needs a revisit.
> True. The sync between validation and concurrent operations was not in
> focus so far. Validation was used (at least by me) so far only to validate
> that the slab structures are still okay after running some tests.
>
> Lets just do this in steps. First patch here is a simple taking of the
> node lock in free_debug_processing. This is similar to what you have done
> but the changes are made to the debugging function instead. Then we can
> look at how to address the slab counting issue in a separate patch.
>

Thank for the quick response. I have no problem for moving the node-lock 
taking into free_debug_processing. Of the 2 problems that are reported, 
this is a more serious one and so need to be fixed sooner rather than 
later. For the other one, we can take more time to find a better solution.

So are you going to integrate your change to the mainline?

> Subject: slub: Take node lock during object free checks
>
> This is needed for proper synchronization with validate_slab()
> as pointed out by Waiman Long<Waiman.Long@hp.com>
>
> Reported-by: Waiman Long<Waiman.Long@hp.com>
> Signed-off-by: Christoph Lameter<cl@linux.com>
>
>
> ---
>   mm/slub.c |   30 ++++++++++++++++++------------
>   1 file changed, 18 insertions(+), 12 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2012-04-27 09:40:00.000000000 -0500
> +++ linux-2.6/mm/slub.c	2012-04-27 09:50:15.000000000 -0500
> @@ -1082,13 +1082,13 @@ bad:
>   	return 0;
>   }
>
> -static noinline int free_debug_processing(struct kmem_cache *s,
> -		 struct page *page, void *object, unsigned long addr)
> +static noinline struct kmem_cache_node *free_debug_processing(
> +	struct kmem_cache *s, struct page *page, void *object,
> +	unsigned long addr, unsigned long *flags)
>   {
> -	unsigned long flags;
> -	int rc = 0;
> +	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
>
> -	local_irq_save(flags);
> +	spin_lock_irqsave(&n->list_lock, *flags);
>   	slab_lock(page);
>
>   	if (!check_slab(s, page))
> @@ -1126,15 +1126,19 @@ static noinline int free_debug_processin
>   		set_track(s, object, TRACK_FREE, addr);
>   	trace(s, page, object, 0);
>   	init_object(s, object, SLUB_RED_INACTIVE);
> -	rc = 1;
>   out:
>   	slab_unlock(page);
> -	local_irq_restore(flags);
> -	return rc;
> +	/*
> +	 * Keep node_lock to preserve integrity
> +	 * until the object is actually freed
> +	 */
> +	return n;
>
>   fail:
> +	slab_unlock(page);
> +	spin_unlock_irqrestore(&n->list_lock, *flags);
>   	slab_fix(s, "Object at 0x%p not freed", object);
> -	goto out;
> +	return NULL;
>   }
>
>   static int __init setup_slub_debug(char *str)
> @@ -1227,8 +1231,9 @@ static inline void setup_object_debug(st
>   static inline int alloc_debug_processing(struct kmem_cache *s,
>   	struct page *page, void *object, unsigned long addr) { return 0; }
>
> -static inline int free_debug_processing(struct kmem_cache *s,
> -	struct page *page, void *object, unsigned long addr) { return 0; }
> +static inline struct kmem_cache_node *free_debug_processing(
> +	struct kmem_cache *s, struct page *page, void *object,
> +	unsigned long addr, unsigned long *flags) { return NULL; }
>
>   static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
>   			{ return 1; }
> @@ -2445,7 +2450,8 @@ static void __slab_free(struct kmem_cach
>
>   	stat(s, FREE_SLOWPATH);
>
> -	if (kmem_cache_debug(s)&&  !free_debug_processing(s, page, x, addr))
> +	if (kmem_cache_debug(s)&&
> +		!(n = free_debug_processing(s, page, x, addr,&flags)))
>   		return;
>
>   	do {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
  2012-04-27 20:10     ` Waiman Long
@ 2012-04-30  6:35       ` Pekka Enberg
  -1 siblings, 0 replies; 18+ messages in thread
From: Pekka Enberg @ 2012-04-30  6:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, mpm, linux-mm, linux-kernel, Morris,
	Donald George (HP-UX Cupertino),
	David Rientjes, Eric Dumazet

On Fri, Apr 27, 2012 at 11:10 PM, Waiman Long <waiman.long@hp.com> wrote:
> Thank for the quick response. I have no problem for moving the node-lock
> taking into free_debug_processing. Of the 2 problems that are reported, this
> is a more serious one and so need to be fixed sooner rather than later. For
> the other one, we can take more time to find a better solution.
>
> So are you going to integrate your change to the mainline?

Christoph, can you send the patch with an improved changelog that also
explains what the problem is?

How far back in the stable series do we want to backport this?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
@ 2012-04-30  6:35       ` Pekka Enberg
  0 siblings, 0 replies; 18+ messages in thread
From: Pekka Enberg @ 2012-04-30  6:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, mpm, linux-mm, linux-kernel, Morris,
	Donald George (HP-UX Cupertino),
	David Rientjes, Eric Dumazet

On Fri, Apr 27, 2012 at 11:10 PM, Waiman Long <waiman.long@hp.com> wrote:
> Thank for the quick response. I have no problem for moving the node-lock
> taking into free_debug_processing. Of the 2 problems that are reported, this
> is a more serious one and so need to be fixed sooner rather than later. For
> the other one, we can take more time to find a better solution.
>
> So are you going to integrate your change to the mainline?

Christoph, can you send the patch with an improved changelog that also
explains what the problem is?

How far back in the stable series do we want to backport this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
  2012-04-30  6:35       ` Pekka Enberg
@ 2012-05-01 20:23         ` Christoph Lameter
  -1 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2012-05-01 20:23 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Waiman Long, mpm, linux-mm, linux-kernel, Morris,
	Donald George (HP-UX Cupertino),
	David Rientjes, Eric Dumazet

On Mon, 30 Apr 2012, Pekka Enberg wrote:

> On Fri, Apr 27, 2012 at 11:10 PM, Waiman Long <waiman.long@hp.com> wrote:
> > Thank for the quick response. I have no problem for moving the node-lock
> > taking into free_debug_processing. Of the 2 problems that are reported, this
> > is a more serious one and so need to be fixed sooner rather than later. For
> > the other one, we can take more time to find a better solution.
> >
> > So are you going to integrate your change to the mainline?
>
> Christoph, can you send the patch with an improved changelog that also
> explains what the problem is?


Will do so once I get back from the conference I am at.

> How far back in the stable series do we want to backport this?

This only affects slab validation when running with deubgging so I would
suggest to merge in the next cycle.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
@ 2012-05-01 20:23         ` Christoph Lameter
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2012-05-01 20:23 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Waiman Long, mpm, linux-mm, linux-kernel, Morris,
	Donald George (HP-UX Cupertino),
	David Rientjes, Eric Dumazet

On Mon, 30 Apr 2012, Pekka Enberg wrote:

> On Fri, Apr 27, 2012 at 11:10 PM, Waiman Long <waiman.long@hp.com> wrote:
> > Thank for the quick response. I have no problem for moving the node-lock
> > taking into free_debug_processing. Of the 2 problems that are reported, this
> > is a more serious one and so need to be fixed sooner rather than later. For
> > the other one, we can take more time to find a better solution.
> >
> > So are you going to integrate your change to the mainline?
>
> Christoph, can you send the patch with an improved changelog that also
> explains what the problem is?


Will do so once I get back from the conference I am at.

> How far back in the stable series do we want to backport this?

This only affects slab validation when running with deubgging so I would
suggest to merge in the next cycle.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
       [not found]           ` <alpine.DEB.2.00.1205301039420.29257@router.home>
@ 2012-05-30 17:54             ` Christoph Lameter
  2012-08-16  7:02               ` Pekka Enberg
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2012-05-30 17:54 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, waiman.long

Updated patch:

Subject: slub: Take node lock during object free checks

Only applies to scenarios where debugging is on:

Validation of slabs can currently occur while debugging
information is updated from the fast paths of the allocator.
This results in various races where we get false reports about
slab metadata not being in order.

This patch makes the fast paths take the node lock so that
serialization with slab validation will occur. Causes additional
slowdown in debug scenarios.

Reported-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Christoph Lameter <cl@linux.com>


---
 mm/slub.c |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2012-05-21 08:58:30.000000000 -0500
+++ linux-2.6/mm/slub.c	2012-05-30 12:53:29.000000000 -0500
@@ -1082,13 +1082,13 @@ bad:
 	return 0;
 }

-static noinline int free_debug_processing(struct kmem_cache *s,
-		 struct page *page, void *object, unsigned long addr)
+static noinline struct kmem_cache_node *free_debug_processing(
+	struct kmem_cache *s, struct page *page, void *object,
+	unsigned long addr, unsigned long *flags)
 {
-	unsigned long flags;
-	int rc = 0;
+	struct kmem_cache_node *n = get_node(s, page_to_nid(page));

-	local_irq_save(flags);
+	spin_lock_irqsave(&n->list_lock, *flags);
 	slab_lock(page);

 	if (!check_slab(s, page))
@@ -1126,15 +1126,19 @@ static noinline int free_debug_processin
 		set_track(s, object, TRACK_FREE, addr);
 	trace(s, page, object, 0);
 	init_object(s, object, SLUB_RED_INACTIVE);
-	rc = 1;
 out:
 	slab_unlock(page);
-	local_irq_restore(flags);
-	return rc;
+	/*
+	 * Keep node_lock to preserve integrity
+	 * until the object is actually freed
+	 */
+	return n;

 fail:
+	slab_unlock(page);
+	spin_unlock_irqrestore(&n->list_lock, *flags);
 	slab_fix(s, "Object at 0x%p not freed", object);
-	goto out;
+	return NULL;
 }

 static int __init setup_slub_debug(char *str)
@@ -1227,8 +1231,9 @@ static inline void setup_object_debug(st
 static inline int alloc_debug_processing(struct kmem_cache *s,
 	struct page *page, void *object, unsigned long addr) { return 0; }

-static inline int free_debug_processing(struct kmem_cache *s,
-	struct page *page, void *object, unsigned long addr) { return 0; }
+static inline struct kmem_cache_node *free_debug_processing(
+	struct kmem_cache *s, struct page *page, void *object,
+	unsigned long addr, unsigned long *flags) { return NULL; }

 static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
 			{ return 1; }
@@ -2445,7 +2450,8 @@ static void __slab_free(struct kmem_cach

 	stat(s, FREE_SLOWPATH);

-	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
+	if (kmem_cache_debug(s) &&
+		!(n = free_debug_processing(s, page, x, addr, &flags)))
 		return;

 	do {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] slub: prevent validate_slab() error due to race condition
  2012-05-30 17:54             ` Christoph Lameter
@ 2012-08-16  7:02               ` Pekka Enberg
  0 siblings, 0 replies; 18+ messages in thread
From: Pekka Enberg @ 2012-08-16  7:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, waiman.long, rientjes

On Wed, 30 May 2012, Christoph Lameter wrote:
> Subject: slub: Take node lock during object free checks
> 
> Only applies to scenarios where debugging is on:
> 
> Validation of slabs can currently occur while debugging
> information is updated from the fast paths of the allocator.
> This results in various races where we get false reports about
> slab metadata not being in order.
> 
> This patch makes the fast paths take the node lock so that
> serialization with slab validation will occur. Causes additional
> slowdown in debug scenarios.
> 
> Reported-by: Waiman Long <Waiman.Long@hp.com>
> Signed-off-by: Christoph Lameter <cl@linux.com>

Applied! Thanks and sorry for the delay.

> 
> ---
>  mm/slub.c |   30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2012-05-21 08:58:30.000000000 -0500
> +++ linux-2.6/mm/slub.c	2012-05-30 12:53:29.000000000 -0500
> @@ -1082,13 +1082,13 @@ bad:
>  	return 0;
>  }
> 
> -static noinline int free_debug_processing(struct kmem_cache *s,
> -		 struct page *page, void *object, unsigned long addr)
> +static noinline struct kmem_cache_node *free_debug_processing(
> +	struct kmem_cache *s, struct page *page, void *object,
> +	unsigned long addr, unsigned long *flags)
>  {
> -	unsigned long flags;
> -	int rc = 0;
> +	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
> 
> -	local_irq_save(flags);
> +	spin_lock_irqsave(&n->list_lock, *flags);
>  	slab_lock(page);
> 
>  	if (!check_slab(s, page))
> @@ -1126,15 +1126,19 @@ static noinline int free_debug_processin
>  		set_track(s, object, TRACK_FREE, addr);
>  	trace(s, page, object, 0);
>  	init_object(s, object, SLUB_RED_INACTIVE);
> -	rc = 1;
>  out:
>  	slab_unlock(page);
> -	local_irq_restore(flags);
> -	return rc;
> +	/*
> +	 * Keep node_lock to preserve integrity
> +	 * until the object is actually freed
> +	 */
> +	return n;
> 
>  fail:
> +	slab_unlock(page);
> +	spin_unlock_irqrestore(&n->list_lock, *flags);
>  	slab_fix(s, "Object at 0x%p not freed", object);
> -	goto out;
> +	return NULL;
>  }
> 
>  static int __init setup_slub_debug(char *str)
> @@ -1227,8 +1231,9 @@ static inline void setup_object_debug(st
>  static inline int alloc_debug_processing(struct kmem_cache *s,
>  	struct page *page, void *object, unsigned long addr) { return 0; }
> 
> -static inline int free_debug_processing(struct kmem_cache *s,
> -	struct page *page, void *object, unsigned long addr) { return 0; }
> +static inline struct kmem_cache_node *free_debug_processing(
> +	struct kmem_cache *s, struct page *page, void *object,
> +	unsigned long addr, unsigned long *flags) { return NULL; }
> 
>  static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
>  			{ return 1; }
> @@ -2445,7 +2450,8 @@ static void __slab_free(struct kmem_cach
> 
>  	stat(s, FREE_SLOWPATH);
> 
> -	if (kmem_cache_debug(s) && !free_debug_processing(s, page, x, addr))
> +	if (kmem_cache_debug(s) &&
> +		!(n = free_debug_processing(s, page, x, addr, &flags)))
>  		return;
> 
>  	do {
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-08-16  7:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-26 18:57 [PATCH] slub: prevent validate_slab() error due to race condition Waiman Long
2012-04-26 18:57 ` Waiman Long
2012-04-26 19:12 ` Eric Dumazet
2012-04-26 19:12   ` Eric Dumazet
2012-04-26 19:20   ` Don Morris
2012-04-26 19:20     ` Don Morris
2012-04-26 19:44     ` Eric Dumazet
2012-04-26 19:44       ` Eric Dumazet
2012-04-27 14:59 ` Christoph Lameter
2012-04-27 14:59   ` Christoph Lameter
2012-04-27 20:10   ` Waiman Long
2012-04-27 20:10     ` Waiman Long
2012-04-30  6:35     ` Pekka Enberg
2012-04-30  6:35       ` Pekka Enberg
2012-05-01 20:23       ` Christoph Lameter
2012-05-01 20:23         ` Christoph Lameter
     [not found]         ` <alpine.LFD.2.02.1205300946170.2681@tux.localdomain>
     [not found]           ` <alpine.DEB.2.00.1205301039420.29257@router.home>
2012-05-30 17:54             ` Christoph Lameter
2012-08-16  7:02               ` Pekka Enberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.