All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] FIX: Mdmon crashes after changing RAID level from 1 to 0
@ 2011-09-01 13:10 Lukasz Dorau
  2011-09-03 10:48 ` Jan Ceuleers
  2011-09-07  2:41 ` NeilBrown
  0 siblings, 2 replies; 3+ messages in thread
From: Lukasz Dorau @ 2011-09-01 13:10 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, marcin.labun, ed.ciechanowski

Description of the bug:
Sometimes mdmon crashes after changing RAID level from 1 to 0 (takeover).

Cause of the bug:
The managemon marks an active_array for removal from monitoring
by assigning a->container to NULL value (in the "manage_member" function).
Sometimes (during stress test) it happens right when the monitor
is in the "read_and_act" function and a->container pointer is in use.
This causes the monitor crashes.

Solution:
The active array has to be marked for removal in another way
than setting NULL pointer when it can be in use.
A new field "to_remove" was added to the "active_array" structure.
It is used in the managemon to mark a container to remove
(instead of the old assigment: a->container = NULL)
and monitor checks it to determine if the array should be removed.
The field "to_remove" should be checked in some other places
to avoid managing of the array which is going to be removed.

Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
---
 managemon.c |    4 ++--
 mdmon.h     |    1 +
 monitor.c   |    8 ++++----
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/managemon.c b/managemon.c
index d020f82..9e0a34d 100644
--- a/managemon.c
+++ b/managemon.c
@@ -461,7 +461,7 @@ static void manage_member(struct mdstat_ent *mdstat,
 	if (mdstat->level) {
 		int level = map_name(pers, mdstat->level);
 		if (level == 0 || level == LEVEL_LINEAR) {
-			a->container = NULL;
+			a->to_remove = 1;
 			wakeup_monitor();
 			return;
 		}
@@ -739,7 +739,7 @@ void manage(struct mdstat_ent *mdstat, struct supertype *container)
 		/* Looks like a member of this container */
 		for (a = container->arrays; a; a = a->next) {
 			if (mdstat->devnum == a->devnum) {
-				if (a->container)
+				if (a->container && a->to_remove == 0)
 					manage_member(mdstat, a);
 				break;
 			}
diff --git a/mdmon.h b/mdmon.h
index 6d1776f..59e1b53 100644
--- a/mdmon.h
+++ b/mdmon.h
@@ -28,6 +28,7 @@ struct active_array {
 	struct mdinfo info;
 	struct supertype *container;
 	struct active_array *next, *replaces;
+	int to_remove;
 
 	int action_fd;
 	int resync_start_fd;
diff --git a/monitor.c b/monitor.c
index 7ac5907..b002e90 100644
--- a/monitor.c
+++ b/monitor.c
@@ -479,7 +479,7 @@ static void reconcile_failed(struct active_array *aa, struct mdinfo *failed)
 	struct mdinfo *victim;
 
 	for (a = aa; a; a = a->next) {
-		if (!a->container)
+		if (!a->container || a->to_remove)
 			continue;
 		victim = find_device(a, failed->disk.major, failed->disk.minor);
 		if (!victim)
@@ -539,7 +539,7 @@ static int wait_and_act(struct supertype *container, int nowait)
 		/* once an array has been deactivated we want to
 		 * ask the manager to discard it.
 		 */
-		if (!a->container) {
+		if (!a->container || a->to_remove) {
 			if (discard_this) {
 				ap = &(*ap)->next;
 				continue;
@@ -642,7 +642,7 @@ static int wait_and_act(struct supertype *container, int nowait)
 			/* FIXME check if device->state_fd need to be cleared?*/
 			signal_manager();
 		}
-		if (a->container) {
+		if (a->container && !a->to_remove) {
 			is_dirty = read_and_act(a);
 			rv |= 1;
 			dirty_arrays += is_dirty;
@@ -657,7 +657,7 @@ static int wait_and_act(struct supertype *container, int nowait)
 
 	/* propagate failures across container members */
 	for (a = *aap; a ; a = a->next) {
-		if (!a->container)
+		if (!a->container || a->to_remove)
 			continue;
 		for (mdi = a->info.devs ; mdi ; mdi = mdi->next)
 			if (mdi->curr_state & DS_FAULTY)

---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
z siedziba w Gdansku
ul. Slowackiego 173
80-298 Gdansk

Sad Rejonowy Gdansk Polnoc w Gdansku, 
VII Wydzial Gospodarczy Krajowego Rejestru Sadowego, 
numer KRS 101882

NIP 957-07-52-316
Kapital zakladowy 200.000 zl

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] FIX: Mdmon crashes after changing RAID level from 1 to 0
  2011-09-01 13:10 [PATCH] FIX: Mdmon crashes after changing RAID level from 1 to 0 Lukasz Dorau
@ 2011-09-03 10:48 ` Jan Ceuleers
  2011-09-07  2:41 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: Jan Ceuleers @ 2011-09-03 10:48 UTC (permalink / raw)
  To: Lukasz Dorau; +Cc: neilb, linux-raid, marcin.labun, ed.ciechanowski

On 09/01/2011 03:10 PM, Lukasz Dorau wrote:
> ---------------------------------------------------------------------
> Intel Technology Poland sp. z o.o.
> z siedziba w Gdansku
> ul. Slowackiego 173
> 80-298 Gdansk
>
> Sad Rejonowy Gdansk Polnoc w Gdansku,
> VII Wydzial Gospodarczy Krajowego Rejestru Sadowego,
> numer KRS 101882
>
> NIP 957-07-52-316
> Kapital zakladowy 200.000 zl
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
Hi Lukasz.

I'm not a maintainer of any kind, but I think your contributions are 
unusable because of the above footer, particularly the confidentiality 
clause. Can you resubmit, after reconfiguring your mail setup not to 
include this footer?

Thanks, Jan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] FIX: Mdmon crashes after changing RAID level from 1 to 0
  2011-09-01 13:10 [PATCH] FIX: Mdmon crashes after changing RAID level from 1 to 0 Lukasz Dorau
  2011-09-03 10:48 ` Jan Ceuleers
@ 2011-09-07  2:41 ` NeilBrown
  1 sibling, 0 replies; 3+ messages in thread
From: NeilBrown @ 2011-09-07  2:41 UTC (permalink / raw)
  To: Lukasz Dorau; +Cc: linux-raid, marcin.labun, ed.ciechanowski

On Thu, 01 Sep 2011 15:10:34 +0200 Lukasz Dorau <lukasz.dorau@intel.com>
wrote:

> Description of the bug:
> Sometimes mdmon crashes after changing RAID level from 1 to 0 (takeover).
> 
> Cause of the bug:
> The managemon marks an active_array for removal from monitoring
> by assigning a->container to NULL value (in the "manage_member" function).
> Sometimes (during stress test) it happens right when the monitor
> is in the "read_and_act" function and a->container pointer is in use.
> This causes the monitor crashes.
> 
> Solution:
> The active array has to be marked for removal in another way
> than setting NULL pointer when it can be in use.
> A new field "to_remove" was added to the "active_array" structure.
> It is used in the managemon to mark a container to remove
> (instead of the old assigment: a->container = NULL)
> and monitor checks it to determine if the array should be removed.
> The field "to_remove" should be checked in some other places
> to avoid managing of the array which is going to be removed.
> 
> Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>

Thanks.

I have applied this - despite the ridiculous disclaimer at the bottom :-)

NeilBrown


> ---
>  managemon.c |    4 ++--
>  mdmon.h     |    1 +
>  monitor.c   |    8 ++++----
>  3 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/managemon.c b/managemon.c
> index d020f82..9e0a34d 100644
> --- a/managemon.c
> +++ b/managemon.c
> @@ -461,7 +461,7 @@ static void manage_member(struct mdstat_ent *mdstat,
>  	if (mdstat->level) {
>  		int level = map_name(pers, mdstat->level);
>  		if (level == 0 || level == LEVEL_LINEAR) {
> -			a->container = NULL;
> +			a->to_remove = 1;
>  			wakeup_monitor();
>  			return;
>  		}
> @@ -739,7 +739,7 @@ void manage(struct mdstat_ent *mdstat, struct supertype *container)
>  		/* Looks like a member of this container */
>  		for (a = container->arrays; a; a = a->next) {
>  			if (mdstat->devnum == a->devnum) {
> -				if (a->container)
> +				if (a->container && a->to_remove == 0)
>  					manage_member(mdstat, a);
>  				break;
>  			}
> diff --git a/mdmon.h b/mdmon.h
> index 6d1776f..59e1b53 100644
> --- a/mdmon.h
> +++ b/mdmon.h
> @@ -28,6 +28,7 @@ struct active_array {
>  	struct mdinfo info;
>  	struct supertype *container;
>  	struct active_array *next, *replaces;
> +	int to_remove;
>  
>  	int action_fd;
>  	int resync_start_fd;
> diff --git a/monitor.c b/monitor.c
> index 7ac5907..b002e90 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -479,7 +479,7 @@ static void reconcile_failed(struct active_array *aa, struct mdinfo *failed)
>  	struct mdinfo *victim;
>  
>  	for (a = aa; a; a = a->next) {
> -		if (!a->container)
> +		if (!a->container || a->to_remove)
>  			continue;
>  		victim = find_device(a, failed->disk.major, failed->disk.minor);
>  		if (!victim)
> @@ -539,7 +539,7 @@ static int wait_and_act(struct supertype *container, int nowait)
>  		/* once an array has been deactivated we want to
>  		 * ask the manager to discard it.
>  		 */
> -		if (!a->container) {
> +		if (!a->container || a->to_remove) {
>  			if (discard_this) {
>  				ap = &(*ap)->next;
>  				continue;
> @@ -642,7 +642,7 @@ static int wait_and_act(struct supertype *container, int nowait)
>  			/* FIXME check if device->state_fd need to be cleared?*/
>  			signal_manager();
>  		}
> -		if (a->container) {
> +		if (a->container && !a->to_remove) {
>  			is_dirty = read_and_act(a);
>  			rv |= 1;
>  			dirty_arrays += is_dirty;
> @@ -657,7 +657,7 @@ static int wait_and_act(struct supertype *container, int nowait)
>  
>  	/* propagate failures across container members */
>  	for (a = *aap; a ; a = a->next) {
> -		if (!a->container)
> +		if (!a->container || a->to_remove)
>  			continue;
>  		for (mdi = a->info.devs ; mdi ; mdi = mdi->next)
>  			if (mdi->curr_state & DS_FAULTY)
> 
> ---------------------------------------------------------------------
> Intel Technology Poland sp. z o.o.
> z siedziba w Gdansku
> ul. Slowackiego 173
> 80-298 Gdansk
> 
> Sad Rejonowy Gdansk Polnoc w Gdansku, 
> VII Wydzial Gospodarczy Krajowego Rejestru Sadowego, 
> numer KRS 101882
> 
> NIP 957-07-52-316
> Kapital zakladowy 200.000 zl
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-09-07  2:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-01 13:10 [PATCH] FIX: Mdmon crashes after changing RAID level from 1 to 0 Lukasz Dorau
2011-09-03 10:48 ` Jan Ceuleers
2011-09-07  2:41 ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.