RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <BAY151-W32DCB4BAFEC97DD4913A12A1090@phx.gbl>
@ 2011-09-17 17:34 ` Alan Stern
  2011-09-18 23:00   ` Ben Hutchings
  0 siblings, 1 reply; 10+ messages in thread
From: Alan Stern @ 2011-09-17 17:34 UTC (permalink / raw)
  To: Rocko Requin; +Cc: tytso, Kernel development list

On Sat, 17 Sep 2011, Rocko Requin wrote:

> > Why were you using gnome-terminal?  You should be running the tests at
> > a console VT, not under X at all.  Ctrl-Alt-F2 or the equivalent...
> 
> Because with Ted's patch it doesn't crash when run from a console VT, even with an X server running.

That's weird.  Maybe the screen updates change some timing.

> > Here's another patch to address the new problem.  You can apply it on 
> > top of all the other patches.
> 
> Attached is the crash log I get with the latest patch applied.

Okay, more fallout from the same problem.  Here's an updated version of 
the previous patch.

These are really just bandaid-type fixes.  The people who understand
the block layer ought to be involved.  If this keeps up much longer
I'll get in touch with them.

Alan Stern


Index: usb-3.1/block/blk-core.c
===================================================================
--- usb-3.1.orig/block/blk-core.c
+++ usb-3.1/block/blk-core.c
@@ -367,8 +367,10 @@ void blk_cleanup_queue(struct request_qu
 	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
 	mutex_unlock(&q->sysfs_lock);
 
-	if (q->elevator)
+	if (q->elevator) {
 		elevator_exit(q->elevator);
+		q->elevator = NULL;
+	}
 
 	blk_throtl_exit(q);
 
Index: usb-3.1/block/elevator.c
===================================================================
--- usb-3.1.orig/block/elevator.c
+++ usb-3.1/block/elevator.c
@@ -769,7 +769,7 @@ void elv_put_request(struct request_queu
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_put_req_fn)
+	if (e && e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
 
@@ -812,7 +812,7 @@ void elv_completed_request(struct reques
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
-		if ((rq->cmd_flags & REQ_SORTED) &&
+		if ((rq->cmd_flags & REQ_SORTED) && e &&
 		    e->ops->elevator_completed_req_fn)
 			e->ops->elevator_completed_req_fn(q, rq);
 	}


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-17 17:34 ` [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed Alan Stern
@ 2011-09-18 23:00   ` Ben Hutchings
  2011-09-20  7:32     ` Jun'ichi Nomura
  0 siblings, 1 reply; 10+ messages in thread
From: Ben Hutchings @ 2011-09-18 23:00 UTC (permalink / raw)
  To: Alan Stern, James Bottomley
  Cc: Rocko Requin, tytso, Kernel development list, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1864 bytes --]

On Sat, 2011-09-17 at 13:34 -0400, Alan Stern wrote:
> On Sat, 17 Sep 2011, Rocko Requin wrote:
> 
> > > Why were you using gnome-terminal?  You should be running the tests at
> > > a console VT, not under X at all.  Ctrl-Alt-F2 or the equivalent...
> > 
> > Because with Ted's patch it doesn't crash when run from a console VT, even with an X server running.
> 
> That's weird.  Maybe the screen updates change some timing.
> 
> > > Here's another patch to address the new problem.  You can apply it on 
> > > top of all the other patches.
> > 
> > Attached is the crash log I get with the latest patch applied.
> 
> Okay, more fallout from the same problem.  Here's an updated version of 
> the previous patch.
[...]

There have been reports of this in Debian going back to 2.6.39:

http://bugs.debian.org/631187
http://bugs.debian.org/636263
http://bugs.debian.org/642043

Plus possibly related crashes in elv_put_request after CD-ROM removal:

http://bugs.debian.org/633890
http://bugs.debian.org/634681
http://bugs.debian.org/636103

The former was also reported in Ubuntu since their 2.6.38-10:

https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796

The result of the discussion there was that it appeared to be a
regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
("[SCSI] put stricter guards on queue dead checks") which was also
included in a stable update for 2.6.38.

There was also a report on bugzilla.kernel.org, though no-one can see
quite what that says now:

https://bugzilla.kernel.org/show_bug.cgi?id=38842

I also reported most of the above to James Bottomley and linux-scsi
nearly 2 months ago, to no response.

Ben.

-- 
Ben Hutchings
Power corrupts.  Absolute power is kind of neat.
                           - John Lehman, Secretary of the US Navy 1981-1987

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-18 23:00   ` Ben Hutchings
@ 2011-09-20  7:32     ` Jun'ichi Nomura
  2011-09-22 12:26       ` Hannes Reinecke
  0 siblings, 1 reply; 10+ messages in thread
From: Jun'ichi Nomura @ 2011-09-20  7:32 UTC (permalink / raw)
  To: Ben Hutchings, jaxboe
  Cc: Alan Stern, James Bottomley, Rocko Requin, tytso,
	Kernel development list, linux-scsi

On 09/19/11 08:00, Ben Hutchings wrote:
> On Sat, 2011-09-17 at 13:34 -0400, Alan Stern wrote:
>> On Sat, 17 Sep 2011, Rocko Requin wrote:
>>
>>>> Why were you using gnome-terminal?  You should be running the tests at
>>>> a console VT, not under X at all.  Ctrl-Alt-F2 or the equivalent...
>>>
>>> Because with Ted's patch it doesn't crash when run from a console VT, even with an X server running.
>>
>> That's weird.  Maybe the screen updates change some timing.
>>
>>>> Here's another patch to address the new problem.  You can apply it on 
>>>> top of all the other patches.
>>>
>>> Attached is the crash log I get with the latest patch applied.
>>
>> Okay, more fallout from the same problem.  Here's an updated version of 
>> the previous patch.
> [...]
> 
> There have been reports of this in Debian going back to 2.6.39:
> 
> http://bugs.debian.org/631187
> http://bugs.debian.org/636263
> http://bugs.debian.org/642043
> 
> Plus possibly related crashes in elv_put_request after CD-ROM removal:
> 
> http://bugs.debian.org/633890
> http://bugs.debian.org/634681
> http://bugs.debian.org/636103
> 
> The former was also reported in Ubuntu since their 2.6.38-10:
> 
> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
> 
> The result of the discussion there was that it appeared to be a
> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
> ("[SCSI] put stricter guards on queue dead checks") which was also
> included in a stable update for 2.6.38.
> 
> There was also a report on bugzilla.kernel.org, though no-one can see
> quite what that says now:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=38842
> 
> I also reported most of the above to James Bottomley and linux-scsi
> nearly 2 months ago, to no response.

I've reported a similar oops related to the above commit:
  [BUG] Oops when SCSI device under multipath is removed
  https://lkml.org/lkml/2011/8/10/11

Elevator being removed is the core of the problem.
And the essential issue seems 2 different models of queue/driver relation
implied by queue_lock.

If reverting the commit is not an option,
until somebody comes up to fix the essential issue,
the patch below should close the regressions introduced by the commit.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation


This patch moves elevator_exit() and blk_throtl_exit() from
blk_cleanup_queue() to blk_release_queue() when it is possible.

elevator_exit() and blk_throtl_exit() were called in blk_cleanup_queue()
because they use queue_lock.

There are 2 types of queue_locks:
  a) supplied by driver (via blk_init_queue)
  b) embedded in struct request_queue (__queue_lock)

When queue_lock is supplied by driver, there is no guarantee that
the pointer is valid after blk_cleanup_queue(), so they have to be
called in blk_cleanup_queue().
In this case, the driver has to make sure nobody is using the queue
before calling blk_cleanup_queue().

However, OTOH, if queue_lock is '__queue_lock' in request_queue,
blk_release_queue() is better place for freeing structures
because the block layer knows for sure there is no reference.

This patch is ugly but should fix various oopses introduced by this change:
  86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b
  [SCSI] put stricter guards on queue dead checks

For example:
  https://lkml.org/lkml/2011/8/10/11

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>

Index: linux-3.1-rc4/block/blk-core.c
===================================================================
--- linux-3.1-rc4.orig/block/blk-core.c	2011-08-29 13:16:01.000000000 +0900
+++ linux-3.1-rc4/block/blk-core.c	2011-09-20 15:53:23.496814819 +0900
@@ -352,6 +352,14 @@
  * unexpectedly as some queue cleanup components like elevator_exit() and
  * blk_throtl_exit() need queue lock.
  */
+void blk_release_queue_components_with_queuelock(struct request_queue *q)
+{
+	if (q->elevator)
+		elevator_exit(q->elevator);
+
+	blk_throtl_exit(q);
+}
+
 void blk_cleanup_queue(struct request_queue *q)
 {
 	/*
@@ -367,10 +375,12 @@
 	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
 	mutex_unlock(&q->sysfs_lock);
 
-	if (q->elevator)
-		elevator_exit(q->elevator);
-
-	blk_throtl_exit(q);
+	/*
+	 * A driver supplied the queue lock.
+	 * Cleanup components while the queue lock is valid.
+	 */
+	if (q->queue_lock != &q->__queue_lock)
+		blk_release_queue_components_with_queuelock(q);
 
 	blk_put_queue(q);
 }
Index: linux-3.1-rc4/block/blk-sysfs.c
===================================================================
--- linux-3.1-rc4.orig/block/blk-sysfs.c	2011-09-19 09:38:51.000000000 +0900
+++ linux-3.1-rc4/block/blk-sysfs.c	2011-09-20 15:57:50.358807023 +0900
@@ -477,6 +477,9 @@
 
 	blk_sync_queue(q);
 
+	if (q->queue_lock == &q->__queue_lock)
+		blk_release_queue_components_with_queuelock(q);
+
 	if (rl->rq_pool)
 		mempool_destroy(rl->rq_pool);
 
Index: linux-3.1-rc4/block/blk.h
===================================================================
--- linux-3.1-rc4.orig/block/blk.h	2011-08-29 13:16:01.000000000 +0900
+++ linux-3.1-rc4/block/blk.h	2011-09-20 15:57:38.306807136 +0900
@@ -25,6 +25,9 @@
 void blk_add_timer(struct request *);
 void __generic_unplug_device(struct request_queue *);
 
+/* Wrapper to release functions to be called while queue_lock is valid */
+void blk_release_queue_components_with_queuelock(struct request_queue *q);
+
 /*
  * Internal atomic flags for request handling
  */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-20  7:32     ` Jun'ichi Nomura
@ 2011-09-22 12:26       ` Hannes Reinecke
  2011-09-22 12:35         ` James Bottomley
  2011-09-22 15:16         ` Alan Stern
  0 siblings, 2 replies; 10+ messages in thread
From: Hannes Reinecke @ 2011-09-22 12:26 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Ben Hutchings, jaxboe, Alan Stern, James Bottomley, Rocko Requin,
	tytso, Kernel development list, linux-scsi

On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> On 09/19/11 08:00, Ben Hutchings wrote:
[ .. ]
>>
>> There have been reports of this in Debian going back to 2.6.39:
>>
>> http://bugs.debian.org/631187
>> http://bugs.debian.org/636263
>> http://bugs.debian.org/642043
>>
>> Plus possibly related crashes in elv_put_request after CD-ROM removal:
>>
>> http://bugs.debian.org/633890
>> http://bugs.debian.org/634681
>> http://bugs.debian.org/636103
>>
>> The former was also reported in Ubuntu since their 2.6.38-10:
>>
>> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
>>
>> The result of the discussion there was that it appeared to be a
>> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
>> ("[SCSI] put stricter guards on queue dead checks") which was also
>> included in a stable update for 2.6.38.
>>
>> There was also a report on bugzilla.kernel.org, though no-one can see
>> quite what that says now:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=38842
>>
>> I also reported most of the above to James Bottomley and linux-scsi
>> nearly 2 months ago, to no response.
> 
> I've reported a similar oops related to the above commit:
>   [BUG] Oops when SCSI device under multipath is removed
>   https://lkml.org/lkml/2011/8/10/11
> 
> Elevator being removed is the core of the problem.
> And the essential issue seems 2 different models of queue/driver relation
> implied by queue_lock.
> 
> If reverting the commit is not an option,
> until somebody comes up to fix the essential issue,
> the patch below should close the regressions introduced by the commit.
> 
Why do you have to do it that complicated?
Couldn't we just state that any external lock is being disconnected from
queue_lock after blk_cleanup_queue()?

Then something like this should suffice here:

diff --git a/block/blk-core.c b/block/blk-core.c
index 90e1ffd..a4ac005 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
        queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
        mutex_unlock(&q->sysfs_lock);

-       if (q->elevator)
-               elevator_exit(q->elevator);
-
-       blk_throtl_exit(q);
+       if (q->queue_lock != q->__queue_lock)
+               q->queue_lock = q->__queue_lock;

        blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0ee17b5..a5a756b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)

        blk_sync_queue(q);

+       if (q->elevator)
+               elevator_exit(q->elevator);
+
+       blk_throtl_exit(q);
+
        if (rl->rq_pool)
                mempool_destroy(rl->rq_pool);


And yeah, I find it pretty annoying, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke              zSeries & Storage
hare@suse.de                  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-22 12:26       ` Hannes Reinecke
@ 2011-09-22 12:35         ` James Bottomley
  2011-09-22 15:16         ` Alan Stern
  1 sibling, 0 replies; 10+ messages in thread
From: James Bottomley @ 2011-09-22 12:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jun'ichi Nomura, Ben Hutchings, jaxboe, Alan Stern,
	Rocko Requin, tytso, Kernel development list, linux-scsi

On Thu, 2011-09-22 at 14:26 +0200, Hannes Reinecke wrote:
> On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> > On 09/19/11 08:00, Ben Hutchings wrote:
> [ .. ]
> >>
> >> There have been reports of this in Debian going back to 2.6.39:
> >>
> >> http://bugs.debian.org/631187
> >> http://bugs.debian.org/636263
> >> http://bugs.debian.org/642043
> >>
> >> Plus possibly related crashes in elv_put_request after CD-ROM removal:
> >>
> >> http://bugs.debian.org/633890
> >> http://bugs.debian.org/634681
> >> http://bugs.debian.org/636103
> >>
> >> The former was also reported in Ubuntu since their 2.6.38-10:
> >>
> >> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
> >>
> >> The result of the discussion there was that it appeared to be a
> >> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
> >> ("[SCSI] put stricter guards on queue dead checks") which was also
> >> included in a stable update for 2.6.38.
> >>
> >> There was also a report on bugzilla.kernel.org, though no-one can see
> >> quite what that says now:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=38842
> >>
> >> I also reported most of the above to James Bottomley and linux-scsi
> >> nearly 2 months ago, to no response.
> > 
> > I've reported a similar oops related to the above commit:
> >   [BUG] Oops when SCSI device under multipath is removed
> >   https://lkml.org/lkml/2011/8/10/11
> > 
> > Elevator being removed is the core of the problem.
> > And the essential issue seems 2 different models of queue/driver relation
> > implied by queue_lock.
> > 
> > If reverting the commit is not an option,
> > until somebody comes up to fix the essential issue,
> > the patch below should close the regressions introduced by the commit.
> > 
> Why do you have to do it that complicated?
> Couldn't we just state that any external lock is being disconnected from
> queue_lock after blk_cleanup_queue()?
> 
> Then something like this should suffice here:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 90e1ffd..a4ac005 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
>         queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
>         mutex_unlock(&q->sysfs_lock);
> 
> -       if (q->elevator)
> -               elevator_exit(q->elevator);
> -
> -       blk_throtl_exit(q);
> +       if (q->queue_lock != q->__queue_lock)
> +               q->queue_lock = q->__queue_lock;
> 
>         blk_put_queue(q);
>  }
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 0ee17b5..a5a756b 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)
> 
>         blk_sync_queue(q);
> 
> +       if (q->elevator)
> +               elevator_exit(q->elevator);
> +
> +       blk_throtl_exit(q);
> +

OK, I'll buy this one (when you fix the whitespace issue ... you have
spaces instead of tabs).

The fact that the lock check/replacement doesn't actually need any
locking is probably worthy of a comment.

James



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-22 12:26       ` Hannes Reinecke
  2011-09-22 12:35         ` James Bottomley
@ 2011-09-22 15:16         ` Alan Stern
  2011-09-22 16:20           ` Thadeu Lima de Souza Cascardo
  1 sibling, 1 reply; 10+ messages in thread
From: Alan Stern @ 2011-09-22 15:16 UTC (permalink / raw)
  To: Rocko Requin
  Cc: Hannes Reinecke, Jun'ichi Nomura, Ben Hutchings, jaxboe,
	James Bottomley, tytso, Kernel development list, linux-scsi

Rocko:

Can you try testing this patch instead of all the patches I sent to 
you (but keep Ted's patch)?

Alan Stern

On Thu, 22 Sep 2011, Hannes Reinecke wrote:

> On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> > On 09/19/11 08:00, Ben Hutchings wrote:
> [ .. ]
> >>
> >> There have been reports of this in Debian going back to 2.6.39:
> >>
> >> http://bugs.debian.org/631187
> >> http://bugs.debian.org/636263
> >> http://bugs.debian.org/642043
> >>
> >> Plus possibly related crashes in elv_put_request after CD-ROM removal:
> >>
> >> http://bugs.debian.org/633890
> >> http://bugs.debian.org/634681
> >> http://bugs.debian.org/636103
> >>
> >> The former was also reported in Ubuntu since their 2.6.38-10:
> >>
> >> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
> >>
> >> The result of the discussion there was that it appeared to be a
> >> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
> >> ("[SCSI] put stricter guards on queue dead checks") which was also
> >> included in a stable update for 2.6.38.
> >>
> >> There was also a report on bugzilla.kernel.org, though no-one can see
> >> quite what that says now:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=38842
> >>
> >> I also reported most of the above to James Bottomley and linux-scsi
> >> nearly 2 months ago, to no response.
> > 
> > I've reported a similar oops related to the above commit:
> >   [BUG] Oops when SCSI device under multipath is removed
> >   https://lkml.org/lkml/2011/8/10/11
> > 
> > Elevator being removed is the core of the problem.
> > And the essential issue seems 2 different models of queue/driver relation
> > implied by queue_lock.
> > 
> > If reverting the commit is not an option,
> > until somebody comes up to fix the essential issue,
> > the patch below should close the regressions introduced by the commit.
> > 
> Why do you have to do it that complicated?
> Couldn't we just state that any external lock is being disconnected from
> queue_lock after blk_cleanup_queue()?
> 
> Then something like this should suffice here:



diff --git a/block/blk-core.c b/block/blk-core.c
index 90e1ffd..a4ac005 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
        queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
        mutex_unlock(&q->sysfs_lock);

-       if (q->elevator)
-               elevator_exit(q->elevator);
-
-       blk_throtl_exit(q);
+       if (q->queue_lock != q->__queue_lock)
+               q->queue_lock = q->__queue_lock;

        blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0ee17b5..a5a756b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)

        blk_sync_queue(q);

+       if (q->elevator)
+               elevator_exit(q->elevator);
+
+       blk_throtl_exit(q);
+
        if (rl->rq_pool)
                mempool_destroy(rl->rq_pool);
 

> And yeah, I find it pretty annoying, too.
> 
> Cheers,
> 
> Hannes
> 


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-22 15:16         ` Alan Stern
@ 2011-09-22 16:20           ` Thadeu Lima de Souza Cascardo
  2011-09-22 16:32             ` Hannes Reinecke
  0 siblings, 1 reply; 10+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2011-09-22 16:20 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rocko Requin, Hannes Reinecke, Jun'ichi Nomura,
	Ben Hutchings, jaxboe, James Bottomley, tytso,
	Kernel development list, linux-scsi

On Thu, Sep 22, 2011 at 11:16:30AM -0400, Alan Stern wrote:
> Rocko:
> 
> Can you try testing this patch instead of all the patches I sent to 
> you (but keep Ted's patch)?
> 
> Alan Stern
> 
> On Thu, 22 Sep 2011, Hannes Reinecke wrote:
> 
> > On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> > > On 09/19/11 08:00, Ben Hutchings wrote:
> > [ .. ]
> > >>
> > >> There have been reports of this in Debian going back to 2.6.39:
> > >>
> > >> http://bugs.debian.org/631187
> > >> http://bugs.debian.org/636263
> > >> http://bugs.debian.org/642043
> > >>
> > >> Plus possibly related crashes in elv_put_request after CD-ROM removal:
> > >>
> > >> http://bugs.debian.org/633890
> > >> http://bugs.debian.org/634681
> > >> http://bugs.debian.org/636103
> > >>
> > >> The former was also reported in Ubuntu since their 2.6.38-10:
> > >>
> > >> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
> > >>
> > >> The result of the discussion there was that it appeared to be a
> > >> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
> > >> ("[SCSI] put stricter guards on queue dead checks") which was also
> > >> included in a stable update for 2.6.38.
> > >>
> > >> There was also a report on bugzilla.kernel.org, though no-one can see
> > >> quite what that says now:
> > >>
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=38842
> > >>
> > >> I also reported most of the above to James Bottomley and linux-scsi
> > >> nearly 2 months ago, to no response.
> > > 
> > > I've reported a similar oops related to the above commit:
> > >   [BUG] Oops when SCSI device under multipath is removed
> > >   https://lkml.org/lkml/2011/8/10/11
> > > 
> > > Elevator being removed is the core of the problem.
> > > And the essential issue seems 2 different models of queue/driver relation
> > > implied by queue_lock.
> > > 
> > > If reverting the commit is not an option,
> > > until somebody comes up to fix the essential issue,
> > > the patch below should close the regressions introduced by the commit.
> > > 
> > Why do you have to do it that complicated?
> > Couldn't we just state that any external lock is being disconnected from
> > queue_lock after blk_cleanup_queue()?
> > 
> > Then something like this should suffice here:
> 
> 
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 90e1ffd..a4ac005 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
>         queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
>         mutex_unlock(&q->sysfs_lock);
> 
> -       if (q->elevator)
> -               elevator_exit(q->elevator);
> -
> -       blk_throtl_exit(q);
> +       if (q->queue_lock != q->__queue_lock)
> +               q->queue_lock = q->__queue_lock;

That should be &q->__queue_lock.

Regards,
Cascardo.

> 
>         blk_put_queue(q);
>  }
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 0ee17b5..a5a756b 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)
> 
>         blk_sync_queue(q);
> 
> +       if (q->elevator)
> +               elevator_exit(q->elevator);
> +
> +       blk_throtl_exit(q);
> +
>         if (rl->rq_pool)
>                 mempool_destroy(rl->rq_pool);
> 
> 
> > And yeah, I find it pretty annoying, too.
> > 
> > Cheers,
> > 
> > Hannes
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-22 16:20           ` Thadeu Lima de Souza Cascardo
@ 2011-09-22 16:32             ` Hannes Reinecke
  0 siblings, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2011-09-22 16:32 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: Alan Stern, Rocko Requin, Jun'ichi Nomura, Ben Hutchings,
	jaxboe, James Bottomley, tytso, Kernel development list,
	linux-scsi

On 09/22/2011 06:20 PM, Thadeu Lima de Souza Cascardo wrote:
> On Thu, Sep 22, 2011 at 11:16:30AM -0400, Alan Stern wrote:
>> Rocko:
>>
>> Can you try testing this patch instead of all the patches I sent to 
>> you (but keep Ted's patch)?
>>
>> Alan Stern
>>
>> On Thu, 22 Sep 2011, Hannes Reinecke wrote:
>>
>>> On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
>>>> On 09/19/11 08:00, Ben Hutchings wrote:
>>> [ .. ]
>>>>>
>>>>> There have been reports of this in Debian going back to 2.6.39:
>>>>>
>>>>> http://bugs.debian.org/631187
>>>>> http://bugs.debian.org/636263
>>>>> http://bugs.debian.org/642043
>>>>>
>>>>> Plus possibly related crashes in elv_put_request after CD-ROM removal:
>>>>>
>>>>> http://bugs.debian.org/633890
>>>>> http://bugs.debian.org/634681
>>>>> http://bugs.debian.org/636103
>>>>>
>>>>> The former was also reported in Ubuntu since their 2.6.38-10:
>>>>>
>>>>> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
>>>>>
>>>>> The result of the discussion there was that it appeared to be a
>>>>> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
>>>>> ("[SCSI] put stricter guards on queue dead checks") which was also
>>>>> included in a stable update for 2.6.38.
>>>>>
>>>>> There was also a report on bugzilla.kernel.org, though no-one can see
>>>>> quite what that says now:
>>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=38842
>>>>>
>>>>> I also reported most of the above to James Bottomley and linux-scsi
>>>>> nearly 2 months ago, to no response.
>>>>
>>>> I've reported a similar oops related to the above commit:
>>>>   [BUG] Oops when SCSI device under multipath is removed
>>>>   https://lkml.org/lkml/2011/8/10/11
>>>>
>>>> Elevator being removed is the core of the problem.
>>>> And the essential issue seems 2 different models of queue/driver relation
>>>> implied by queue_lock.
>>>>
>>>> If reverting the commit is not an option,
>>>> until somebody comes up to fix the essential issue,
>>>> the patch below should close the regressions introduced by the commit.
>>>>
>>> Why do you have to do it that complicated?
>>> Couldn't we just state that any external lock is being disconnected from
>>> queue_lock after blk_cleanup_queue()?
>>>
>>> Then something like this should suffice here:
>>
>>
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 90e1ffd..a4ac005 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
>>         queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
>>         mutex_unlock(&q->sysfs_lock);
>>
>> -       if (q->elevator)
>> -               elevator_exit(q->elevator);
>> -
>> -       blk_throtl_exit(q);
>> +       if (q->queue_lock != q->__queue_lock)
>> +               q->queue_lock = q->__queue_lock;
> 
> That should be &q->__queue_lock.
> 
Why, but of course.
It's been fixed with the official patch
(cf block: Free queue resources at blk_release_queue())

Cheers,

Hannes
-- 
Dr. Hannes Reinecke              zSeries & Storage
hare@suse.de                  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <BAY151-W13DDCCEFEB7B68EE506214A10C0@phx.gbl>
@ 2011-09-23 15:18 ` Alan Stern
  0 siblings, 0 replies; 10+ messages in thread
From: Alan Stern @ 2011-09-23 15:18 UTC (permalink / raw)
  To: Rocko Requin
  Cc: hare, j-nomura, ben, jaxboe, james.bottomley, tytso,
	linux-kernel, linux-scsi

On Thu, 22 Sep 2011, Rocko Requin wrote:

> > Rocko:
> > 
> > Can you try testing this patch instead of all the patches I sent to 
> > you (but keep Ted's patch)?
> > 
> > Alan Stern
> > 
> 
> The simpler patch (in conjunction with Ted's patch) does stop the
> crashes. I get the same results as before: no kernel crashes
> (marvellous!), but the script's attempt to umount fails. I can then
> manually umount afterwards.

That sounds like a problem in the ext4 unmount implementation.  Ted
should be able to help track it down.

What happens if you change your script to try two unmounts in a row?  
In theory, the second should work like your manual unmount.

> Are these patches likely to be backported to the 3.0 kernel?

Yes, I should think so.  The ext4/ext3 patches may be ported even
farther back.

Alan Stern


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <BAY151-W234D9A977DF076A732C2AAA1080@phx.gbl>
@ 2011-09-18 14:43 ` Alan Stern
  0 siblings, 0 replies; 10+ messages in thread
From: Alan Stern @ 2011-09-18 14:43 UTC (permalink / raw)
  To: Rocko Requin; +Cc: tytso, linux-kernel

On Sun, 18 Sep 2011, Rocko Requin wrote:

> That patch worked, thanks - no more kernel crashes!

That's good news.

> I did see some slightly strange behaviour, though: the umounts issued
> by the script were not working correctly, so the drive was mounting
> successively on /dev/sdb1, /dev/sdc1, /dev/sdd1, etc again. After
> stopping the script, I was able to umount the various mountpoints
> manually, except for one (not the last one in the list, so not the
> most recently mounted one) which reported that the device was busy.
> lsof showed process jbd2/sdj1 three times, once with FD=cwd and
> TYPE=DIR, another with FD=rtd and TYPE=DIR, and lastly with FD=txt
> and TYPE=unknown.

Ted may have some ideas on how to find out what part of the unmount
commands is failing.

Alan Stern


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-09-23 15:18 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BAY151-W32DCB4BAFEC97DD4913A12A1090@phx.gbl>
2011-09-17 17:34 ` [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed Alan Stern
2011-09-18 23:00   ` Ben Hutchings
2011-09-20  7:32     ` Jun'ichi Nomura
2011-09-22 12:26       ` Hannes Reinecke
2011-09-22 12:35         ` James Bottomley
2011-09-22 15:16         ` Alan Stern
2011-09-22 16:20           ` Thadeu Lima de Souza Cascardo
2011-09-22 16:32             ` Hannes Reinecke
     [not found] <BAY151-W234D9A977DF076A732C2AAA1080@phx.gbl>
2011-09-18 14:43 ` Alan Stern
     [not found] <BAY151-W13DDCCEFEB7B68EE506214A10C0@phx.gbl>
2011-09-23 15:18 ` Alan Stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).