RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <BAY151-W13DDCCEFEB7B68EE506214A10C0@phx.gbl>
@ 2011-09-23 15:18   ` Alan Stern
  0 siblings, 0 replies; 53+ messages in thread
From: Alan Stern @ 2011-09-23 15:18 UTC (permalink / raw)
  To: Rocko Requin
  Cc: hare, j-nomura, ben, jaxboe, james.bottomley, tytso,
	linux-kernel, linux-scsi

On Thu, 22 Sep 2011, Rocko Requin wrote:

> > Rocko:
> > 
> > Can you try testing this patch instead of all the patches I sent to 
> > you (but keep Ted's patch)?
> > 
> > Alan Stern
> > 
> 
> The simpler patch (in conjunction with Ted's patch) does stop the
> crashes. I get the same results as before: no kernel crashes
> (marvellous!), but the script's attempt to umount fails. I can then
> manually umount afterwards.

That sounds like a problem in the ext4 unmount implementation.  Ted
should be able to help track it down.

What happens if you change your script to try two unmounts in a row?  
In theory, the second should work like your manual unmount.

> Are these patches likely to be backported to the 3.0 kernel?

Yes, I should think so.  The ext4/ext3 patches may be ported even
farther back.

Alan Stern


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
@ 2011-09-23 15:18   ` Alan Stern
  0 siblings, 0 replies; 53+ messages in thread
From: Alan Stern @ 2011-09-23 15:18 UTC (permalink / raw)
  To: Rocko Requin
  Cc: hare, j-nomura, ben, jaxboe, james.bottomley, tytso,
	linux-kernel, linux-scsi

On Thu, 22 Sep 2011, Rocko Requin wrote:

> > Rocko:
> > 
> > Can you try testing this patch instead of all the patches I sent to 
> > you (but keep Ted's patch)?
> > 
> > Alan Stern
> > 
> 
> The simpler patch (in conjunction with Ted's patch) does stop the
> crashes. I get the same results as before: no kernel crashes
> (marvellous!), but the script's attempt to umount fails. I can then
> manually umount afterwards.

That sounds like a problem in the ext4 unmount implementation.  Ted
should be able to help track it down.

What happens if you change your script to try two unmounts in a row?  
In theory, the second should work like your manual unmount.

> Are these patches likely to be backported to the 3.0 kernel?

Yes, I should think so.  The ext4/ext3 patches may be ported even
farther back.

Alan Stern


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (30 preceding siblings ...)
  2011-09-05 17:44 ` bugzilla-daemon
@ 2012-07-02 13:24 ` bugzilla-daemon
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2012-07-02 13:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832


Alan <alan@lxorguk.ukuu.org.uk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |alan@lxorguk.ukuu.org.uk
         Resolution|                            |CODE_FIX




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-22 16:20           ` Thadeu Lima de Souza Cascardo
@ 2011-09-22 16:32               ` Hannes Reinecke
  0 siblings, 0 replies; 53+ messages in thread
From: Hannes Reinecke @ 2011-09-22 16:32 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: Alan Stern, Rocko Requin, Jun'ichi Nomura, Ben Hutchings,
	jaxboe, James Bottomley, tytso, Kernel development list,
	linux-scsi

On 09/22/2011 06:20 PM, Thadeu Lima de Souza Cascardo wrote:
> On Thu, Sep 22, 2011 at 11:16:30AM -0400, Alan Stern wrote:
>> Rocko:
>>
>> Can you try testing this patch instead of all the patches I sent to 
>> you (but keep Ted's patch)?
>>
>> Alan Stern
>>
>> On Thu, 22 Sep 2011, Hannes Reinecke wrote:
>>
>>> On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
>>>> On 09/19/11 08:00, Ben Hutchings wrote:
>>> [ .. ]
>>>>>
>>>>> There have been reports of this in Debian going back to 2.6.39:
>>>>>
>>>>> http://bugs.debian.org/631187
>>>>> http://bugs.debian.org/636263
>>>>> http://bugs.debian.org/642043
>>>>>
>>>>> Plus possibly related crashes in elv_put_request after CD-ROM removal:
>>>>>
>>>>> http://bugs.debian.org/633890
>>>>> http://bugs.debian.org/634681
>>>>> http://bugs.debian.org/636103
>>>>>
>>>>> The former was also reported in Ubuntu since their 2.6.38-10:
>>>>>
>>>>> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
>>>>>
>>>>> The result of the discussion there was that it appeared to be a
>>>>> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
>>>>> ("[SCSI] put stricter guards on queue dead checks") which was also
>>>>> included in a stable update for 2.6.38.
>>>>>
>>>>> There was also a report on bugzilla.kernel.org, though no-one can see
>>>>> quite what that says now:
>>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=38842
>>>>>
>>>>> I also reported most of the above to James Bottomley and linux-scsi
>>>>> nearly 2 months ago, to no response.
>>>>
>>>> I've reported a similar oops related to the above commit:
>>>>   [BUG] Oops when SCSI device under multipath is removed
>>>>   https://lkml.org/lkml/2011/8/10/11
>>>>
>>>> Elevator being removed is the core of the problem.
>>>> And the essential issue seems 2 different models of queue/driver relation
>>>> implied by queue_lock.
>>>>
>>>> If reverting the commit is not an option,
>>>> until somebody comes up to fix the essential issue,
>>>> the patch below should close the regressions introduced by the commit.
>>>>
>>> Why do you have to do it that complicated?
>>> Couldn't we just state that any external lock is being disconnected from
>>> queue_lock after blk_cleanup_queue()?
>>>
>>> Then something like this should suffice here:
>>
>>
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 90e1ffd..a4ac005 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
>>         queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
>>         mutex_unlock(&q->sysfs_lock);
>>
>> -       if (q->elevator)
>> -               elevator_exit(q->elevator);
>> -
>> -       blk_throtl_exit(q);
>> +       if (q->queue_lock != q->__queue_lock)
>> +               q->queue_lock = q->__queue_lock;
> 
> That should be &q->__queue_lock.
> 
Why, but of course.
It's been fixed with the official patch
(cf block: Free queue resources at blk_release_queue())

Cheers,

Hannes
-- 
Dr. Hannes Reinecke              zSeries & Storage
hare@suse.de                  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
@ 2011-09-22 16:32               ` Hannes Reinecke
  0 siblings, 0 replies; 53+ messages in thread
From: Hannes Reinecke @ 2011-09-22 16:32 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: Alan Stern, Rocko Requin, Jun'ichi Nomura, Ben Hutchings,
	jaxboe, James Bottomley, tytso, Kernel development list,
	linux-scsi

On 09/22/2011 06:20 PM, Thadeu Lima de Souza Cascardo wrote:
> On Thu, Sep 22, 2011 at 11:16:30AM -0400, Alan Stern wrote:
>> Rocko:
>>
>> Can you try testing this patch instead of all the patches I sent to 
>> you (but keep Ted's patch)?
>>
>> Alan Stern
>>
>> On Thu, 22 Sep 2011, Hannes Reinecke wrote:
>>
>>> On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
>>>> On 09/19/11 08:00, Ben Hutchings wrote:
>>> [ .. ]
>>>>>
>>>>> There have been reports of this in Debian going back to 2.6.39:
>>>>>
>>>>> http://bugs.debian.org/631187
>>>>> http://bugs.debian.org/636263
>>>>> http://bugs.debian.org/642043
>>>>>
>>>>> Plus possibly related crashes in elv_put_request after CD-ROM removal:
>>>>>
>>>>> http://bugs.debian.org/633890
>>>>> http://bugs.debian.org/634681
>>>>> http://bugs.debian.org/636103
>>>>>
>>>>> The former was also reported in Ubuntu since their 2.6.38-10:
>>>>>
>>>>> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
>>>>>
>>>>> The result of the discussion there was that it appeared to be a
>>>>> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
>>>>> ("[SCSI] put stricter guards on queue dead checks") which was also
>>>>> included in a stable update for 2.6.38.
>>>>>
>>>>> There was also a report on bugzilla.kernel.org, though no-one can see
>>>>> quite what that says now:
>>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=38842
>>>>>
>>>>> I also reported most of the above to James Bottomley and linux-scsi
>>>>> nearly 2 months ago, to no response.
>>>>
>>>> I've reported a similar oops related to the above commit:
>>>>   [BUG] Oops when SCSI device under multipath is removed
>>>>   https://lkml.org/lkml/2011/8/10/11
>>>>
>>>> Elevator being removed is the core of the problem.
>>>> And the essential issue seems 2 different models of queue/driver relation
>>>> implied by queue_lock.
>>>>
>>>> If reverting the commit is not an option,
>>>> until somebody comes up to fix the essential issue,
>>>> the patch below should close the regressions introduced by the commit.
>>>>
>>> Why do you have to do it that complicated?
>>> Couldn't we just state that any external lock is being disconnected from
>>> queue_lock after blk_cleanup_queue()?
>>>
>>> Then something like this should suffice here:
>>
>>
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 90e1ffd..a4ac005 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
>>         queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
>>         mutex_unlock(&q->sysfs_lock);
>>
>> -       if (q->elevator)
>> -               elevator_exit(q->elevator);
>> -
>> -       blk_throtl_exit(q);
>> +       if (q->queue_lock != q->__queue_lock)
>> +               q->queue_lock = q->__queue_lock;
> 
> That should be &q->__queue_lock.
> 
Why, but of course.
It's been fixed with the official patch
(cf block: Free queue resources at blk_release_queue())

Cheers,

Hannes
-- 
Dr. Hannes Reinecke              zSeries & Storage
hare@suse.de                  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-22 15:16           ` Alan Stern
  (?)
@ 2011-09-22 16:20           ` Thadeu Lima de Souza Cascardo
  2011-09-22 16:32               ` Hannes Reinecke
  -1 siblings, 1 reply; 53+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2011-09-22 16:20 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rocko Requin, Hannes Reinecke, Jun'ichi Nomura,
	Ben Hutchings, jaxboe, James Bottomley, tytso,
	Kernel development list, linux-scsi

On Thu, Sep 22, 2011 at 11:16:30AM -0400, Alan Stern wrote:
> Rocko:
> 
> Can you try testing this patch instead of all the patches I sent to 
> you (but keep Ted's patch)?
> 
> Alan Stern
> 
> On Thu, 22 Sep 2011, Hannes Reinecke wrote:
> 
> > On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> > > On 09/19/11 08:00, Ben Hutchings wrote:
> > [ .. ]
> > >>
> > >> There have been reports of this in Debian going back to 2.6.39:
> > >>
> > >> http://bugs.debian.org/631187
> > >> http://bugs.debian.org/636263
> > >> http://bugs.debian.org/642043
> > >>
> > >> Plus possibly related crashes in elv_put_request after CD-ROM removal:
> > >>
> > >> http://bugs.debian.org/633890
> > >> http://bugs.debian.org/634681
> > >> http://bugs.debian.org/636103
> > >>
> > >> The former was also reported in Ubuntu since their 2.6.38-10:
> > >>
> > >> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
> > >>
> > >> The result of the discussion there was that it appeared to be a
> > >> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
> > >> ("[SCSI] put stricter guards on queue dead checks") which was also
> > >> included in a stable update for 2.6.38.
> > >>
> > >> There was also a report on bugzilla.kernel.org, though no-one can see
> > >> quite what that says now:
> > >>
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=38842
> > >>
> > >> I also reported most of the above to James Bottomley and linux-scsi
> > >> nearly 2 months ago, to no response.
> > > 
> > > I've reported a similar oops related to the above commit:
> > >   [BUG] Oops when SCSI device under multipath is removed
> > >   https://lkml.org/lkml/2011/8/10/11
> > > 
> > > Elevator being removed is the core of the problem.
> > > And the essential issue seems 2 different models of queue/driver relation
> > > implied by queue_lock.
> > > 
> > > If reverting the commit is not an option,
> > > until somebody comes up to fix the essential issue,
> > > the patch below should close the regressions introduced by the commit.
> > > 
> > Why do you have to do it that complicated?
> > Couldn't we just state that any external lock is being disconnected from
> > queue_lock after blk_cleanup_queue()?
> > 
> > Then something like this should suffice here:
> 
> 
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 90e1ffd..a4ac005 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
>         queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
>         mutex_unlock(&q->sysfs_lock);
> 
> -       if (q->elevator)
> -               elevator_exit(q->elevator);
> -
> -       blk_throtl_exit(q);
> +       if (q->queue_lock != q->__queue_lock)
> +               q->queue_lock = q->__queue_lock;

That should be &q->__queue_lock.

Regards,
Cascardo.

> 
>         blk_put_queue(q);
>  }
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 0ee17b5..a5a756b 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)
> 
>         blk_sync_queue(q);
> 
> +       if (q->elevator)
> +               elevator_exit(q->elevator);
> +
> +       blk_throtl_exit(q);
> +
>         if (rl->rq_pool)
>                 mempool_destroy(rl->rq_pool);
> 
> 
> > And yeah, I find it pretty annoying, too.
> > 
> > Cheers,
> > 
> > Hannes
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-22 12:26         ` Hannes Reinecke
@ 2011-09-22 15:16           ` Alan Stern
  -1 siblings, 0 replies; 53+ messages in thread
From: Alan Stern @ 2011-09-22 15:16 UTC (permalink / raw)
  To: Rocko Requin
  Cc: Hannes Reinecke, Jun'ichi Nomura, Ben Hutchings, jaxboe,
	James Bottomley, tytso, Kernel development list, linux-scsi

Rocko:

Can you try testing this patch instead of all the patches I sent to 
you (but keep Ted's patch)?

Alan Stern

On Thu, 22 Sep 2011, Hannes Reinecke wrote:

> On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> > On 09/19/11 08:00, Ben Hutchings wrote:
> [ .. ]
> >>
> >> There have been reports of this in Debian going back to 2.6.39:
> >>
> >> http://bugs.debian.org/631187
> >> http://bugs.debian.org/636263
> >> http://bugs.debian.org/642043
> >>
> >> Plus possibly related crashes in elv_put_request after CD-ROM removal:
> >>
> >> http://bugs.debian.org/633890
> >> http://bugs.debian.org/634681
> >> http://bugs.debian.org/636103
> >>
> >> The former was also reported in Ubuntu since their 2.6.38-10:
> >>
> >> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
> >>
> >> The result of the discussion there was that it appeared to be a
> >> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
> >> ("[SCSI] put stricter guards on queue dead checks") which was also
> >> included in a stable update for 2.6.38.
> >>
> >> There was also a report on bugzilla.kernel.org, though no-one can see
> >> quite what that says now:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=38842
> >>
> >> I also reported most of the above to James Bottomley and linux-scsi
> >> nearly 2 months ago, to no response.
> > 
> > I've reported a similar oops related to the above commit:
> >   [BUG] Oops when SCSI device under multipath is removed
> >   https://lkml.org/lkml/2011/8/10/11
> > 
> > Elevator being removed is the core of the problem.
> > And the essential issue seems 2 different models of queue/driver relation
> > implied by queue_lock.
> > 
> > If reverting the commit is not an option,
> > until somebody comes up to fix the essential issue,
> > the patch below should close the regressions introduced by the commit.
> > 
> Why do you have to do it that complicated?
> Couldn't we just state that any external lock is being disconnected from
> queue_lock after blk_cleanup_queue()?
> 
> Then something like this should suffice here:



diff --git a/block/blk-core.c b/block/blk-core.c
index 90e1ffd..a4ac005 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
        queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
        mutex_unlock(&q->sysfs_lock);

-       if (q->elevator)
-               elevator_exit(q->elevator);
-
-       blk_throtl_exit(q);
+       if (q->queue_lock != q->__queue_lock)
+               q->queue_lock = q->__queue_lock;

        blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0ee17b5..a5a756b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)

        blk_sync_queue(q);

+       if (q->elevator)
+               elevator_exit(q->elevator);
+
+       blk_throtl_exit(q);
+
        if (rl->rq_pool)
                mempool_destroy(rl->rq_pool);
 

> And yeah, I find it pretty annoying, too.
> 
> Cheers,
> 
> Hannes
> 


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
@ 2011-09-22 15:16           ` Alan Stern
  0 siblings, 0 replies; 53+ messages in thread
From: Alan Stern @ 2011-09-22 15:16 UTC (permalink / raw)
  To: Rocko Requin
  Cc: Hannes Reinecke, Jun'ichi Nomura, Ben Hutchings, jaxboe,
	James Bottomley, tytso, Kernel development list, linux-scsi

Rocko:

Can you try testing this patch instead of all the patches I sent to 
you (but keep Ted's patch)?

Alan Stern

On Thu, 22 Sep 2011, Hannes Reinecke wrote:

> On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> > On 09/19/11 08:00, Ben Hutchings wrote:
> [ .. ]
> >>
> >> There have been reports of this in Debian going back to 2.6.39:
> >>
> >> http://bugs.debian.org/631187
> >> http://bugs.debian.org/636263
> >> http://bugs.debian.org/642043
> >>
> >> Plus possibly related crashes in elv_put_request after CD-ROM removal:
> >>
> >> http://bugs.debian.org/633890
> >> http://bugs.debian.org/634681
> >> http://bugs.debian.org/636103
> >>
> >> The former was also reported in Ubuntu since their 2.6.38-10:
> >>
> >> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
> >>
> >> The result of the discussion there was that it appeared to be a
> >> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
> >> ("[SCSI] put stricter guards on queue dead checks") which was also
> >> included in a stable update for 2.6.38.
> >>
> >> There was also a report on bugzilla.kernel.org, though no-one can see
> >> quite what that says now:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=38842
> >>
> >> I also reported most of the above to James Bottomley and linux-scsi
> >> nearly 2 months ago, to no response.
> > 
> > I've reported a similar oops related to the above commit:
> >   [BUG] Oops when SCSI device under multipath is removed
> >   https://lkml.org/lkml/2011/8/10/11
> > 
> > Elevator being removed is the core of the problem.
> > And the essential issue seems 2 different models of queue/driver relation
> > implied by queue_lock.
> > 
> > If reverting the commit is not an option,
> > until somebody comes up to fix the essential issue,
> > the patch below should close the regressions introduced by the commit.
> > 
> Why do you have to do it that complicated?
> Couldn't we just state that any external lock is being disconnected from
> queue_lock after blk_cleanup_queue()?
> 
> Then something like this should suffice here:



diff --git a/block/blk-core.c b/block/blk-core.c
index 90e1ffd..a4ac005 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
        queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
        mutex_unlock(&q->sysfs_lock);

-       if (q->elevator)
-               elevator_exit(q->elevator);
-
-       blk_throtl_exit(q);
+       if (q->queue_lock != q->__queue_lock)
+               q->queue_lock = q->__queue_lock;

        blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0ee17b5..a5a756b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)

        blk_sync_queue(q);

+       if (q->elevator)
+               elevator_exit(q->elevator);
+
+       blk_throtl_exit(q);
+
        if (rl->rq_pool)
                mempool_destroy(rl->rq_pool);
 

> And yeah, I find it pretty annoying, too.
> 
> Cheers,
> 
> Hannes
> 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-22 12:26         ` Hannes Reinecke
  (?)
@ 2011-09-22 12:35         ` James Bottomley
  -1 siblings, 0 replies; 53+ messages in thread
From: James Bottomley @ 2011-09-22 12:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jun'ichi Nomura, Ben Hutchings, jaxboe, Alan Stern,
	Rocko Requin, tytso, Kernel development list, linux-scsi

On Thu, 2011-09-22 at 14:26 +0200, Hannes Reinecke wrote:
> On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> > On 09/19/11 08:00, Ben Hutchings wrote:
> [ .. ]
> >>
> >> There have been reports of this in Debian going back to 2.6.39:
> >>
> >> http://bugs.debian.org/631187
> >> http://bugs.debian.org/636263
> >> http://bugs.debian.org/642043
> >>
> >> Plus possibly related crashes in elv_put_request after CD-ROM removal:
> >>
> >> http://bugs.debian.org/633890
> >> http://bugs.debian.org/634681
> >> http://bugs.debian.org/636103
> >>
> >> The former was also reported in Ubuntu since their 2.6.38-10:
> >>
> >> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
> >>
> >> The result of the discussion there was that it appeared to be a
> >> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
> >> ("[SCSI] put stricter guards on queue dead checks") which was also
> >> included in a stable update for 2.6.38.
> >>
> >> There was also a report on bugzilla.kernel.org, though no-one can see
> >> quite what that says now:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=38842
> >>
> >> I also reported most of the above to James Bottomley and linux-scsi
> >> nearly 2 months ago, to no response.
> > 
> > I've reported a similar oops related to the above commit:
> >   [BUG] Oops when SCSI device under multipath is removed
> >   https://lkml.org/lkml/2011/8/10/11
> > 
> > Elevator being removed is the core of the problem.
> > And the essential issue seems 2 different models of queue/driver relation
> > implied by queue_lock.
> > 
> > If reverting the commit is not an option,
> > until somebody comes up to fix the essential issue,
> > the patch below should close the regressions introduced by the commit.
> > 
> Why do you have to do it that complicated?
> Couldn't we just state that any external lock is being disconnected from
> queue_lock after blk_cleanup_queue()?
> 
> Then something like this should suffice here:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 90e1ffd..a4ac005 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
>         queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
>         mutex_unlock(&q->sysfs_lock);
> 
> -       if (q->elevator)
> -               elevator_exit(q->elevator);
> -
> -       blk_throtl_exit(q);
> +       if (q->queue_lock != q->__queue_lock)
> +               q->queue_lock = q->__queue_lock;
> 
>         blk_put_queue(q);
>  }
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 0ee17b5..a5a756b 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)
> 
>         blk_sync_queue(q);
> 
> +       if (q->elevator)
> +               elevator_exit(q->elevator);
> +
> +       blk_throtl_exit(q);
> +

OK, I'll buy this one (when you fix the whitespace issue ... you have
spaces instead of tabs).

The fact that the lock check/replacement doesn't actually need any
locking is probably worthy of a comment.

James



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-20  7:32     ` Jun'ichi Nomura
@ 2011-09-22 12:26         ` Hannes Reinecke
  0 siblings, 0 replies; 53+ messages in thread
From: Hannes Reinecke @ 2011-09-22 12:26 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Ben Hutchings, jaxboe, Alan Stern, James Bottomley, Rocko Requin,
	tytso, Kernel development list, linux-scsi

On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> On 09/19/11 08:00, Ben Hutchings wrote:
[ .. ]
>>
>> There have been reports of this in Debian going back to 2.6.39:
>>
>> http://bugs.debian.org/631187
>> http://bugs.debian.org/636263
>> http://bugs.debian.org/642043
>>
>> Plus possibly related crashes in elv_put_request after CD-ROM removal:
>>
>> http://bugs.debian.org/633890
>> http://bugs.debian.org/634681
>> http://bugs.debian.org/636103
>>
>> The former was also reported in Ubuntu since their 2.6.38-10:
>>
>> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
>>
>> The result of the discussion there was that it appeared to be a
>> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
>> ("[SCSI] put stricter guards on queue dead checks") which was also
>> included in a stable update for 2.6.38.
>>
>> There was also a report on bugzilla.kernel.org, though no-one can see
>> quite what that says now:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=38842
>>
>> I also reported most of the above to James Bottomley and linux-scsi
>> nearly 2 months ago, to no response.
> 
> I've reported a similar oops related to the above commit:
>   [BUG] Oops when SCSI device under multipath is removed
>   https://lkml.org/lkml/2011/8/10/11
> 
> Elevator being removed is the core of the problem.
> And the essential issue seems 2 different models of queue/driver relation
> implied by queue_lock.
> 
> If reverting the commit is not an option,
> until somebody comes up to fix the essential issue,
> the patch below should close the regressions introduced by the commit.
> 
Why do you have to do it that complicated?
Couldn't we just state that any external lock is being disconnected from
queue_lock after blk_cleanup_queue()?

Then something like this should suffice here:

diff --git a/block/blk-core.c b/block/blk-core.c
index 90e1ffd..a4ac005 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
        queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
        mutex_unlock(&q->sysfs_lock);

-       if (q->elevator)
-               elevator_exit(q->elevator);
-
-       blk_throtl_exit(q);
+       if (q->queue_lock != q->__queue_lock)
+               q->queue_lock = q->__queue_lock;

        blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0ee17b5..a5a756b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)

        blk_sync_queue(q);

+       if (q->elevator)
+               elevator_exit(q->elevator);
+
+       blk_throtl_exit(q);
+
        if (rl->rq_pool)
                mempool_destroy(rl->rq_pool);


And yeah, I find it pretty annoying, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke              zSeries & Storage
hare@suse.de                  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
@ 2011-09-22 12:26         ` Hannes Reinecke
  0 siblings, 0 replies; 53+ messages in thread
From: Hannes Reinecke @ 2011-09-22 12:26 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Ben Hutchings, jaxboe, Alan Stern, James Bottomley, Rocko Requin,
	tytso, Kernel development list, linux-scsi

On 09/20/2011 09:32 AM, Jun'ichi Nomura wrote:
> On 09/19/11 08:00, Ben Hutchings wrote:
[ .. ]
>>
>> There have been reports of this in Debian going back to 2.6.39:
>>
>> http://bugs.debian.org/631187
>> http://bugs.debian.org/636263
>> http://bugs.debian.org/642043
>>
>> Plus possibly related crashes in elv_put_request after CD-ROM removal:
>>
>> http://bugs.debian.org/633890
>> http://bugs.debian.org/634681
>> http://bugs.debian.org/636103
>>
>> The former was also reported in Ubuntu since their 2.6.38-10:
>>
>> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
>>
>> The result of the discussion there was that it appeared to be a
>> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
>> ("[SCSI] put stricter guards on queue dead checks") which was also
>> included in a stable update for 2.6.38.
>>
>> There was also a report on bugzilla.kernel.org, though no-one can see
>> quite what that says now:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=38842
>>
>> I also reported most of the above to James Bottomley and linux-scsi
>> nearly 2 months ago, to no response.
> 
> I've reported a similar oops related to the above commit:
>   [BUG] Oops when SCSI device under multipath is removed
>   https://lkml.org/lkml/2011/8/10/11
> 
> Elevator being removed is the core of the problem.
> And the essential issue seems 2 different models of queue/driver relation
> implied by queue_lock.
> 
> If reverting the commit is not an option,
> until somebody comes up to fix the essential issue,
> the patch below should close the regressions introduced by the commit.
> 
Why do you have to do it that complicated?
Couldn't we just state that any external lock is being disconnected from
queue_lock after blk_cleanup_queue()?

Then something like this should suffice here:

diff --git a/block/blk-core.c b/block/blk-core.c
index 90e1ffd..a4ac005 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -367,10 +367,8 @@ void blk_cleanup_queue(struct request_queue *q)
        queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
        mutex_unlock(&q->sysfs_lock);

-       if (q->elevator)
-               elevator_exit(q->elevator);
-
-       blk_throtl_exit(q);
+       if (q->queue_lock != q->__queue_lock)
+               q->queue_lock = q->__queue_lock;

        blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0ee17b5..a5a756b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -477,6 +477,11 @@ static void blk_release_queue(struct kobject *kobj)

        blk_sync_queue(q);

+       if (q->elevator)
+               elevator_exit(q->elevator);
+
+       blk_throtl_exit(q);
+
        if (rl->rq_pool)
                mempool_destroy(rl->rq_pool);


And yeah, I find it pretty annoying, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke              zSeries & Storage
hare@suse.de                  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-18 23:00   ` Ben Hutchings
@ 2011-09-20  7:32     ` Jun'ichi Nomura
  2011-09-22 12:26         ` Hannes Reinecke
  0 siblings, 1 reply; 53+ messages in thread
From: Jun'ichi Nomura @ 2011-09-20  7:32 UTC (permalink / raw)
  To: Ben Hutchings, jaxboe
  Cc: Alan Stern, James Bottomley, Rocko Requin, tytso,
	Kernel development list, linux-scsi

On 09/19/11 08:00, Ben Hutchings wrote:
> On Sat, 2011-09-17 at 13:34 -0400, Alan Stern wrote:
>> On Sat, 17 Sep 2011, Rocko Requin wrote:
>>
>>>> Why were you using gnome-terminal?  You should be running the tests at
>>>> a console VT, not under X at all.  Ctrl-Alt-F2 or the equivalent...
>>>
>>> Because with Ted's patch it doesn't crash when run from a console VT, even with an X server running.
>>
>> That's weird.  Maybe the screen updates change some timing.
>>
>>>> Here's another patch to address the new problem.  You can apply it on 
>>>> top of all the other patches.
>>>
>>> Attached is the crash log I get with the latest patch applied.
>>
>> Okay, more fallout from the same problem.  Here's an updated version of 
>> the previous patch.
> [...]
> 
> There have been reports of this in Debian going back to 2.6.39:
> 
> http://bugs.debian.org/631187
> http://bugs.debian.org/636263
> http://bugs.debian.org/642043
> 
> Plus possibly related crashes in elv_put_request after CD-ROM removal:
> 
> http://bugs.debian.org/633890
> http://bugs.debian.org/634681
> http://bugs.debian.org/636103
> 
> The former was also reported in Ubuntu since their 2.6.38-10:
> 
> https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796
> 
> The result of the discussion there was that it appeared to be a
> regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
> ("[SCSI] put stricter guards on queue dead checks") which was also
> included in a stable update for 2.6.38.
> 
> There was also a report on bugzilla.kernel.org, though no-one can see
> quite what that says now:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=38842
> 
> I also reported most of the above to James Bottomley and linux-scsi
> nearly 2 months ago, to no response.

I've reported a similar oops related to the above commit:
  [BUG] Oops when SCSI device under multipath is removed
  https://lkml.org/lkml/2011/8/10/11

Elevator being removed is the core of the problem.
And the essential issue seems 2 different models of queue/driver relation
implied by queue_lock.

If reverting the commit is not an option,
until somebody comes up to fix the essential issue,
the patch below should close the regressions introduced by the commit.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation


This patch moves elevator_exit() and blk_throtl_exit() from
blk_cleanup_queue() to blk_release_queue() when it is possible.

elevator_exit() and blk_throtl_exit() were called in blk_cleanup_queue()
because they use queue_lock.

There are 2 types of queue_locks:
  a) supplied by driver (via blk_init_queue)
  b) embedded in struct request_queue (__queue_lock)

When queue_lock is supplied by driver, there is no guarantee that
the pointer is valid after blk_cleanup_queue(), so they have to be
called in blk_cleanup_queue().
In this case, the driver has to make sure nobody is using the queue
before calling blk_cleanup_queue().

However, OTOH, if queue_lock is '__queue_lock' in request_queue,
blk_release_queue() is better place for freeing structures
because the block layer knows for sure there is no reference.

This patch is ugly but should fix various oopses introduced by this change:
  86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b
  [SCSI] put stricter guards on queue dead checks

For example:
  https://lkml.org/lkml/2011/8/10/11

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>

Index: linux-3.1-rc4/block/blk-core.c
===================================================================
--- linux-3.1-rc4.orig/block/blk-core.c	2011-08-29 13:16:01.000000000 +0900
+++ linux-3.1-rc4/block/blk-core.c	2011-09-20 15:53:23.496814819 +0900
@@ -352,6 +352,14 @@
  * unexpectedly as some queue cleanup components like elevator_exit() and
  * blk_throtl_exit() need queue lock.
  */
+void blk_release_queue_components_with_queuelock(struct request_queue *q)
+{
+	if (q->elevator)
+		elevator_exit(q->elevator);
+
+	blk_throtl_exit(q);
+}
+
 void blk_cleanup_queue(struct request_queue *q)
 {
 	/*
@@ -367,10 +375,12 @@
 	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
 	mutex_unlock(&q->sysfs_lock);
 
-	if (q->elevator)
-		elevator_exit(q->elevator);
-
-	blk_throtl_exit(q);
+	/*
+	 * A driver supplied the queue lock.
+	 * Cleanup components while the queue lock is valid.
+	 */
+	if (q->queue_lock != &q->__queue_lock)
+		blk_release_queue_components_with_queuelock(q);
 
 	blk_put_queue(q);
 }
Index: linux-3.1-rc4/block/blk-sysfs.c
===================================================================
--- linux-3.1-rc4.orig/block/blk-sysfs.c	2011-09-19 09:38:51.000000000 +0900
+++ linux-3.1-rc4/block/blk-sysfs.c	2011-09-20 15:57:50.358807023 +0900
@@ -477,6 +477,9 @@
 
 	blk_sync_queue(q);
 
+	if (q->queue_lock == &q->__queue_lock)
+		blk_release_queue_components_with_queuelock(q);
+
 	if (rl->rq_pool)
 		mempool_destroy(rl->rq_pool);
 
Index: linux-3.1-rc4/block/blk.h
===================================================================
--- linux-3.1-rc4.orig/block/blk.h	2011-08-29 13:16:01.000000000 +0900
+++ linux-3.1-rc4/block/blk.h	2011-09-20 15:57:38.306807136 +0900
@@ -25,6 +25,9 @@
 void blk_add_timer(struct request *);
 void __generic_unplug_device(struct request_queue *);
 
+/* Wrapper to release functions to be called while queue_lock is valid */
+void blk_release_queue_components_with_queuelock(struct request_queue *q);
+
 /*
  * Internal atomic flags for request handling
  */

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-17 17:34 ` Alan Stern
@ 2011-09-18 23:00   ` Ben Hutchings
  2011-09-20  7:32     ` Jun'ichi Nomura
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Hutchings @ 2011-09-18 23:00 UTC (permalink / raw)
  To: Alan Stern, James Bottomley
  Cc: Rocko Requin, tytso, Kernel development list, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1864 bytes --]

On Sat, 2011-09-17 at 13:34 -0400, Alan Stern wrote:
> On Sat, 17 Sep 2011, Rocko Requin wrote:
> 
> > > Why were you using gnome-terminal?  You should be running the tests at
> > > a console VT, not under X at all.  Ctrl-Alt-F2 or the equivalent...
> > 
> > Because with Ted's patch it doesn't crash when run from a console VT, even with an X server running.
> 
> That's weird.  Maybe the screen updates change some timing.
> 
> > > Here's another patch to address the new problem.  You can apply it on 
> > > top of all the other patches.
> > 
> > Attached is the crash log I get with the latest patch applied.
> 
> Okay, more fallout from the same problem.  Here's an updated version of 
> the previous patch.
[...]

There have been reports of this in Debian going back to 2.6.39:

http://bugs.debian.org/631187
http://bugs.debian.org/636263
http://bugs.debian.org/642043

Plus possibly related crashes in elv_put_request after CD-ROM removal:

http://bugs.debian.org/633890
http://bugs.debian.org/634681
http://bugs.debian.org/636103

The former was also reported in Ubuntu since their 2.6.38-10:

https://bugs.launchpad.net/debian/+source/linux-2.6/+bug/793796

The result of the discussion there was that it appeared to be a
regression due to commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b 
("[SCSI] put stricter guards on queue dead checks") which was also
included in a stable update for 2.6.38.

There was also a report on bugzilla.kernel.org, though no-one can see
quite what that says now:

https://bugzilla.kernel.org/show_bug.cgi?id=38842

I also reported most of the above to James Bottomley and linux-scsi
nearly 2 months ago, to no response.

Ben.

-- 
Ben Hutchings
Power corrupts.  Absolute power is kind of neat.
                           - John Lehman, Secretary of the US Navy 1981-1987

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <BAY151-W234D9A977DF076A732C2AAA1080@phx.gbl>
@ 2011-09-18 14:43 ` Alan Stern
  0 siblings, 0 replies; 53+ messages in thread
From: Alan Stern @ 2011-09-18 14:43 UTC (permalink / raw)
  To: Rocko Requin; +Cc: tytso, linux-kernel

On Sun, 18 Sep 2011, Rocko Requin wrote:

> That patch worked, thanks - no more kernel crashes!

That's good news.

> I did see some slightly strange behaviour, though: the umounts issued
> by the script were not working correctly, so the drive was mounting
> successively on /dev/sdb1, /dev/sdc1, /dev/sdd1, etc again. After
> stopping the script, I was able to umount the various mountpoints
> manually, except for one (not the last one in the list, so not the
> most recently mounted one) which reported that the device was busy.
> lsof showed process jbd2/sdj1 three times, once with FD=cwd and
> TYPE=DIR, another with FD=rtd and TYPE=DIR, and lastly with FD=txt
> and TYPE=unknown.

Ted may have some ideas on how to find out what part of the unmount
commands is failing.

Alan Stern


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <BAY151-W32DCB4BAFEC97DD4913A12A1090@phx.gbl>
@ 2011-09-17 17:34 ` Alan Stern
  2011-09-18 23:00   ` Ben Hutchings
  0 siblings, 1 reply; 53+ messages in thread
From: Alan Stern @ 2011-09-17 17:34 UTC (permalink / raw)
  To: Rocko Requin; +Cc: tytso, Kernel development list

On Sat, 17 Sep 2011, Rocko Requin wrote:

> > Why were you using gnome-terminal?  You should be running the tests at
> > a console VT, not under X at all.  Ctrl-Alt-F2 or the equivalent...
> 
> Because with Ted's patch it doesn't crash when run from a console VT, even with an X server running.

That's weird.  Maybe the screen updates change some timing.

> > Here's another patch to address the new problem.  You can apply it on 
> > top of all the other patches.
> 
> Attached is the crash log I get with the latest patch applied.

Okay, more fallout from the same problem.  Here's an updated version of 
the previous patch.

These are really just bandaid-type fixes.  The people who understand
the block layer ought to be involved.  If this keeps up much longer
I'll get in touch with them.

Alan Stern


Index: usb-3.1/block/blk-core.c
===================================================================
--- usb-3.1.orig/block/blk-core.c
+++ usb-3.1/block/blk-core.c
@@ -367,8 +367,10 @@ void blk_cleanup_queue(struct request_qu
 	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
 	mutex_unlock(&q->sysfs_lock);
 
-	if (q->elevator)
+	if (q->elevator) {
 		elevator_exit(q->elevator);
+		q->elevator = NULL;
+	}
 
 	blk_throtl_exit(q);
 
Index: usb-3.1/block/elevator.c
===================================================================
--- usb-3.1.orig/block/elevator.c
+++ usb-3.1/block/elevator.c
@@ -769,7 +769,7 @@ void elv_put_request(struct request_queu
 {
 	struct elevator_queue *e = q->elevator;
 
-	if (e->ops->elevator_put_req_fn)
+	if (e && e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
 
@@ -812,7 +812,7 @@ void elv_completed_request(struct reques
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
-		if ((rq->cmd_flags & REQ_SORTED) &&
+		if ((rq->cmd_flags & REQ_SORTED) && e &&
 		    e->ops->elevator_completed_req_fn)
 			e->ops->elevator_completed_req_fn(q, rq);
 	}


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <BAY151-W1224E6C1A20D179965A149A1090@phx.gbl>
@ 2011-09-17 13:21 ` Alan Stern
  0 siblings, 0 replies; 53+ messages in thread
From: Alan Stern @ 2011-09-17 13:21 UTC (permalink / raw)
  To: Rocko Requin; +Cc: tytso, linux-ext4

Please don't include an entire 400-line message in your reply if you're
only going to add a few new lines of text.  Use some judicious trimming.

On Sat, 17 Sep 2011, Rocko Requin wrote:

> Here's a log of the latest kernel from git crashing with that patch
> applied (as well as Ted's patch), does it help any?

It does.  It shows a new problem that's completely unrelated to the 
earlier one.

> The gnome-terminal console cursor

Why were you using gnome-terminal?  You should be running the tests at
a console VT, not under X at all.  Ctrl-Alt-F2 or the equivalent...

>  stopped flashing after the last
> 'detaching wakeup' message for a while (it *seemed* to have locked up
> at this point) but then it came back after what looks like 17 seconds
> or so from the log (apport reported something else crashing at this
> point) and then the oops happened and it crashed for good.

Here's another patch to address the new problem.  You can apply it on 
top of all the other patches.

Alan Stern


Index: usb-3.1/block/blk-core.c
===================================================================
--- usb-3.1.orig/block/blk-core.c
+++ usb-3.1/block/blk-core.c
@@ -367,8 +367,10 @@ void blk_cleanup_queue(struct request_qu
 	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
 	mutex_unlock(&q->sysfs_lock);
 
-	if (q->elevator)
+	if (q->elevator) {
 		elevator_exit(q->elevator);
+		q->elevator = NULL;
+	}
 
 	blk_throtl_exit(q);
 
Index: usb-3.1/block/elevator.c
===================================================================
--- usb-3.1.orig/block/elevator.c
+++ usb-3.1/block/elevator.c
@@ -812,7 +812,7 @@ void elv_completed_request(struct reques
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
-		if ((rq->cmd_flags & REQ_SORTED) &&
+		if ((rq->cmd_flags & REQ_SORTED) && e &&
 		    e->ops->elevator_completed_req_fn)
 			e->ops->elevator_completed_req_fn(q, rq);
 	}


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <BAY151-W3498E8491E671BDAE90421A1070@phx.gbl>
@ 2011-09-16 16:28 ` Alan Stern
  0 siblings, 0 replies; 53+ messages in thread
From: Alan Stern @ 2011-09-16 16:28 UTC (permalink / raw)
  To: Rocko Requin; +Cc: tytso, linux-ext4

On Thu, 15 Sep 2011, Rocko Requin wrote:

> Unfortunately the lockup is complete - I can't switch away from the X
> server and sysrq-t/p doesn't work if I'm in a tty console when it
> happens. The stack traces are like the ones I posted earlier in the
> bug, and they didn't contain any useful information.

Try applying the patch below.  It will print out some extra debugging
information during normal operation and especially when the USB drive
is mounted and unmounted.  Oh yes -- and be certain to run the test 
from a tty console so that the messages don't get lost.  Maybe you can 
capture the log messages using a network console.

This may not give any useful information in the end, because it 
concentrates on the BDI interface which Ted's patch should have fixed.  
If something else is causing your crashes, you might not see anything.  
But it's worth a try.

Alan Stern



Index: usb-3.1/kernel/timer.c
===================================================================
--- usb-3.1.orig/kernel/timer.c
+++ usb-3.1/kernel/timer.c
@@ -111,6 +111,143 @@ timer_set_base(struct timer_list *timer,
 				      tbase_get_deferrable(timer->base));
 }
 
+static void check_timer_list(struct list_head *start, char *name)
+{
+	struct timer_list *t, *tnext, *tprev, *nt;
+	struct list_head *h = start;
+
+	nt = NULL;
+	do {
+		if (!h->next || !h->prev) {
+			nt = list_entry(h, struct timer_list, entry);
+			break;
+		}
+		h = h->next;
+	} while (h != start);
+	if (!nt)
+		return;
+	pr_err("%s: Found bad timer at %p\n", name, nt);
+
+	tnext = tprev = list_entry(start, struct timer_list, entry);
+	list_for_each_entry(t, start, entry) {
+		if (!t)
+			break;
+		pr_info(" Entry %p cb %pS list %p\n", t, t->function,
+				t->entry.prev);
+		if (t == nt)
+			break;
+		tprev = t;
+	}
+	pr_info(" -----\n");
+
+	tnext = list_entry(start, struct timer_list, entry);
+	list_for_each_entry_reverse(t, start, entry) {
+		if (!t) {
+			pr_info(" Broken link\n");
+			break;
+		}
+		if (t == nt)
+			break;
+		pr_info(" Entry %p cb %pS list %p\n", t, t->function,
+				t->entry.next);
+		tnext = t;
+	}
+	pr_info(" ----- Fixing\n");
+	nt->entry.prev = &tprev->entry;
+	tprev->entry.next = &nt->entry;
+	nt->entry.next = &tnext->entry;
+	tnext->entry.prev = &nt->entry;
+}
+
+struct timer_list *alantimer;
+int alanok;
+
+#include <linux/perf_event.h>
+#include <linux/hw_breakpoint.h>
+
+struct perf_event * __percpu *alanhbp;
+unsigned long alanunused;
+int alanhbp_enabled;
+struct list_head **alanptr;
+
+extern void *last_bdi_unreg;
+
+static void check_alan(char *type)
+{
+	if (!alanok)
+		return;
+	if (!alantimer->entry.next || !alantimer->entry.prev) {
+		pr_err("ERROR %s: Bad alantimer %p\n", type, alantimer);
+		alanok = 0;
+	}
+}
+
+static void alanhbp_handler(struct perf_event *bp,
+			       struct perf_sample_data *data,
+			       struct pt_regs *regs)
+{
+	pr_info("*alanptr written: %p\n", *alanptr);
+	if (!alanok || !alanhbp_enabled)
+		return;
+	if (alantimer->entry.next)
+		return;
+	dump_stack();
+}
+
+static void set_alan(struct timer_list *timer)
+{
+	if (alantimer)
+		return;
+	alantimer = timer;
+	alanok = 1;
+
+	if (alanhbp)
+		alanhbp_enabled = (alanptr == &alantimer->entry.next);
+}
+
+static void clear_alan(struct timer_list *timer)
+{
+	if (alantimer != timer)
+		return;
+	alanok = 0;
+	alantimer = NULL;
+	alanhbp_enabled = 0;
+}
+
+void init_alan(unsigned long addr)
+{
+	struct perf_event_attr attr;
+
+	if (alanhbp) {
+		unregister_wide_hw_breakpoint(alanhbp);
+		alanhbp = NULL;
+		alanhbp_enabled = 0;
+	}
+
+	if (addr) {
+		hw_breakpoint_init(&attr);
+		attr.bp_addr = addr;
+		attr.bp_len = HW_BREAKPOINT_LEN_4;
+		attr.bp_type = HW_BREAKPOINT_W;
+		alanhbp = register_wide_hw_breakpoint(&attr, alanhbp_handler,
+				NULL);
+		if (IS_ERR((void __force *) alanhbp)) {
+			pr_info("Breakpoint reg failed %ld\n",
+					PTR_ERR((void __force *) alanhbp));
+			alanhbp = NULL;
+		} else if (!alanhbp) {
+			pr_info("alanhbp was not created\n");
+		} else {
+			pr_info("alanhbp created\n");
+		}
+
+		alanptr = (struct list_head **) addr;
+		alanhbp_enabled = (alanok && alanptr == &alantimer->entry.next);
+		pr_info("alanhbp set for %p\n", alanptr);
+	}
+}
+EXPORT_SYMBOL(init_alan);
+
 static unsigned long round_jiffies_common(unsigned long j, int cpu,
 		bool force_up)
 {
@@ -330,6 +467,9 @@ void set_timer_slack(struct timer_list *
 }
 EXPORT_SYMBOL_GPL(set_timer_slack);
 
+extern void wakeup_timer_fn(unsigned long data);
+#include <linux/backing-dev.h>
+
 static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
 {
 	unsigned long expires = timer->expires;
@@ -369,7 +509,17 @@ static void internal_add_timer(struct tv
 	/*
 	 * Timers are FIFO:
 	 */
+check_timer_list(vec, "internal_add_1");
+if (timer->function == wakeup_timer_fn) {
+	struct backing_dev_info *bdi = (struct backing_dev_info *) timer->data;
+
+	pr_info("Adding wakeup %p: bdi %p name %s\n", timer, bdi, bdi->name);
+}
 	list_add_tail(&timer->entry, vec);
+if (timer->function == wakeup_timer_fn)
+	set_alan(timer);
+check_timer_list(vec, "internal_add_2");
+check_alan("add");
 }
 
 #ifdef CONFIG_TIMER_STATS
@@ -608,17 +758,24 @@ void init_timer_deferrable_key(struct ti
 }
 EXPORT_SYMBOL(init_timer_deferrable_key);
 
-static inline void detach_timer(struct timer_list *timer,
+static void detach_timer(struct timer_list *timer,
 				int clear_pending)
 {
 	struct list_head *entry = &timer->entry;
 
+check_alan("detach 1");
+if (timer->function == wakeup_timer_fn) {
+	pr_info("Detaching wakeup %p\n", timer);
+	clear_alan(timer);
+}
+
 	debug_deactivate(timer);
 
 	__list_del(entry->prev, entry->next);
 	if (clear_pending)
 		entry->next = NULL;
 	entry->prev = LIST_POISON2;
+check_alan("detach 2");
 }
 
 /*
@@ -1026,6 +1183,7 @@ static int cascade(struct tvec_base *bas
 	struct list_head tv_list;
 
 	list_replace_init(tv->vec + index, &tv_list);
+check_alan("cascade 1");
 
 	/*
 	 * We are removing _all_ timers from the list, so we
@@ -1033,7 +1191,10 @@ static int cascade(struct tvec_base *bas
 	 */
 	list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
 		BUG_ON(tbase_get_base(timer->base) != base);
+if (timer->function == wakeup_timer_fn)
+	pr_info("Cascading wakeup_timer %p\n", timer);
 		internal_add_timer(base, timer);
+check_alan("cascade 2");
 	}
 
 	return index;
@@ -1109,6 +1270,7 @@ static inline void __run_timers(struct t
 			cascade(base, &base->tv5, INDEX(3));
 		++base->timer_jiffies;
 		list_replace_init(base->tv1.vec + index, &work_list);
+check_alan("run 1");
 		while (!list_empty(head)) {
 			void (*fn)(unsigned long);
 			unsigned long data;
@@ -1148,6 +1310,7 @@ static unsigned long __next_timer_interr
 	/* Look for timer events in tv1. */
 	index = slot = timer_jiffies & TVR_MASK;
 	do {
+check_timer_list(base->tv1.vec + slot, "next_timer_1");
 		list_for_each_entry(nte, base->tv1.vec + slot, entry) {
 			if (tbase_get_deferrable(nte->base))
 				continue;
@@ -1179,6 +1342,7 @@ cascade:
 
 		index = slot = timer_jiffies & TVN_MASK;
 		do {
+check_timer_list(varp->vec + slot, "next_timer_2");
 			list_for_each_entry(nte, varp->vec + slot, entry) {
 				if (tbase_get_deferrable(nte->base))
 					continue;
Index: usb-3.1/mm/backing-dev.c
===================================================================
--- usb-3.1.orig/mm/backing-dev.c
+++ usb-3.1/mm/backing-dev.c
@@ -308,7 +308,7 @@ static void sync_supers_timer_fn(unsigne
 	bdi_arm_supers_timer();
 }
 
-static void wakeup_timer_fn(unsigned long data)
+void wakeup_timer_fn(unsigned long data)
 {
 	struct backing_dev_info *bdi = (struct backing_dev_info *)data;
 
@@ -328,6 +328,8 @@ static void wakeup_timer_fn(unsigned lon
 	spin_unlock_bh(&bdi->wb_lock);
 }
 
+void *last_bdi_unreg;
+
 /*
  * This function is used when the first inode for this bdi is marked dirty. It
  * wakes-up the corresponding bdi thread which should then take care of the
@@ -345,6 +347,8 @@ void bdi_wakeup_thread_delayed(struct ba
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
 	mod_timer(&bdi->wb.wakeup_timer, jiffies + timeout);
+if (bdi == last_bdi_unreg)
+	dump_stack();
 }
 
 /*
@@ -547,6 +551,7 @@ int bdi_register(struct backing_dev_info
 			return PTR_ERR(wb->task);
 	}
 
+pr_info("bdi register %s %p\n", dev_name(dev), bdi);
 	bdi_debug_register(bdi, dev_name(dev));
 	set_bit(BDI_registered, &bdi->state);
 
@@ -617,6 +622,8 @@ void bdi_unregister(struct backing_dev_i
 		bdi_set_min_ratio(bdi, 0);
 		trace_writeback_bdi_unregister(bdi);
 		bdi_prune_sb(bdi);
+pr_info("bdi_unreg: wb %p bdi %p\n", &bdi->wb, bdi);
+last_bdi_unreg = bdi;
 		del_timer_sync(&bdi->wb.wakeup_timer);
 
 		if (!bdi_cap_flush_forker(bdi))
@@ -632,6 +639,8 @@ static void bdi_wb_init(struct bdi_write
 {
 	memset(wb, 0, sizeof(*wb));
 
+pr_info("bdi_wb_init: wb %p bdi %p\n", wb, bdi);
+last_bdi_unreg = NULL;
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
Index: usb-3.1/drivers/usb/core/usb.c
===================================================================
--- usb-3.1.orig/drivers/usb/core/usb.c
+++ usb-3.1/drivers/usb/core/usb.c
@@ -974,6 +974,29 @@ struct dentry *usb_debug_root;
 EXPORT_SYMBOL_GPL(usb_debug_root);
 
 static struct dentry *usb_debug_devices;
+static struct dentry *alan_dentry;
+
+static ssize_t alan_write(struct file *fd, const char __user *buf,
+		size_t len, loff_t *ptr)
+{
+	unsigned long addr;
+	char buf2[16];
+	void init_alan(unsigned long);
+
+	if (len >= 16)
+		return -EINVAL;
+	buf2[len] = 0;
+	if (copy_from_user(buf2, buf, len))
+		return -EFAULT;
+
+	addr = simple_strtoul(buf2, NULL, 16);
+	init_alan(addr);
+	return len;
+}
+
+static const struct file_operations alan_fops = {
+	.write = alan_write,
+};
 
 static int usb_debugfs_init(void)
 {
@@ -990,11 +1013,17 @@ static int usb_debugfs_init(void)
 		return -ENOENT;
 	}
 
+	alan_dentry = debugfs_create_file("alan", 0200,
+				usb_debug_root, NULL, &alan_fops);
+	if (!alan_dentry)
+		pr_err("Unable to register alan\n");
+
 	return 0;
 }
 
 static void usb_debugfs_cleanup(void)
 {
+	debugfs_remove(alan_dentry);
 	debugfs_remove(usb_debug_devices);
 	debugfs_remove(usb_debug_root);
 }


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-09 19:13   ` Ted Ts'o
  2011-09-09 22:10     ` Alan Stern
  2011-09-10 18:07     ` Alan Stern
@ 2011-09-12  1:58     ` Alan Stern
  2 siblings, 0 replies; 53+ messages in thread
From: Alan Stern @ 2011-09-12  1:58 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: linux-ext4, rockorequin

Ted:

You also ought to look at this bug report and the follow-up thread.  
The symptoms are similar, although not exactly the same.

	http://marc.info/?l=linux-kernel&m=131504588401397&w=2
	([BUG] D state process after unplug and umount usb disk)

Alan Stern


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-09 19:13   ` Ted Ts'o
  2011-09-09 22:10     ` Alan Stern
@ 2011-09-10 18:07     ` Alan Stern
  2011-09-12  1:58     ` Alan Stern
  2 siblings, 0 replies; 53+ messages in thread
From: Alan Stern @ 2011-09-10 18:07 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: bugzilla-daemon, linux-ext4, rockorequin

On Fri, 9 Sep 2011, Ted Ts'o wrote:

> commit 6e478d46e58181ec4814f25a2fd91c6323e16ad4
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Fri Sep 9 15:02:54 2011 -0400
> 
>     ext4: add ext4-specific kludge to avoid an oops after the disk disappears
>     
>     The del_gendisk() function uninitializes the disk-specific data
>     structures, including the bdi structure, without telling anyone
>     else.  Once this happens, any attempt to call mark_buffer_dirty()
>     (for example, by ext4_commit_super), will cause a kernel OOPS.
>     
>     Fix this for now until we can fix things in an architecturally correct
>     way.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

Further testing revealed the following problem.  I changed the test 
script so that after the USB device is unbound, the script tries to 
write a file before unmounting the ext4 filesystem.

There was no drastic failure; the unregistered bdi structure wasn't
accessed.  But lockdep complained.  This is what I got:

[  166.932194] end_request: I/O error, dev uba, sector 136
[  166.940903] EXT4-fs error (device uba): ext4_find_entry:934: inode #2: comm sh: reading directory lblock 0
[  166.949284] end_request: I/O error, dev uba, sector 164
[  166.952084] EXT4-fs error (device uba): ext4_read_inode_bitmap:161: comm sh: Cannot read inode bitmap - block_group = 0, inode_bitmap = 82
[  166.952906] EXT4-fs error (device uba) in ext4_new_inode:1073: IO failure
[  166.953357] 
[  166.953381] =============================================
[  166.953624] [ INFO: possible recursive locking detected ]
[  166.953958] 3.1.0-rc4 #34
[  166.954099] ---------------------------------------------
[  166.954295] sh/819 is trying to acquire lock:
[  166.954613]  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<c1101290>] ext4_evict_inode+0x17/0x288
[  166.955947] 
[  166.955969] but task is already holding lock:
[  166.956281]  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<c10aeb45>] do_last+0x165/0x4ff
[  166.956586] 
[  166.956586] other info that might help us debug this:
[  166.956586]  Possible unsafe locking scenario:
[  166.956586] 
[  166.956586]        CPU0
[  166.956586]        ----
[  166.956586]   lock(&sb->s_type->i_mutex_key);
[  166.956586]   lock(&sb->s_type->i_mutex_key);
[  166.956586] 
[  166.956586]  *** DEADLOCK ***
[  166.956586] 
[  166.956586]  May be due to missing lock nesting notation
[  166.956586] 
[  166.956586] 2 locks held by sh/819:
[  166.956586]  #0:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<c10aeb45>] do_last+0x165/0x4ff
[  166.956586]  #1:  (jbd2_handle){+.+...}, at: [<c112469f>] start_this_handle+0x3c2/0x41e
[  166.956586] 
[  166.956586] stack backtrace:
[  166.956586] Pid: 819, comm: sh Not tainted 3.1.0-rc4 #34
[  166.956586] Call Trace:
[  166.956586]  [<c135f26e>] ? printk+0xf/0x11
[  166.956586]  [<c105223c>] __lock_acquire+0x875/0xbe7
[  166.956586]  [<c1361600>] ? _raw_spin_unlock_irq+0x2d/0x30
[  166.956586]  [<c105183a>] ? mark_lock+0x26/0x1b3
[  166.956586]  [<c105183a>] ? mark_lock+0x26/0x1b3
[  166.956586]  [<c1052944>] lock_acquire+0x59/0x70
[  166.956586]  [<c1101290>] ? ext4_evict_inode+0x17/0x288
[  166.956586]  [<c13601f9>] __mutex_lock_common+0x38/0x2d4
[  166.956586]  [<c1101290>] ? ext4_evict_inode+0x17/0x288
[  166.956586]  [<c1360573>] mutex_lock_nested+0x32/0x3b
[  166.956586]  [<c1101290>] ? ext4_evict_inode+0x17/0x288
[  166.956586]  [<c1101290>] ext4_evict_inode+0x17/0x288
[  166.956586]  [<c10b5f63>] evict+0x7b/0x11c
[  166.956586]  [<c10b6136>] iput+0x132/0x137
[  166.956586]  [<c10fc467>] ext4_new_inode+0xa53/0xa92
[  166.956586]  [<c1108942>] ? ext4_journal_start_sb+0xdd/0xec
[  166.956586]  [<c10b4afb>] ? d_splice_alias+0xa9/0xb1
[  166.956586]  [<c11045ec>] ext4_create+0xa6/0x10b
[  166.956586]  [<c10ae2d7>] vfs_create+0x61/0x7b
[  166.956586]  [<c10aebd7>] do_last+0x1f7/0x4ff
[  166.956586]  [<c10aefa1>] path_openat+0x9d/0x2b7
[  166.956586]  [<c1052636>] ? lock_release_non_nested+0x88/0x1f7
[  166.956586]  [<c10af1f3>] do_filp_open+0x21/0x5d
[  166.956586]  [<c1361666>] ? _raw_spin_unlock+0x1d/0x2a
[  166.956586]  [<c10b78b1>] ? alloc_fd+0xc0/0xcb
[  166.956586]  [<c10a4207>] do_sys_open+0x54/0xcd
[  166.956586]  [<c10a429e>] sys_open+0x1e/0x26
[  166.956586]  [<c1361820>] syscall_call+0x7/0xb
[  167.175766] end_request: I/O error, dev uba, sector 16534
[  167.177204] Aborting journal on device uba-8.
[  167.179255] end_request: I/O error, dev uba, sector 16516
[  167.179768] Buffer I/O error on device uba, logical block 8258
[  167.179983] lost page write due to I/O error on uba
[  167.180866] JBD2: I/O error detected when updating journal superblock for uba-8.
[  167.181956] journal commit I/O error
[  167.195334] EXT4-fs error (device uba): ext4_put_super:817: Couldn't clean up the journal
[  167.195777] EXT4-fs (uba): Remounting filesystem read-only

It appears to be an unrelated error, but worth looking at.

Alan Stern


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found]       ` <BAY151-W6176D929049AA9E2BDBAEBA1000@phx.gbl>
@ 2011-09-10 14:06         ` Ted Ts'o
  0 siblings, 0 replies; 53+ messages in thread
From: Ted Ts'o @ 2011-09-10 14:06 UTC (permalink / raw)
  To: Rocko Requin; +Cc: stern, bugzilla-daemon, linux-ext4

On Sat, Sep 10, 2011 at 01:49:44AM +0000, Rocko Requin wrote:
> 
> The patch does stop my console-only test script from crashing the
> kernel (thanks for figuring this patch out, Ted!), but if I try it
> from a desktop, the desktop still freezes after two or three
> bind/unbind iterations. So I guess there must be another way to try
> and access the missing file system that also need patching.

Can you get stack traces or register information?  Via sysrq-t /
sysrq-p?  This might require setting up serial console on your desktop
if you don't have kernel mode switching set up so you can switch away
from the X server.

							- Ted

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-09 19:13   ` Ted Ts'o
@ 2011-09-09 22:10     ` Alan Stern
       [not found]       ` <BAY151-W6176D929049AA9E2BDBAEBA1000@phx.gbl>
  2011-09-10 18:07     ` Alan Stern
  2011-09-12  1:58     ` Alan Stern
  2 siblings, 1 reply; 53+ messages in thread
From: Alan Stern @ 2011-09-09 22:10 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: bugzilla-daemon, linux-ext4, rockorequin

On Fri, 9 Sep 2011, Ted Ts'o wrote:

> Rocko, Alan, could you try this patch and see what happens.  It may be
> that we'll crash somewhere else; the problem is that Linux that the
> low-level generic hd routines doesn't have a formal way of informing
> the VFS and layers below that the disk has disappeared.  It just yanks
> it out from under the file system, and we've been manually patching
> around kernel crashes....

I confirm that this patch fixes the issue on with my test script.  The 
unmounts occurred with no apparent problems.

However you probably should make the same change to the ext3 driver, 
because it has exactly the same issue and some people may still be 
using it.

Alan Stern


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
  2011-09-05 17:44 ` bugzilla-daemon
@ 2011-09-09 19:13   ` Ted Ts'o
  2011-09-09 22:10     ` Alan Stern
                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Ted Ts'o @ 2011-09-09 19:13 UTC (permalink / raw)
  To: bugzilla-daemon; +Cc: linux-ext4, rockorequin, stern

Bugzilla.kernel.org is down, so apologies to people who have
subscribed to this bug but which I didn't cc explicitly...

Rocko, Alan, could you try this patch and see what happens.  It may be
that we'll crash somewhere else; the problem is that Linux that the
low-level generic hd routines doesn't have a formal way of informing
the VFS and layers below that the disk has disappeared.  It just yanks
it out from under the file system, and we've been manually patching
around kernel crashes....

     	   	 	   	       	    - Ted

commit 6e478d46e58181ec4814f25a2fd91c6323e16ad4
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Fri Sep 9 15:02:54 2011 -0400

    ext4: add ext4-specific kludge to avoid an oops after the disk disappears
    
    The del_gendisk() function uninitializes the disk-specific data
    structures, including the bdi structure, without telling anyone
    else.  Once this happens, any attempt to call mark_buffer_dirty()
    (for example, by ext4_commit_super), will cause a kernel OOPS.
    
    Fix this for now until we can fix things in an architecturally correct
    way.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ee2f74a..48cb615 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -414,6 +414,22 @@ static void save_error_info(struct super_block *sb, const char *func,
 	ext4_commit_super(sb, 1);
 }
 
+/*
+ * The del_gendisk() function deactivates the inode and deactivates
+ * the bdi without telling the file system.  Once this happens, any
+ * attempt to call mark_buffer_dirty() (for example, by
+ * ext4_commit_super), will cause a kernel OOPS.  This is a kludge to
+ * prevent these oops until we can put in a proper hook in
+ * del_gendisk() to inform the VFS and file system layers.
+ */
+static int block_device_ejected(struct super_block *sb)
+{
+	struct inode *bd_inode = sb->s_bdev->bd_inode;
+	struct backing_dev_info *bdi = bd_inode->i_mapping->backing_dev_info;
+
+	return bdi->dev == NULL;
+}
+
 
 /* Deal with the reporting of failure conditions on a filesystem such as
  * inconsistencies detected or read IO failures.
@@ -4072,7 +4088,7 @@ static int ext4_commit_super(struct super_block *sb, int sync)
 	struct buffer_head *sbh = EXT4_SB(sb)->s_sbh;
 	int error = 0;
 
-	if (!sbh)
+	if (!sbh || block_device_ejected(sb))
 		return error;
 	if (buffer_write_io_error(sbh)) {
 		/*

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (29 preceding siblings ...)
  2011-09-04 14:00 ` bugzilla-daemon
@ 2011-09-05 17:44 ` bugzilla-daemon
  2011-09-09 19:13   ` Ted Ts'o
  2012-07-02 13:24 ` bugzilla-daemon
  31 siblings, 1 reply; 53+ messages in thread
From: bugzilla-daemon @ 2011-09-05 17:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #93 from Alan Stern <stern@rowland.harvard.edu>  2011-09-05 17:44:30 ---
It causes the ext4 driver to used for ext2 and ext3 filesystems, instead of
using the ext2 and ext3 drivers.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (28 preceding siblings ...)
  2011-09-04 13:55 ` bugzilla-daemon
@ 2011-09-04 14:00 ` bugzilla-daemon
  2011-09-05 17:44 ` bugzilla-daemon
  2012-07-02 13:24 ` bugzilla-daemon
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-09-04 14:00 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #92 from rocko <rockorequin@hotmail.com>  2011-09-04 14:00:29 ---
Hey, that is good news!

I can't find that setting at all in my config. I only have these for
CONFIG_EXT4_:

CONFIG_EXT4_FS=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set

What is it supposed to do?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (27 preceding siblings ...)
  2011-09-04  3:53 ` bugzilla-daemon
@ 2011-09-04 13:55 ` bugzilla-daemon
  2011-09-04 14:00 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-09-04 13:55 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #91 from Alan Stern <stern@rowland.harvard.edu>  2011-09-04 13:55:27 ---
Good news: I have been able to reproduce the same sort of crash regularly,
using a variant of your script.  That will make debugging a lot easier, even
though it will still be difficult.

Incidentally, is CONFIG_EXT4_USE_FOR_EXT23 set in your test kernel
configuration?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (26 preceding siblings ...)
  2011-09-01  1:30 ` bugzilla-daemon
@ 2011-09-04  3:53 ` bugzilla-daemon
  2011-09-04 13:55 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-09-04  3:53 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #90 from rocko <rockorequin@hotmail.com>  2011-09-04 03:53:16 ---
BTW, CONFIG_NETCONSOLE=y doesn't make any difference for me (it still crashes
the kernel), although I didn't really expect it to since I had it compiled in
as a module and was specifically loading it prior to running the test that
crashes the kernel.

So I have a reproducible test case (which, even better, works in VirtualBox
VM), but the error information generated in the crash log isn't sufficient for
tracking down this problem. Where do I go from here?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (25 preceding siblings ...)
  2011-08-31 23:43 ` bugzilla-daemon
@ 2011-09-01  1:30 ` bugzilla-daemon
  2011-09-04  3:53 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-09-01  1:30 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832

--- Comment #89 from rocko <rockorequin@hotmail.com>  2011-09-01 01:30:13 ---
Created an attachment (id=71072)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=71072)
usbonoff-with-mount.sh - script to reproduce the crash

For the record, it is very easy to reproduce the bug with the attached script,
which repeatedly mounts, forces an eject, then umounts the USB device until the
kernel eventually crashes. The procedure is:

1. Create a VirtualBox VM running Ubuntu 11.04 64-bit with the kernel being
tested and boot the VM in recovery mode, ie so no desktop is running.

2. Use the VirtualBox USB facility (right click on the USB icon in the status
bar) to attach an ext3/4 drive to the VM.

3. Run the attached script, specifying the target USB device as the argument
(the script lists possible devices if you don't supply an argument). There's a
commented-out option to redirect output to output to an external machine via
netconsole as well, ie to capture the crash log, which you can't fully see in
the VM window.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (24 preceding siblings ...)
  2011-08-31 14:36 ` bugzilla-daemon
@ 2011-08-31 23:43 ` bugzilla-daemon
  2011-09-01  1:30 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-08-31 23:43 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #88 from rocko <rockorequin@hotmail.com>  2011-08-31 23:43:20 ---
Running my desktop-less usb-on-off-with-mount.sh test script in 3.1-rc4, the
drive stayed at /dev/sdb1 each time. So that seems to be fixed.

The kernel eventually crashed on a umount (not straightaway like last time).
load_balance appears in the limited stack trace that I can see in the VM.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (23 preceding siblings ...)
  2011-08-31  5:07 ` bugzilla-daemon
@ 2011-08-31 14:36 ` bugzilla-daemon
  2011-08-31 23:43 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-08-31 14:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832

--- Comment #87 from Alan Stern <stern@rowland.harvard.edu>  2011-08-31 14:35:39 ---
I think all I had was  CONFIG_NETCONSOLE=y.

As for fixing the problem...  To be honest, you shouldn't expect it to get
fixed until somebody can identify what's causing it.  Since you seem to be one
of the very few people experiencing it regularly, the situation doesn't look
good until you can provide more information.

The best course of action is to narrow down the variables as much as possible. 
That means not running any extraneous programs (i.e., don't run a graphical
desktop, and indeed, don't run X at all).  It also means coming up with a very
repeatable scenario to trigger the problem.  Something like what you described
in comment #75 would be good.

Speaking of which, you mentioned in that comment that on each loop through the
test, the driver letter would go up by one.  That should not have happened! 
It's another indication of something strange.  Can you verify -- does this
still happen with 3.1-rc4?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (22 preceding siblings ...)
  2011-08-31  5:00 ` bugzilla-daemon
@ 2011-08-31  5:07 ` bugzilla-daemon
  2011-08-31 14:36 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-08-31  5:07 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #86 from rocko <rockorequin@hotmail.com>  2011-08-31 05:07:27 ---
@Alan: re your comment #78, how did you compile in netconsole support? I have
the following set in .config for my builds, which is the default in Ubuntu's
kernel:

CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (21 preceding siblings ...)
  2011-07-13  7:52 ` bugzilla-daemon
@ 2011-08-31  5:00 ` bugzilla-daemon
  2011-08-31  5:07 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-08-31  5:00 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #85 from rocko <rockorequin@hotmail.com>  2011-08-31 05:00:32 ---
This bug is still an issue in 3.0.4 and 3.1-rc4. Sadly, this is making linux
look rather unreliable for real-world use. :(

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (20 preceding siblings ...)
  2011-05-26 14:27 ` bugzilla-daemon
@ 2011-07-13  7:52 ` bugzilla-daemon
  2011-08-31  5:00 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-07-13  7:52 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832


Bryce Nesbitt <bryce2@obviously.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bryce2@obviously.com




--- Comment #84 from Bryce Nesbitt <bryce2@obviously.com>  2011-07-13 07:52:10 ---
The bug bit me today, after I pulled out a mounted ext3 usb drive.  A total
Kernel hang.  Video still displayed.  No ping.  No mouse.  No keyboard. 
Unfortunately I forgot about Magic SysRq
(http://en.wikipedia.org/wiki/Magic_SysRq_key ) and did not probe.  Here's all
I got:

# vi /var/log/syslog
Jul 13 00:31:43 ubuntu kernel: [181916.094245] JBD2: I/O error detected when
updating journal superblock for sde1-8.
Jul 13 00:40:49 ubuntu kernel: imklog 4.6.4, log source = /proc/kmsg started.

# uname -a
Linux ubuntu 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (19 preceding siblings ...)
  2011-05-26  6:44 ` bugzilla-daemon
@ 2011-05-26 14:27 ` bugzilla-daemon
  2011-07-13  7:52 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-05-26 14:27 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #83 from Alan Stern <stern@rowland.harvard.edu>  2011-05-26 14:27:43 ---
This is a completely different issue from the main problem in this bug report. 
You should report it separately -- perhaps by posting it on the linux-usb
mailing list.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (18 preceding siblings ...)
  2011-05-10 23:27 ` bugzilla-daemon
@ 2011-05-26  6:44 ` bugzilla-daemon
  2011-05-26 14:27 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-05-26  6:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #82 from rocko <rockorequin@hotmail.com>  2011-05-26 06:44:12 ---
I just hit this again in 2.6.39. I was trying to delete a partition from an
external drive, something went wrong and for some reason the kernel dumped all
the other attached USB drives, whereupon the entire machine crashed with an
"unable to handle paging" then "scheduling while atomic" oops. It's a really
annoying bug.


This is what was recorded in the system log (it actually got written this
time):

May 26 14:28:02 hercules kernel: [128626.171297] xhci_hcd 0000:04:00.0: WARN:
Stalled endpoint
May 26 14:29:36 hercules kernel: [128720.267229] xhci_hcd 0000:04:00.0: WARN:
Stalled endpoint
May 26 14:29:36 hercules kernel: [128720.353640] xhci_hcd 0000:04:00.0: WARN:
Stalled endpoint
May 26 14:29:36 hercules kernel: [128720.380617] xhci_hcd 0000:04:00.0: WARN:
Stalled endpoint
May 26 14:29:41 hercules kernel: [128725.194501] xhci_hcd 0000:04:00.0: WARN:
Stalled endpoint
May 26 14:29:41 hercules kernel: [128725.222284] xhci_hcd 0000:04:00.0: WARN:
Stalled endpoint
May 26 14:29:46 hercules kernel: [128729.766858] xhci_hcd 0000:04:00.0: WARN:
Stalled endpoint
May 26 14:29:46 hercules kernel: [128729.801933] xhci_hcd 0000:04:00.0: WARN:
Stalled endpoint
May 26 14:29:53 hercules kernel: [128736.817629] xhci_hcd 0000:04:00.0: WARN:
Stalled endpoint
May 26 14:29:53 hercules kernel: [128736.817863] xhci_hcd 0000:04:00.0: ERROR
no room on ep ring
May 26 14:29:53 hercules kernel: [128736.817873] xhci_hcd 0000:04:00.0: ERR: No
room for command on command ring
May 26 14:29:53 hercules kernel: [128736.817882] xhci_hcd 0000:04:00.0: FIXME
allocate a new ring segment
May 26 14:30:00 hercules kernel: [128743.863985] xhci_hcd 0000:04:00.0: ERROR
no room on ep ring
May 26 14:30:00 hercules kernel: [128743.863998] xhci_hcd 0000:04:00.0: ERR: No
room for command on command ring
May 26 14:30:05 hercules kernel: [128748.872230] xhci_hcd 0000:04:00.0: xHCI
host not responding to stop endpoint command.
May 26 14:30:05 hercules kernel: [128748.872240] xhci_hcd 0000:04:00.0:
Assuming host is dying, halting host.
May 26 14:30:05 hercules kernel: [128748.878855] xhci_hcd 0000:04:00.0: HC
died; cleaning up
May 26 14:30:05 hercules kernel: [128748.878960] usb 3-1: USB disconnect,
device number 12
May 26 14:30:05 hercules kernel: [128748.878971] usb 3-1.1: USB disconnect,
device number 14
May 26 14:30:05 hercules kernel: [128748.879684] sd 18:0:0:0: Device offlined -
not ready after error recovery
May 26 14:30:05 hercules kernel: [128748.879717] sd 18:0:0:0: rejecting I/O to
offline device
May 26 14:30:05 hercules kernel: [128748.879786] sd 18:0:0:0: [sdf] Unhandled
error code
May 26 14:30:05 hercules kernel: [128748.879793] sd 18:0:0:0: [sdf]  Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 26 14:30:05 hercules kernel: [128748.879804] sd 18:0:0:0: [sdf] CDB:
Read(10): 28 00 00 00 00 00 00 00 08 00
May 26 14:30:05 hercules kernel: [128748.879830] end_request: I/O error, dev
sdf, sector 0
May 26 14:30:05 hercules kernel: [128748.879840] Buffer I/O error on device
sdf, logical block 0
May 26 14:30:05 hercules kernel: [128748.879915] sd 18:0:0:0: rejecting I/O to
offline device
May 26 14:30:05 hercules kernel: [128748.879978] sd 18:0:0:0: rejecting I/O to
offline device
May 26 14:30:05 hercules kernel: [128748.963133] usb 3-1.5: USB disconnect,
device number 15
May 26 14:30:05 hercules kernel: [128749.116280] JBD2: I/O error detected when
updating journal superblock for sdd1-8.
May 26 14:30:05 hercules kernel: [128749.222460] usb 3-1.6: USB disconnect,
device number 13
May 26 14:30:05 hercules kernel: [128749.240211] JBD2: I/O error detected when
updating journal superblock for sdb1-8.
May 26 14:30:05 hercules kernel: [128749.262545] usb 3-1.7: USB disconnect,
device number 16
May 26 14:30:05 hercules kernel: [128749.269556] JBD2: I/O error detected when
updating journal superblock for sde1-8.
May 26 14:30:05 hercules kernel: [128749.363281] usb 3-2: USB disconnect,
device number 17
May 26 14:30:10 hercules kernel: [128754.270529] BUG: unable to handle kernel
paging request at 00000001638ca0f8


The "scheduling while atomic" message was shown on the screen, with a short
stack trace ("cpuidle_idle_call / cpu_idle / start_secondary").

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (17 preceding siblings ...)
  2011-05-04  7:36 ` bugzilla-daemon
@ 2011-05-10 23:27 ` bugzilla-daemon
  2011-05-26  6:44 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-05-10 23:27 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #81 from rocko <rockorequin@hotmail.com>  2011-05-10 23:27:56 ---
Still present in 2.6.39-rc7.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (16 preceding siblings ...)
  2011-05-03  2:19 ` bugzilla-daemon
@ 2011-05-04  7:36 ` bugzilla-daemon
  2011-05-10 23:27 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-05-04  7:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #80 from rocko <rockorequin@hotmail.com>  2011-05-04 07:36:48 ---
The null pointer dereference still happens in 2.6.39-rc6.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (15 preceding siblings ...)
  2011-04-26 18:15 ` bugzilla-daemon
@ 2011-05-03  2:19 ` bugzilla-daemon
  2011-05-04  7:36 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-05-03  2:19 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #79 from rocko <rockorequin@hotmail.com>  2011-05-03 02:19:42 ---
Created an attachment (id=56272)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=56272)
oops log for null pointer dereference 2.6.39-rc5

Still present in 2.6.39-rc5 (log attached).

Yes, I can readily reproduce this bug in a VM with my current ext4 USB test key
- the attached log is from a fresh install of Ubuntu 11.04 amd64 in VirtualBox
4.0.6, with the kernel upgraded to 2.6.39-rc5.

Fwiw, 2.6.39-rc5 behaved a bit differently from 2.6.38: the system reported
issues with eg /dev/sdb1 and I started seeing multiple mountpoints in the
desktop that I couldn't unmount (/media/disk, /media/disk_, /media/disk__).

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (14 preceding siblings ...)
  2011-04-26  4:02 ` bugzilla-daemon
@ 2011-04-26 18:15 ` bugzilla-daemon
  2011-05-03  2:19 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-26 18:15 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #78 from Alan Stern <stern@rowland.harvard.edu>  2011-04-26 18:15:00 ---
I can report triggering a similar problem -- once.  I attached a USB drive with
an ext4 filesystem, mounted it, read a file from it, unbound usb-storage, and
then unmounted it.  No desktop was running at the time.  About a second later I
got a nasty crash -- an unending stream of log messages scrolling up the
console, with no way to stop it other than powering off the machine.

So then I rebuilt the test kernel to add netconsole support, and I have not
been able to trigger the problem since.

It wouldn't be at all suprising to find out that this bug lies not in the
filesystem code but somewhere lower, such as the block layer or even the SCSI
core.  It seems to have a large random component as well as a delayed impact. 
Rocko is able to trigger it a lot more reproducibly.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (13 preceding siblings ...)
  2011-04-26  3:29 ` bugzilla-daemon
@ 2011-04-26  4:02 ` bugzilla-daemon
  2011-04-26 18:15 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-26  4:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #76 from Theodore Tso <tytso@mit.edu>  2011-04-26 03:28:39 ---
So for this latest result, which file systems have you been using?  ext3? ext4?
ntfs? vfat?

--- Comment #77 from rocko <rockorequin@hotmail.com>  2011-04-26 04:02:41 ---
This is with ext4 on both the root file system and the external USB drive.

AFAIK, I have only ever reproduced the crash on ext4, but I haven't done much
experimentation with ext3 as I don't use it. Another user told me he had the
same issue - oops on resume after removing drive - and he was using an ext3
drive. I did try reproducing the crash with vfat and ntfs but couldn't make the
kernel crash.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (12 preceding siblings ...)
  2011-04-26  1:22 ` bugzilla-daemon
@ 2011-04-26  3:29 ` bugzilla-daemon
  2011-04-26  4:02 ` bugzilla-daemon
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-26  3:29 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #76 from Theodore Tso <tytso@mit.edu>  2011-04-26 03:28:39 ---
So for this latest result, which file systems have you been using?  ext3? ext4?
ntfs? vfat?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (11 preceding siblings ...)
  2011-04-26  0:44 ` bugzilla-daemon
@ 2011-04-26  1:22 ` bugzilla-daemon
  2011-04-26  3:29 ` bugzilla-daemon
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-26  1:22 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832

--- Comment #75 from rocko <rockorequin@hotmail.com>  2011-04-26 01:22:52 ---
Created an attachment (id=55522)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=55522)
null pointer deference hit while running without a desktop

I just now managed to crash the kernel *without* a desktop running. Running the
VM in recovery mode (ie console with no desktop), I tried two ways:

a) I modified the script to mount the device after binding it. With no other
modifications, the kernel did not crash with around 50 bind/mount/unbind
attempts (which is not conclusive but seems a reasonable number of tests to
try). Note that with this setup, the device kept getting a new drive letter on
each bind, ie /dev/sdb, /dev/sdc, /dev/sdd, etc, whereas with a desktop running
it is assigned each time to /dev/sdb.

b) I modified the script to umount the device immediately after the subsequent
unbind, ie the process is bind, mount on /tmp/usb, unbind, umount /tmp/usb. It
crashed with the null pointer dereference first time (log attached just in
case).

So the umount might be the key to the issue. I should think desktop's
auto-mounting code would also be trying to umount devices once it realises
they're no longer there.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (10 preceding siblings ...)
  2011-04-26  0:28 ` bugzilla-daemon
@ 2011-04-26  0:44 ` bugzilla-daemon
  2011-04-26  1:22 ` bugzilla-daemon
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-26  0:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #73 from rocko <rockorequin@hotmail.com>  2011-04-26 00:27:38 ---
Yes, sdb is in this case the external USB key that has just been inserted (or
bound by the script).

--- Comment #74 from Theodore Tso <tytso@mit.edu>  2011-04-26 00:44:42 ---
So have you confirmed that if you're not running a desktop, the system doesn't
crash?

Part of the problem is I have absolutely *no* idea what the desktop was doing
at the time of the crash.   The stack trace is completely useless.....

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (9 preceding siblings ...)
  2011-04-25 20:28 ` bugzilla-daemon
@ 2011-04-26  0:28 ` bugzilla-daemon
  2011-04-26  0:44 ` bugzilla-daemon
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-26  0:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #73 from rocko <rockorequin@hotmail.com>  2011-04-26 00:27:38 ---
Yes, sdb is in this case the external USB key that has just been inserted (or
bound by the script).

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (8 preceding siblings ...)
  2011-04-25  0:39 ` bugzilla-daemon
@ 2011-04-25 20:28 ` bugzilla-daemon
  2011-04-26  0:28 ` bugzilla-daemon
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-25 20:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #72 from Theodore Tso <tytso@mit.edu>  2011-04-25 20:28:11 ---
The thing is, the given the stack trace and the fact that it's caused by a null
pointer in next_interrupt, makes me highly dubious this has anything to do with
ext4.   You could put in a printk in fs/ext4/super.c:ext4_put_super() and make
sure it's not trigger, since that's the only place where we mess with timers at
all --- but del_timer() properly disable interrupts before mucking with the
pointer, so I'm not convinced it was caused by ext4.  (Also, the ext4 timer is
only present if the file system has reported any errors, and it only fires once
every 24 hours, so it's highly unlikely it would be on a timer bucket that
would the next timer interrupt would trip against right afterwards.  So I very
much doubt it's caused by the ext4 error reporting timer.)

And if you're seeing this on ext3, which doesn't use a timer at all, then it's
definitely not the fault of the file system layer, but probably something in
the usb block device driver...

In all of the stack traces, there's an scsi disk attach going on right before
the crash:

[ 1255.355192] scsi4 : usb-storage 1-1:1.0
[ 1256.387085] scsi 4:0:0:0: Direct-Access     JetFlash TS2GJF110        0.00
PQ: 0 ANSI: 2
[ 1256.409758] sd 4:0:0:0: Attached scsi generic sg2 type 0
[ 1256.425575] sd 4:0:0:0: [sdb] 4063232 512-byte logical blocks: (2.08 GB/1.93
GiB)
[ 1256.434955] sd 4:0:0:0: [sdb] Write Protect is off
[ 1256.435172] sd 4:0:0:0: [sdb] Mode Sense: 00 00 00 00
[ 1256.447520] sd 4:0:0:0: [sdb] Asking for cache data failed
[ 1256.448031] sd 4:0:0:0: [sdb] Assuming drive cache: write through
[ 1256.484174] sd 4:0:0:0: [sdb] Asking for cache data failed
[ 1256.484174] sd 4:0:0:0: [sdb] Assuming drive cache: write through
[ 1256.493409]  sdb: sdb1
[ 1256.540082] sd 4:0:0:0: [sdb] Asking for cache data failed
[ 1256.540083] sd 4:0:0:0: [sdb] Assuming drive cache: write through
[ 1256.540083] sd 4:0:0:0: [sdb] Attached SCSI removable disk
[ 1257.566980] usb-storage 1-1:1.0: Quirks match for vid 0457 pid 0150: 80
[ 1257.566983] scsi5 : usb-storage 1-1:1.0
[ 1258.400641] BUG: unable to handle kernel NULL pointer dereference at  
(null)
[ 1258.400818] IP: [<c105ccf8>] __next_timer_interrupt+0xa8/0x160

Was sdb the device that had been just yanked out?  Or was this some other SCSI
device?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (7 preceding siblings ...)
  2011-04-25  0:37 ` bugzilla-daemon
@ 2011-04-25  0:39 ` bugzilla-daemon
  2011-04-25 20:28 ` bugzilla-daemon
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-25  0:39 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #71 from rocko <rockorequin@hotmail.com>  2011-04-25 00:39:14 ---
Created an attachment (id=55332)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=55332)
and another null pointer dereference

Do these logs have enough information to continue?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (6 preceding siblings ...)
  2011-04-25  0:36 ` bugzilla-daemon
@ 2011-04-25  0:37 ` bugzilla-daemon
  2011-04-25  0:39 ` bugzilla-daemon
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-25  0:37 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #70 from rocko <rockorequin@hotmail.com>  2011-04-25 00:37:51 ---
Created an attachment (id=55322)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=55322)
another null pointer dereference

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (5 preceding siblings ...)
  2011-04-24  1:35 ` bugzilla-daemon
@ 2011-04-25  0:36 ` bugzilla-daemon
  2011-04-25  0:37 ` bugzilla-daemon
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-25  0:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #69 from rocko <rockorequin@hotmail.com>  2011-04-25 00:36:39 ---
Created an attachment (id=55312)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=55312)
oops log for null pointer dereference sync + unbind

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (4 preceding siblings ...)
  2011-04-23 19:31 ` bugzilla-daemon
@ 2011-04-24  1:35 ` bugzilla-daemon
  2011-04-25  0:36 ` bugzilla-daemon
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-24  1:35 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832

--- Comment #68 from rocko <rockorequin@hotmail.com>  2011-04-24 01:35:15 ---
Created an attachment (id=55282)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=55282)
oops on USB unbind - unable to handle kernel paging request

> Just because *I* don't like desktops that initiate I/O at random times
> when I don't request doesn't mean that other users shouldn't use it.

Honest, I wasn't having a go, I meant it tongue-in-cheek!

I have managed to reproduce the crash on a VM and log the output via
netconsole. An important thing to note is that it made no difference when my
script was set to call sync just prior to the unbind. (In fact, it crashed on
the very first unbind when I did this.)

The VM was doing very little: I booted into the desktop, ran gnome-terminal,
ran the modprobe command to load netconsole, and then ran the unbind/rebind
script. The first crash happened on the fourth unbind.

I've attached the resulting log, which is for an 'unable to handle kernel
paging request' - hopefully it's sufficiently complete, but it doesn't look
much longer than some of the ones I captured manually earlier. Note that I
wasn't doing suspend/resume, just running the unbind/rebind script, and this
one is without sync being called.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (3 preceding siblings ...)
  2011-04-23  4:12 ` bugzilla-daemon
@ 2011-04-23 19:31 ` bugzilla-daemon
  2011-04-24  1:35 ` bugzilla-daemon
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-23 19:31 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832

--- Comment #67 from Theodore Tso <tytso@mit.edu>  2011-04-23 19:31:45 ---
Just because *I* don't like desktops that initiate I/O at random times when I
don't request doesn't mean that other users shouldn't use it.  It's just that
if we're talking about making a reliable test case, it's much better if it
doesn't depend on random I/O initiated by a desktop.  The test case should do
whatever I/O is needed, so that it is complete reproducible, even by people who
don't necessarily use the same desktop as you.

Thinking about this some more, *very* recently (as in the most recent merge
window) there have been some hanges recently to avoid deadlock in ext3/4 on
when freezing and unfreezing file systems for snapshots, and that code path is
also used on suspend/resume.  Those changes came in way after 2.6.36-rc1, so
yes, if they are also causing some issues with 2.6.39-rc2+ systems, it's very
likely that there are different bugs involved.   Which is why I insist on
getting full and accurate OOPS logs, so we can see if they are different
crashes that happen to have apparently the same symptoms caused by the same
event (i.e. USB keys getting rudely yanked out of the system).

By the way, for years and years and years USB disks just didn't work at all
across suspend/reusmes.  Which is why I have scripts into my suspend/resume
framework to automatically unmount removeable disks at suspend-time....

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
                   ` (2 preceding siblings ...)
  2011-04-23  0:32 ` bugzilla-daemon
@ 2011-04-23  4:12 ` bugzilla-daemon
  2011-04-23 19:31 ` bugzilla-daemon
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-23  4:12 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832





--- Comment #66 from rocko <rockorequin@hotmail.com>  2011-04-23 04:12:30 ---
FYI, I reproduced it on the second attempt at suspend/resume with 2.6.38.4. I
was slightly hopeful it might be fixed as some of the patches seemed to be
addressing this kind of situation.

I had a thought about the consistency of the logs: I think the ones related to
this bug might be consistent, it's just that I've posted logs that might be due
to different bugs entirely. The ntfs/fuse one turned out to be another bug, for
instance. It's possible the logs from 2.6.36-rc1 were caused by a number of of
other bugs, rc1 being in all likelihood the least robust of any kernel release.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
  2011-04-22 13:42 ` bugzilla-daemon
  2011-04-22 15:00 ` bugzilla-daemon
@ 2011-04-23  0:32 ` bugzilla-daemon
  2011-04-23  4:12 ` bugzilla-daemon
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-23  0:32 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832

--- Comment #65 from rocko <rockorequin@hotmail.com>  2011-04-23 00:32:32 ---
Yes, the crash only happens when I'm running a desktop (gdm in this case),
partly because this is what handles the auto-mounting of the USB drives. I
suppose we *could* tell people to just not use a desktop at all :)

I don't think users deliberately yank out mounted USB drives. I think the most
likely real-world scenarios that trigger this crash are (1) suspend, remove
drive, resume [ie how I first noticed this], (2) remove the wrong drive by
accident, (3) a power failure makes the drive suddenly go offline [I've seen
that, too].

Anyway, my test crash case using the script above that was working so reliably
for this ext4 USB key is no longer crashing the kernel in either 2.6.38.3 or
2.6.39-rc4 (I've done over a thousand bind/unbind cycles for each now). My
guess is that the suspend/resume test results in a higher likelihood of I/O
(especially if multiple drives are involved) and therefore triggering the bug.

I'm also still curious whether it's possible for the ext3/4 drive to somehow
get its format into a state that causes this to happen, given that yesterday it
crashed so reliably but not today. If so, couldn't it be possible that such a
state could be corrected by a fsck and therefore reduce the chances of this
mystery I/O pattern happening?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
  2011-04-22 13:42 ` bugzilla-daemon
@ 2011-04-22 15:00 ` bugzilla-daemon
  2011-04-23  0:32 ` bugzilla-daemon
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-22 15:00 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832

--- Comment #64 from Theodore Tso <tytso@mit.edu>  2011-04-22 15:00:15 ---
I'm going to guess that your script depends on your desktop trying to access
(and possibly write to) your USB stick during the time that you are running
causing the unbind to happen?

A more useful test case would be one that works even if no desktop is running
(i.e., you're logged in via SSH, or the VT console, or the serial console), and
the script contains all of the commands which are accessing the USB storage
device.  Otherwise, it might be dependent on what desktop you are running, and
someone who is using fvwm or KDE (for example) if you happened to create the
test case while using the GNOME desktop (and then you have to answer the
question of which version of the GNOME desktop, and what desktop pacakges you
might have installed, etc.)

I very much doubt it has to do with when the file system was fsck'ed.  The real
question is what specific I/O pattern happened to be going on at the time when
the USB stick was yanked out.  And that might explain why I don't see it,
because normally I'm not crazy enough to yank out a device while it's actively
been accessed.  (And I don't like desktops that initiate a lot of I/O behind my
back.... since that generally means it's doing this at times when I might not
like it, such as when I'm running on battery and am trying to conserve
power....)

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed
       [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
@ 2011-04-22 13:42 ` bugzilla-daemon
  2011-04-22 15:00 ` bugzilla-daemon
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 53+ messages in thread
From: bugzilla-daemon @ 2011-04-22 13:42 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=25832


rocko <rockorequin@hotmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|kernel crashes upon resume  |kernel crashes when a
                   |if usb devices are removed  |mounted ext3/4 file system
                   |when suspended              |is physically removed




--- Comment #63 from rocko <rockorequin@hotmail.com>  2011-04-22 13:41:45 ---
2.6.39-rc4 is either _much_ harder to crash, or my script isn't as reliable at
crashing the kernel as I thought (until now I've mostly used the suspend/resume
method with multiple drives attached). I've now done over 200 bind/unbind
cycles of this external ext4 USB key without a crash. But I certainly did crash
it once earlier today.

An observation from earlier that might be relevant here: a couple of weeks ago
one of my drives got itself into a state that made it crash the kernel almost
every time I unplugged it, but after I did an fsck on it it became
significantly less likely to cause the crash. And after my last reboot there
was a lot of fsck'ing going on, probably including the external drive.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2012-07-02 13:25 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BAY151-W13DDCCEFEB7B68EE506214A10C0@phx.gbl>
2011-09-23 15:18 ` [Bug 25832] kernel crashes when a mounted ext3/4 file system is physically removed Alan Stern
2011-09-23 15:18   ` Alan Stern
     [not found] <BAY151-W234D9A977DF076A732C2AAA1080@phx.gbl>
2011-09-18 14:43 ` Alan Stern
     [not found] <BAY151-W32DCB4BAFEC97DD4913A12A1090@phx.gbl>
2011-09-17 17:34 ` Alan Stern
2011-09-18 23:00   ` Ben Hutchings
2011-09-20  7:32     ` Jun'ichi Nomura
2011-09-22 12:26       ` Hannes Reinecke
2011-09-22 12:26         ` Hannes Reinecke
2011-09-22 12:35         ` James Bottomley
2011-09-22 15:16         ` Alan Stern
2011-09-22 15:16           ` Alan Stern
2011-09-22 16:20           ` Thadeu Lima de Souza Cascardo
2011-09-22 16:32             ` Hannes Reinecke
2011-09-22 16:32               ` Hannes Reinecke
     [not found] <BAY151-W1224E6C1A20D179965A149A1090@phx.gbl>
2011-09-17 13:21 ` Alan Stern
     [not found] <BAY151-W3498E8491E671BDAE90421A1070@phx.gbl>
2011-09-16 16:28 ` Alan Stern
     [not found] <bug-25832-13602@https.bugzilla.kernel.org/>
2011-04-22 13:42 ` bugzilla-daemon
2011-04-22 15:00 ` bugzilla-daemon
2011-04-23  0:32 ` bugzilla-daemon
2011-04-23  4:12 ` bugzilla-daemon
2011-04-23 19:31 ` bugzilla-daemon
2011-04-24  1:35 ` bugzilla-daemon
2011-04-25  0:36 ` bugzilla-daemon
2011-04-25  0:37 ` bugzilla-daemon
2011-04-25  0:39 ` bugzilla-daemon
2011-04-25 20:28 ` bugzilla-daemon
2011-04-26  0:28 ` bugzilla-daemon
2011-04-26  0:44 ` bugzilla-daemon
2011-04-26  1:22 ` bugzilla-daemon
2011-04-26  3:29 ` bugzilla-daemon
2011-04-26  4:02 ` bugzilla-daemon
2011-04-26 18:15 ` bugzilla-daemon
2011-05-03  2:19 ` bugzilla-daemon
2011-05-04  7:36 ` bugzilla-daemon
2011-05-10 23:27 ` bugzilla-daemon
2011-05-26  6:44 ` bugzilla-daemon
2011-05-26 14:27 ` bugzilla-daemon
2011-07-13  7:52 ` bugzilla-daemon
2011-08-31  5:00 ` bugzilla-daemon
2011-08-31  5:07 ` bugzilla-daemon
2011-08-31 14:36 ` bugzilla-daemon
2011-08-31 23:43 ` bugzilla-daemon
2011-09-01  1:30 ` bugzilla-daemon
2011-09-04  3:53 ` bugzilla-daemon
2011-09-04 13:55 ` bugzilla-daemon
2011-09-04 14:00 ` bugzilla-daemon
2011-09-05 17:44 ` bugzilla-daemon
2011-09-09 19:13   ` Ted Ts'o
2011-09-09 22:10     ` Alan Stern
     [not found]       ` <BAY151-W6176D929049AA9E2BDBAEBA1000@phx.gbl>
2011-09-10 14:06         ` Ted Ts'o
2011-09-10 18:07     ` Alan Stern
2011-09-12  1:58     ` Alan Stern
2012-07-02 13:24 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.