linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] block: Free queue resources at blk_release_queue()
       [not found] <201109221318.p8MDIXPP020015@pentland.suse.de>
@ 2011-09-28  0:47 ` Jens Axboe
  2011-09-28  0:55   ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2011-09-28  0:47 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: James Bottomley, linux-scsi, Linux Kernel, Linus Torvalds

On Thu, Sep 22 2011, Hannes Reinecke wrote:
> 
> A kernel crash is observed when a mounted ext3/ext4 filesystem
> is physically removed.
> The problem is that blk_cleanup_queue() frees up some resources
> eg by calling elevator_exit(), which are not checked for in
> normal operation.
> So we should rather move these calls to the destructor function
> blk_release_queue() as at that point all remaining references
> are gone.
> However, in doing so we have to ensure that any externally
> supplied queue_lock is disconnected as the driver might
> free up the lock after the call of blk_cleanup_queue(),

Do we have acks from the people who saw this oops, that this really does
fix it for them? I ask since this fix is vastly cleaner than the other
patch floating around, so I'd much rather pull this one in.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] block: Free queue resources at blk_release_queue()
  2011-09-28  0:47 ` [PATCH] block: Free queue resources at blk_release_queue() Jens Axboe
@ 2011-09-28  0:55   ` Linus Torvalds
  2011-09-28  1:15     ` Jens Axboe
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2011-09-28  0:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Hannes Reinecke, James Bottomley, linux-scsi, Linux Kernel

On Tue, Sep 27, 2011 at 5:47 PM, Jens Axboe <axboe@kernel.dk> wrote:
>
> Do we have acks from the people who saw this oops, that this really does
> fix it for them? I ask since this fix is vastly cleaner than the other
> patch floating around, so I'd much rather pull this one in.

I don't think I can't test it, since it seems to be some hard-to-hit
race condition. It certainly doesn't seem to happen all the time -
today was the first time I've seen it on one of my computers.

But if you forward the actual patch to me (the one I see on lkml seems
to be broken and doesn't compile cleanly because it's assiging a
structure to a pointer), I'll try it out anyway.

                     Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] block: Free queue resources at blk_release_queue()
  2011-09-28  0:55   ` Linus Torvalds
@ 2011-09-28  1:15     ` Jens Axboe
  2011-09-28  1:59       ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2011-09-28  1:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hannes Reinecke, James Bottomley, linux-scsi, Linux Kernel

On 2011-09-27 18:55, Linus Torvalds wrote:
> On Tue, Sep 27, 2011 at 5:47 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Do we have acks from the people who saw this oops, that this really does
>> fix it for them? I ask since this fix is vastly cleaner than the other
>> patch floating around, so I'd much rather pull this one in.
> 
> I don't think I can't test it, since it seems to be some hard-to-hit
> race condition. It certainly doesn't seem to happen all the time -
> today was the first time I've seen it on one of my computers.
> 
> But if you forward the actual patch to me (the one I see on lkml seems
> to be broken and doesn't compile cleanly because it's assiging a
> structure to a pointer), I'll try it out anyway.

Thanks, that would be great. It's inlined below.

diff --git a/block/blk-core.c b/block/blk-core.c
index b2ed78a..d34433a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -348,9 +348,10 @@ void blk_put_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_put_queue);
 
 /*
- * Note: If a driver supplied the queue lock, it should not zap that lock
- * unexpectedly as some queue cleanup components like elevator_exit() and
- * blk_throtl_exit() need queue lock.
+ * Note: If a driver supplied the queue lock, it is disconnected
+ * by this function. The actual state of the lock doesn't matter
+ * here as the request_queue isn't accessible after this point
+ * (QUEUE_FLAG_DEAD is set) and no other requests will be queued.
  */
 void blk_cleanup_queue(struct request_queue *q)
 {
@@ -367,10 +368,8 @@ void blk_cleanup_queue(struct request_queue *q)
 	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
 	mutex_unlock(&q->sysfs_lock);
 
-	if (q->elevator)
-		elevator_exit(q->elevator);
-
-	blk_throtl_exit(q);
+	if (q->queue_lock != &q->__queue_lock)
+		q->queue_lock = &q->__queue_lock;
 
 	blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e681805..60fda88 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -479,6 +479,11 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_sync_queue(q);
 
+	if (q->elevator)
+		elevator_exit(q->elevator);
+
+	blk_throtl_exit(q);
+
 	if (rl->rq_pool)
 		mempool_destroy(rl->rq_pool);
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH] block: Free queue resources at blk_release_queue()
  2011-09-28  1:15     ` Jens Axboe
@ 2011-09-28  1:59       ` Linus Torvalds
  2011-09-28  2:02         ` Jens Axboe
  2011-09-28  4:10         ` James Bottomley
  0 siblings, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2011-09-28  1:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Hannes Reinecke, James Bottomley, linux-scsi, Linux Kernel

On Tue, Sep 27, 2011 at 6:15 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> But if you forward the actual patch to me (the one I see on lkml seems
>> to be broken and doesn't compile cleanly because it's assiging a
>> structure to a pointer), I'll try it out anyway.
>
> Thanks, that would be great. It's inlined below.

Well, I did several USB eject events, and nothing bad happened.

But as mentioned, I don't think that means much. Have you tried this
with slub debugging and poisoning? It might be worth some extra
testing that way.

                   Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] block: Free queue resources at blk_release_queue()
  2011-09-28  1:59       ` Linus Torvalds
@ 2011-09-28  2:02         ` Jens Axboe
  2011-09-28  4:10         ` James Bottomley
  1 sibling, 0 replies; 60+ messages in thread
From: Jens Axboe @ 2011-09-28  2:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hannes Reinecke, James Bottomley, linux-scsi, Linux Kernel

On 2011-09-27 19:59, Linus Torvalds wrote:
> On Tue, Sep 27, 2011 at 6:15 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> But if you forward the actual patch to me (the one I see on lkml seems
>>> to be broken and doesn't compile cleanly because it's assiging a
>>> structure to a pointer), I'll try it out anyway.
>>
>> Thanks, that would be great. It's inlined below.
> 
> Well, I did several USB eject events, and nothing bad happened.
> 
> But as mentioned, I don't think that means much. Have you tried this
> with slub debugging and poisoning? It might be worth some extra
> testing that way.

I'm traveling at the moment, and don't have a stick handy here. I'll try
it tomorrow morning, before and after (both with pagealloc etc debuggin
on).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] block: Free queue resources at blk_release_queue()
  2011-09-28  1:59       ` Linus Torvalds
  2011-09-28  2:02         ` Jens Axboe
@ 2011-09-28  4:10         ` James Bottomley
  2011-09-28 14:08           ` Jens Axboe
  1 sibling, 1 reply; 60+ messages in thread
From: James Bottomley @ 2011-09-28  4:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Hannes Reinecke, James Bottomley, linux-scsi, Linux Kernel

On Tue, 2011-09-27 at 18:59 -0700, Linus Torvalds wrote:
> On Tue, Sep 27, 2011 at 6:15 PM, Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> But if you forward the actual patch to me (the one I see on lkml seems
> >> to be broken and doesn't compile cleanly because it's assiging a
> >> structure to a pointer), I'll try it out anyway.
> >
> > Thanks, that would be great. It's inlined below.
> 
> Well, I did several USB eject events, and nothing bad happened.
> 
> But as mentioned, I don't think that means much. Have you tried this
> with slub debugging and poisoning? It might be worth some extra
> testing that way.

I've been testing a simpler version:

http://marc.info/?l=linux-kernel&m=131300594629839

For a long time now with great success.  The only difference is the
locking cleanup, but SCSI doesn't need that since it doesn't supply its
own lock, so I've been voting for this as the final fix for a while now.

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] block: Free queue resources at blk_release_queue()
  2011-09-28  4:10         ` James Bottomley
@ 2011-09-28 14:08           ` Jens Axboe
  2011-09-28 14:11             ` James Bottomley
  0 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2011-09-28 14:08 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Hannes Reinecke, James Bottomley, linux-scsi,
	Linux Kernel

On 2011-09-27 22:10, James Bottomley wrote:
> On Tue, 2011-09-27 at 18:59 -0700, Linus Torvalds wrote:
>> On Tue, Sep 27, 2011 at 6:15 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> But if you forward the actual patch to me (the one I see on lkml seems
>>>> to be broken and doesn't compile cleanly because it's assiging a
>>>> structure to a pointer), I'll try it out anyway.
>>>
>>> Thanks, that would be great. It's inlined below.
>>
>> Well, I did several USB eject events, and nothing bad happened.
>>
>> But as mentioned, I don't think that means much. Have you tried this
>> with slub debugging and poisoning? It might be worth some extra
>> testing that way.
> 
> I've been testing a simpler version:
> 
> http://marc.info/?l=linux-kernel&m=131300594629839
> 
> For a long time now with great success.  The only difference is the
> locking cleanup, but SCSI doesn't need that since it doesn't supply its
> own lock, so I've been voting for this as the final fix for a while now.

The locking cleanup looks good, though, for devices that do use the
embedded lock. But functionally they should be the same for SCSI, so if
you had great success with it, then that's a good data point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] block: Free queue resources at blk_release_queue()
  2011-09-28 14:08           ` Jens Axboe
@ 2011-09-28 14:11             ` James Bottomley
  2011-09-28 14:14               ` [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue()) Jens Axboe
  0 siblings, 1 reply; 60+ messages in thread
From: James Bottomley @ 2011-09-28 14:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Hannes Reinecke, James Bottomley, linux-scsi,
	Linux Kernel

On Wed, 2011-09-28 at 08:08 -0600, Jens Axboe wrote:
> On 2011-09-27 22:10, James Bottomley wrote:
> > On Tue, 2011-09-27 at 18:59 -0700, Linus Torvalds wrote:
> >> On Tue, Sep 27, 2011 at 6:15 PM, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> But if you forward the actual patch to me (the one I see on lkml seems
> >>>> to be broken and doesn't compile cleanly because it's assiging a
> >>>> structure to a pointer), I'll try it out anyway.
> >>>
> >>> Thanks, that would be great. It's inlined below.
> >>
> >> Well, I did several USB eject events, and nothing bad happened.
> >>
> >> But as mentioned, I don't think that means much. Have you tried this
> >> with slub debugging and poisoning? It might be worth some extra
> >> testing that way.
> > 
> > I've been testing a simpler version:
> > 
> > http://marc.info/?l=linux-kernel&m=131300594629839
> > 
> > For a long time now with great success.  The only difference is the
> > locking cleanup, but SCSI doesn't need that since it doesn't supply its
> > own lock, so I've been voting for this as the final fix for a while now.
> 
> The locking cleanup looks good, though, for devices that do use the
> embedded lock.

Exactly ... it's the missing piece; without it my patch is really only
addressing SCSI.

>  But functionally they should be the same for SCSI, so if
> you had great success with it, then that's a good data point.

Right, I've run it through the memory debugger and USB ejection testing
(with ext2, which seems to be the fs that triggers this problem the
most).

James




^ permalink raw reply	[flat|nested] 60+ messages in thread

* [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 14:11             ` James Bottomley
@ 2011-09-28 14:14               ` Jens Axboe
  2011-09-28 15:22                 ` Linus Torvalds
  2011-10-13 13:09                 ` Steffen Maier
  0 siblings, 2 replies; 60+ messages in thread
From: Jens Axboe @ 2011-09-28 14:14 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Hannes Reinecke, James Bottomley, linux-scsi,
	Linux Kernel

On 2011-09-28 08:11, James Bottomley wrote:
> On Wed, 2011-09-28 at 08:08 -0600, Jens Axboe wrote:
>> On 2011-09-27 22:10, James Bottomley wrote:
>>> On Tue, 2011-09-27 at 18:59 -0700, Linus Torvalds wrote:
>>>> On Tue, Sep 27, 2011 at 6:15 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> But if you forward the actual patch to me (the one I see on lkml seems
>>>>>> to be broken and doesn't compile cleanly because it's assiging a
>>>>>> structure to a pointer), I'll try it out anyway.
>>>>>
>>>>> Thanks, that would be great. It's inlined below.
>>>>
>>>> Well, I did several USB eject events, and nothing bad happened.
>>>>
>>>> But as mentioned, I don't think that means much. Have you tried this
>>>> with slub debugging and poisoning? It might be worth some extra
>>>> testing that way.
>>>
>>> I've been testing a simpler version:
>>>
>>> http://marc.info/?l=linux-kernel&m=131300594629839
>>>
>>> For a long time now with great success.  The only difference is the
>>> locking cleanup, but SCSI doesn't need that since it doesn't supply its
>>> own lock, so I've been voting for this as the final fix for a while now.
>>
>> The locking cleanup looks good, though, for devices that do use the
>> embedded lock.
> 
> Exactly ... it's the missing piece; without it my patch is really only
> addressing SCSI.
> 
>>  But functionally they should be the same for SCSI, so if
>> you had great success with it, then that's a good data point.
> 
> Right, I've run it through the memory debugger and USB ejection testing
> (with ext2, which seems to be the fs that triggers this problem the
> most).

Alright, lets call this fully tested and fixed then. Linus, I committed
it, please pull:

  git://git.kernel.dk/linux-block.git for-linus

Hannes Reinecke (1):
      block: Free queue resources at blk_release_queue()

 block/blk-core.c  |   13 ++++++-------
 block/blk-sysfs.c |    5 +++++
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b2ed78a..d34433a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -348,9 +348,10 @@ void blk_put_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_put_queue);
 
 /*
- * Note: If a driver supplied the queue lock, it should not zap that lock
- * unexpectedly as some queue cleanup components like elevator_exit() and
- * blk_throtl_exit() need queue lock.
+ * Note: If a driver supplied the queue lock, it is disconnected
+ * by this function. The actual state of the lock doesn't matter
+ * here as the request_queue isn't accessible after this point
+ * (QUEUE_FLAG_DEAD is set) and no other requests will be queued.
  */
 void blk_cleanup_queue(struct request_queue *q)
 {
@@ -367,10 +368,8 @@ void blk_cleanup_queue(struct request_queue *q)
 	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
 	mutex_unlock(&q->sysfs_lock);
 
-	if (q->elevator)
-		elevator_exit(q->elevator);
-
-	blk_throtl_exit(q);
+	if (q->queue_lock != &q->__queue_lock)
+		q->queue_lock = &q->__queue_lock;
 
 	blk_put_queue(q);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e681805..60fda88 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -479,6 +479,11 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_sync_queue(q);
 
+	if (q->elevator)
+		elevator_exit(q->elevator);
+
+	blk_throtl_exit(q);
+
 	if (rl->rq_pool)
 		mempool_destroy(rl->rq_pool);
 


-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 14:14               ` [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue()) Jens Axboe
@ 2011-09-28 15:22                 ` Linus Torvalds
  2011-09-28 15:43                   ` James Bottomley
  2011-10-13 13:09                 ` Steffen Maier
  1 sibling, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2011-09-28 15:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: James Bottomley, Hannes Reinecke, James Bottomley, linux-scsi,
	Linux Kernel

On Wed, Sep 28, 2011 at 7:14 AM, Jens Axboe <axboe@kernel.dk> wrote:
>
>  /*
> - * Note: If a driver supplied the queue lock, it should not zap that lock
> - * unexpectedly as some queue cleanup components like elevator_exit() and
> - * blk_throtl_exit() need queue lock.
> + * Note: If a driver supplied the queue lock, it is disconnected
> + * by this function. The actual state of the lock doesn't matter
> + * here as the request_queue isn't accessible after this point
> + * (QUEUE_FLAG_DEAD is set) and no other requests will be queued.
>  */

So quite frankly, I just don't believe in that comment.

If no more requests will be queued or completed, then the queue lock
is irrelevant and should not be changed.

More importantly, if no more requests are queued or completed after
blk_cleanup_queue(), then we wouldn't have had the bug that we clearly
had with the elevator accesses, now would we? So the comment seems to
be obviously bogus and wrong.

I pulled this, but I think the "just move the teardown" would have
been the safer option. What  happens if a request completes on another
CPU just as we are changing locks, and we lock one lock and then
unlock another?!

                              Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 15:22                 ` Linus Torvalds
@ 2011-09-28 15:43                   ` James Bottomley
  2011-09-28 17:48                     ` Vivek Goyal
  0 siblings, 1 reply; 60+ messages in thread
From: James Bottomley @ 2011-09-28 15:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Hannes Reinecke, James Bottomley, linux-scsi, Linux Kernel

On Wed, 2011-09-28 at 08:22 -0700, Linus Torvalds wrote:
> On Wed, Sep 28, 2011 at 7:14 AM, Jens Axboe <axboe@kernel.dk> wrote:
> >
> >  /*
> > - * Note: If a driver supplied the queue lock, it should not zap that lock
> > - * unexpectedly as some queue cleanup components like elevator_exit() and
> > - * blk_throtl_exit() need queue lock.
> > + * Note: If a driver supplied the queue lock, it is disconnected
> > + * by this function. The actual state of the lock doesn't matter
> > + * here as the request_queue isn't accessible after this point
> > + * (QUEUE_FLAG_DEAD is set) and no other requests will be queued.
> >  */
> 
> So quite frankly, I just don't believe in that comment.
> 
> If no more requests will be queued or completed, then the queue lock
> is irrelevant and should not be changed.

That was my original argument for my patch.  I lost it because you can
still hold a queue reference in the sysfs code for block, which means
that the put in blk_cleanup_queue() won't be the final one and you'll
get a use after free of the lock when the sysfs directory is exited
because we take the lock again as we destroy the elevator.

> More importantly, if no more requests are queued or completed after
> blk_cleanup_queue(), then we wouldn't have had the bug that we clearly
> had with the elevator accesses, now would we? So the comment seems to
> be obviously bogus and wrong.

So this I agree with.  blk_cleanup_queue() prevents any new access to
the queue, but we still have the old reference holders to contend with.
They can submit requests, although we try to error them again with the
queue guards check.

> I pulled this, but I think the "just move the teardown" would have
> been the safer option. What  happens if a request completes on another
> CPU just as we are changing locks, and we lock one lock and then
> unlock another?!

The only code for which this could be true is code where we use the
block supplied lock, so effectively it never changes.  The drivers which
supply their own lock are supposed to have already ensured that the
queue is unused.  I don't really believe this given the sysfs example
above, but this fix is no worse than the use after free that would have
resulted with the previous code.

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 15:43                   ` James Bottomley
@ 2011-09-28 17:48                     ` Vivek Goyal
  2011-09-28 17:53                       ` Christoph Hellwig
  2011-09-28 17:59                       ` James Bottomley
  0 siblings, 2 replies; 60+ messages in thread
From: Vivek Goyal @ 2011-09-28 17:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Jens Axboe, Hannes Reinecke, James Bottomley,
	linux-scsi, Linux Kernel

On Wed, Sep 28, 2011 at 10:43:36AM -0500, James Bottomley wrote:
> On Wed, 2011-09-28 at 08:22 -0700, Linus Torvalds wrote:
> > On Wed, Sep 28, 2011 at 7:14 AM, Jens Axboe <axboe@kernel.dk> wrote:
> > >
> > >  /*
> > > - * Note: If a driver supplied the queue lock, it should not zap that lock
> > > - * unexpectedly as some queue cleanup components like elevator_exit() and
> > > - * blk_throtl_exit() need queue lock.
> > > + * Note: If a driver supplied the queue lock, it is disconnected
> > > + * by this function. The actual state of the lock doesn't matter
> > > + * here as the request_queue isn't accessible after this point
> > > + * (QUEUE_FLAG_DEAD is set) and no other requests will be queued.
> > >  */
> > 
> > So quite frankly, I just don't believe in that comment.
> > 
> > If no more requests will be queued or completed, then the queue lock
> > is irrelevant and should not be changed.
> 
> That was my original argument for my patch.  I lost it because you can
> still hold a queue reference in the sysfs code for block, which means
> that the put in blk_cleanup_queue() won't be the final one and you'll
> get a use after free of the lock when the sysfs directory is exited
> because we take the lock again as we destroy the elevator.
> 
> > More importantly, if no more requests are queued or completed after
> > blk_cleanup_queue(), then we wouldn't have had the bug that we clearly
> > had with the elevator accesses, now would we? So the comment seems to
> > be obviously bogus and wrong.
> 
> So this I agree with.  blk_cleanup_queue() prevents any new access to
> the queue, but we still have the old reference holders to contend with.
> They can submit requests, although we try to error them again with the
> queue guards check.
> 
> > I pulled this, but I think the "just move the teardown" would have
> > been the safer option. What  happens if a request completes on another
> > CPU just as we are changing locks, and we lock one lock and then
> > unlock another?!
> 
> The only code for which this could be true is code where we use the
> block supplied lock, so effectively it never changes.

> The drivers which supply their own lock are supposed to have already
> ensured that the queue is unused.

Hi James,

For my education purposes, how will driver come to know that queue is
unused? Does it happen by checking if any requsts are queued or not? If
yes, we might run into issues with throttling logic.

For example, if some bio have been throttled and are queued in some data
structures on queue.  In that case driver does not even know that some bios
are queued and will be submitted later. Now if driver calls blk_cleanup_queue()
it might happen that throttling related worker is already queue lock and
trying to do some housekeeping or trying to dispatch bio etc. Now if queue
lock is swapped, it will just cause all the kind of issues.

I am wondering if we should retain blk_throtl_exit() in blk_cleanup_queue()
before lock swap and just move elevator cleanup in blk_release_queue().

A note to myself, I should probably enhance blk_throtl_exit() to look for any
queued throttled bio and single their completion with error (-ENODEV) or
something like that.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 17:48                     ` Vivek Goyal
@ 2011-09-28 17:53                       ` Christoph Hellwig
  2011-09-28 18:09                         ` Vivek Goyal
  2011-09-28 17:59                       ` James Bottomley
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2011-09-28 17:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: James Bottomley, Linus Torvalds, Jens Axboe, Hannes Reinecke,
	James Bottomley, linux-scsi, Linux Kernel

On Wed, Sep 28, 2011 at 01:48:59PM -0400, Vivek Goyal wrote:
> I am wondering if we should retain blk_throtl_exit() in blk_cleanup_queue()
> before lock swap and just move elevator cleanup in blk_release_queue().
> 
> A note to myself, I should probably enhance blk_throtl_exit() to look for any
> queued throttled bio and single their completion with error (-ENODEV) or
> something like that.

The root of this evil is how queue_lock is implemented and (ab)used.
Instead of letting the driver assign a pointer to make the core use
its locks we really need to make the queue_lock a lock embedded directly
into the queue, and drivers may or may not use that lock for their
internal data structures.  For high performance drivers they preferable
should use their own locks as queue_lock is far too contended already
for any high IOPS device.   The same applies to throtteling btw -
instead of overloading an already highly contended lock it really
should have its own.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 17:48                     ` Vivek Goyal
  2011-09-28 17:53                       ` Christoph Hellwig
@ 2011-09-28 17:59                       ` James Bottomley
  1 sibling, 0 replies; 60+ messages in thread
From: James Bottomley @ 2011-09-28 17:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linus Torvalds, Jens Axboe, Hannes Reinecke, James Bottomley,
	linux-scsi, Linux Kernel

On Wed, 2011-09-28 at 13:48 -0400, Vivek Goyal wrote:
> On Wed, Sep 28, 2011 at 10:43:36AM -0500, James Bottomley wrote:
> > On Wed, 2011-09-28 at 08:22 -0700, Linus Torvalds wrote:
> > > On Wed, Sep 28, 2011 at 7:14 AM, Jens Axboe <axboe@kernel.dk> wrote:
> > > >
> > > >  /*
> > > > - * Note: If a driver supplied the queue lock, it should not zap that lock
> > > > - * unexpectedly as some queue cleanup components like elevator_exit() and
> > > > - * blk_throtl_exit() need queue lock.
> > > > + * Note: If a driver supplied the queue lock, it is disconnected
> > > > + * by this function. The actual state of the lock doesn't matter
> > > > + * here as the request_queue isn't accessible after this point
> > > > + * (QUEUE_FLAG_DEAD is set) and no other requests will be queued.
> > > >  */
> > > 
> > > So quite frankly, I just don't believe in that comment.
> > > 
> > > If no more requests will be queued or completed, then the queue lock
> > > is irrelevant and should not be changed.
> > 
> > That was my original argument for my patch.  I lost it because you can
> > still hold a queue reference in the sysfs code for block, which means
> > that the put in blk_cleanup_queue() won't be the final one and you'll
> > get a use after free of the lock when the sysfs directory is exited
> > because we take the lock again as we destroy the elevator.
> > 
> > > More importantly, if no more requests are queued or completed after
> > > blk_cleanup_queue(), then we wouldn't have had the bug that we clearly
> > > had with the elevator accesses, now would we? So the comment seems to
> > > be obviously bogus and wrong.
> > 
> > So this I agree with.  blk_cleanup_queue() prevents any new access to
> > the queue, but we still have the old reference holders to contend with.
> > They can submit requests, although we try to error them again with the
> > queue guards check.
> > 
> > > I pulled this, but I think the "just move the teardown" would have
> > > been the safer option. What  happens if a request completes on another
> > > CPU just as we are changing locks, and we lock one lock and then
> > > unlock another?!
> > 
> > The only code for which this could be true is code where we use the
> > block supplied lock, so effectively it never changes.
> 
> > The drivers which supply their own lock are supposed to have already
> > ensured that the queue is unused.
> 
> Hi James,
> 
> For my education purposes, how will driver come to know that queue is
> unused? Does it happen by checking if any requsts are queued or not? If
> yes, we might run into issues with throttling logic.

I can't explain this ... it's the bit I think is bogus.  If we need
refcounted queues, it's impossible to satisfy and if we don't, why do we
have refcounts?

The root cause of this is allowing drivers to specify locks.  If we
really want to continue doing this, we should have a lock release
function which is called as part of the queue refcounting model where
the final lock free can be done.  That's a big amount of work if you
look at all the block drivers which use their own locks.

James


> For example, if some bio have been throttled and are queued in some data
> structures on queue.  In that case driver does not even know that some bios
> are queued and will be submitted later. Now if driver calls blk_cleanup_queue()
> it might happen that throttling related worker is already queue lock and
> trying to do some housekeeping or trying to dispatch bio etc. Now if queue
> lock is swapped, it will just cause all the kind of issues.
> 
> I am wondering if we should retain blk_throtl_exit() in blk_cleanup_queue()
> before lock swap and just move elevator cleanup in blk_release_queue().
> 
> A note to myself, I should probably enhance blk_throtl_exit() to look for any
> queued throttled bio and single their completion with error (-ENODEV) or
> something like that.
> 
> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 17:53                       ` Christoph Hellwig
@ 2011-09-28 18:09                         ` Vivek Goyal
  2011-09-28 18:16                           ` Christoph Hellwig
  0 siblings, 1 reply; 60+ messages in thread
From: Vivek Goyal @ 2011-09-28 18:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Linus Torvalds, Jens Axboe, Hannes Reinecke,
	James Bottomley, linux-scsi, Linux Kernel

On Wed, Sep 28, 2011 at 01:53:04PM -0400, Christoph Hellwig wrote:
> On Wed, Sep 28, 2011 at 01:48:59PM -0400, Vivek Goyal wrote:
> > I am wondering if we should retain blk_throtl_exit() in blk_cleanup_queue()
> > before lock swap and just move elevator cleanup in blk_release_queue().
> > 
> > A note to myself, I should probably enhance blk_throtl_exit() to look for any
> > queued throttled bio and single their completion with error (-ENODEV) or
> > something like that.
> 
> The root of this evil is how queue_lock is implemented and (ab)used.
> Instead of letting the driver assign a pointer to make the core use
> its locks we really need to make the queue_lock a lock embedded directly
> into the queue, and drivers may or may not use that lock for their
> internal data structures.  For high performance drivers they preferable
> should use their own locks as queue_lock is far too contended already
> for any high IOPS device.   The same applies to throtteling btw -
> instead of overloading an already highly contended lock it really
> should have its own.

I had thought of implementing a separate lock for throttling. Then I
noticed few operations like checking for queue flags where I would
be required to hold queue locks.

So I could do lock nesting.

hold throttling_lock;
   hold queue_lock;

IIRC, I also had noticed some operations where queue might want to call
into throttling with queue lock held and that would have led to lock
order problems.

So I had given up on the idea and continued to use queue lock for
throttling. Thought it could still probably be done if one could
justify additional complexity.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 18:09                         ` Vivek Goyal
@ 2011-09-28 18:16                           ` Christoph Hellwig
  2011-09-28 19:05                             ` Eric Seppanen
  2011-09-28 22:34                             ` Vivek Goyal
  0 siblings, 2 replies; 60+ messages in thread
From: Christoph Hellwig @ 2011-09-28 18:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: James Bottomley, Linus Torvalds, Jens Axboe, Hannes Reinecke,
	James Bottomley, linux-scsi, Linux Kernel

On Wed, Sep 28, 2011 at 02:09:05PM -0400, Vivek Goyal wrote:
> I had thought of implementing a separate lock for throttling. Then I
> noticed few operations like checking for queue flags where I would
> be required to hold queue locks.

Can you do a writeup of these issues?  E.g. if the flags are throtteling
specific we can move them to a separate flags field, if not we can
see if we can make them atomic or similar.

Right now on high iops device queue_lock is the major killer for
performance.  It's one major reason (*) why a lot of the high iops devices
are all moving to ->make_request, which has other issues.

(*) others are struct request allocation and the pointless merge hash

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 18:16                           ` Christoph Hellwig
@ 2011-09-28 19:05                             ` Eric Seppanen
  2011-09-28 19:14                               ` Christoph Hellwig
  2011-11-30 10:18                               ` Jens Axboe
  2011-09-28 22:34                             ` Vivek Goyal
  1 sibling, 2 replies; 60+ messages in thread
From: Eric Seppanen @ 2011-09-28 19:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vivek Goyal, James Bottomley, Linus Torvalds, Jens Axboe,
	Hannes Reinecke, James Bottomley, linux-scsi, Linux Kernel

On Wed, Sep 28, 2011 at 11:16 AM, Christoph Hellwig <hch@infradead.org> wrote:
> Right now on high iops device queue_lock is the major killer for
> performance.  It's one major reason (*) why a lot of the high iops devices
> are all moving to ->make_request, which has other issues.
>
> (*) others are struct request allocation and the pointless merge hash

I agree: queue lock is the worst performance killer when hw can do
>100K IOPS per block device.

Rather than just being chased away from the request queue due to
performance issues, I could argue there's very little point to having
a queue for devices that
(a) have no seek penalty (and always use noop elevator)
(b) have hardware queues at least as deep as the default request queue
(c) don't benefit from merging

(c) is maybe debatable, but if a device can saturate its bus bandwidth
on 4KB IO, the latency is probably not worth it.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 19:05                             ` Eric Seppanen
@ 2011-09-28 19:14                               ` Christoph Hellwig
  2011-11-30 10:18                               ` Jens Axboe
  1 sibling, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2011-09-28 19:14 UTC (permalink / raw)
  To: Eric Seppanen
  Cc: Vivek Goyal, James Bottomley, Linus Torvalds, Jens Axboe,
	Hannes Reinecke, James Bottomley, linux-scsi, Linux Kernel

On Wed, Sep 28, 2011 at 12:05:10PM -0700, Eric Seppanen wrote:
> I agree: queue lock is the worst performance killer when hw can do
> >100K IOPS per block device.
> 
> Rather than just being chased away from the request queue due to
> performance issues, I could argue there's very little point to having
> a queue for devices that
> (a) have no seek penalty (and always use noop elevator)
> (b) have hardware queues at least as deep as the default request queue
> (c) don't benefit from merging
> 
> (c) is maybe debatable, but if a device can saturate its bus bandwidth
> on 4KB IO, the latency is probably not worth it.

In theory, yes.  But at some point we will be able to saturate them,
and then people want proportional I/O, light amounts of queueing, etc.

And I really don't want to reinvent that for every little device.  The
other problem is that we a single driver might driver totally different
types of devices, already today we have iSCSI or FC accessibly high
IOPS devices, there are good PCI-e flash devices masquerading as AHCI,
and my current problem is that the queue_lock really hurts me in
virtio-blk when using a PCIe flash device underneath.

So we really need some infrastructure that allows a generic interface
to the driver, and allow us to plug in merging, scheduling, queueing
on an as needed basis.

That is my long term plan - making request_lock suck a little less,
and improving the driver interface is a good first step, though.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 18:16                           ` Christoph Hellwig
  2011-09-28 19:05                             ` Eric Seppanen
@ 2011-09-28 22:34                             ` Vivek Goyal
  1 sibling, 0 replies; 60+ messages in thread
From: Vivek Goyal @ 2011-09-28 22:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Linus Torvalds, Jens Axboe, Hannes Reinecke,
	James Bottomley, linux-scsi, Linux Kernel

On Wed, Sep 28, 2011 at 02:16:49PM -0400, Christoph Hellwig wrote:
> On Wed, Sep 28, 2011 at 02:09:05PM -0400, Vivek Goyal wrote:
> > I had thought of implementing a separate lock for throttling. Then I
> > noticed few operations like checking for queue flags where I would
> > be required to hold queue locks.
> 
> Can you do a writeup of these issues?  E.g. if the flags are throtteling
> specific we can move them to a separate flags field, if not we can
> see if we can make them atomic or similar.

Sure. I am going through the code and trying to list down some of the
dependencies.

But before that I would say that tyring to optimize the throttling code
to reduce the queue lock contention should be last on the priority list.
The simple reason being that in common case it does not incur any locking
cost. One needs to have IO cgroup configured and needs to have some
throttling rules put in place only then queue lock cost will come into
picture. And for most of the people, I don't think that's the common
case.

Now coming back to question of dependencies. We take some services from
request queue which needs to happen under queue lock.

- Looks like in current code I am only accessing QUEUE_FLAG_DEAD. I see
  that flag is not synchronized using queue lock. So it probably is not
  a concern.

- I am using tracing functionality. blk_add_trace_msg(). I guess it is
  not required to take queue lock for this. But I am not sure.

- I am accessing q->backing_dev_info to get to associated devices's
  major and minor number.

That's it I guess. That's what a quick look tells me. So looks like
it is not too hard to convert existing blk throttle code to use its
own lock. That will also get rid of dependency on queue lock and
in turn uncertainity associated with queue lock being valid (if
driver provided the lock).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 14:14               ` [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue()) Jens Axboe
  2011-09-28 15:22                 ` Linus Torvalds
@ 2011-10-13 13:09                 ` Steffen Maier
  2011-10-14 16:03                   ` James Bottomley
  1 sibling, 1 reply; 60+ messages in thread
From: Steffen Maier @ 2011-10-13 13:09 UTC (permalink / raw)
  To: linux-scsi
  Cc: Jens Axboe, James Bottomley, Hannes Reinecke, Linux Kernel,
	Alan Stern, Jun'ichi Nomura, Thadeu Lima de Souza Cascardo,
	Taraka R. Bodireddy, Seshagiri N. Ippili,
	Manvanthara B. Puttashankar, Heiko Carstens

This fix also went into 3.0.5 via
http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=releases/3.0.5/block-free-queue-resources-at-blk_release_queue.patch
(originated at http://marc.info/?l=linux-scsi&m=131669751909474&w=2 and 
http://marc.info/?l=linux-scsi&m=131669414205696&w=2)

However, it seems we still have a use-after-free bug, now causing the 
following oops with kernel 3.0.6 when removing paths to storage while 
generating load on device-mapper multipath disks:

> Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> Modules linked in: iptable_filter ip_tables x_tables dm_round_robin sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh dm_mod ipv6 qeth ccwgroup [last unloaded: scsi_wait_scan]
> CPU: 1 Not tainted 3.0.6-50.x.20111006-s390xdefault #1
> Process blast.LzS_64_SL (pid: 26613, task: 0000000063e2a398, ksp: 0000000064de3560)
> Krnl PSW : 0704100180000000 000000000048a038 (scsi_print_command+0x44/0xf8)
>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
> Krnl GPRS: 000000000000006b 6b6b6b6b6b6b6b6b 000000006717f800 000000000094f2e0
>            000000000061242e 000000000062bd88 0000000066fb90d8 0000000065391ad7
>            000000006717f800 000000006717f800 000000006716a090 000000006717f800
>            0000000000000004 0000000000672f88 0000000064de3838 0000000064de3808
> Krnl Code: 000000000048a026: f0b80004ebbf        srp     4(12,%r0),3007(%r14),8
>            000000000048a02c: f0a0000407f4        srp     4(11,%r0),2036,0
>            000000000048a032: e31020800004        lg      %r1,128(%r2)
>           >000000000048a038: e31010b00004        lg      %r1,176(%r1)
>            000000000048a03e: b9020011            ltgr    %r1,%r1
>            000000000048a042: a7840032            brc     8,48a0a6
>            000000000048a046: e33020000004        lg      %r3,0(%r2)
>            000000000048a04c: c04000182f4c        larl    %r4,78fee4
> Call Trace:
> ([<000000006717f800>] 0x6717f800)
>  [<0000000000487f28>] scsi_log_send+0xf0/0x130
>  [<000000000048824c>] scsi_dispatch_cmd+0xc8/0x4bc
>  [<0000000000490694>] scsi_request_fn+0x3b8/0x480
>  [<00000000003fe4c8>] __elv_add_request+0x174/0x35c
>  [<0000000000402fd6>] blk_insert_cloned_request+0x72/0xcc
>  [<000003e001e134f2>] dm_dispatch_request+0x5a/0x94 [dm_mod]
>  [<000003e001e15f36>] dm_request_fn+0x216/0x398 [dm_mod]
>  [<0000000000405590>] queue_unplugged+0x60/0x1a8
>  [<0000000000407470>] blk_flush_plug_list+0x218/0x29c
>  [<00000000006136ec>] io_schedule+0x94/0x100
>  [<000000000020e6d6>] sleep_on_page_killable+0x22/0x60
>  [<00000000006140a8>] __wait_on_bit_lock+0xc0/0x124
>  [<000000000020e5f8>] __lock_page_killable+0x78/0x84
>  [<0000000000210690>] generic_file_aio_read+0x56c/0x800
>  [<000000000027a0a0>] do_sync_read+0xcc/0x11c
>  [<000000000027a974>] vfs_read+0xb0/0x194
>  [<000000000027aab0>] SyS_read+0x58/0xb4
>  [<00000000006185f6>] sysc_noemu+0x16/0x1c
>  [<000003fffd473bd4>] 0x3fffd473bd4
> INFO: lockdep is turned off.
> Last Breaking-Event-Address:
>  [<000000000048a020>] scsi_print_command+0x2c/0xf8

During a different run, we also got a similar oops which was not logged 
but only its panic_on_oops:

> Kernel panic - not syncing: Fatal exception: panic_on_oops
> CPU: 1 Tainted: G      D     3.0.6-50.x.20111006-s390xdefault #1
> Process blast.LzS_64_SL (pid: 6551, task: 00000000a2942398, ksp: 00000000a11834e0)
> 00000000a1183630 00000000a11835b0 0000000000000002 0000000000000000
>        00000000a1183650 00000000a11835c8 00000000a11835c8 000000000061242e
>        0000000000000000 0000000000000000 00000000a3176ed7 0000000000100ed2
>        000000000000000d 000000000000000c 00000000a1183618 0000000000000000
>        0000000000000000 0000000000100a18 00000000a11835b0 00000000a11835f0
> Call Trace:
> ([<0000000000100926>] show_trace+0xee/0x144)
>  [<0000000000612258>] panic+0xb0/0x234
>  [<0000000000100ed2>] die+0x166/0x168
>  [<000000000011eede>] do_no_context+0xba/0xf8
>  [<000000000061ad28>] do_asce_exception+0x130/0x18c
>  [<0000000000618918>] pgm_exit+0x0/0x4
>  [<000000000048a0d4>] scsi_print_command+0xe0/0xf8
> ([<000000000048a0e6>] scsi_print_command+0xf2/0xf8)
>  [<0000000000487f28>] scsi_log_send+0xf0/0x130
>  [<000000000048824c>] scsi_dispatch_cmd+0xc8/0x4bc
>  [<0000000000490694>] scsi_request_fn+0x3b8/0x480
>  [<00000000003fe4c8>] __elv_add_request+0x174/0x35c
>  [<0000000000402fd6>] blk_insert_cloned_request+0x72/0xcc
>  [<000003e001e134f2>] dm_dispatch_request+0x5a/0x94 [dm_mod]
>  [<000003e001e15f36>] dm_request_fn+0x216/0x398 [dm_mod]
>  [<0000000000405590>] queue_unplugged+0x60/0x1a8
>  [<0000000000407470>] blk_flush_plug_list+0x218/0x29c
>  [<000000000040751e>] blk_finish_plug+0x2a/0x58
>  [<000000000021034e>] generic_file_aio_read+0x22a/0x800
>  [<000000000027a0a0>] do_sync_read+0xcc/0x11c
>  [<000000000027a974>] vfs_read+0xb0/0x194
>  [<000000000027aab0>] SyS_read+0x58/0xb4
>  [<00000000006185f6>] sysc_noemu+0x16/0x1c
>  [<000003fffd395bd4>] 0x3fffd395bd4
> INFO: lockdep is turned off.

The other CPU was doing scsi_print_command successfully to the console 
(sclp_console_write) when this panic occurred.

Earlier, we tried Alan's patch
http://marc.info/?l=linux-scsi&m=130963676807728&w=2
on 3.0.4, but in contrast to Cascardo's experience
http://marc.info/?l=linux-scsi&m=131482024119345&w=2
it did not work for us.

Initially, we encountered use-after-free bugs in
scsi_print_command / scsi_dispatch_cmd
http://marc.info/?l=linux-scsi&m=130824013229933&w=2
__elv_add_request
http://marc.info/?l=linux-scsi&m=130858383702949&w=2

Also related is the following thread:
http://marc.info/?l=linux-scsi&m=130977887419062&w=2

Unfortunately, I don't understand what exactly is going wrong here.
Can you shed some light on this?

On 09/28/2011 04:14 PM, Jens Axboe wrote:
> Alright, lets call this fully tested and fixed then. Linus, I committed
> it, please pull:
>
>    git://git.kernel.dk/linux-block.git for-linus
>
> Hannes Reinecke (1):
>        block: Free queue resources at blk_release_queue()
>
>   block/blk-core.c  |   13 ++++++-------
>   block/blk-sysfs.c |    5 +++++
>   2 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b2ed78a..d34433a 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -348,9 +348,10 @@ void blk_put_queue(struct request_queue *q)
>   EXPORT_SYMBOL(blk_put_queue);
>
>   /*
> - * Note: If a driver supplied the queue lock, it should not zap that lock
> - * unexpectedly as some queue cleanup components like elevator_exit() and
> - * blk_throtl_exit() need queue lock.
> + * Note: If a driver supplied the queue lock, it is disconnected
> + * by this function. The actual state of the lock doesn't matter
> + * here as the request_queue isn't accessible after this point
> + * (QUEUE_FLAG_DEAD is set) and no other requests will be queued.
>    */
>   void blk_cleanup_queue(struct request_queue *q)
>   {
> @@ -367,10 +368,8 @@ void blk_cleanup_queue(struct request_queue *q)
>   	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
>   	mutex_unlock(&q->sysfs_lock);
>
> -	if (q->elevator)
> -		elevator_exit(q->elevator);
> -
> -	blk_throtl_exit(q);
> +	if (q->queue_lock !=&q->__queue_lock)
> +		q->queue_lock =&q->__queue_lock;
>
>   	blk_put_queue(q);
>   }
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index e681805..60fda88 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -479,6 +479,11 @@ static void blk_release_queue(struct kobject *kobj)
>
>   	blk_sync_queue(q);
>
> +	if (q->elevator)
> +		elevator_exit(q->elevator);
> +
> +	blk_throtl_exit(q);
> +
>   	if (rl->rq_pool)
>   		mempool_destroy(rl->rq_pool);
>
>
>

-- 
Steffen Maier

Linux on System z Development

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-13 13:09                 ` Steffen Maier
@ 2011-10-14 16:03                   ` James Bottomley
  2011-10-17  8:46                     ` Jun'ichi Nomura
  0 siblings, 1 reply; 60+ messages in thread
From: James Bottomley @ 2011-10-14 16:03 UTC (permalink / raw)
  To: Steffen Maier
  Cc: linux-scsi, Jens Axboe, Hannes Reinecke, Linux Kernel,
	Alan Stern, Jun'ichi Nomura, Thadeu Lima de Souza Cascardo,
	Taraka R. Bodireddy, Seshagiri N. Ippili,
	Manvanthara B. Puttashankar, Heiko Carstens

On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
> This fix also went into 3.0.5 via
> http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=releases/3.0.5/block-free-queue-resources-at-blk_release_queue.patch
> (originated at http://marc.info/?l=linux-scsi&m=131669751909474&w=2 and 
> http://marc.info/?l=linux-scsi&m=131669414205696&w=2)
> 
> However, it seems we still have a use-after-free bug, now causing the 
> following oops with kernel 3.0.6 when removing paths to storage while 
> generating load on device-mapper multipath disks:
> 
> > Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> > Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > Modules linked in: iptable_filter ip_tables x_tables dm_round_robin sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh dm_mod ipv6 qeth ccwgroup [last unloaded: scsi_wait_scan]
> > CPU: 1 Not tainted 3.0.6-50.x.20111006-s390xdefault #1
> > Process blast.LzS_64_SL (pid: 26613, task: 0000000063e2a398, ksp: 0000000064de3560)
> > Krnl PSW : 0704100180000000 000000000048a038 (scsi_print_command+0x44/0xf8)
> >            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
> > Krnl GPRS: 000000000000006b 6b6b6b6b6b6b6b6b 000000006717f800 000000000094f2e0
> >            000000000061242e 000000000062bd88 0000000066fb90d8 0000000065391ad7
> >            000000006717f800 000000006717f800 000000006716a090 000000006717f800
> >            0000000000000004 0000000000672f88 0000000064de3838 0000000064de3808
> > Krnl Code: 000000000048a026: f0b80004ebbf        srp     4(12,%r0),3007(%r14),8
> >            000000000048a02c: f0a0000407f4        srp     4(11,%r0),2036,0
> >            000000000048a032: e31020800004        lg      %r1,128(%r2)
> >           >000000000048a038: e31010b00004        lg      %r1,176(%r1)
> >            000000000048a03e: b9020011            ltgr    %r1,%r1
> >            000000000048a042: a7840032            brc     8,48a0a6
> >            000000000048a046: e33020000004        lg      %r3,0(%r2)
> >            000000000048a04c: c04000182f4c        larl    %r4,78fee4
> > Call Trace:
> > ([<000000006717f800>] 0x6717f800)
> >  [<0000000000487f28>] scsi_log_send+0xf0/0x130
> >  [<000000000048824c>] scsi_dispatch_cmd+0xc8/0x4bc
> >  [<0000000000490694>] scsi_request_fn+0x3b8/0x480

Correct me if I'm wrong, but this seems to be saying that struct
scsi_cmnd was used after free.  This looks to be a different problem
because the command has a separate refcounting model which wasn't
impacted by the change ... it could be we've just exposed yet another
refcounting problem outside of the queue one.

If I had to guess, I'd say a bio got cloned with a SCSI command already
attached, but the ref count on the SCSI command wasn't correctly
adjusted.

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-14 16:03                   ` James Bottomley
@ 2011-10-17  8:46                     ` Jun'ichi Nomura
  2011-10-17 14:06                       ` James Bottomley
  0 siblings, 1 reply; 60+ messages in thread
From: Jun'ichi Nomura @ 2011-10-17  8:46 UTC (permalink / raw)
  To: James Bottomley
  Cc: Steffen Maier, linux-scsi, Jens Axboe, Hannes Reinecke,
	Linux Kernel, Alan Stern, Thadeu Lima de Souza Cascardo,
	Taraka R. Bodireddy, Seshagiri N. Ippili,
	Manvanthara B. Puttashankar, Heiko Carstens

On 10/15/11 01:03, James Bottomley wrote:
> On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
>> This fix also went into 3.0.5 via
>> http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=releases/3.0.5/block-free-queue-resources-at-blk_release_queue.patch
>> (originated at http://marc.info/?l=linux-scsi&m=131669751909474&w=2 and 
>> http://marc.info/?l=linux-scsi&m=131669414205696&w=2)
>>
>> However, it seems we still have a use-after-free bug, now causing the 
>> following oops with kernel 3.0.6 when removing paths to storage while 
>> generating load on device-mapper multipath disks:
>>
>>> Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
>>> Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>> Modules linked in: iptable_filter ip_tables x_tables dm_round_robin sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh dm_mod ipv6 qeth ccwgroup [last unloaded: scsi_wait_scan]
>>> CPU: 1 Not tainted 3.0.6-50.x.20111006-s390xdefault #1
>>> Process blast.LzS_64_SL (pid: 26613, task: 0000000063e2a398, ksp: 0000000064de3560)
>>> Krnl PSW : 0704100180000000 000000000048a038 (scsi_print_command+0x44/0xf8)
>>>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
>>> Krnl GPRS: 000000000000006b 6b6b6b6b6b6b6b6b 000000006717f800 000000000094f2e0
>>>            000000000061242e 000000000062bd88 0000000066fb90d8 0000000065391ad7
>>>            000000006717f800 000000006717f800 000000006716a090 000000006717f800
>>>            0000000000000004 0000000000672f88 0000000064de3838 0000000064de3808
>>> Krnl Code: 000000000048a026: f0b80004ebbf        srp     4(12,%r0),3007(%r14),8
>>>            000000000048a02c: f0a0000407f4        srp     4(11,%r0),2036,0
>>>            000000000048a032: e31020800004        lg      %r1,128(%r2)
>>>           >000000000048a038: e31010b00004        lg      %r1,176(%r1)
>>>            000000000048a03e: b9020011            ltgr    %r1,%r1
>>>            000000000048a042: a7840032            brc     8,48a0a6
>>>            000000000048a046: e33020000004        lg      %r3,0(%r2)
>>>            000000000048a04c: c04000182f4c        larl    %r4,78fee4
>>> Call Trace:
>>> ([<000000006717f800>] 0x6717f800)
>>>  [<0000000000487f28>] scsi_log_send+0xf0/0x130
>>>  [<000000000048824c>] scsi_dispatch_cmd+0xc8/0x4bc
>>>  [<0000000000490694>] scsi_request_fn+0x3b8/0x480
> 
> Correct me if I'm wrong, but this seems to be saying that struct
> scsi_cmnd was used after free.  This looks to be a different problem
> because the command has a separate refcounting model which wasn't
> impacted by the change ... it could be we've just exposed yet another
> refcounting problem outside of the queue one.
> 
> If I had to guess, I'd say a bio got cloned with a SCSI command already
> attached, but the ref count on the SCSI command wasn't correctly
> adjusted.

As far as dm is concerned, it shouldn't happen.
Clone is made from a dm request, not from SCSI one.
Also clone is not reused when retrying.

>> Initially, we encountered use-after-free bugs in
>> scsi_print_command / scsi_dispatch_cmd
>> http://marc.info/?l=linux-scsi&m=130824013229933&w=2

It is interesting that both this and the older report
got oopsed in scsi_log_send(), while there are other
dereferences of 'cmd' around scsi_dispatch_cmd().
Are there any reason they are special? Just by accident?

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-17  8:46                     ` Jun'ichi Nomura
@ 2011-10-17 14:06                       ` James Bottomley
  2011-10-18 13:31                         ` Jun'ichi Nomura
  0 siblings, 1 reply; 60+ messages in thread
From: James Bottomley @ 2011-10-17 14:06 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Steffen Maier, linux-scsi, Jens Axboe, Hannes Reinecke,
	Linux Kernel, Alan Stern, Thadeu Lima de Souza Cascardo,
	Taraka R. Bodireddy, Seshagiri N. Ippili,
	Manvanthara B. Puttashankar, Heiko Carstens

On Mon, 2011-10-17 at 17:46 +0900, Jun'ichi Nomura wrote:
> On 10/15/11 01:03, James Bottomley wrote:
> > On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
> >> This fix also went into 3.0.5 via
> >> http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=releases/3.0.5/block-free-queue-resources-at-blk_release_queue.patch
> >> (originated at http://marc.info/?l=linux-scsi&m=131669751909474&w=2 and 
> >> http://marc.info/?l=linux-scsi&m=131669414205696&w=2)
> >>
> >> However, it seems we still have a use-after-free bug, now causing the 
> >> following oops with kernel 3.0.6 when removing paths to storage while 
> >> generating load on device-mapper multipath disks:
> >>
> >>> Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> >>> Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >>> Modules linked in: iptable_filter ip_tables x_tables dm_round_robin sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh dm_mod ipv6 qeth ccwgroup [last unloaded: scsi_wait_scan]
> >>> CPU: 1 Not tainted 3.0.6-50.x.20111006-s390xdefault #1
> >>> Process blast.LzS_64_SL (pid: 26613, task: 0000000063e2a398, ksp: 0000000064de3560)
> >>> Krnl PSW : 0704100180000000 000000000048a038 (scsi_print_command+0x44/0xf8)
> >>>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
> >>> Krnl GPRS: 000000000000006b 6b6b6b6b6b6b6b6b 000000006717f800 000000000094f2e0
> >>>            000000000061242e 000000000062bd88 0000000066fb90d8 0000000065391ad7
> >>>            000000006717f800 000000006717f800 000000006716a090 000000006717f800
> >>>            0000000000000004 0000000000672f88 0000000064de3838 0000000064de3808
> >>> Krnl Code: 000000000048a026: f0b80004ebbf        srp     4(12,%r0),3007(%r14),8
> >>>            000000000048a02c: f0a0000407f4        srp     4(11,%r0),2036,0
> >>>            000000000048a032: e31020800004        lg      %r1,128(%r2)
> >>>           >000000000048a038: e31010b00004        lg      %r1,176(%r1)
> >>>            000000000048a03e: b9020011            ltgr    %r1,%r1
> >>>            000000000048a042: a7840032            brc     8,48a0a6
> >>>            000000000048a046: e33020000004        lg      %r3,0(%r2)
> >>>            000000000048a04c: c04000182f4c        larl    %r4,78fee4
> >>> Call Trace:
> >>> ([<000000006717f800>] 0x6717f800)
> >>>  [<0000000000487f28>] scsi_log_send+0xf0/0x130
> >>>  [<000000000048824c>] scsi_dispatch_cmd+0xc8/0x4bc
> >>>  [<0000000000490694>] scsi_request_fn+0x3b8/0x480
> > 
> > Correct me if I'm wrong, but this seems to be saying that struct
> > scsi_cmnd was used after free.  This looks to be a different problem
> > because the command has a separate refcounting model which wasn't
> > impacted by the change ... it could be we've just exposed yet another
> > refcounting problem outside of the queue one.
> > 
> > If I had to guess, I'd say a bio got cloned with a SCSI command already
> > attached, but the ref count on the SCSI command wasn't correctly
> > adjusted.
> 
> As far as dm is concerned, it shouldn't happen.
> Clone is made from a dm request, not from SCSI one.
> Also clone is not reused when retrying.

It was just a guess.  Assuming the command got freed prematurely, there
has to be something in the dm path to explain why the SCSI refcounting
model got screwed up.  Cloning a bio with an attached command was what
first occurred to me, but perhaps there are other ways I'm not seeing.

> >> Initially, we encountered use-after-free bugs in
> >> scsi_print_command / scsi_dispatch_cmd
> >> http://marc.info/?l=linux-scsi&m=130824013229933&w=2
> 
> It is interesting that both this and the older report
> got oopsed in scsi_log_send(), while there are other
> dereferences of 'cmd' around scsi_dispatch_cmd().
> Are there any reason they are special? Just by accident?

Right, that's why it looks like the command area got freed rather than
the command pointer was bogus (6b is a poison free pattern).  Perhaps if
the reporter could pin down the failing source line, we'd know better
what was going on?

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-17 14:06                       ` James Bottomley
@ 2011-10-18 13:31                         ` Jun'ichi Nomura
  2011-10-18 15:45                           ` Heiko Carstens
  0 siblings, 1 reply; 60+ messages in thread
From: Jun'ichi Nomura @ 2011-10-18 13:31 UTC (permalink / raw)
  To: James Bottomley, Steffen Maier
  Cc: linux-scsi, Jens Axboe, Hannes Reinecke, Linux Kernel,
	Alan Stern, Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Heiko Carstens

On 10/17/11 23:06, James Bottomley wrote:
> On Mon, 2011-10-17 at 17:46 +0900, Jun'ichi Nomura wrote:
>> On 10/15/11 01:03, James Bottomley wrote:
>>> On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
>>>> This fix also went into 3.0.5 via
>>>> http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=releases/3.0.5/block-free-queue-resources-at-blk_release_queue.patch
>>>> (originated at http://marc.info/?l=linux-scsi&m=131669751909474&w=2 and 
>>>> http://marc.info/?l=linux-scsi&m=131669414205696&w=2)
>>>>
>>>> However, it seems we still have a use-after-free bug, now causing the 
>>>> following oops with kernel 3.0.6 when removing paths to storage while 
>>>> generating load on device-mapper multipath disks:
>>>>
>>>>> Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
>>>>> Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>>> Modules linked in: iptable_filter ip_tables x_tables dm_round_robin sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh dm_mod ipv6 qeth ccwgroup [last unloaded: scsi_wait_scan]
>>>>> CPU: 1 Not tainted 3.0.6-50.x.20111006-s390xdefault #1
>>>>> Process blast.LzS_64_SL (pid: 26613, task: 0000000063e2a398, ksp: 0000000064de3560)
>>>>> Krnl PSW : 0704100180000000 000000000048a038 (scsi_print_command+0x44/0xf8)
>>>>>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
>>>>> Krnl GPRS: 000000000000006b 6b6b6b6b6b6b6b6b 000000006717f800 000000000094f2e0
>>>>>            000000000061242e 000000000062bd88 0000000066fb90d8 0000000065391ad7
>>>>>            000000006717f800 000000006717f800 000000006716a090 000000006717f800
>>>>>            0000000000000004 0000000000672f88 0000000064de3838 0000000064de3808
>>>>> Krnl Code: 000000000048a026: f0b80004ebbf        srp     4(12,%r0),3007(%r14),8
>>>>>            000000000048a02c: f0a0000407f4        srp     4(11,%r0),2036,0
>>>>>            000000000048a032: e31020800004        lg      %r1,128(%r2)
>>>>>           >000000000048a038: e31010b00004        lg      %r1,176(%r1)
>>>>>            000000000048a03e: b9020011            ltgr    %r1,%r1
>>>>>            000000000048a042: a7840032            brc     8,48a0a6
>>>>>            000000000048a046: e33020000004        lg      %r3,0(%r2)
>>>>>            000000000048a04c: c04000182f4c        larl    %r4,78fee4
>>>>> Call Trace:
>>>>> ([<000000006717f800>] 0x6717f800)
>>>>>  [<0000000000487f28>] scsi_log_send+0xf0/0x130
>>>>>  [<000000000048824c>] scsi_dispatch_cmd+0xc8/0x4bc
>>>>>  [<0000000000490694>] scsi_request_fn+0x3b8/0x480
>>>
>>> Correct me if I'm wrong, but this seems to be saying that struct
>>> scsi_cmnd was used after free.  This looks to be a different problem
>>> because the command has a separate refcounting model which wasn't
>>> impacted by the change ... it could be we've just exposed yet another
>>> refcounting problem outside of the queue one.
>>>
>>> If I had to guess, I'd say a bio got cloned with a SCSI command already
>>> attached, but the ref count on the SCSI command wasn't correctly
>>> adjusted.
>>
>> As far as dm is concerned, it shouldn't happen.
>> Clone is made from a dm request, not from SCSI one.
>> Also clone is not reused when retrying.
> 
> It was just a guess.  Assuming the command got freed prematurely, there
> has to be something in the dm path to explain why the SCSI refcounting
> model got screwed up.  Cloning a bio with an attached command was what
> first occurred to me, but perhaps there are other ways I'm not seeing.
> 
>>>> Initially, we encountered use-after-free bugs in
>>>> scsi_print_command / scsi_dispatch_cmd
>>>> http://marc.info/?l=linux-scsi&m=130824013229933&w=2
>>
>> It is interesting that both this and the older report
>> got oopsed in scsi_log_send(), while there are other
>> dereferences of 'cmd' around scsi_dispatch_cmd().
>> Are there any reason they are special? Just by accident?
> 
> Right, that's why it looks like the command area got freed rather than
> the command pointer was bogus (6b is a poison free pattern).  Perhaps if
> the reporter could pin down the failing source line, we'd know better
> what was going on?

Yeah, that might be useful.

One remote possibility I imagined is if the submitting process
took very long after blk_start_request until scsi_dispatch_cmd
and timeout handler kicks in, can cmd be freed?

Also Tejun's report here could be related to possible data corruption:
  [PATCH] block: make gendisk hold a reference to its queue
  https://lkml.org/lkml/2011/10/16/148

Though I haven't hit the reported oops myself,
a process closing a removed device may modify freed memory.
And his patch will fix it.

So if the problem is easily reproducible, I think it's worth
trying his patch to see if the problem disappear.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-18 13:31                         ` Jun'ichi Nomura
@ 2011-10-18 15:45                           ` Heiko Carstens
  2011-10-18 16:29                             ` James Bottomley
  0 siblings, 1 reply; 60+ messages in thread
From: Heiko Carstens @ 2011-10-18 15:45 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: James Bottomley, Steffen Maier, linux-scsi, Jens Axboe,
	Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar

On Tue, Oct 18, 2011 at 10:31:20PM +0900, Jun'ichi Nomura wrote:
> On 10/17/11 23:06, James Bottomley wrote:
> > On Mon, 2011-10-17 at 17:46 +0900, Jun'ichi Nomura wrote:
> >> On 10/15/11 01:03, James Bottomley wrote:
> >>> On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
> >>>> Initially, we encountered use-after-free bugs in
> >>>> scsi_print_command / scsi_dispatch_cmd
> >>>> http://marc.info/?l=linux-scsi&m=130824013229933&w=2
> >>
> >> It is interesting that both this and the older report
> >> got oopsed in scsi_log_send(), while there are other
> >> dereferences of 'cmd' around scsi_dispatch_cmd().
> >> Are there any reason they are special? Just by accident?
> > 
> > Right, that's why it looks like the command area got freed rather than
> > the command pointer was bogus (6b is a poison free pattern).  Perhaps if
> > the reporter could pin down the failing source line, we'd know better
> > what was going on?
> 
> Yeah, that might be useful.

The struct scsi_cmnd that was passed to scsi_log_send() was already freed
(contents completely 6b6b6b...).
Since SLUB debugging was turned on we can see that it was freed from
__scsi_put_command(). Not too much of a surprise.

> One remote possibility I imagined is if the submitting process
> took very long after blk_start_request until scsi_dispatch_cmd
> and timeout handler kicks in, can cmd be freed?
> 
> Also Tejun's report here could be related to possible data corruption:
>   [PATCH] block: make gendisk hold a reference to its queue
>   https://lkml.org/lkml/2011/10/16/148
> 
> Though I haven't hit the reported oops myself,
> a process closing a removed device may modify freed memory.
> And his patch will fix it.
> 
> So if the problem is easily reproducible, I think it's worth
> trying his patch to see if the problem disappear.

In addition it might be worthwhile to apply this small slub patch as well,
so that we get a full trace of the caller in the object which was freed:

---
 mm/slub.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -27,6 +27,7 @@
 #include <linux/memory.h>
 #include <linux/math64.h>
 #include <linux/fault-inject.h>
+#include <linux/stacktrace.h>
 
 #include <trace/events/kmem.h>
 
@@ -191,11 +192,14 @@ static LIST_HEAD(slab_caches);
 /*
  * Tracking user of a slab.
  */
+#define SAVE_TRACE_NUM 30
 struct track {
 	unsigned long addr;	/* Called from address */
 	int cpu;		/* Was running on cpu */
 	int pid;		/* Pid context */
 	unsigned long when;	/* When did the operation occur */
+	unsigned long trace[SAVE_TRACE_NUM];
+	unsigned long long stamp;
 };
 
 enum track_item { TRACK_ALLOC, TRACK_FREE };
@@ -397,10 +401,17 @@ static void set_track(struct kmem_cache
 	struct track *p = get_track(s, object, alloc);
 
 	if (addr) {
+		struct stack_trace trace = {
+			.max_entries = SAVE_TRACE_NUM,
+			.entries = p->trace,
+		};
+
 		p->addr = addr;
 		p->cpu = smp_processor_id();
 		p->pid = current->pid;
 		p->when = jiffies;
+		p->stamp = get_clock();
+		save_stack_trace(&trace);
 	} else
 		memset(p, 0, sizeof(struct track));
 }

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-18 15:45                           ` Heiko Carstens
@ 2011-10-18 16:29                             ` James Bottomley
  2011-10-31 10:05                               ` Heiko Carstens
  0 siblings, 1 reply; 60+ messages in thread
From: James Bottomley @ 2011-10-18 16:29 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Jun'ichi Nomura, Steffen Maier, linux-scsi, Jens Axboe,
	Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar

On Tue, 2011-10-18 at 17:45 +0200, Heiko Carstens wrote:
> On Tue, Oct 18, 2011 at 10:31:20PM +0900, Jun'ichi Nomura wrote:
> > On 10/17/11 23:06, James Bottomley wrote:
> > > On Mon, 2011-10-17 at 17:46 +0900, Jun'ichi Nomura wrote:
> > >> On 10/15/11 01:03, James Bottomley wrote:
> > >>> On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
> > >>>> Initially, we encountered use-after-free bugs in
> > >>>> scsi_print_command / scsi_dispatch_cmd
> > >>>> http://marc.info/?l=linux-scsi&m=130824013229933&w=2
> > >>
> > >> It is interesting that both this and the older report
> > >> got oopsed in scsi_log_send(), while there are other
> > >> dereferences of 'cmd' around scsi_dispatch_cmd().
> > >> Are there any reason they are special? Just by accident?
> > > 
> > > Right, that's why it looks like the command area got freed rather than
> > > the command pointer was bogus (6b is a poison free pattern).  Perhaps if
> > > the reporter could pin down the failing source line, we'd know better
> > > what was going on?
> > 
> > Yeah, that might be useful.
> 
> The struct scsi_cmnd that was passed to scsi_log_send() was already freed
> (contents completely 6b6b6b...).
> Since SLUB debugging was turned on we can see that it was freed from
> __scsi_put_command(). Not too much of a surprise.

But it does tell us the put must be racing with dispatch, since
dereferencing the command to find the device worked higher up in
scsi_dispatch_cmd().

There is one way to invalidate the theory that we cloned something with
an attached command, and that's to put 

BUG_ON(rq->special)

in blk_insert_cloned_request().  I think we're careful about clearing
it, so it should work (perhaps a warn on just in case).

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-18 16:29                             ` James Bottomley
@ 2011-10-31 10:05                               ` Heiko Carstens
  2011-10-31 10:42                                 ` James Bottomley
  0 siblings, 1 reply; 60+ messages in thread
From: Heiko Carstens @ 2011-10-31 10:05 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jun'ichi Nomura, Steffen Maier, linux-scsi, Jens Axboe,
	Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, Mike Snitzer, gmuelas

On Tue, Oct 18, 2011 at 11:29:40AM -0500, James Bottomley wrote:
> On Tue, 2011-10-18 at 17:45 +0200, Heiko Carstens wrote:
> > On Tue, Oct 18, 2011 at 10:31:20PM +0900, Jun'ichi Nomura wrote:
> > > On 10/17/11 23:06, James Bottomley wrote:
> > > > On Mon, 2011-10-17 at 17:46 +0900, Jun'ichi Nomura wrote:
> > > >> On 10/15/11 01:03, James Bottomley wrote:
> > > >>> On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
> > > >>>> Initially, we encountered use-after-free bugs in
> > > >>>> scsi_print_command / scsi_dispatch_cmd
> > > >>>> http://marc.info/?l=linux-scsi&m=130824013229933&w=2
> > > >>
> > > >> It is interesting that both this and the older report
> > > >> got oopsed in scsi_log_send(), while there are other
> > > >> dereferences of 'cmd' around scsi_dispatch_cmd().
> > > >> Are there any reason they are special? Just by accident?
> > > > 
> > > > Right, that's why it looks like the command area got freed rather than
> > > > the command pointer was bogus (6b is a poison free pattern).  Perhaps if
> > > > the reporter could pin down the failing source line, we'd know better
> > > > what was going on?
> > > 
> > > Yeah, that might be useful.
> > 
> > The struct scsi_cmnd that was passed to scsi_log_send() was already freed
> > (contents completely 6b6b6b...).
> > Since SLUB debugging was turned on we can see that it was freed from
> > __scsi_put_command(). Not too much of a surprise.
> 
> But it does tell us the put must be racing with dispatch, since
> dereferencing the command to find the device worked higher up in
> scsi_dispatch_cmd().
> 
> There is one way to invalidate the theory that we cloned something with
> an attached command, and that's to put 
> 
> BUG_ON(rq->special)
> 
> in blk_insert_cloned_request().  I think we're careful about clearing
> it, so it should work (perhaps a warn on just in case).

It _looks_ like we do not hit the BUG_ON() that. This time we get this instead:

[ 4024.937870] Unable to handle kernel pointer dereference at virtual kernel address 000003e004d41000
[ 4024.937886] Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 4024.937899] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [las
t unloaded: scsi_wait_scan]
[ 4024.937925] CPU: 1 Not tainted 3.0.7-50.x.20111021-s390xdefault #1
[ 4024.937930] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
[ 4024.937936] Krnl PSW : 0704000180000000 000003e00126263a (dm_softirq_done+0x72/0x140 [dm_mod])
[ 4024.937959]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
[ 4024.937966] Krnl GPRS: 000000007b9156b0 000003e004d41100 000000000e14b600 000000000000006d
[ 4024.937971]            00000000715332b0 000000000c140ce8 000000000090d2ef 0000000000000005
[ 4024.937977]            0000000000000001 0000000000000101 000000000c140d00 0000000000000000
[ 4024.937983]            000003e001260000 000003e00126f098 0000000073adfd08 0000000073adfcb8
[ 4024.938001] Krnl Code: 000003e00126262a: f0a0000407f1        srp     4(11,%r0),2033,0
[ 4024.938009]            000003e001262630: e31050080004        lg      %r1,8(%r5)
[ 4024.938017]            000003e001262636: 58b05180            l       %r11,384(%r5)
[ 4024.938024]           >000003e00126263a: e31010080004        lg      %r1,8(%r1)
[ 4024.938031]            000003e001262640: e31010500004        lg      %r1,80(%r1)
[ 4024.938038]            000003e001262646: b9020011            ltgr    %r1,%r1
[ 4024.938045]            000003e00126264a: a784ffdf            brc     8,3e001262608
[ 4024.938053]            000003e00126264e: e32050080004        lg      %r2,8(%r5)
[ 4024.938060] Call Trace:
[ 4024.938063] ([<070000000040716c>] 0x70000000040716c)
[ 4024.938069]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
[ 4024.938080]  [<00000000001587c2>] __do_softirq+0xda/0x398
[ 4024.938088]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
[ 4024.938095]  [<000000000017c2aa>] kthread+0xa6/0xb0
[ 4024.938102]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
[ 4024.938112]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
[ 4024.938118] INFO: lockdep is turned off.
[ 4024.938121] Last Breaking-Event-Address:
[ 4024.938124]  [<000003e001262600>] dm_softirq_done+0x38/0x140 [dm_mod]
[ 4024.938135]  
[ 4024.938139] Kernel panic - not syncing: Fatal exception in interrupt
[ 4024.938144] CPU: 1 Tainted: G      D     3.0.7-50.x.20111021-s390xdefault #1
[ 4024.938150] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
[ 4024.938155] 0000000073adf958 0000000073adf8d8 0000000000000002 0000000000000000 
[ 4024.938164]        0000000073adf978 0000000073adf8f0 0000000073adf8f0 000000000061386a 
[ 4024.938174]        0000000000000000 0000000000000000 0000000000000005 0000000000100ec6 
[ 4024.938184]        000000000000000d 000000000000000c 0000000073adf940 0000000000000000 
[ 4024.938194]        0000000000000000 0000000000100a18 0000000073adf8d8 0000000073adf918 
[ 4024.938205] Call Trace:
[ 4024.938208] ([<0000000000100926>] show_trace+0xee/0x144)
[ 4024.938216]  [<0000000000613694>] panic+0xb0/0x234
[ 4024.938224]  [<0000000000100ec6>] die+0x15a/0x168
[ 4024.938230]  [<000000000011fb9e>] do_no_context+0xba/0xf8
[ 4024.938306]  [<000000000061c074>] do_dat_exception+0x378/0x3e4
[ 4024.938314]  [<0000000000619e02>] pgm_exit+0x0/0x4
[ 4024.938319]  [<000003e00126263a>] dm_softirq_done+0x72/0x140 [dm_mod]
[ 4024.938329] ([<070000000040716c>] 0x70000000040716c)
[ 4024.938334]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
[ 4024.938341]  [<00000000001587c2>] __do_softirq+0xda/0x398
[ 4024.938347]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
[ 4024.938354]  [<000000000017c2aa>] kthread+0xa6/0xb0
[ 4024.938360]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
[ 4024.938366]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
[ 4024.938373] INFO: lockdep is turned off.

So we thought we might as well upgrade to 3.1 but immediately got a

kernel BUG at block/blk-flush.c:323!

which was handled here https://lkml.org/lkml/2011/10/4/105 and
here https://lkml.org/lkml/2011/10/12/408 .

But no patches for that one went upstream AFAICS.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-31 10:05                               ` Heiko Carstens
@ 2011-10-31 10:42                                 ` James Bottomley
  2011-10-31 11:46                                   ` Jun'ichi Nomura
  2011-10-31 13:21                                   ` Mike Snitzer
  0 siblings, 2 replies; 60+ messages in thread
From: James Bottomley @ 2011-10-31 10:42 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Jun'ichi Nomura, Steffen Maier, linux-scsi, Jens Axboe,
	Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, Mike Snitzer, gmuelas

On Mon, 2011-10-31 at 11:05 +0100, Heiko Carstens wrote:
> On Tue, Oct 18, 2011 at 11:29:40AM -0500, James Bottomley wrote:
> > On Tue, 2011-10-18 at 17:45 +0200, Heiko Carstens wrote:
> > > On Tue, Oct 18, 2011 at 10:31:20PM +0900, Jun'ichi Nomura wrote:
> > > > On 10/17/11 23:06, James Bottomley wrote:
> > > > > On Mon, 2011-10-17 at 17:46 +0900, Jun'ichi Nomura wrote:
> > > > >> On 10/15/11 01:03, James Bottomley wrote:
> > > > >>> On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
> > > > >>>> Initially, we encountered use-after-free bugs in
> > > > >>>> scsi_print_command / scsi_dispatch_cmd
> > > > >>>> http://marc.info/?l=linux-scsi&m=130824013229933&w=2
> > > > >>
> > > > >> It is interesting that both this and the older report
> > > > >> got oopsed in scsi_log_send(), while there are other
> > > > >> dereferences of 'cmd' around scsi_dispatch_cmd().
> > > > >> Are there any reason they are special? Just by accident?
> > > > > 
> > > > > Right, that's why it looks like the command area got freed rather than
> > > > > the command pointer was bogus (6b is a poison free pattern).  Perhaps if
> > > > > the reporter could pin down the failing source line, we'd know better
> > > > > what was going on?
> > > > 
> > > > Yeah, that might be useful.
> > > 
> > > The struct scsi_cmnd that was passed to scsi_log_send() was already freed
> > > (contents completely 6b6b6b...).
> > > Since SLUB debugging was turned on we can see that it was freed from
> > > __scsi_put_command(). Not too much of a surprise.
> > 
> > But it does tell us the put must be racing with dispatch, since
> > dereferencing the command to find the device worked higher up in
> > scsi_dispatch_cmd().
> > 
> > There is one way to invalidate the theory that we cloned something with
> > an attached command, and that's to put 
> > 
> > BUG_ON(rq->special)
> > 
> > in blk_insert_cloned_request().  I think we're careful about clearing
> > it, so it should work (perhaps a warn on just in case).
> 
> It _looks_ like we do not hit the BUG_ON() that. This time we get this instead:
> 
> [ 4024.937870] Unable to handle kernel pointer dereference at virtual kernel address 000003e004d41000
> [ 4024.937886] Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 4024.937899] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [las
> t unloaded: scsi_wait_scan]
> [ 4024.937925] CPU: 1 Not tainted 3.0.7-50.x.20111021-s390xdefault #1
> [ 4024.937930] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
> [ 4024.937936] Krnl PSW : 0704000180000000 000003e00126263a (dm_softirq_done+0x72/0x140 [dm_mod])
> [ 4024.937959]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> [ 4024.937966] Krnl GPRS: 000000007b9156b0 000003e004d41100 000000000e14b600 000000000000006d
> [ 4024.937971]            00000000715332b0 000000000c140ce8 000000000090d2ef 0000000000000005
> [ 4024.937977]            0000000000000001 0000000000000101 000000000c140d00 0000000000000000
> [ 4024.937983]            000003e001260000 000003e00126f098 0000000073adfd08 0000000073adfcb8
> [ 4024.938001] Krnl Code: 000003e00126262a: f0a0000407f1        srp     4(11,%r0),2033,0
> [ 4024.938009]            000003e001262630: e31050080004        lg      %r1,8(%r5)
> [ 4024.938017]            000003e001262636: 58b05180            l       %r11,384(%r5)
> [ 4024.938024]           >000003e00126263a: e31010080004        lg      %r1,8(%r1)
> [ 4024.938031]            000003e001262640: e31010500004        lg      %r1,80(%r1)
> [ 4024.938038]            000003e001262646: b9020011            ltgr    %r1,%r1
> [ 4024.938045]            000003e00126264a: a784ffdf            brc     8,3e001262608
> [ 4024.938053]            000003e00126264e: e32050080004        lg      %r2,8(%r5)
> [ 4024.938060] Call Trace:
> [ 4024.938063] ([<070000000040716c>] 0x70000000040716c)
> [ 4024.938069]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> [ 4024.938080]  [<00000000001587c2>] __do_softirq+0xda/0x398
> [ 4024.938088]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
> [ 4024.938095]  [<000000000017c2aa>] kthread+0xa6/0xb0
> [ 4024.938102]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
> [ 4024.938112]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
> [ 4024.938118] INFO: lockdep is turned off.
> [ 4024.938121] Last Breaking-Event-Address:
> [ 4024.938124]  [<000003e001262600>] dm_softirq_done+0x38/0x140 [dm_mod]
> [ 4024.938135]  
> [ 4024.938139] Kernel panic - not syncing: Fatal exception in interrupt
> [ 4024.938144] CPU: 1 Tainted: G      D     3.0.7-50.x.20111021-s390xdefault #1
> [ 4024.938150] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
> [ 4024.938155] 0000000073adf958 0000000073adf8d8 0000000000000002 0000000000000000 
> [ 4024.938164]        0000000073adf978 0000000073adf8f0 0000000073adf8f0 000000000061386a 
> [ 4024.938174]        0000000000000000 0000000000000000 0000000000000005 0000000000100ec6 
> [ 4024.938184]        000000000000000d 000000000000000c 0000000073adf940 0000000000000000 
> [ 4024.938194]        0000000000000000 0000000000100a18 0000000073adf8d8 0000000073adf918 
> [ 4024.938205] Call Trace:
> [ 4024.938208] ([<0000000000100926>] show_trace+0xee/0x144)
> [ 4024.938216]  [<0000000000613694>] panic+0xb0/0x234
> [ 4024.938224]  [<0000000000100ec6>] die+0x15a/0x168
> [ 4024.938230]  [<000000000011fb9e>] do_no_context+0xba/0xf8
> [ 4024.938306]  [<000000000061c074>] do_dat_exception+0x378/0x3e4
> [ 4024.938314]  [<0000000000619e02>] pgm_exit+0x0/0x4
> [ 4024.938319]  [<000003e00126263a>] dm_softirq_done+0x72/0x140 [dm_mod]
> [ 4024.938329] ([<070000000040716c>] 0x70000000040716c)
> [ 4024.938334]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> [ 4024.938341]  [<00000000001587c2>] __do_softirq+0xda/0x398
> [ 4024.938347]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
> [ 4024.938354]  [<000000000017c2aa>] kthread+0xa6/0xb0
> [ 4024.938360]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
> [ 4024.938366]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
> [ 4024.938373] INFO: lockdep is turned off.
> 
> So we thought we might as well upgrade to 3.1 but immediately got a
> 
> kernel BUG at block/blk-flush.c:323!
> 
> which was handled here https://lkml.org/lkml/2011/10/4/105 and
> here https://lkml.org/lkml/2011/10/12/408 .
> 
> But no patches for that one went upstream AFAICS.

Well, all I can say is "hm".  You put only a BUG_ON() in the code, which
wasn't triggered, but now we get a completely different oops.  However,
I think it does point to the dm barrier handling code.  Can you turn off
barriers and see if all oopses go away?

James



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-31 10:42                                 ` James Bottomley
@ 2011-10-31 11:46                                   ` Jun'ichi Nomura
  2011-10-31 13:00                                     ` Heiko Carstens
  2011-10-31 13:21                                   ` Mike Snitzer
  1 sibling, 1 reply; 60+ messages in thread
From: Jun'ichi Nomura @ 2011-10-31 11:46 UTC (permalink / raw)
  To: Heiko Carstens, James Bottomley
  Cc: Steffen Maier, linux-scsi, Jens Axboe, Hannes Reinecke,
	Linux Kernel, Alan Stern, Thadeu Lima de Souza Cascardo,
	Taraka R. Bodireddy, Seshagiri N. Ippili,
	Manvanthara B. Puttashankar, Jeff Moyer, Shaohua Li,
	Mike Snitzer, gmuelas

On 10/31/11 19:42, James Bottomley wrote:
> On Mon, 2011-10-31 at 11:05 +0100, Heiko Carstens wrote:
>> On Tue, Oct 18, 2011 at 11:29:40AM -0500, James Bottomley wrote:
>>> On Tue, 2011-10-18 at 17:45 +0200, Heiko Carstens wrote:
>>>> On Tue, Oct 18, 2011 at 10:31:20PM +0900, Jun'ichi Nomura wrote:
>>>>> On 10/17/11 23:06, James Bottomley wrote:
>>>>>> On Mon, 2011-10-17 at 17:46 +0900, Jun'ichi Nomura wrote:
>>>>>>> On 10/15/11 01:03, James Bottomley wrote:
>>>>>>>> On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
>>>>>>>>> Initially, we encountered use-after-free bugs in
>>>>>>>>> scsi_print_command / scsi_dispatch_cmd
>>>>>>>>> http://marc.info/?l=linux-scsi&m=130824013229933&w=2
>>>>>>>
>>>>>>> It is interesting that both this and the older report
>>>>>>> got oopsed in scsi_log_send(), while there are other
>>>>>>> dereferences of 'cmd' around scsi_dispatch_cmd().
>>>>>>> Are there any reason they are special? Just by accident?
>>>>>>
>>>>>> Right, that's why it looks like the command area got freed rather than
>>>>>> the command pointer was bogus (6b is a poison free pattern).  Perhaps if
>>>>>> the reporter could pin down the failing source line, we'd know better
>>>>>> what was going on?
>>>>>
>>>>> Yeah, that might be useful.
>>>>
>>>> The struct scsi_cmnd that was passed to scsi_log_send() was already freed
>>>> (contents completely 6b6b6b...).
>>>> Since SLUB debugging was turned on we can see that it was freed from
>>>> __scsi_put_command(). Not too much of a surprise.
>>>
>>> But it does tell us the put must be racing with dispatch, since
>>> dereferencing the command to find the device worked higher up in
>>> scsi_dispatch_cmd().
>>>
>>> There is one way to invalidate the theory that we cloned something with
>>> an attached command, and that's to put 
>>>
>>> BUG_ON(rq->special)
>>>
>>> in blk_insert_cloned_request().  I think we're careful about clearing
>>> it, so it should work (perhaps a warn on just in case).
>>
>> It _looks_ like we do not hit the BUG_ON() that. This time we get this instead:
>>
>> [ 4024.937870] Unable to handle kernel pointer dereference at virtual kernel address 000003e004d41000
>> [ 4024.937886] Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [ 4024.937899] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [las
>> t unloaded: scsi_wait_scan]
>> [ 4024.937925] CPU: 1 Not tainted 3.0.7-50.x.20111021-s390xdefault #1
>> [ 4024.937930] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
>> [ 4024.937936] Krnl PSW : 0704000180000000 000003e00126263a (dm_softirq_done+0x72/0x140 [dm_mod])
>> [ 4024.937959]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
>> [ 4024.937966] Krnl GPRS: 000000007b9156b0 000003e004d41100 000000000e14b600 000000000000006d
>> [ 4024.937971]            00000000715332b0 000000000c140ce8 000000000090d2ef 0000000000000005
>> [ 4024.937977]            0000000000000001 0000000000000101 000000000c140d00 0000000000000000
>> [ 4024.937983]            000003e001260000 000003e00126f098 0000000073adfd08 0000000073adfcb8
>> [ 4024.938001] Krnl Code: 000003e00126262a: f0a0000407f1        srp     4(11,%r0),2033,0
>> [ 4024.938009]            000003e001262630: e31050080004        lg      %r1,8(%r5)
>> [ 4024.938017]            000003e001262636: 58b05180            l       %r11,384(%r5)
>> [ 4024.938024]           >000003e00126263a: e31010080004        lg      %r1,8(%r1)
>> [ 4024.938031]            000003e001262640: e31010500004        lg      %r1,80(%r1)
>> [ 4024.938038]            000003e001262646: b9020011            ltgr    %r1,%r1
>> [ 4024.938045]            000003e00126264a: a784ffdf            brc     8,3e001262608
>> [ 4024.938053]            000003e00126264e: e32050080004        lg      %r2,8(%r5)
>> [ 4024.938060] Call Trace:
>> [ 4024.938063] ([<070000000040716c>] 0x70000000040716c)
>> [ 4024.938069]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
>> [ 4024.938080]  [<00000000001587c2>] __do_softirq+0xda/0x398
>> [ 4024.938088]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
>> [ 4024.938095]  [<000000000017c2aa>] kthread+0xa6/0xb0
>> [ 4024.938102]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
>> [ 4024.938112]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
>> [ 4024.938118] INFO: lockdep is turned off.
>> [ 4024.938121] Last Breaking-Event-Address:
>> [ 4024.938124]  [<000003e001262600>] dm_softirq_done+0x38/0x140 [dm_mod]
>> [ 4024.938135]  
>> [ 4024.938139] Kernel panic - not syncing: Fatal exception in interrupt
>> [ 4024.938144] CPU: 1 Tainted: G      D     3.0.7-50.x.20111021-s390xdefault #1
>> [ 4024.938150] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
>> [ 4024.938155] 0000000073adf958 0000000073adf8d8 0000000000000002 0000000000000000 
>> [ 4024.938164]        0000000073adf978 0000000073adf8f0 0000000073adf8f0 000000000061386a 
>> [ 4024.938174]        0000000000000000 0000000000000000 0000000000000005 0000000000100ec6 
>> [ 4024.938184]        000000000000000d 000000000000000c 0000000073adf940 0000000000000000 
>> [ 4024.938194]        0000000000000000 0000000000100a18 0000000073adf8d8 0000000073adf918 
>> [ 4024.938205] Call Trace:
>> [ 4024.938208] ([<0000000000100926>] show_trace+0xee/0x144)
>> [ 4024.938216]  [<0000000000613694>] panic+0xb0/0x234
>> [ 4024.938224]  [<0000000000100ec6>] die+0x15a/0x168
>> [ 4024.938230]  [<000000000011fb9e>] do_no_context+0xba/0xf8
>> [ 4024.938306]  [<000000000061c074>] do_dat_exception+0x378/0x3e4
>> [ 4024.938314]  [<0000000000619e02>] pgm_exit+0x0/0x4
>> [ 4024.938319]  [<000003e00126263a>] dm_softirq_done+0x72/0x140 [dm_mod]
>> [ 4024.938329] ([<070000000040716c>] 0x70000000040716c)
>> [ 4024.938334]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
>> [ 4024.938341]  [<00000000001587c2>] __do_softirq+0xda/0x398
>> [ 4024.938347]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
>> [ 4024.938354]  [<000000000017c2aa>] kthread+0xa6/0xb0
>> [ 4024.938360]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
>> [ 4024.938366]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
>> [ 4024.938373] INFO: lockdep is turned off.
>>
>> So we thought we might as well upgrade to 3.1 but immediately got a
>>
>> kernel BUG at block/blk-flush.c:323!
>>
>> which was handled here https://lkml.org/lkml/2011/10/4/105 and
>> here https://lkml.org/lkml/2011/10/12/408 .
>>
>> But no patches for that one went upstream AFAICS.
> 
> Well, all I can say is "hm".  You put only a BUG_ON() in the code, which
> wasn't triggered, but now we get a completely different oops.  However,
> I think it does point to the dm barrier handling code.  Can you turn off
> barriers and see if all oopses go away?

Hm, dm_softirq_done is generic completion code of original
request in dm-multipath.
So oops here might be another manifestation of use-after-free.

Do you always hit the oops at the same address?

Could you find corresponding source code line for
the crashed address, dm_softirq_done+0x72/0x140,
and which pointer was invalid?

-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-31 11:46                                   ` Jun'ichi Nomura
@ 2011-10-31 13:00                                     ` Heiko Carstens
  2011-11-02 12:37                                       ` Jun'ichi Nomura
  2011-11-03 18:25                                       ` Mike Snitzer
  0 siblings, 2 replies; 60+ messages in thread
From: Heiko Carstens @ 2011-10-31 13:00 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: James Bottomley, Steffen Maier, linux-scsi, Jens Axboe,
	Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, Mike Snitzer, gmuelas

On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
> Hm, dm_softirq_done is generic completion code of original
> request in dm-multipath.
> So oops here might be another manifestation of use-after-free.
> 
> Do you always hit the oops at the same address?

I think we saw this bug the first time. But before that the scsi
logging level was higher. Gonzalo is trying to recreate it with
the same (old) scsi logging level.
Afterwards we will try with barrier=0.

Both on v3.0.7 btw.

> Could you find corresponding source code line for
> the crashed address, dm_softirq_done+0x72/0x140,
> and which pointer was invalid?

It crashes in the inlined function dm_done() when trying to
dereference tio (aka clone->end_io_data):

static void dm_done(struct request *clone, int error, bool mapped)
{
        int r = error;
        struct dm_rq_target_io *tio = clone->end_io_data;
        dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-31 10:42                                 ` James Bottomley
  2011-10-31 11:46                                   ` Jun'ichi Nomura
@ 2011-10-31 13:21                                   ` Mike Snitzer
  2011-10-31 13:40                                     ` Heiko Carstens
  1 sibling, 1 reply; 60+ messages in thread
From: Mike Snitzer @ 2011-10-31 13:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Heiko Carstens, Jun'ichi Nomura, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Mon, Oct 31 2011 at  6:42am -0400,
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Mon, 2011-10-31 at 11:05 +0100, Heiko Carstens wrote:
> > On Tue, Oct 18, 2011 at 11:29:40AM -0500, James Bottomley wrote:
> > > On Tue, 2011-10-18 at 17:45 +0200, Heiko Carstens wrote:
> > > > On Tue, Oct 18, 2011 at 10:31:20PM +0900, Jun'ichi Nomura wrote:
> > > > > On 10/17/11 23:06, James Bottomley wrote:
> > > > > > On Mon, 2011-10-17 at 17:46 +0900, Jun'ichi Nomura wrote:
> > > > > >> On 10/15/11 01:03, James Bottomley wrote:
> > > > > >>> On Thu, 2011-10-13 at 15:09 +0200, Steffen Maier wrote:
> > > > > >>>> Initially, we encountered use-after-free bugs in
> > > > > >>>> scsi_print_command / scsi_dispatch_cmd
> > > > > >>>> http://marc.info/?l=linux-scsi&m=130824013229933&w=2
> > > > > >>
> > > > > >> It is interesting that both this and the older report
> > > > > >> got oopsed in scsi_log_send(), while there are other
> > > > > >> dereferences of 'cmd' around scsi_dispatch_cmd().
> > > > > >> Are there any reason they are special? Just by accident?
> > > > > > 
> > > > > > Right, that's why it looks like the command area got freed rather than
> > > > > > the command pointer was bogus (6b is a poison free pattern).  Perhaps if
> > > > > > the reporter could pin down the failing source line, we'd know better
> > > > > > what was going on?
> > > > > 
> > > > > Yeah, that might be useful.
> > > > 
> > > > The struct scsi_cmnd that was passed to scsi_log_send() was already freed
> > > > (contents completely 6b6b6b...).
> > > > Since SLUB debugging was turned on we can see that it was freed from
> > > > __scsi_put_command(). Not too much of a surprise.
> > > 
> > > But it does tell us the put must be racing with dispatch, since
> > > dereferencing the command to find the device worked higher up in
> > > scsi_dispatch_cmd().
> > > 
> > > There is one way to invalidate the theory that we cloned something with
> > > an attached command, and that's to put 
> > > 
> > > BUG_ON(rq->special)
> > > 
> > > in blk_insert_cloned_request().  I think we're careful about clearing
> > > it, so it should work (perhaps a warn on just in case).
> > 
> > It _looks_ like we do not hit the BUG_ON() that. This time we get this instead:
> > 
> > [ 4024.937870] Unable to handle kernel pointer dereference at virtual kernel address 000003e004d41000
> > [ 4024.937886] Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [ 4024.937899] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [las
> > t unloaded: scsi_wait_scan]
> > [ 4024.937925] CPU: 1 Not tainted 3.0.7-50.x.20111021-s390xdefault #1
> > [ 4024.937930] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
> > [ 4024.937936] Krnl PSW : 0704000180000000 000003e00126263a (dm_softirq_done+0x72/0x140 [dm_mod])
> > [ 4024.937959]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> > [ 4024.937966] Krnl GPRS: 000000007b9156b0 000003e004d41100 000000000e14b600 000000000000006d
> > [ 4024.937971]            00000000715332b0 000000000c140ce8 000000000090d2ef 0000000000000005
> > [ 4024.937977]            0000000000000001 0000000000000101 000000000c140d00 0000000000000000
> > [ 4024.937983]            000003e001260000 000003e00126f098 0000000073adfd08 0000000073adfcb8
> > [ 4024.938001] Krnl Code: 000003e00126262a: f0a0000407f1        srp     4(11,%r0),2033,0
> > [ 4024.938009]            000003e001262630: e31050080004        lg      %r1,8(%r5)
> > [ 4024.938017]            000003e001262636: 58b05180            l       %r11,384(%r5)
> > [ 4024.938024]           >000003e00126263a: e31010080004        lg      %r1,8(%r1)
> > [ 4024.938031]            000003e001262640: e31010500004        lg      %r1,80(%r1)
> > [ 4024.938038]            000003e001262646: b9020011            ltgr    %r1,%r1
> > [ 4024.938045]            000003e00126264a: a784ffdf            brc     8,3e001262608
> > [ 4024.938053]            000003e00126264e: e32050080004        lg      %r2,8(%r5)
> > [ 4024.938060] Call Trace:
> > [ 4024.938063] ([<070000000040716c>] 0x70000000040716c)
> > [ 4024.938069]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > [ 4024.938080]  [<00000000001587c2>] __do_softirq+0xda/0x398
> > [ 4024.938088]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
> > [ 4024.938095]  [<000000000017c2aa>] kthread+0xa6/0xb0
> > [ 4024.938102]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
> > [ 4024.938112]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
> > [ 4024.938118] INFO: lockdep is turned off.
> > [ 4024.938121] Last Breaking-Event-Address:
> > [ 4024.938124]  [<000003e001262600>] dm_softirq_done+0x38/0x140 [dm_mod]
> > [ 4024.938135]  
> > [ 4024.938139] Kernel panic - not syncing: Fatal exception in interrupt
> > [ 4024.938144] CPU: 1 Tainted: G      D     3.0.7-50.x.20111021-s390xdefault #1
> > [ 4024.938150] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
> > [ 4024.938155] 0000000073adf958 0000000073adf8d8 0000000000000002 0000000000000000 
> > [ 4024.938164]        0000000073adf978 0000000073adf8f0 0000000073adf8f0 000000000061386a 
> > [ 4024.938174]        0000000000000000 0000000000000000 0000000000000005 0000000000100ec6 
> > [ 4024.938184]        000000000000000d 000000000000000c 0000000073adf940 0000000000000000 
> > [ 4024.938194]        0000000000000000 0000000000100a18 0000000073adf8d8 0000000073adf918 
> > [ 4024.938205] Call Trace:
> > [ 4024.938208] ([<0000000000100926>] show_trace+0xee/0x144)
> > [ 4024.938216]  [<0000000000613694>] panic+0xb0/0x234
> > [ 4024.938224]  [<0000000000100ec6>] die+0x15a/0x168
> > [ 4024.938230]  [<000000000011fb9e>] do_no_context+0xba/0xf8
> > [ 4024.938306]  [<000000000061c074>] do_dat_exception+0x378/0x3e4
> > [ 4024.938314]  [<0000000000619e02>] pgm_exit+0x0/0x4
> > [ 4024.938319]  [<000003e00126263a>] dm_softirq_done+0x72/0x140 [dm_mod]
> > [ 4024.938329] ([<070000000040716c>] 0x70000000040716c)
> > [ 4024.938334]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > [ 4024.938341]  [<00000000001587c2>] __do_softirq+0xda/0x398
> > [ 4024.938347]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
> > [ 4024.938354]  [<000000000017c2aa>] kthread+0xa6/0xb0
> > [ 4024.938360]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
> > [ 4024.938366]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
> > [ 4024.938373] INFO: lockdep is turned off.
> > 
> > So we thought we might as well upgrade to 3.1 but immediately got a
> > 
> > kernel BUG at block/blk-flush.c:323!
> > 
> > which was handled here https://lkml.org/lkml/2011/10/4/105 and
> > here https://lkml.org/lkml/2011/10/12/408 .
> > 
> > But no patches for that one went upstream AFAICS.
> 
> Well, all I can say is "hm".  You put only a BUG_ON() in the code, which
> wasn't triggered, but now we get a completely different oops.  However,
> I think it does point to the dm barrier handling code.  Can you turn off
> barriers and see if all oopses go away?

There are two 3.1-stable fixes from Jeff Moyer that Jens staged for
Linus to pick up (but seems Jens hasn't sent his 3.2 pull to Linus yet):

http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7008f2a441f19a14
http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943a0e9d9fa81d946a

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-31 13:21                                   ` Mike Snitzer
@ 2011-10-31 13:40                                     ` Heiko Carstens
  2011-10-31 14:01                                       ` Mike Snitzer
  0 siblings, 1 reply; 60+ messages in thread
From: Heiko Carstens @ 2011-10-31 13:40 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: James Bottomley, Jun'ichi Nomura, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Mon, Oct 31, 2011 at 09:21:58AM -0400, Mike Snitzer wrote:
> > > It _looks_ like we do not hit the BUG_ON() that. This time we get this instead:
> > > 
> > > [ 4024.937870] Unable to handle kernel pointer dereference at virtual kernel address 000003e004d41000
> > > [ 4024.937886] Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > > [ 4024.937899] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [las
> > > t unloaded: scsi_wait_scan]
> > > [ 4024.937925] CPU: 1 Not tainted 3.0.7-50.x.20111021-s390xdefault #1
> > > [ 4024.937930] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
> > > [ 4024.937936] Krnl PSW : 0704000180000000 000003e00126263a (dm_softirq_done+0x72/0x140 [dm_mod])
> > > [ 4024.937959]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> > > [ 4024.937966] Krnl GPRS: 000000007b9156b0 000003e004d41100 000000000e14b600 000000000000006d
> > > [ 4024.937971]            00000000715332b0 000000000c140ce8 000000000090d2ef 0000000000000005
> > > [ 4024.937977]            0000000000000001 0000000000000101 000000000c140d00 0000000000000000
> > > [ 4024.937983]            000003e001260000 000003e00126f098 0000000073adfd08 0000000073adfcb8
> > > [ 4024.938001] Krnl Code: 000003e00126262a: f0a0000407f1        srp     4(11,%r0),2033,0
> > > [ 4024.938009]            000003e001262630: e31050080004        lg      %r1,8(%r5)
> > > [ 4024.938017]            000003e001262636: 58b05180            l       %r11,384(%r5)
> > > [ 4024.938024]           >000003e00126263a: e31010080004        lg      %r1,8(%r1)
> > > [ 4024.938031]            000003e001262640: e31010500004        lg      %r1,80(%r1)
> > > [ 4024.938038]            000003e001262646: b9020011            ltgr    %r1,%r1
> > > [ 4024.938045]            000003e00126264a: a784ffdf            brc     8,3e001262608
> > > [ 4024.938053]            000003e00126264e: e32050080004        lg      %r2,8(%r5)
> > > [ 4024.938060] Call Trace:
> > > [ 4024.938063] ([<070000000040716c>] 0x70000000040716c)
> > > [ 4024.938069]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > > [ 4024.938080]  [<00000000001587c2>] __do_softirq+0xda/0x398
> > > [ 4024.938088]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
> > > [ 4024.938095]  [<000000000017c2aa>] kthread+0xa6/0xb0
> > > [ 4024.938102]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
> > > [ 4024.938112]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
> > > [ 4024.938118] INFO: lockdep is turned off.
> > > [ 4024.938121] Last Breaking-Event-Address:
> > > [ 4024.938124]  [<000003e001262600>] dm_softirq_done+0x38/0x140 [dm_mod]
> > > [ 4024.938135]  
> > > [ 4024.938139] Kernel panic - not syncing: Fatal exception in interrupt
> > > [ 4024.938144] CPU: 1 Tainted: G      D     3.0.7-50.x.20111021-s390xdefault #1
> > > [ 4024.938150] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
> > > [ 4024.938155] 0000000073adf958 0000000073adf8d8 0000000000000002 0000000000000000 
> > > [ 4024.938164]        0000000073adf978 0000000073adf8f0 0000000073adf8f0 000000000061386a 
> > > [ 4024.938174]        0000000000000000 0000000000000000 0000000000000005 0000000000100ec6 
> > > [ 4024.938184]        000000000000000d 000000000000000c 0000000073adf940 0000000000000000 
> > > [ 4024.938194]        0000000000000000 0000000000100a18 0000000073adf8d8 0000000073adf918 
> > > [ 4024.938205] Call Trace:
> > > [ 4024.938208] ([<0000000000100926>] show_trace+0xee/0x144)
> > > [ 4024.938216]  [<0000000000613694>] panic+0xb0/0x234
> > > [ 4024.938224]  [<0000000000100ec6>] die+0x15a/0x168
> > > [ 4024.938230]  [<000000000011fb9e>] do_no_context+0xba/0xf8
> > > [ 4024.938306]  [<000000000061c074>] do_dat_exception+0x378/0x3e4
> > > [ 4024.938314]  [<0000000000619e02>] pgm_exit+0x0/0x4
> > > [ 4024.938319]  [<000003e00126263a>] dm_softirq_done+0x72/0x140 [dm_mod]
> > > [ 4024.938329] ([<070000000040716c>] 0x70000000040716c)
> > > [ 4024.938334]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > > [ 4024.938341]  [<00000000001587c2>] __do_softirq+0xda/0x398
> > > [ 4024.938347]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
> > > [ 4024.938354]  [<000000000017c2aa>] kthread+0xa6/0xb0
> > > [ 4024.938360]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
> > > [ 4024.938366]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
> > > [ 4024.938373] INFO: lockdep is turned off.
> > > 
> > > So we thought we might as well upgrade to 3.1 but immediately got a
> > > 
> > > kernel BUG at block/blk-flush.c:323!
> > > 
> > > which was handled here https://lkml.org/lkml/2011/10/4/105 and
> > > here https://lkml.org/lkml/2011/10/12/408 .
> > > 
> > > But no patches for that one went upstream AFAICS.
> > 
> > Well, all I can say is "hm".  You put only a BUG_ON() in the code, which
> > wasn't triggered, but now we get a completely different oops.  However,
> > I think it does point to the dm barrier handling code.  Can you turn off
> > barriers and see if all oopses go away?
> 
> There are two 3.1-stable fixes from Jeff Moyer that Jens staged for
> Linus to pick up (but seems Jens hasn't sent his 3.2 pull to Linus yet):
> 
> http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7008f2a441f19a14
> http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943a0e9d9fa81d946a

Those two fixes would only address the "kernel BUG at block/blk-flush.c:323!" but not the
crash report above, right?
Since looking at the changelog they refer to a patch that went in with 3.1-rc1 while the
crash report above is with 3.0.7. Oh well...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-31 13:40                                     ` Heiko Carstens
@ 2011-10-31 14:01                                       ` Mike Snitzer
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Snitzer @ 2011-10-31 14:01 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: James Bottomley, Jun'ichi Nomura, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas, dm-devel

On Mon, Oct 31 2011 at  9:40am -0400,
Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> On Mon, Oct 31, 2011 at 09:21:58AM -0400, Mike Snitzer wrote:
> > > > It _looks_ like we do not hit the BUG_ON() that. This time we get this instead:
> > > > 
> > > > [ 4024.937870] Unable to handle kernel pointer dereference at virtual kernel address 000003e004d41000
> > > > [ 4024.937886] Oops: 0011 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > > > [ 4024.937899] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [las
> > > > t unloaded: scsi_wait_scan]
> > > > [ 4024.937925] CPU: 1 Not tainted 3.0.7-50.x.20111021-s390xdefault #1
> > > > [ 4024.937930] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
> > > > [ 4024.937936] Krnl PSW : 0704000180000000 000003e00126263a (dm_softirq_done+0x72/0x140 [dm_mod])
> > > > [ 4024.937959]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> > > > [ 4024.937966] Krnl GPRS: 000000007b9156b0 000003e004d41100 000000000e14b600 000000000000006d
> > > > [ 4024.937971]            00000000715332b0 000000000c140ce8 000000000090d2ef 0000000000000005
> > > > [ 4024.937977]            0000000000000001 0000000000000101 000000000c140d00 0000000000000000
> > > > [ 4024.937983]            000003e001260000 000003e00126f098 0000000073adfd08 0000000073adfcb8
> > > > [ 4024.938001] Krnl Code: 000003e00126262a: f0a0000407f1        srp     4(11,%r0),2033,0
> > > > [ 4024.938009]            000003e001262630: e31050080004        lg      %r1,8(%r5)
> > > > [ 4024.938017]            000003e001262636: 58b05180            l       %r11,384(%r5)
> > > > [ 4024.938024]           >000003e00126263a: e31010080004        lg      %r1,8(%r1)
> > > > [ 4024.938031]            000003e001262640: e31010500004        lg      %r1,80(%r1)
> > > > [ 4024.938038]            000003e001262646: b9020011            ltgr    %r1,%r1
> > > > [ 4024.938045]            000003e00126264a: a784ffdf            brc     8,3e001262608
> > > > [ 4024.938053]            000003e00126264e: e32050080004        lg      %r2,8(%r5)
> > > > [ 4024.938060] Call Trace:
> > > > [ 4024.938063] ([<070000000040716c>] 0x70000000040716c)
> > > > [ 4024.938069]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > > > [ 4024.938080]  [<00000000001587c2>] __do_softirq+0xda/0x398
> > > > [ 4024.938088]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
> > > > [ 4024.938095]  [<000000000017c2aa>] kthread+0xa6/0xb0
> > > > [ 4024.938102]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
> > > > [ 4024.938112]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
> > > > [ 4024.938118] INFO: lockdep is turned off.
> > > > [ 4024.938121] Last Breaking-Event-Address:
> > > > [ 4024.938124]  [<000003e001262600>] dm_softirq_done+0x38/0x140 [dm_mod]
> > > > [ 4024.938135]  
> > > > [ 4024.938139] Kernel panic - not syncing: Fatal exception in interrupt
> > > > [ 4024.938144] CPU: 1 Tainted: G      D     3.0.7-50.x.20111021-s390xdefault #1
> > > > [ 4024.938150] Process ksoftirqd/1 (pid: 1942, task: 0000000079c6c750, ksp: 0000000073adfc50)
> > > > [ 4024.938155] 0000000073adf958 0000000073adf8d8 0000000000000002 0000000000000000 
> > > > [ 4024.938164]        0000000073adf978 0000000073adf8f0 0000000073adf8f0 000000000061386a 
> > > > [ 4024.938174]        0000000000000000 0000000000000000 0000000000000005 0000000000100ec6 
> > > > [ 4024.938184]        000000000000000d 000000000000000c 0000000073adf940 0000000000000000 
> > > > [ 4024.938194]        0000000000000000 0000000000100a18 0000000073adf8d8 0000000073adf918 
> > > > [ 4024.938205] Call Trace:
> > > > [ 4024.938208] ([<0000000000100926>] show_trace+0xee/0x144)
> > > > [ 4024.938216]  [<0000000000613694>] panic+0xb0/0x234
> > > > [ 4024.938224]  [<0000000000100ec6>] die+0x15a/0x168
> > > > [ 4024.938230]  [<000000000011fb9e>] do_no_context+0xba/0xf8
> > > > [ 4024.938306]  [<000000000061c074>] do_dat_exception+0x378/0x3e4
> > > > [ 4024.938314]  [<0000000000619e02>] pgm_exit+0x0/0x4
> > > > [ 4024.938319]  [<000003e00126263a>] dm_softirq_done+0x72/0x140 [dm_mod]
> > > > [ 4024.938329] ([<070000000040716c>] 0x70000000040716c)
> > > > [ 4024.938334]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > > > [ 4024.938341]  [<00000000001587c2>] __do_softirq+0xda/0x398
> > > > [ 4024.938347]  [<0000000000158ba0>] run_ksoftirqd+0x120/0x23c
> > > > [ 4024.938354]  [<000000000017c2aa>] kthread+0xa6/0xb0
> > > > [ 4024.938360]  [<000000000061970e>] kernel_thread_starter+0x6/0xc
> > > > [ 4024.938366]  [<0000000000619708>] kernel_thread_starter+0x0/0xc
> > > > [ 4024.938373] INFO: lockdep is turned off.
> > > > 
> > > > So we thought we might as well upgrade to 3.1 but immediately got a
> > > > 
> > > > kernel BUG at block/blk-flush.c:323!
> > > > 
> > > > which was handled here https://lkml.org/lkml/2011/10/4/105 and
> > > > here https://lkml.org/lkml/2011/10/12/408 .
> > > > 
> > > > But no patches for that one went upstream AFAICS.
> > > 
> > > Well, all I can say is "hm".  You put only a BUG_ON() in the code, which
> > > wasn't triggered, but now we get a completely different oops.  However,
> > > I think it does point to the dm barrier handling code.  Can you turn off
> > > barriers and see if all oopses go away?
> > 
> > There are two 3.1-stable fixes from Jeff Moyer that Jens staged for
> > Linus to pick up (but seems Jens hasn't sent his 3.2 pull to Linus yet):
> > 
> > http://git.kernel.dk/?p=linux-block.git;a=commit;h=8f02b3a09b1b7d2a4d24b8cd7008f2a441f19a14
> > http://git.kernel.dk/?p=linux-block.git;a=commit;h=f26d8f0562da76731cb049943a0e9d9fa81d946a
> 
> Those two fixes would only address the "kernel BUG at block/blk-flush.c:323!" but not the
> crash report above, right?

Right.

> Since looking at the changelog they refer to a patch that went in with 3.1-rc1 while the
> crash report above is with 3.0.7. Oh well...

Good data point.  This is the second request-based DM report I've seen
now with 3.0 (first was with fedora on btrfs and request-based DM).

Will look closer but it should be noted that DM didn't change
significantly in 3.0.  So it is likely a lingering oversight from the
block changes introduced for onstack plugging (from 2.6.39) or some
other change.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-31 13:00                                     ` Heiko Carstens
@ 2011-11-02 12:37                                       ` Jun'ichi Nomura
  2011-11-02 12:44                                         ` Hannes Reinecke
  2011-11-02 13:47                                         ` Heiko Carstens
  2011-11-03 18:25                                       ` Mike Snitzer
  1 sibling, 2 replies; 60+ messages in thread
From: Jun'ichi Nomura @ 2011-11-02 12:37 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: James Bottomley, Steffen Maier, linux-scsi, Jens Axboe,
	Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, Mike Snitzer, gmuelas

On 10/31/11 22:00, Heiko Carstens wrote:
> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
>> Hm, dm_softirq_done is generic completion code of original
>> request in dm-multipath.
>> So oops here might be another manifestation of use-after-free.
>>
>> Do you always hit the oops at the same address?
> 
> I think we saw this bug the first time. But before that the scsi
> logging level was higher. Gonzalo is trying to recreate it with
> the same (old) scsi logging level.
> Afterwards we will try with barrier=0.
> 
> Both on v3.0.7 btw.
> 
>> Could you find corresponding source code line for
>> the crashed address, dm_softirq_done+0x72/0x140,
>> and which pointer was invalid?
> 
> It crashes in the inlined function dm_done() when trying to
> dereference tio (aka clone->end_io_data):
> 
> static void dm_done(struct request *clone, int error, bool mapped)
> {
>         int r = error;
>         struct dm_rq_target_io *tio = clone->end_io_data;
>         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;

Thank you. But, hmm. I have no idea about scenario.

struct dm_rq_target_io is a container of clone request
and clone->end_io_data points to its container.

  struct dm_rq_target_io {
        struct mapped_device *md;
        struct dm_target *ti;
        struct request *orig, clone;
        int error;
        union map_info info;
  };

If clone can be dereferenced, clone->end_io_data should be, too.

-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-02 12:37                                       ` Jun'ichi Nomura
@ 2011-11-02 12:44                                         ` Hannes Reinecke
  2011-11-02 13:47                                         ` Heiko Carstens
  1 sibling, 0 replies; 60+ messages in thread
From: Hannes Reinecke @ 2011-11-02 12:44 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Heiko Carstens, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, Mike Snitzer, gmuelas

On 11/02/2011 01:37 PM, Jun'ichi Nomura wrote:
> On 10/31/11 22:00, Heiko Carstens wrote:
>> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
>>> Hm, dm_softirq_done is generic completion code of original
>>> request in dm-multipath.
>>> So oops here might be another manifestation of use-after-free.
>>>
>>> Do you always hit the oops at the same address?
>>
>> I think we saw this bug the first time. But before that the scsi
>> logging level was higher. Gonzalo is trying to recreate it with
>> the same (old) scsi logging level.
>> Afterwards we will try with barrier=0.
>>
>> Both on v3.0.7 btw.
>>
>>> Could you find corresponding source code line for
>>> the crashed address, dm_softirq_done+0x72/0x140,
>>> and which pointer was invalid?
>>
>> It crashes in the inlined function dm_done() when trying to
>> dereference tio (aka clone->end_io_data):
>>
>> static void dm_done(struct request *clone, int error, bool mapped)
>> {
>>          int r = error;
>>          struct dm_rq_target_io *tio = clone->end_io_data;
>>          dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
>
> Thank you. But, hmm. I have no idea about scenario.
>
> struct dm_rq_target_io is a container of clone request
> and clone->end_io_data points to its container.
>
>    struct dm_rq_target_io {
>          struct mapped_device *md;
>          struct dm_target *ti;
>          struct request *orig, clone;
>          int error;
>          union map_info info;
>    };
>
> If clone can be dereferenced, clone->end_io_data should be, too.
>
Well, actually it _always_ can be dereferenced.

At the very least we'd need to do an integrity check, ie
if the pointer 'clone->end_io_data' is indeed of the
required type.
More to the point, the end_io_data pointer could've been
assigned to something else; so even though the pointer is set
(which we don't check, either), it might not be pointing
to a 'struct dm_rq_target_io'.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-02 12:37                                       ` Jun'ichi Nomura
  2011-11-02 12:44                                         ` Hannes Reinecke
@ 2011-11-02 13:47                                         ` Heiko Carstens
  2011-11-04  4:07                                           ` Jun'ichi Nomura
  1 sibling, 1 reply; 60+ messages in thread
From: Heiko Carstens @ 2011-11-02 13:47 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: James Bottomley, Steffen Maier, linux-scsi, Jens Axboe,
	Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, Mike Snitzer, gmuelas

On Wed, Nov 02, 2011 at 09:37:06PM +0900, Jun'ichi Nomura wrote:
> On 10/31/11 22:00, Heiko Carstens wrote:
> > On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
> >> Hm, dm_softirq_done is generic completion code of original
> >> request in dm-multipath.
> >> So oops here might be another manifestation of use-after-free.
> >>
> >> Do you always hit the oops at the same address?
> > 
> > I think we saw this bug the first time. But before that the scsi
> > logging level was higher. Gonzalo is trying to recreate it with
> > the same (old) scsi logging level.
> > Afterwards we will try with barrier=0.
> > 
> > Both on v3.0.7 btw.
> > 
> >> Could you find corresponding source code line for
> >> the crashed address, dm_softirq_done+0x72/0x140,
> >> and which pointer was invalid?
> > 
> > It crashes in the inlined function dm_done() when trying to
> > dereference tio (aka clone->end_io_data):
> > 
> > static void dm_done(struct request *clone, int error, bool mapped)
> > {
> >         int r = error;
> >         struct dm_rq_target_io *tio = clone->end_io_data;
> >         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
> 
> Thank you. But, hmm. I have no idea about scenario.
> 
> struct dm_rq_target_io is a container of clone request
> and clone->end_io_data points to its container.
> 
>   struct dm_rq_target_io {
>         struct mapped_device *md;
>         struct dm_target *ti;
>         struct request *orig, clone;
>         int error;
>         union map_info info;
>   };
> 
> If clone can be dereferenced, clone->end_io_data should be, too.

If it helps: the above *ti pointer is the only one that points to
an (invalid) vmalloc area address. Invalid means the page was unmapped
because it was freed because of DEBUG_PAGEALLOC.
All other addresses I followed to get to this one belong to
the 1:1 mapping of the kernel, so no vmalloc involved.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-10-31 13:00                                     ` Heiko Carstens
  2011-11-02 12:37                                       ` Jun'ichi Nomura
@ 2011-11-03 18:25                                       ` Mike Snitzer
  2011-11-04  9:19                                         ` Heiko Carstens
  1 sibling, 1 reply; 60+ messages in thread
From: Mike Snitzer @ 2011-11-03 18:25 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Jun'ichi Nomura, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Mon, Oct 31 2011 at  9:00am -0400,
Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
> > Hm, dm_softirq_done is generic completion code of original
> > request in dm-multipath.
> > So oops here might be another manifestation of use-after-free.
> > 
> > Do you always hit the oops at the same address?
> 
> I think we saw this bug the first time. But before that the scsi
> logging level was higher. Gonzalo is trying to recreate it with
> the same (old) scsi logging level.
> Afterwards we will try with barrier=0.
> 
> Both on v3.0.7 btw.
> 
> > Could you find corresponding source code line for
> > the crashed address, dm_softirq_done+0x72/0x140,
> > and which pointer was invalid?
> 
> It crashes in the inlined function dm_done() when trying to
> dereference tio (aka clone->end_io_data):
> 
> static void dm_done(struct request *clone, int error, bool mapped)
> {
>         int r = error;
>         struct dm_rq_target_io *tio = clone->end_io_data;
>         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;

Hi,

Which underlying storage driver is being used by this multipath device?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-02 13:47                                         ` Heiko Carstens
@ 2011-11-04  4:07                                           ` Jun'ichi Nomura
  2011-11-04  9:12                                             ` Heiko Carstens
  0 siblings, 1 reply; 60+ messages in thread
From: Jun'ichi Nomura @ 2011-11-04  4:07 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: James Bottomley, Steffen Maier, linux-scsi, Jens Axboe,
	Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, Mike Snitzer, gmuelas

On 11/02/11 22:47, Heiko Carstens wrote:
> On Wed, Nov 02, 2011 at 09:37:06PM +0900, Jun'ichi Nomura wrote:
>> On 10/31/11 22:00, Heiko Carstens wrote:
>>> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
>>>> Hm, dm_softirq_done is generic completion code of original
>>>> request in dm-multipath.
>>>> So oops here might be another manifestation of use-after-free.
>>>>
>>>> Do you always hit the oops at the same address?
>>>
>>> I think we saw this bug the first time. But before that the scsi
>>> logging level was higher. Gonzalo is trying to recreate it with
>>> the same (old) scsi logging level.
>>> Afterwards we will try with barrier=0.
>>>
>>> Both on v3.0.7 btw.
>>>
>>>> Could you find corresponding source code line for
>>>> the crashed address, dm_softirq_done+0x72/0x140,
>>>> and which pointer was invalid?
>>>
>>> It crashes in the inlined function dm_done() when trying to
>>> dereference tio (aka clone->end_io_data):
>>>
>>> static void dm_done(struct request *clone, int error, bool mapped)
>>> {
>>>         int r = error;
>>>         struct dm_rq_target_io *tio = clone->end_io_data;
>>>         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
>>
>> Thank you. But, hmm. I have no idea about scenario.
>>
>> struct dm_rq_target_io is a container of clone request
>> and clone->end_io_data points to its container.
>>
>>   struct dm_rq_target_io {
>>         struct mapped_device *md;
>>         struct dm_target *ti;
>>         struct request *orig, clone;
>>         int error;
>>         union map_info info;
>>   };
>>
>> If clone can be dereferenced, clone->end_io_data should be, too.
> 
> If it helps: the above *ti pointer is the only one that points to
> an (invalid) vmalloc area address. Invalid means the page was unmapped
> because it was freed because of DEBUG_PAGEALLOC.
> All other addresses I followed to get to this one belong to
> the 1:1 mapping of the kernel, so no vmalloc involved.

Thanks, ok it was ti which was invalid. Not tio.
ti is a pointer to dm table entry, which is vmalloc-ed.
So it means the dm table was replaced while I/O was in-flight.

dm has a machanism to prevent it: in dm_suspend(),
stop_queue() is called to stop block queue processing
and no new I/O becomes in-flight after that.
Then all in-flight I/Os are waited to be completed or requeued
(dm_wait_for_completion()).
If the wait was successful, the table can become "suspended",
i.e. ready to be replaced.

So ti should be always valid.
Hmm..

-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-04  4:07                                           ` Jun'ichi Nomura
@ 2011-11-04  9:12                                             ` Heiko Carstens
  0 siblings, 0 replies; 60+ messages in thread
From: Heiko Carstens @ 2011-11-04  9:12 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: James Bottomley, Steffen Maier, linux-scsi, Jens Axboe,
	Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, Mike Snitzer, gmuelas

On Fri, Nov 04, 2011 at 01:07:53PM +0900, Jun'ichi Nomura wrote:
> Thanks, ok it was ti which was invalid. Not tio.
> ti is a pointer to dm table entry, which is vmalloc-ed.
> So it means the dm table was replaced while I/O was in-flight.
> 
> dm has a machanism to prevent it: in dm_suspend(),
> stop_queue() is called to stop block queue processing
> and no new I/O becomes in-flight after that.
> Then all in-flight I/Os are waited to be completed or requeued
> (dm_wait_for_completion()).
> If the wait was successful, the table can become "suspended",
> i.e. ready to be replaced.
> 
> So ti should be always valid.
> Hmm..

Here is another one (crash at same place in code). Might be of interest
since userspace triggered a call to dm_suspend(). While executing an irq
arrives and the same crash as observed before happened:

[96875.876768] Unable to handle kernel pointer dereference at virtual kernel address 000003e004bcf000
[96875.876773] Oops: 0011 [#2] PREEMPT SMP DEBUG_PAGEALLOC
[96875.876780] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [last unloaded: scsi_wait_scan]
[96875.876800] CPU: 0 Tainted: G      D W   3.0.7-50.x.20111024-s390xdefault #1
[96875.876804] Process kpartx (pid: 36986, task: 0000000058754750, ksp: 000000005874b318)
[96875.876808] Krnl PSW : 0704000180000000 000003e0012e363a (dm_softirq_done+0x72/0x140 [dm_mod])
[96875.876819]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
[96875.876825] Krnl GPRS: 000000007b9156b0 000003e004bcf100 00000000586f2f40 0000000000000065
[96875.876829]            00000000586f3610 00000000717a9f58 0000000000000002 0000000000000005
[96875.876834]            0000000000000000 0400000000000102 00000000717a9f70 0000000000000000
[96875.876838]            000003e0012e1000 000003e0012f0098 000000007a9c7c40 000000007a9c7bf0
[96875.876852] Krnl Code: 000003e0012e362a: f0a0000407f1        srp     4(11,%r0),2033,0
[96875.876858]            000003e0012e3630: e31050080004        lg      %r1,8(%r5)
[96875.876864]            000003e0012e3636: 58b05180            l       %r11,384(%r5)
[96875.876869]           >000003e0012e363a: e31010080004        lg      %r1,8(%r1)
[96875.876875]            000003e0012e3640: e31010500004        lg      %r1,80(%r1)
[96875.876880]            000003e0012e3646: b9020011            ltgr    %r1,%r1
[96875.876885]            000003e0012e364a: a784ffdf            brc     8,3e0012e3608
[96875.876891]            000003e0012e364e: e32050080004        lg      %r2,8(%r5)
[96875.876895] Call Trace:
[96875.876898] ([<070000000040716c>] 0x70000000040716c)
[96875.876902]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
[96875.876909]  [<00000000001587c2>] __do_softirq+0xda/0x398
[96875.876914]  [<000000000010f47e>] do_softirq+0xe2/0xe8
[96875.876919]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
[96875.876924]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
[96875.876930]  [<000000000061a164>] io_return+0x0/0x16
[96875.876935]  [<000000000061c2a0>] sub_preempt_count+0x34/0xd4
[96875.876941] ([<000000005874ba38>] 0x5874ba38)
[96875.876945]  [<000000000061918c>] _raw_spin_unlock_irq+0x50/0x7c
[96875.876951]  [<00000000001729ea>] flush_workqueue_prep_cwqs+0x222/0x4dc
[96875.876959]  [<0000000000174718>] flush_workqueue+0x1e8/0x4c0
[96875.876964]  [<000003e0012e63e8>] dm_suspend+0x174/0x384 [dm_mod]
[96875.876973]  [<000003e0012ebe1e>] dev_suspend+0x21e/0x250 [dm_mod]
[96875.876983]  [<000003e0012eccb6>] ctl_ioctl+0x1e2/0x2f4 [dm_mod]
[96875.876992]  [<000003e0012ecdf2>] dm_ctl_ioctl+0x2a/0x38 [dm_mod]
[96875.877000]  [<0000000000291c98>] do_vfs_ioctl+0x94/0x588
[96875.877005]  [<0000000000292220>] SyS_ioctl+0x94/0xac
[96875.877010]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
[96875.877015]  [<000003fffd32f7ca>] 0x3fffd32f7ca
[96875.877019] INFO: lockdep is turned off.
[96875.877022] Last Breaking-Event-Address:
[96875.877025]  [<000003e0012e3600>] dm_softirq_done+0x38/0x140 [dm_mod]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-03 18:25                                       ` Mike Snitzer
@ 2011-11-04  9:19                                         ` Heiko Carstens
  2011-11-04 13:30                                           ` Mike Snitzer
                                                             ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Heiko Carstens @ 2011-11-04  9:19 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jun'ichi Nomura, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Thu, Nov 03, 2011 at 02:25:48PM -0400, Mike Snitzer wrote:
> On Mon, Oct 31 2011 at  9:00am -0400,
> Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> 
> > On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
> > > Hm, dm_softirq_done is generic completion code of original
> > > request in dm-multipath.
> > > So oops here might be another manifestation of use-after-free.
> > > 
> > > Do you always hit the oops at the same address?
> > 
> > I think we saw this bug the first time. But before that the scsi
> > logging level was higher. Gonzalo is trying to recreate it with
> > the same (old) scsi logging level.
> > Afterwards we will try with barrier=0.
> > 
> > Both on v3.0.7 btw.
> > 
> > > Could you find corresponding source code line for
> > > the crashed address, dm_softirq_done+0x72/0x140,
> > > and which pointer was invalid?
> > 
> > It crashes in the inlined function dm_done() when trying to
> > dereference tio (aka clone->end_io_data):
> > 
> > static void dm_done(struct request *clone, int error, bool mapped)
> > {
> >         int r = error;
> >         struct dm_rq_target_io *tio = clone->end_io_data;
> >         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
> 
> Hi,
> 
> Which underlying storage driver is being used by this multipath device?

It's the s390 only zfcp device driver.

FWIW, yet another use-after-free crash, this time however in multipath_end_io:

[96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
[96875.870602] Oops: 0038 [#1]
[96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
[96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
st unloaded: scsi_wait_scan]
[96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
[96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
[96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
[96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
[96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
[96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
[96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
[96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
[96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
[96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
[96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
[96875.870809]           >000003e001347138: e31020180004        lg %r1,24(%r2)
[96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
[96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
[96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
[96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
[96875.870889] Call Trace:
[96875.870892] ([<0700000000000008>] 0x700000000000008)
[96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
[96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
[96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
[96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
[96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
[96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
[96875.870953]  [<000000000061a164>] io_return+0x0/0x16
[96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
[96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
[96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
[96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
[96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
[96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
[96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
[96875.871028] INFO: lockdep is turned off.
[96875.871031] Last Breaking-Event-Address:
[96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]

static int multipath_end_io(struct dm_target *ti, struct request *clone,
			    int error, union map_info *map_context)
{
	struct multipath *m = ti->private;
	struct dm_mpath_io *mpio = map_context->ptr;
	struct pgpath *pgpath = mpio->pgpath;
	struct path_selector *ps;
	int r;

	r  = do_end_io(m, clone, error, mpio);
	if (pgpath) {
		ps = &pgpath->pg->ps; <--- crashes here
		if (ps->type->end_io)
			ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
	}
	mempool_free(mpio, m->mpio_pool);

	return r;
}

It crashes when trying to derefence pgpath, which was freed. Since we have
SLUB debugging turned on the freed object tells us that it was allocated
via a call to multipath_ctr() and freed via a call to free_priority_group().

FWIW, this reproduction was done with barriers off.

Btw, since I cc'ed you rather late I'm not sure if you are aware of the
test scenario: it's I/O stress with multipathing were paths come and go
all the time. It usually takes quite a few hours before the crashes are
observed.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-04  9:19                                         ` Heiko Carstens
@ 2011-11-04 13:30                                           ` Mike Snitzer
  2011-11-04 13:37                                             ` Hannes Reinecke
                                                               ` (2 more replies)
  2011-11-07 11:30                                           ` Jun'ichi Nomura
  2011-11-09  9:37                                           ` Hannes Reinecke
  2 siblings, 3 replies; 60+ messages in thread
From: Mike Snitzer @ 2011-11-04 13:30 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Jun'ichi Nomura, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Fri, Nov 04 2011 at  5:19am -0400,
Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> On Thu, Nov 03, 2011 at 02:25:48PM -0400, Mike Snitzer wrote:
> > On Mon, Oct 31 2011 at  9:00am -0400,
> > Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > 
> > > On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
> > > > Hm, dm_softirq_done is generic completion code of original
> > > > request in dm-multipath.
> > > > So oops here might be another manifestation of use-after-free.
> > > > 
> > > > Do you always hit the oops at the same address?
> > > 
> > > I think we saw this bug the first time. But before that the scsi
> > > logging level was higher. Gonzalo is trying to recreate it with
> > > the same (old) scsi logging level.
> > > Afterwards we will try with barrier=0.
> > > 
> > > Both on v3.0.7 btw.
> > > 
> > > > Could you find corresponding source code line for
> > > > the crashed address, dm_softirq_done+0x72/0x140,
> > > > and which pointer was invalid?
> > > 
> > > It crashes in the inlined function dm_done() when trying to
> > > dereference tio (aka clone->end_io_data):
> > > 
> > > static void dm_done(struct request *clone, int error, bool mapped)
> > > {
> > >         int r = error;
> > >         struct dm_rq_target_io *tio = clone->end_io_data;
> > >         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
> > 
> > Hi,
> > 
> > Which underlying storage driver is being used by this multipath device?
> 
> It's the s390 only zfcp device driver.
> 
> FWIW, yet another use-after-free crash, this time however in multipath_end_io:
> 
> [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> [96875.870602] Oops: 0038 [#1]
> [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
> [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
> st unloaded: scsi_wait_scan]
> [96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
> [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
> [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
> [96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
> [96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
> [96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
> [96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
> [96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
> [96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
> [96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
> [96875.870809]           >000003e001347138: e31020180004        lg %r1,24(%r2)
> [96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
> [96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
> [96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
> [96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
> [96875.870889] Call Trace:
> [96875.870892] ([<0700000000000008>] 0x700000000000008)
> [96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
> [96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> [96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
> [96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
> [96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
> [96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
> [96875.870953]  [<000000000061a164>] io_return+0x0/0x16
> [96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
> [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
> [96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
> [96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
> [96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
> [96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
> [96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
> [96875.871028] INFO: lockdep is turned off.
> [96875.871031] Last Breaking-Event-Address:
> [96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]
> 
> static int multipath_end_io(struct dm_target *ti, struct request *clone,
> 			    int error, union map_info *map_context)
> {
> 	struct multipath *m = ti->private;
> 	struct dm_mpath_io *mpio = map_context->ptr;
> 	struct pgpath *pgpath = mpio->pgpath;
> 	struct path_selector *ps;
> 	int r;
> 
> 	r  = do_end_io(m, clone, error, mpio);
> 	if (pgpath) {
> 		ps = &pgpath->pg->ps; <--- crashes here
> 		if (ps->type->end_io)
> 			ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
> 	}
> 	mempool_free(mpio, m->mpio_pool);
> 
> 	return r;
> }
> 
> It crashes when trying to derefence pgpath, which was freed. Since we have
> SLUB debugging turned on the freed object tells us that it was allocated
> via a call to multipath_ctr() and freed via a call to free_priority_group().
> 
> FWIW, this reproduction was done with barriers off.
> 
> Btw, since I cc'ed you rather late I'm not sure if you are aware of the
> test scenario: it's I/O stress with multipathing were paths come and go
> all the time. It usually takes quite a few hours before the crashes are
> observed.

OK, thanks for the backstory.

That is the same type of testing we've been doing with some partners
for RHEL6.2 with the qla2xxx driver.  They have seen the same crash that
you originally reported here: https://lkml.org/lkml/2011/10/31/64

The really interesting observation that was made is that the qla2xxx
driver was made lockless in RHEL6.2.  We've found that reverting the
qla2xxx lockless changes eliminates the problems seen with it and I/O
stress testing with multipath path failures.

The zfcp driver was also made lockless upstream, via this commit:
e55f875 [SCSI] zfcp: Issue FCP command without holding SCSI host_lock

It would be great if you could try reverting e55f875 and see how your
testing goes.

If doing so resolves the crashes for you then the post mortem on why
these lockless SCSI driver changes are causing such odd multipath
completion failures is going to be "fun" ;)

Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-04 13:30                                           ` Mike Snitzer
@ 2011-11-04 13:37                                             ` Hannes Reinecke
  2011-11-07 11:31                                             ` Jun'ichi Nomura
  2011-11-07 12:23                                             ` Heiko Carstens
  2 siblings, 0 replies; 60+ messages in thread
From: Hannes Reinecke @ 2011-11-04 13:37 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Heiko Carstens, Jun'ichi Nomura, James Bottomley,
	Steffen Maier, linux-scsi, Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On 11/04/2011 02:30 PM, Mike Snitzer wrote:
> On Fri, Nov 04 2011 at  5:19am -0400,
> Heiko Carstens<heiko.carstens@de.ibm.com>  wrote:
>
>> On Thu, Nov 03, 2011 at 02:25:48PM -0400, Mike Snitzer wrote:
>>> On Mon, Oct 31 2011 at  9:00am -0400,
>>> Heiko Carstens<heiko.carstens@de.ibm.com>  wrote:
>>>
>>>> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
>>>>> Hm, dm_softirq_done is generic completion code of original
>>>>> request in dm-multipath.
>>>>> So oops here might be another manifestation of use-after-free.
>>>>>
>>>>> Do you always hit the oops at the same address?
>>>>
>>>> I think we saw this bug the first time. But before that the scsi
>>>> logging level was higher. Gonzalo is trying to recreate it with
>>>> the same (old) scsi logging level.
>>>> Afterwards we will try with barrier=0.
>>>>
>>>> Both on v3.0.7 btw.
>>>>
>>>>> Could you find corresponding source code line for
>>>>> the crashed address, dm_softirq_done+0x72/0x140,
>>>>> and which pointer was invalid?
>>>>
>>>> It crashes in the inlined function dm_done() when trying to
>>>> dereference tio (aka clone->end_io_data):
>>>>
>>>> static void dm_done(struct request *clone, int error, bool mapped)
>>>> {
>>>>          int r = error;
>>>>          struct dm_rq_target_io *tio = clone->end_io_data;
>>>>          dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
>>>
>>> Hi,
>>>
>>> Which underlying storage driver is being used by this multipath device?
>>
>> It's the s390 only zfcp device driver.
>>
>> FWIW, yet another use-after-free crash, this time however in multipath_end_io:
>>
>> [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
>> [96875.870602] Oops: 0038 [#1]
>> [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
>> [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
>> st unloaded: scsi_wait_scan]
>> [96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
>> [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
>> [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
>> [96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
>> [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
>> [96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
>> [96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
>> [96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
>> [96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
>> [96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
>> [96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
>> [96875.870809]>000003e001347138: e31020180004        lg %r1,24(%r2)
>> [96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
>> [96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
>> [96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
>> [96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
>> [96875.870889] Call Trace:
>> [96875.870892] ([<0700000000000008>] 0x700000000000008)
>> [96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
>> [96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
>> [96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
>> [96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
>> [96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
>> [96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
>> [96875.870953]  [<000000000061a164>] io_return+0x0/0x16
>> [96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
>> [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
>> [96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
>> [96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
>> [96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
>> [96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
>> [96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
>> [96875.871028] INFO: lockdep is turned off.
>> [96875.871031] Last Breaking-Event-Address:
>> [96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]
>>
>> static int multipath_end_io(struct dm_target *ti, struct request *clone,
>> 			    int error, union map_info *map_context)
>> {
>> 	struct multipath *m = ti->private;
>> 	struct dm_mpath_io *mpio = map_context->ptr;
>> 	struct pgpath *pgpath = mpio->pgpath;
>> 	struct path_selector *ps;
>> 	int r;
>>
>> 	r  = do_end_io(m, clone, error, mpio);
>> 	if (pgpath) {
>> 		ps =&pgpath->pg->ps;<--- crashes here
>> 		if (ps->type->end_io)
>> 			ps->type->end_io(ps,&pgpath->path, mpio->nr_bytes);
>> 	}
>> 	mempool_free(mpio, m->mpio_pool);
>>
>> 	return r;
>> }
>>
>> It crashes when trying to derefence pgpath, which was freed. Since we have
>> SLUB debugging turned on the freed object tells us that it was allocated
>> via a call to multipath_ctr() and freed via a call to free_priority_group().
>>
>> FWIW, this reproduction was done with barriers off.
>>
>> Btw, since I cc'ed you rather late I'm not sure if you are aware of the
>> test scenario: it's I/O stress with multipathing were paths come and go
>> all the time. It usually takes quite a few hours before the crashes are
>> observed.
>
> OK, thanks for the backstory.
>
> That is the same type of testing we've been doing with some partners
> for RHEL6.2 with the qla2xxx driver.  They have seen the same crash that
> you originally reported here: https://lkml.org/lkml/2011/10/31/64
>
> The really interesting observation that was made is that the qla2xxx
> driver was made lockless in RHEL6.2.  We've found that reverting the
> qla2xxx lockless changes eliminates the problems seen with it and I/O
> stress testing with multipath path failures.
>
> The zfcp driver was also made lockless upstream, via this commit:
> e55f875 [SCSI] zfcp: Issue FCP command without holding SCSI host_lock
>
> It would be great if you could try reverting e55f875 and see how your
> testing goes.
>
> If doing so resolves the crashes for you then the post mortem on why
> these lockless SCSI driver changes are causing such odd multipath
> completion failures is going to be "fun" ;)
>
And before anyone asks:

We will _not_ be reverting the lockless changes.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-04  9:19                                         ` Heiko Carstens
  2011-11-04 13:30                                           ` Mike Snitzer
@ 2011-11-07 11:30                                           ` Jun'ichi Nomura
  2011-11-07 15:36                                             ` Mike Snitzer
  2011-11-09  9:37                                           ` Hannes Reinecke
  2 siblings, 1 reply; 60+ messages in thread
From: Jun'ichi Nomura @ 2011-11-07 11:30 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Mike Snitzer, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On 11/04/11 18:19, Heiko Carstens wrote:
> On Thu, Nov 03, 2011 at 02:25:48PM -0400, Mike Snitzer wrote:
>> On Mon, Oct 31 2011 at  9:00am -0400,
>> Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
>>
>>> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
>>>> Hm, dm_softirq_done is generic completion code of original
>>>> request in dm-multipath.
>>>> So oops here might be another manifestation of use-after-free.
>>>>
>>>> Do you always hit the oops at the same address?
>>>
>>> I think we saw this bug the first time. But before that the scsi
>>> logging level was higher. Gonzalo is trying to recreate it with
>>> the same (old) scsi logging level.
>>> Afterwards we will try with barrier=0.
>>>
>>> Both on v3.0.7 btw.
>>>
>>>> Could you find corresponding source code line for
>>>> the crashed address, dm_softirq_done+0x72/0x140,
>>>> and which pointer was invalid?
>>>
>>> It crashes in the inlined function dm_done() when trying to
>>> dereference tio (aka clone->end_io_data):
>>>
>>> static void dm_done(struct request *clone, int error, bool mapped)
>>> {
>>>         int r = error;
>>>         struct dm_rq_target_io *tio = clone->end_io_data;
>>>         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
>>
>> Hi,
>>
>> Which underlying storage driver is being used by this multipath device?
> 
> It's the s390 only zfcp device driver.
> 
> FWIW, yet another use-after-free crash, this time however in multipath_end_io:
> 
> [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> [96875.870602] Oops: 0038 [#1]
> [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
> [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
> st unloaded: scsi_wait_scan]
> [96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
> [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
> [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
> [96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
> [96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
> [96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
> [96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
> [96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
> [96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
> [96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
> [96875.870809]           >000003e001347138: e31020180004        lg %r1,24(%r2)
> [96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
> [96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
> [96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
> [96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
> [96875.870889] Call Trace:
> [96875.870892] ([<0700000000000008>] 0x700000000000008)
> [96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
> [96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> [96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
> [96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
> [96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
> [96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
> [96875.870953]  [<000000000061a164>] io_return+0x0/0x16
> [96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
> [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
> [96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
> [96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
> [96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
> [96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
> [96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
> [96875.871028] INFO: lockdep is turned off.
> [96875.871031] Last Breaking-Event-Address:
> [96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]
> 
> static int multipath_end_io(struct dm_target *ti, struct request *clone,
> 			    int error, union map_info *map_context)
> {
> 	struct multipath *m = ti->private;
> 	struct dm_mpath_io *mpio = map_context->ptr;
> 	struct pgpath *pgpath = mpio->pgpath;
> 	struct path_selector *ps;
> 	int r;
> 
> 	r  = do_end_io(m, clone, error, mpio);
> 	if (pgpath) {
> 		ps = &pgpath->pg->ps; <--- crashes here
> 		if (ps->type->end_io)
> 			ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
> 	}
> 	mempool_free(mpio, m->mpio_pool);
> 
> 	return r;
> }
> 
> It crashes when trying to derefence pgpath, which was freed. Since we have
> SLUB debugging turned on the freed object tells us that it was allocated
> via a call to multipath_ctr() and freed via a call to free_priority_group().

struct pgpath is freed before dm_target when tearing down dm table.
So if the problematic completion was being done after freeing pgpath
but before freeing dm_target, crash would look like that
and what's happening seems the same for these dm crashes:
dm table was somehow destroyed while I/O was in-flight.

It's interesting that your test started to crash in dm with v3.0.7.
Have you gotten these dm crashes with v3.0.6 or before?
Have you hit the initially-reported scsi oops with v3.0.7?
Are your v3.0.6 and v3.0.7 compiled with same config and the tests
ran on same system?

> 
> FWIW, this reproduction was done with barriers off.
> 
> Btw, since I cc'ed you rather late I'm not sure if you are aware of the
> test scenario: it's I/O stress with multipathing were paths come and go
> all the time. It usually takes quite a few hours before the crashes are
> observed.

-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-04 13:30                                           ` Mike Snitzer
  2011-11-04 13:37                                             ` Hannes Reinecke
@ 2011-11-07 11:31                                             ` Jun'ichi Nomura
  2011-11-07 13:42                                               ` Mike Snitzer
  2011-11-07 12:23                                             ` Heiko Carstens
  2 siblings, 1 reply; 60+ messages in thread
From: Jun'ichi Nomura @ 2011-11-07 11:31 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Heiko Carstens, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On 11/04/11 22:30, Mike Snitzer wrote:
> On Fri, Nov 04 2011 at  5:19am -0400,
> Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> 
>> On Thu, Nov 03, 2011 at 02:25:48PM -0400, Mike Snitzer wrote:
>>> On Mon, Oct 31 2011 at  9:00am -0400,
>>> Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
>>>
>>>> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
>>>>> Hm, dm_softirq_done is generic completion code of original
>>>>> request in dm-multipath.
>>>>> So oops here might be another manifestation of use-after-free.
>>>>>
>>>>> Do you always hit the oops at the same address?
>>>>
>>>> I think we saw this bug the first time. But before that the scsi
>>>> logging level was higher. Gonzalo is trying to recreate it with
>>>> the same (old) scsi logging level.
>>>> Afterwards we will try with barrier=0.
>>>>
>>>> Both on v3.0.7 btw.
>>>>
>>>>> Could you find corresponding source code line for
>>>>> the crashed address, dm_softirq_done+0x72/0x140,
>>>>> and which pointer was invalid?
>>>>
>>>> It crashes in the inlined function dm_done() when trying to
>>>> dereference tio (aka clone->end_io_data):
>>>>
>>>> static void dm_done(struct request *clone, int error, bool mapped)
>>>> {
>>>>         int r = error;
>>>>         struct dm_rq_target_io *tio = clone->end_io_data;
>>>>         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
>>>
>>> Hi,
>>>
>>> Which underlying storage driver is being used by this multipath device?
>>
>> It's the s390 only zfcp device driver.
>>
>> FWIW, yet another use-after-free crash, this time however in multipath_end_io:
>>
>> [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
>> [96875.870602] Oops: 0038 [#1]
>> [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
>> [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
>> st unloaded: scsi_wait_scan]
>> [96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
>> [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
>> [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
>> [96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
>> [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
>> [96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
>> [96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
>> [96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
>> [96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
>> [96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
>> [96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
>> [96875.870809]           >000003e001347138: e31020180004        lg %r1,24(%r2)
>> [96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
>> [96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
>> [96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
>> [96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
>> [96875.870889] Call Trace:
>> [96875.870892] ([<0700000000000008>] 0x700000000000008)
>> [96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
>> [96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
>> [96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
>> [96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
>> [96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
>> [96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
>> [96875.870953]  [<000000000061a164>] io_return+0x0/0x16
>> [96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
>> [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
>> [96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
>> [96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
>> [96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
>> [96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
>> [96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
>> [96875.871028] INFO: lockdep is turned off.
>> [96875.871031] Last Breaking-Event-Address:
>> [96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]
>>
>> static int multipath_end_io(struct dm_target *ti, struct request *clone,
>> 			    int error, union map_info *map_context)
>> {
>> 	struct multipath *m = ti->private;
>> 	struct dm_mpath_io *mpio = map_context->ptr;
>> 	struct pgpath *pgpath = mpio->pgpath;
>> 	struct path_selector *ps;
>> 	int r;
>>
>> 	r  = do_end_io(m, clone, error, mpio);
>> 	if (pgpath) {
>> 		ps = &pgpath->pg->ps; <--- crashes here
>> 		if (ps->type->end_io)
>> 			ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
>> 	}
>> 	mempool_free(mpio, m->mpio_pool);
>>
>> 	return r;
>> }
>>
>> It crashes when trying to derefence pgpath, which was freed. Since we have
>> SLUB debugging turned on the freed object tells us that it was allocated
>> via a call to multipath_ctr() and freed via a call to free_priority_group().
>>
>> FWIW, this reproduction was done with barriers off.
>>
>> Btw, since I cc'ed you rather late I'm not sure if you are aware of the
>> test scenario: it's I/O stress with multipathing were paths come and go
>> all the time. It usually takes quite a few hours before the crashes are
>> observed.
> 
> OK, thanks for the backstory.
> 
> That is the same type of testing we've been doing with some partners
> for RHEL6.2 with the qla2xxx driver.  They have seen the same crash that
> you originally reported here: https://lkml.org/lkml/2011/10/31/64
> 
> The really interesting observation that was made is that the qla2xxx
> driver was made lockless in RHEL6.2.  We've found that reverting the
> qla2xxx lockless changes eliminates the problems seen with it and I/O
> stress testing with multipath path failures.

It's interesting.
Have you analyzed how the lockless change caused the dm crash?

> The zfcp driver was also made lockless upstream, via this commit:
> e55f875 [SCSI] zfcp: Issue FCP command without holding SCSI host_lock
> 
> It would be great if you could try reverting e55f875 and see how your
> testing goes.
> 
> If doing so resolves the crashes for you then the post mortem on why
> these lockless SCSI driver changes are causing such odd multipath
> completion failures is going to be "fun" ;)

-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-04 13:30                                           ` Mike Snitzer
  2011-11-04 13:37                                             ` Hannes Reinecke
  2011-11-07 11:31                                             ` Jun'ichi Nomura
@ 2011-11-07 12:23                                             ` Heiko Carstens
  2 siblings, 0 replies; 60+ messages in thread
From: Heiko Carstens @ 2011-11-07 12:23 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jun'ichi Nomura, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Fri, Nov 04, 2011 at 09:30:52AM -0400, Mike Snitzer wrote:
> > FWIW, yet another use-after-free crash, this time however in multipath_end_io:
> > 
> > [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> > [96875.870602] Oops: 0038 [#1]
> > [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
> > [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
> > st unloaded: scsi_wait_scan]
> > [96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
> > [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
> > [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
> > [96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> > [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
> > [96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
> > [96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
> > [96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
> > [96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
> > [96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
> > [96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
> > [96875.870809]           >000003e001347138: e31020180004        lg %r1,24(%r2)
> > [96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
> > [96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
> > [96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
> > [96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
> > [96875.870889] Call Trace:
> > [96875.870892] ([<0700000000000008>] 0x700000000000008)
> > [96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
> > [96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > [96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
> > [96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
> > [96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
> > [96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
> > [96875.870953]  [<000000000061a164>] io_return+0x0/0x16
> > [96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
> > [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
> > [96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
> > [96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
> > [96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
> > [96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
> > [96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
> > [96875.871028] INFO: lockdep is turned off.
> > [96875.871031] Last Breaking-Event-Address:
> > [96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]

[...]

> OK, thanks for the backstory.
> 
> That is the same type of testing we've been doing with some partners
> for RHEL6.2 with the qla2xxx driver.  They have seen the same crash that
> you originally reported here: https://lkml.org/lkml/2011/10/31/64
> 
> The really interesting observation that was made is that the qla2xxx
> driver was made lockless in RHEL6.2.  We've found that reverting the
> qla2xxx lockless changes eliminates the problems seen with it and I/O
> stress testing with multipath path failures.
> 
> The zfcp driver was also made lockless upstream, via this commit:
> e55f875 [SCSI] zfcp: Issue FCP command without holding SCSI host_lock
> 
> It would be great if you could try reverting e55f875 and see how your
> testing goes.

Ok, we did that, and of course it sometimes runs into deadlocks (since
the commit fixed a deadlock), however we still see the same crash in
multipath_end_io() from above.

> If doing so resolves the crashes for you then the post mortem on why
> these lockless SCSI driver changes are causing such odd multipath
> completion failures is going to be "fun" ;)

Must be something different then...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-07 11:31                                             ` Jun'ichi Nomura
@ 2011-11-07 13:42                                               ` Mike Snitzer
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Snitzer @ 2011-11-07 13:42 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Heiko Carstens, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Mon, Nov 07 2011 at  6:31am -0500,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> On 11/04/11 22:30, Mike Snitzer wrote:
> > On Fri, Nov 04 2011 at  5:19am -0400,
> > Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > 
> >> On Thu, Nov 03, 2011 at 02:25:48PM -0400, Mike Snitzer wrote:
> >>> On Mon, Oct 31 2011 at  9:00am -0400,
> >>> Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> >>>
> >>>> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
> >>>>> Hm, dm_softirq_done is generic completion code of original
> >>>>> request in dm-multipath.
> >>>>> So oops here might be another manifestation of use-after-free.
> >>>>>
> >>>>> Do you always hit the oops at the same address?
> >>>>
> >>>> I think we saw this bug the first time. But before that the scsi
> >>>> logging level was higher. Gonzalo is trying to recreate it with
> >>>> the same (old) scsi logging level.
> >>>> Afterwards we will try with barrier=0.
> >>>>
> >>>> Both on v3.0.7 btw.
> >>>>
> >>>>> Could you find corresponding source code line for
> >>>>> the crashed address, dm_softirq_done+0x72/0x140,
> >>>>> and which pointer was invalid?
> >>>>
> >>>> It crashes in the inlined function dm_done() when trying to
> >>>> dereference tio (aka clone->end_io_data):
> >>>>
> >>>> static void dm_done(struct request *clone, int error, bool mapped)
> >>>> {
> >>>>         int r = error;
> >>>>         struct dm_rq_target_io *tio = clone->end_io_data;
> >>>>         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
> >>>
> >>> Hi,
> >>>
> >>> Which underlying storage driver is being used by this multipath device?
> >>
> >> It's the s390 only zfcp device driver.
> >>
> >> FWIW, yet another use-after-free crash, this time however in multipath_end_io:
> >>
> >> [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> >> [96875.870602] Oops: 0038 [#1]
> >> [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
> >> [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
> >> st unloaded: scsi_wait_scan]
> >> [96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
> >> [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
> >> [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
> >> [96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> >> [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
> >> [96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
> >> [96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
> >> [96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
> >> [96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
> >> [96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
> >> [96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
> >> [96875.870809]           >000003e001347138: e31020180004        lg %r1,24(%r2)
> >> [96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
> >> [96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
> >> [96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
> >> [96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
> >> [96875.870889] Call Trace:
> >> [96875.870892] ([<0700000000000008>] 0x700000000000008)
> >> [96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
> >> [96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> >> [96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
> >> [96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
> >> [96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
> >> [96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
> >> [96875.870953]  [<000000000061a164>] io_return+0x0/0x16
> >> [96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
> >> [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
> >> [96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
> >> [96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
> >> [96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
> >> [96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
> >> [96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
> >> [96875.871028] INFO: lockdep is turned off.
> >> [96875.871031] Last Breaking-Event-Address:
> >> [96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]
> >>
> >> static int multipath_end_io(struct dm_target *ti, struct request *clone,
> >> 			    int error, union map_info *map_context)
> >> {
> >> 	struct multipath *m = ti->private;
> >> 	struct dm_mpath_io *mpio = map_context->ptr;
> >> 	struct pgpath *pgpath = mpio->pgpath;
> >> 	struct path_selector *ps;
> >> 	int r;
> >>
> >> 	r  = do_end_io(m, clone, error, mpio);
> >> 	if (pgpath) {
> >> 		ps = &pgpath->pg->ps; <--- crashes here
> >> 		if (ps->type->end_io)
> >> 			ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
> >> 	}
> >> 	mempool_free(mpio, m->mpio_pool);
> >>
> >> 	return r;
> >> }
> >>
> >> It crashes when trying to derefence pgpath, which was freed. Since we have
> >> SLUB debugging turned on the freed object tells us that it was allocated
> >> via a call to multipath_ctr() and freed via a call to free_priority_group().
> >>
> >> FWIW, this reproduction was done with barriers off.
> >>
> >> Btw, since I cc'ed you rather late I'm not sure if you are aware of the
> >> test scenario: it's I/O stress with multipathing were paths come and go
> >> all the time. It usually takes quite a few hours before the crashes are
> >> observed.
> > 
> > OK, thanks for the backstory.
> > 
> > That is the same type of testing we've been doing with some partners
> > for RHEL6.2 with the qla2xxx driver.  They have seen the same crash that
> > you originally reported here: https://lkml.org/lkml/2011/10/31/64
> > 
> > The really interesting observation that was made is that the qla2xxx
> > driver was made lockless in RHEL6.2.  We've found that reverting the
> > qla2xxx lockless changes eliminates the problems seen with it and I/O
> > stress testing with multipath path failures.
> 
> It's interesting.
> Have you analyzed how the lockless change caused the dm crash?

Not yet.  It will take me some time (as I'm not overly familiar with the
SCSI lockless changes).  Any assistance would be appreciated.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-07 11:30                                           ` Jun'ichi Nomura
@ 2011-11-07 15:36                                             ` Mike Snitzer
  2011-11-07 16:43                                               ` Heiko Carstens
  2011-11-07 17:10                                               ` Mike Snitzer
  0 siblings, 2 replies; 60+ messages in thread
From: Mike Snitzer @ 2011-11-07 15:36 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Heiko Carstens, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Mon, Nov 07 2011 at  6:30am -0500,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> On 11/04/11 18:19, Heiko Carstens wrote:
> > On Thu, Nov 03, 2011 at 02:25:48PM -0400, Mike Snitzer wrote:
> >> On Mon, Oct 31 2011 at  9:00am -0400,
> >> Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> >>
> >>> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
> >>>> Hm, dm_softirq_done is generic completion code of original
> >>>> request in dm-multipath.
> >>>> So oops here might be another manifestation of use-after-free.
> >>>>
> >>>> Do you always hit the oops at the same address?
> >>>
> >>> I think we saw this bug the first time. But before that the scsi
> >>> logging level was higher. Gonzalo is trying to recreate it with
> >>> the same (old) scsi logging level.
> >>> Afterwards we will try with barrier=0.
> >>>
> >>> Both on v3.0.7 btw.
> >>>
> >>>> Could you find corresponding source code line for
> >>>> the crashed address, dm_softirq_done+0x72/0x140,
> >>>> and which pointer was invalid?
> >>>
> >>> It crashes in the inlined function dm_done() when trying to
> >>> dereference tio (aka clone->end_io_data):
> >>>
> >>> static void dm_done(struct request *clone, int error, bool mapped)
> >>> {
> >>>         int r = error;
> >>>         struct dm_rq_target_io *tio = clone->end_io_data;
> >>>         dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
> >>
> >> Hi,
> >>
> >> Which underlying storage driver is being used by this multipath device?
> > 
> > It's the s390 only zfcp device driver.
> > 
> > FWIW, yet another use-after-free crash, this time however in multipath_end_io:
> > 
> > [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> > [96875.870602] Oops: 0038 [#1]
> > [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
> > [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
> > st unloaded: scsi_wait_scan]
> > [96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
> > [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
> > [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
> > [96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> > [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
> > [96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
> > [96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
> > [96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
> > [96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
> > [96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
> > [96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
> > [96875.870809]           >000003e001347138: e31020180004        lg %r1,24(%r2)
> > [96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
> > [96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
> > [96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
> > [96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
> > [96875.870889] Call Trace:
> > [96875.870892] ([<0700000000000008>] 0x700000000000008)
> > [96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
> > [96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > [96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
> > [96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
> > [96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
> > [96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
> > [96875.870953]  [<000000000061a164>] io_return+0x0/0x16
> > [96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
> > [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
> > [96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
> > [96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
> > [96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
> > [96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
> > [96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
> > [96875.871028] INFO: lockdep is turned off.
> > [96875.871031] Last Breaking-Event-Address:
> > [96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]
> > 
> > static int multipath_end_io(struct dm_target *ti, struct request *clone,
> > 			    int error, union map_info *map_context)
> > {
> > 	struct multipath *m = ti->private;
> > 	struct dm_mpath_io *mpio = map_context->ptr;
> > 	struct pgpath *pgpath = mpio->pgpath;
> > 	struct path_selector *ps;
> > 	int r;
> > 
> > 	r  = do_end_io(m, clone, error, mpio);
> > 	if (pgpath) {
> > 		ps = &pgpath->pg->ps; <--- crashes here
> > 		if (ps->type->end_io)
> > 			ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
> > 	}
> > 	mempool_free(mpio, m->mpio_pool);
> > 
> > 	return r;
> > }
> > 
> > It crashes when trying to derefence pgpath, which was freed. Since we have
> > SLUB debugging turned on the freed object tells us that it was allocated
> > via a call to multipath_ctr() and freed via a call to free_priority_group().
> 
> struct pgpath is freed before dm_target when tearing down dm table.
> So if the problematic completion was being done after freeing pgpath
> but before freeing dm_target, crash would look like that
> and what's happening seems the same for these dm crashes:
> dm table was somehow destroyed while I/O was in-flight.

Could be the block layer's onstack plugging changes are at the heart of
this.

I voiced onstack plugging concerns relative to DM some time ago
(https://lkml.org/lkml/2011/3/9/450) but somehow convinced myself DM was
fine to no longer need dm_table_unplug_all() etc.  Unfortunately I
cannot recall _why_ I felt that was the case.

So DM needs further review relative to block's onstack plugging changes
and DM IO completion.

> It's interesting that your test started to crash in dm with v3.0.7.
> Have you gotten these dm crashes with v3.0.6 or before?
> Have you hit the initially-reported scsi oops with v3.0.7?
> Are your v3.0.6 and v3.0.7 compiled with same config and the tests
> ran on same system?

If all 3.0.x fail: it would be interesting to know if 2.6.39 (which
introduced the onstack plugging) also has these problems.  Testing with
2.6.38 would be very insightful because it obviously doesn't have any of
the onstack plugging churn.

Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-07 15:36                                             ` Mike Snitzer
@ 2011-11-07 16:43                                               ` Heiko Carstens
  2011-11-07 17:10                                               ` Mike Snitzer
  1 sibling, 0 replies; 60+ messages in thread
From: Heiko Carstens @ 2011-11-07 16:43 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jun'ichi Nomura, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Mon, Nov 07, 2011 at 10:36:49AM -0500, Mike Snitzer wrote:
> > struct pgpath is freed before dm_target when tearing down dm table.
> > So if the problematic completion was being done after freeing pgpath
> > but before freeing dm_target, crash would look like that
> > and what's happening seems the same for these dm crashes:
> > dm table was somehow destroyed while I/O was in-flight.
> 
> Could be the block layer's onstack plugging changes are at the heart of
> this.
> 
> I voiced onstack plugging concerns relative to DM some time ago
> (https://lkml.org/lkml/2011/3/9/450) but somehow convinced myself DM was
> fine to no longer need dm_table_unplug_all() etc.  Unfortunately I
> cannot recall _why_ I felt that was the case.
> 
> So DM needs further review relative to block's onstack plugging changes
> and DM IO completion.
> 
> > It's interesting that your test started to crash in dm with v3.0.7.
> > Have you gotten these dm crashes with v3.0.6 or before?
> > Have you hit the initially-reported scsi oops with v3.0.7?
> > Are your v3.0.6 and v3.0.7 compiled with same config and the tests
> > ran on same system?
> 
> If all 3.0.x fail: it would be interesting to know if 2.6.39 (which
> introduced the onstack plugging) also has these problems.

At least 2.6.39.1 failed as well, as reported already in June. At least
the crash looked similar back then:

http://marc.info/?l=linux-scsi&m=130824013229933&w=2

> Testing with
> 2.6.38 would be very insightful because it obviously doesn't have any of
> the onstack plugging churn.

Ok, we might try 2.6.38.8 (latest stable) and see what happens.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-07 15:36                                             ` Mike Snitzer
  2011-11-07 16:43                                               ` Heiko Carstens
@ 2011-11-07 17:10                                               ` Mike Snitzer
  2011-11-07 21:44                                                 ` Mike Snitzer
  1 sibling, 1 reply; 60+ messages in thread
From: Mike Snitzer @ 2011-11-07 17:10 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Heiko Carstens, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas, dm-devel

On Mon, Nov 07 2011 at 10:36am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Nov 07 2011 at  6:30am -0500,
> Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:
> 
> > On 11/04/11 18:19, Heiko Carstens wrote:
> > > It's the s390 only zfcp device driver.
> > > 
> > > FWIW, yet another use-after-free crash, this time however in multipath_end_io:
> > > 
> > > [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> > > [96875.870602] Oops: 0038 [#1]
> > > [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
> > > [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
> > > st unloaded: scsi_wait_scan]
> > > [96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
> > > [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
> > > [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
> > > [96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> > > [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
> > > [96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
> > > [96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
> > > [96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
> > > [96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
> > > [96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
> > > [96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
> > > [96875.870809]           >000003e001347138: e31020180004        lg %r1,24(%r2)
> > > [96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
> > > [96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
> > > [96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
> > > [96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
> > > [96875.870889] Call Trace:
> > > [96875.870892] ([<0700000000000008>] 0x700000000000008)
> > > [96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
> > > [96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > > [96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
> > > [96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
> > > [96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
> > > [96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
> > > [96875.870953]  [<000000000061a164>] io_return+0x0/0x16
> > > [96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
> > > [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
> > > [96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
> > > [96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
> > > [96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
> > > [96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
> > > [96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
> > > [96875.871028] INFO: lockdep is turned off.
> > > [96875.871031] Last Breaking-Event-Address:
> > > [96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]
> > > 
> > > static int multipath_end_io(struct dm_target *ti, struct request *clone,
> > > 			    int error, union map_info *map_context)
> > > {
> > > 	struct multipath *m = ti->private;
> > > 	struct dm_mpath_io *mpio = map_context->ptr;
> > > 	struct pgpath *pgpath = mpio->pgpath;
> > > 	struct path_selector *ps;
> > > 	int r;
> > > 
> > > 	r  = do_end_io(m, clone, error, mpio);
> > > 	if (pgpath) {
> > > 		ps = &pgpath->pg->ps; <--- crashes here
> > > 		if (ps->type->end_io)
> > > 			ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
> > > 	}
> > > 	mempool_free(mpio, m->mpio_pool);
> > > 
> > > 	return r;
> > > }
> > > 
> > > It crashes when trying to derefence pgpath, which was freed. Since we have
> > > SLUB debugging turned on the freed object tells us that it was allocated
> > > via a call to multipath_ctr() and freed via a call to free_priority_group().
> > 
> > struct pgpath is freed before dm_target when tearing down dm table.
> > So if the problematic completion was being done after freeing pgpath
> > but before freeing dm_target, crash would look like that
> > and what's happening seems the same for these dm crashes:
> > dm table was somehow destroyed while I/O was in-flight.
> 
> Could be the block layer's onstack plugging changes are at the heart of
> this.
> 
> I voiced onstack plugging concerns relative to DM some time ago
> (https://lkml.org/lkml/2011/3/9/450) but somehow convinced myself DM was
> fine to no longer need dm_table_unplug_all() etc.  Unfortunately I
> cannot recall _why_ I felt that was the case.
> 
> So DM needs further review relative to block's onstack plugging changes
> and DM IO completion.

dm_suspend is performed as part of the DM table reload that is being
done my multipathd during path failure.  Seems DM no longer insures
inflight requests have finished during dm_suspend().

Before onstack plugging (< 2.6.39):
dm_suspend() -> dm_wait_for_completion() -> dm_unplug_all() -> dm_table_unplug_all()

After onstack plugging (>= 2.6.39, commit 7eaceaccab5f40bb):
dm_suspend's call to dm_wait_for_completion() no longer unplugs IO
(dm_unplug_all and dm_table_unplug_all were removed without introducing
a clear equivalent).

Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-07 17:10                                               ` Mike Snitzer
@ 2011-11-07 21:44                                                 ` Mike Snitzer
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Snitzer @ 2011-11-07 21:44 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: Heiko Carstens, James Bottomley, Steffen Maier, linux-scsi,
	Jens Axboe, Hannes Reinecke, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas, dm-devel

On Mon, Nov 07 2011 at 12:10pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Nov 07 2011 at 10:36am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> > Could be the block layer's onstack plugging changes are at the heart of
> > this.
> > 
> > I voiced onstack plugging concerns relative to DM some time ago
> > (https://lkml.org/lkml/2011/3/9/450) but somehow convinced myself DM was
> > fine to no longer need dm_table_unplug_all() etc.  Unfortunately I
> > cannot recall _why_ I felt that was the case.
> > 
> > So DM needs further review relative to block's onstack plugging changes
> > and DM IO completion.
> 
> dm_suspend is performed as part of the DM table reload that is being
> done my multipathd during path failure.  Seems DM no longer insures
> inflight requests have finished during dm_suspend().
> 
> Before onstack plugging (< 2.6.39):
> dm_suspend() -> dm_wait_for_completion() -> dm_unplug_all() -> dm_table_unplug_all()
> 
> After onstack plugging (>= 2.6.39, commit 7eaceaccab5f40bb):
> dm_suspend's call to dm_wait_for_completion() no longer unplugs IO
> (dm_unplug_all and dm_table_unplug_all were removed without introducing
> a clear equivalent).

If a missing unplug were a concern then dm_wait_for_completion()'s check
for inflight IO would cause dm_suspend() to hang.

So the onstack plugging changes are unlikely to be the reason for this.

Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-04  9:19                                         ` Heiko Carstens
  2011-11-04 13:30                                           ` Mike Snitzer
  2011-11-07 11:30                                           ` Jun'ichi Nomura
@ 2011-11-09  9:37                                           ` Hannes Reinecke
  2011-11-10 16:10                                             ` Heiko Carstens
  2 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2011-11-09  9:37 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Mike Snitzer, Jun'ichi Nomura, James Bottomley,
	Steffen Maier, linux-scsi, Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On 11/04/2011 10:19 AM, Heiko Carstens wrote:
> On Thu, Nov 03, 2011 at 02:25:48PM -0400, Mike Snitzer wrote:
>> On Mon, Oct 31 2011 at  9:00am -0400,
>> Heiko Carstens<heiko.carstens@de.ibm.com>  wrote:
>>
>>> On Mon, Oct 31, 2011 at 08:46:06PM +0900, Jun'ichi Nomura wrote:
>>>> Hm, dm_softirq_done is generic completion code of original
>>>> request in dm-multipath.
>>>> So oops here might be another manifestation of use-after-free.
>>>>
>>>> Do you always hit the oops at the same address?
>>>
>>> I think we saw this bug the first time. But before that the scsi
>>> logging level was higher. Gonzalo is trying to recreate it with
>>> the same (old) scsi logging level.
>>> Afterwards we will try with barrier=0.
>>>
>>> Both on v3.0.7 btw.
>>>
>>>> Could you find corresponding source code line for
>>>> the crashed address, dm_softirq_done+0x72/0x140,
>>>> and which pointer was invalid?
>>>
>>> It crashes in the inlined function dm_done() when trying to
>>> dereference tio (aka clone->end_io_data):
>>>
>>> static void dm_done(struct request *clone, int error, bool mapped)
>>> {
>>>          int r = error;
>>>          struct dm_rq_target_io *tio = clone->end_io_data;
>>>          dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
>>
>> Hi,
>>
>> Which underlying storage driver is being used by this multipath device?
>
> It's the s390 only zfcp device driver.
>
> FWIW, yet another use-after-free crash, this time however in multipath_end_io:
>
> [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> [96875.870602] Oops: 0038 [#1]
> [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
> [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
> st unloaded: scsi_wait_scan]
> [96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
> [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
> [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
> [96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
> [96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
> [96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
> [96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
> [96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
> [96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
> [96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
> [96875.870809]>000003e001347138: e31020180004        lg %r1,24(%r2)
> [96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
> [96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
> [96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
> [96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
> [96875.870889] Call Trace:
> [96875.870892] ([<0700000000000008>] 0x700000000000008)
> [96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
> [96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> [96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
> [96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
> [96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
> [96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
> [96875.870953]  [<000000000061a164>] io_return+0x0/0x16
> [96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
> [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
> [96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
> [96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
> [96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
> [96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
> [96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
> [96875.871028] INFO: lockdep is turned off.
> [96875.871031] Last Breaking-Event-Address:
> [96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]
>
> static int multipath_end_io(struct dm_target *ti, struct request *clone,
> 			    int error, union map_info *map_context)
> {
> 	struct multipath *m = ti->private;
> 	struct dm_mpath_io *mpio = map_context->ptr;
> 	struct pgpath *pgpath = mpio->pgpath;
> 	struct path_selector *ps;
> 	int r;
>
> 	r  = do_end_io(m, clone, error, mpio);
> 	if (pgpath) {
> 		ps =&pgpath->pg->ps;<--- crashes here
> 		if (ps->type->end_io)
> 			ps->type->end_io(ps,&pgpath->path, mpio->nr_bytes);
> 	}
> 	mempool_free(mpio, m->mpio_pool);
>
> 	return r;
> }
>
> It crashes when trying to derefence pgpath, which was freed. Since we have
> SLUB debugging turned on the freed object tells us that it was allocated
> via a call to multipath_ctr() and freed via a call to free_priority_group().
>
> FWIW, this reproduction was done with barriers off.
>
> Btw, since I cc'ed you rather late I'm not sure if you are aware of the
> test scenario: it's I/O stress with multipathing were paths come and go
> all the time. It usually takes quite a few hours before the crashes are
> observed.

Hmm. Just to be on the safe side, could you try this one:

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 5e0090e..e6fad46 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -920,8 +920,10 @@ static int multipath_map(struct dm_target *ti, 
struct reque
st *clone,
         map_context->ptr = mpio;
         clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
         r = map_io(m, clone, mpio, 0);
-       if (r < 0 || r == DM_MAPIO_REQUEUE)
+       if (r < 0 || r == DM_MAPIO_REQUEUE) {
                 mempool_free(mpio, m->mpio_pool);
+               map_context->ptr = NULL;
+       }

         return r;
  }

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-09  9:37                                           ` Hannes Reinecke
@ 2011-11-10 16:10                                             ` Heiko Carstens
  2011-11-17 16:29                                               ` Mike Snitzer
  0 siblings, 1 reply; 60+ messages in thread
From: Heiko Carstens @ 2011-11-10 16:10 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Mike Snitzer, Jun'ichi Nomura, James Bottomley,
	Steffen Maier, linux-scsi, Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Wed, Nov 09, 2011 at 10:37:06AM +0100, Hannes Reinecke wrote:
> >FWIW, yet another use-after-free crash, this time however in multipath_end_io:
> >
> >[96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> >[96875.870602] Oops: 0038 [#1]
> >[96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
> >[96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
> >st unloaded: scsi_wait_scan]
> >[96875.870722] CPU: 2 Tainted: G        W   3.0.7-50.x.20111024-s390xdefault #1
> >[96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
> >[96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
> >[96875.870746]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> >[96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
> >[96875.870755]            0000000000000000 00000000717abab0 0000000000000002 0700000000000008
> >[96875.870759]            0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
> >[96875.870764]            000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
> >[96875.870768] Krnl Code: 000003e00134712a: b90200dd            ltgr %r13,%r13
> >[96875.870793]            000003e00134712e: a7840017            brc 8,3e00134715c
> >[96875.870800]            000003e001347132: e320d0100004        lg %r2,16(%r13)
> >[96875.870809]>000003e001347138: e31020180004        lg %r1,24(%r2)
> >[96875.870818]            000003e00134713e: e31010580004        lg %r1,88(%r1)
> >[96875.870827]            000003e001347144: b9020011            ltgr %r1,%r1
> >[96875.870835]            000003e001347148: a784000a            brc 8,3e00134715c
> >[96875.870841]            000003e00134714c: 41202018            la %r2,24(%r2)
> >[96875.870889] Call Trace:
> >[96875.870892] ([<0700000000000008>] 0x700000000000008)
> >[96875.870897]  [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
> >[96875.870915]  [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> >[96875.870925]  [<00000000001587c2>] __do_softirq+0xda/0x398
> >[96875.870932]  [<000000000010f47e>] do_softirq+0xe2/0xe8
> >[96875.870940]  [<0000000000158e2c>] irq_exit+0xc8/0xcc
> >[96875.870945]  [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
> >[96875.870953]  [<000000000061a164>] io_return+0x0/0x16
> >[96875.870961]  [<000000000019c84e>] lock_acquire+0xd2/0x204
> >[96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
> >[96875.870974]  [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
> >[96875.870983]  [<0000000000292796>] vfs_readdir+0x8a/0xe4
> >[96875.870992]  [<00000000002928e0>] SyS_getdents+0x60/0xe8
> >[96875.870999]  [<0000000000619af2>] sysc_noemu+0x16/0x1c
> >[96875.871024]  [<000003fffd1ec83e>] 0x3fffd1ec83e
> >[96875.871028] INFO: lockdep is turned off.
> >[96875.871031] Last Breaking-Event-Address:
> >[96875.871037]  [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]

[...]

> Hmm. Just to be on the safe side, could you try this one:
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 5e0090e..e6fad46 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -920,8 +920,10 @@ static int multipath_map(struct dm_target *ti,
> struct reque
> st *clone,
>         map_context->ptr = mpio;
>         clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
>         r = map_io(m, clone, mpio, 0);
> -       if (r < 0 || r == DM_MAPIO_REQUEUE)
> +       if (r < 0 || r == DM_MAPIO_REQUEUE) {
>                 mempool_free(mpio, m->mpio_pool);
> +               map_context->ptr = NULL;
> +       }
> 
>         return r;
>  }

With your patch we haven't been able to reproduce the kernel crash until now.
Now we "only" run into I/O stalls, which before your patch we also did. But
repeatedly rebooting and retrying and ignoring the I/O stalls always lead to
a crash.
Gonzalo will run a couple of extra rounds so we can have a feeling if at least
one of the bugs could be fixed with your patch ;)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-10 16:10                                             ` Heiko Carstens
@ 2011-11-17 16:29                                               ` Mike Snitzer
  2011-11-29 12:00                                                 ` Heiko Carstens
  0 siblings, 1 reply; 60+ messages in thread
From: Mike Snitzer @ 2011-11-17 16:29 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Hannes Reinecke, Jun'ichi Nomura, James Bottomley,
	Steffen Maier, linux-scsi, Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

On Thu, Nov 10 2011 at 11:10am -0500,
Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> On Wed, Nov 09, 2011 at 10:37:06AM +0100, Hannes Reinecke wrote:

...
 
> > Hmm. Just to be on the safe side, could you try this one:
> > 
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index 5e0090e..e6fad46 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -920,8 +920,10 @@ static int multipath_map(struct dm_target *ti,
> > struct reque
> > st *clone,
> >         map_context->ptr = mpio;
> >         clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
> >         r = map_io(m, clone, mpio, 0);
> > -       if (r < 0 || r == DM_MAPIO_REQUEUE)
> > +       if (r < 0 || r == DM_MAPIO_REQUEUE) {
> >                 mempool_free(mpio, m->mpio_pool);
> > +               map_context->ptr = NULL;
> > +       }
> > 
> >         return r;
> >  }
> 
> With your patch we haven't been able to reproduce the kernel crash until now.
> Now we "only" run into I/O stalls, which before your patch we also did. But
> repeatedly rebooting and retrying and ignoring the I/O stalls always lead to
> a crash.
> Gonzalo will run a couple of extra rounds so we can have a feeling if at least
> one of the bugs could be fixed with your patch ;)

Hi,

Any update after further testing with Hannes' patch?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-17 16:29                                               ` Mike Snitzer
@ 2011-11-29 12:00                                                 ` Heiko Carstens
  2011-11-29 20:18                                                   ` Mike Snitzer
  0 siblings, 1 reply; 60+ messages in thread
From: Heiko Carstens @ 2011-11-29 12:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Hannes Reinecke, Jun'ichi Nomura, James Bottomley,
	Steffen Maier, linux-scsi, Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas

> > > Hmm. Just to be on the safe side, could you try this one:
> > > 
> > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > > index 5e0090e..e6fad46 100644
> > > --- a/drivers/md/dm-mpath.c
> > > +++ b/drivers/md/dm-mpath.c
> > > @@ -920,8 +920,10 @@ static int multipath_map(struct dm_target *ti,
> > > struct reque
> > > st *clone,
> > >         map_context->ptr = mpio;
> > >         clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
> > >         r = map_io(m, clone, mpio, 0);
> > > -       if (r < 0 || r == DM_MAPIO_REQUEUE)
> > > +       if (r < 0 || r == DM_MAPIO_REQUEUE) {
> > >                 mempool_free(mpio, m->mpio_pool);
> > > +               map_context->ptr = NULL;
> > > +       }
> > > 
> > >         return r;
> > >  }
> > 
> > With your patch we haven't been able to reproduce the kernel crash until now.
> > Now we "only" run into I/O stalls, which before your patch we also did. But
> > repeatedly rebooting and retrying and ignoring the I/O stalls always lead to
> > a crash.
> > Gonzalo will run a couple of extra rounds so we can have a feeling if at least
> > one of the bugs could be fixed with your patch ;)
> 
> Hi,
> 
> Any update after further testing with Hannes' patch?

Sorry for the late update, our internal IBM IMAP servers have been down
for nearly a week :/

So, we were unable to reproduce the original bug with the patch applied
during various runs.
However, we ran into this one instead, which is yet another use-after-free bug
(I need to double check, but I'm quite sure that a freed struct scsi_cmnd
caused this).

[ 4906.683654] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
[ 4906.683662] Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 4906.683672] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [last unloaded: scsi_wait_scan]
[ 4906.683696] CPU: 3 Not tainted 3.1.0-52.x.20111111-s390xdefault #1
[ 4906.683700] Process flush-252:12 (pid: 2489, task: 0000000072b4a490, ksp: 0000000072f8fb48)
[ 4906.683705] Krnl PSW : 0404200180000000 000000000052a98c (zfcp_fsf_fcp_handler_common+0x3c/0x2f4)
[ 4906.683719]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
[ 4906.683728] Krnl GPRS: 0000000000000000 00000000726dc800 0000000037e1c4e8 0400043100d78e40
[ 4906.683733]            0000000070ccc000 0000000000000010 0700000074b4dcd0 00000000726dc800
[ 4906.683738]            0000000037e1c4e8 070000000d427960 0000000074b4dcd0 0000000070ccc000
[ 4906.683743]            6b6b6b6b6b6b6b6b 0000000000688560 000000000d427980 000000000d427920
[ 4906.683761] Krnl Code: 000000000052a97c: 58502090            l       %r5,144(%r2)
[ 4906.683767]            000000000052a980: e3c010000004        lg      %r12,0(%r1)
[ 4906.683773]            000000000052a986: e34020980004        lg      %r4,152(%r2)
[ 4906.683780]           >000000000052a98c: e330c0000004        lg      %r3,0(%r12)
[ 4906.683786]            000000000052a992: a7510008            tmll    %r5,8
[ 4906.683792]            000000000052a996: e33032080004        lg      %r3,520(%r3)
[ 4906.683798]            000000000052a99c: 58303204            l       %r3,516(%r3)
[ 4906.683803]            000000000052a9a0: a774001c            brc     7,52a9d8
[ 4906.683809] Call Trace:
[ 4906.683811] ([<000000000d427980>] 0xd427980)
[ 4906.683817]  [<000000000052aff2>] zfcp_fsf_fcp_cmnd_handler+0x52/0x448
[ 4906.683824]  [<000000000052c3f8>] zfcp_fsf_req_complete+0x1d8/0x7e4
[ 4906.683829]  [<000000000052ef2c>] zfcp_fsf_reqid_check+0xc4/0x13c
[ 4906.683835]  [<000000000052fe92>] zfcp_qdio_int_resp+0x72/0x1a4
[ 4906.683841]  [<00000000004eb6fe>] qdio_kick_handler+0x12e/0x2e0
[ 4906.683848]  [<00000000004ecfb2>] __tiqdio_inbound_processing+0xea/0xd98
[ 4906.683854]  [<00000000001552f2>] tasklet_action+0xd2/0x29c
[ 4906.683862]  [<00000000001563e2>] __do_softirq+0xda/0x398
[ 4906.683868]  [<000000000010f47e>] do_softirq+0xe2/0xe8
[ 4906.683876]  [<0000000000156a4c>] irq_exit+0xc8/0xcc
[ 4906.683881]  [<00000000004d79fa>] do_IRQ+0x20e/0x320
[ 4906.683889]  [<000000000061de8c>] io_return+0x0/0x16
[ 4906.683897]  [<000000000061cf78>] _raw_spin_unlock_irqrestore+0x98/0xa8
[ 4906.683904] ([<000000000061cf6e>] _raw_spin_unlock_irqrestore+0x8e/0xa8)
[ 4906.683910]  [<0000000000218262>] test_set_page_writeback+0x10e/0x248
[ 4906.683919]  [<00000000002b9254>] __block_write_full_page+0x310/0x5cc
[ 4906.683926]  [<00000000002b9628>] block_write_full_page_endio+0x118/0x168
[ 4906.683932]  [<000000000031050e>] ext3_writeback_writepage+0x1fa/0x28c
[ 4906.683940]  [<0000000000218006>] __writepage+0x2e/0x88
[ 4906.683945]  [<0000000000218be0>] write_cache_pages+0x224/0x600
[ 4906.683951]  [<000000000021901c>] generic_writepages+0x60/0x94
[ 4906.683957]  [<00000000002ace14>] writeback_single_inode+0x13c/0x53c
[ 4906.683964]  [<00000000002adb80>] writeback_sb_inodes+0x1d4/0x2e4
[ 4906.683970]  [<00000000002ae44c>] __writeback_inodes_wb+0xa0/0xec
[ 4906.683976]  [<00000000002ae926>] wb_writeback+0x48e/0x5f8
[ 4906.683981]  [<00000000002af03a>] wb_do_writeback+0x302/0x3ac
[ 4906.683987]  [<00000000002af194>] bdi_writeback_thread+0xb0/0x4e0
[ 4906.683993]  [<000000000017a3ea>] kthread+0xa6/0xb0
[ 4906.683999]  [<000000000061d436>] kernel_thread_starter+0x6/0xc
[ 4906.684005]  [<000000000061d430>] kernel_thread_starter+0x0/0xc
[ 4906.684010] INFO: lockdep is turned off.
[ 4906.684013] Last Breaking-Event-Address:
[ 4906.684016]  [<000000000052afec>] zfcp_fsf_fcp_cmnd_handler+0x4c/0x448

Gonzalo also tried 2.6.38.8 as suggested and ran into this one:

[  292.877936] ------------[ cut here ]------------
[  292.877939] Kernel BUG at 6b6b6b6b6b6b6b6d [verbose debug info unavailable]
[  292.877947] specification exception: 0006 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  292.877956] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [last unloaded: scsi_wait_scan]
[  292.877979] CPU: 1 Not tainted 2.6.38.8 #1
[  292.877982] Process multipathd (pid: 352, task: 000000007bab8000, ksp: 000000007ba3ba00)
[  292.877988] Krnl PSW : 0704000180000000 6b6b6b6b6b6b6b6d (0x6b6b6b6b6b6b6b6d)
[  292.877997]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
[  292.878003] Krnl GPRS: 17c0000000000000 6b6b6b6b6b6b6b6b 0000000078dc49f0 0000000000000000
[  292.878008]            000003c001f6a728 00000000005ec230 00000000738e2910 00000000756d4aa0
[  292.878013]            000003c000000001 000000007ba3bc58 00000000738e2910 00000000738e2a08
[  292.878018]            000003c001f63000 0000000078dc49f0 00000000003e6c0a 000000007ba3bb80
[  292.878024] Krnl Code: Bad PSW.
[  292.878027] Call Trace:
[  292.878030] ([<00000000003e6c0a>] blk_unplug+0x42/0x150)
[  292.878040]  [<000003c001f6a728>] dm_table_unplug_all+0x60/0x10c [dm_mod]
[  292.878060]  [<000003c001f65926>] dm_unplug_all+0x86/0xa8 [dm_mod]
[  292.878069]  [<000003c001f68508>] dm_suspend+0x1a4/0x394 [dm_mod]
[  292.878078]  [<000003c001f6dce6>] dev_suspend+0x21e/0x250 [dm_mod]
[  292.878087]  [<000003c001f6eaa8>] ctl_ioctl+0x1c8/0x28c [dm_mod]
[  292.878096]  [<000003c001f6eb96>] dm_ctl_ioctl+0x2a/0x38 [dm_mod]
[  292.878105]  [<000000000027df74>] do_vfs_ioctl+0x94/0x5b8
[  292.878112]  [<000000000027e52c>] SyS_ioctl+0x94/0xac
[  292.878117]  [<00000000005d8f5e>] sysc_noemu+0x16/0x1c
[  292.878125]  [<000003fffd2097ca>] 0x3fffd2097ca
[  292.878130] INFO: lockdep is turned off.
[  292.878133] Last Breaking-Event-Address:
[  292.878136]  [<00000000003e6c08>] blk_unplug+0x40/0x150


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-29 12:00                                                 ` Heiko Carstens
@ 2011-11-29 20:18                                                   ` Mike Snitzer
  2011-11-30  7:25                                                     ` Hannes Reinecke
  2011-12-12 12:39                                                     ` Heiko Carstens
  0 siblings, 2 replies; 60+ messages in thread
From: Mike Snitzer @ 2011-11-29 20:18 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Hannes Reinecke, Jun'ichi Nomura, James Bottomley,
	Steffen Maier, linux-scsi, Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas, dm-devel

On Tue, Nov 29 2011 at  7:00am -0500,
Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> > > > Hmm. Just to be on the safe side, could you try this one:
> > > > 
> > > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > > > index 5e0090e..e6fad46 100644
> > > > --- a/drivers/md/dm-mpath.c
> > > > +++ b/drivers/md/dm-mpath.c
> > > > @@ -920,8 +920,10 @@ static int multipath_map(struct dm_target *ti,
> > > > struct reque
> > > > st *clone,
> > > >         map_context->ptr = mpio;
> > > >         clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
> > > >         r = map_io(m, clone, mpio, 0);
> > > > -       if (r < 0 || r == DM_MAPIO_REQUEUE)
> > > > +       if (r < 0 || r == DM_MAPIO_REQUEUE) {
> > > >                 mempool_free(mpio, m->mpio_pool);
> > > > +               map_context->ptr = NULL;
> > > > +       }
> > > > 
> > > >         return r;
> > > >  }
> > > 
> > > With your patch we haven't been able to reproduce the kernel crash until now.
> > > Now we "only" run into I/O stalls, which before your patch we also did. But
> > > repeatedly rebooting and retrying and ignoring the I/O stalls always lead to
> > > a crash.
> > > Gonzalo will run a couple of extra rounds so we can have a feeling if at least
> > > one of the bugs could be fixed with your patch ;)
> > 
> > Hi,
> > 
> > Any update after further testing with Hannes' patch?
> 
> Sorry for the late update, our internal IBM IMAP servers have been down
> for nearly a week :/
> 
> So, we were unable to reproduce the original bug with the patch applied
> during various runs.

OK, so it seems to be a benenficial change (and obviously correct to
me).  Hannes, care to formally post your fix to dm-devel so we can get
it in 3.2-rc?

> However, we ran into this one instead, which is yet another use-after-free bug
> (I need to double check, but I'm quite sure that a freed struct scsi_cmnd
> caused this).

OK, yeah something is causing poisoned (POISON_FREE) memory to be used.

> [ 4906.683654] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000

...

> Gonzalo also tried 2.6.38.8 as suggested and ran into this one:
> 
> [  292.877936] ------------[ cut here ]------------
> [  292.877939] Kernel BUG at 6b6b6b6b6b6b6b6d [verbose debug info unavailable]

Again, more poison.

Seems this test is causing us to fall on our face no matter what.
Likely, best to leave this 2.6.38 blk_unplug crash to one side and
continue focusing on latest upstream.

Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-29 20:18                                                   ` Mike Snitzer
@ 2011-11-30  7:25                                                     ` Hannes Reinecke
  2011-12-12 12:39                                                     ` Heiko Carstens
  1 sibling, 0 replies; 60+ messages in thread
From: Hannes Reinecke @ 2011-11-30  7:25 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Heiko Carstens, Jun'ichi Nomura, James Bottomley,
	Steffen Maier, linux-scsi, Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas, dm-devel

On 11/29/2011 09:18 PM, Mike Snitzer wrote:
> On Tue, Nov 29 2011 at  7:00am -0500,
> Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> 
>>>>> Hmm. Just to be on the safe side, could you try this one:
>>>>>
>>>>> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
>>>>> index 5e0090e..e6fad46 100644
>>>>> --- a/drivers/md/dm-mpath.c
>>>>> +++ b/drivers/md/dm-mpath.c
>>>>> @@ -920,8 +920,10 @@ static int multipath_map(struct dm_target *ti,
>>>>> struct reque
>>>>> st *clone,
>>>>>         map_context->ptr = mpio;
>>>>>         clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
>>>>>         r = map_io(m, clone, mpio, 0);
>>>>> -       if (r < 0 || r == DM_MAPIO_REQUEUE)
>>>>> +       if (r < 0 || r == DM_MAPIO_REQUEUE) {
>>>>>                 mempool_free(mpio, m->mpio_pool);
>>>>> +               map_context->ptr = NULL;
>>>>> +       }
>>>>>
>>>>>         return r;
>>>>>  }
>>>>
>>>> With your patch we haven't been able to reproduce the kernel crash until now.
>>>> Now we "only" run into I/O stalls, which before your patch we also did. But
>>>> repeatedly rebooting and retrying and ignoring the I/O stalls always lead to
>>>> a crash.
>>>> Gonzalo will run a couple of extra rounds so we can have a feeling if at least
>>>> one of the bugs could be fixed with your patch ;)
>>>
>>> Hi,
>>>
>>> Any update after further testing with Hannes' patch?
>>
>> Sorry for the late update, our internal IBM IMAP servers have been down
>> for nearly a week :/
>>
>> So, we were unable to reproduce the original bug with the patch applied
>> during various runs.
> 
> OK, so it seems to be a benenficial change (and obviously correct to
> me).  Hannes, care to formally post your fix to dm-devel so we can get
> it in 3.2-rc?
> 
Yep, will do.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-09-28 19:05                             ` Eric Seppanen
  2011-09-28 19:14                               ` Christoph Hellwig
@ 2011-11-30 10:18                               ` Jens Axboe
  2011-11-30 10:26                                 ` Christoph Hellwig
  1 sibling, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2011-11-30 10:18 UTC (permalink / raw)
  To: Eric Seppanen
  Cc: Christoph Hellwig, Vivek Goyal, James Bottomley, Linus Torvalds,
	Hannes Reinecke, James Bottomley, linux-scsi, Linux Kernel

On 2011-09-28 21:05, Eric Seppanen wrote:
> On Wed, Sep 28, 2011 at 11:16 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> Right now on high iops device queue_lock is the major killer for
>> performance.  It's one major reason (*) why a lot of the high iops devices
>> are all moving to ->make_request, which has other issues.
>>
>> (*) others are struct request allocation and the pointless merge hash
> 
> I agree: queue lock is the worst performance killer when hw can do
>> 100K IOPS per block device.
> 
> Rather than just being chased away from the request queue due to
> performance issues, I could argue there's very little point to having
> a queue for devices that
> (a) have no seek penalty (and always use noop elevator)
> (b) have hardware queues at least as deep as the default request queue
> (c) don't benefit from merging
> 
> (c) is maybe debatable, but if a device can saturate its bus bandwidth
> on 4KB IO, the latency is probably not worth it.

I agree on a+b, but c is definitely more than debatable. I have yet to
see a device saturate its bandwidth on 4KB IOS. So merging on the write
side is always going to be a win.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-30 10:18                               ` Jens Axboe
@ 2011-11-30 10:26                                 ` Christoph Hellwig
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2011-11-30 10:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Eric Seppanen, Christoph Hellwig, Vivek Goyal, James Bottomley,
	Linus Torvalds, Hannes Reinecke, James Bottomley, linux-scsi,
	Linux Kernel

On Wed, Nov 30, 2011 at 11:18:42AM +0100, Jens Axboe wrote:
> I agree on a+b, but c is definitely more than debatable. I have yet to
> see a device saturate its bandwidth on 4KB IOS. So merging on the write
> side is always going to be a win.

We shouldn't submit 4k I/O in the first place unless they truely are
random.  If you look at XFS for example you'd basically never see them.
The only small I/O in a typical workloads are individual btree blocks
and AG header updates, which generally can't be merged anyway.  Data I/O
happens in larger chunks generally, as do reads/writes of inodes
(generally in 32k chunks, with plans to go larger).

Note that the merges in the way the are done currently are one of the
most painful bits of the current request_fn based drivers.  They require
the addition of the struct request data structure, which needs to be
allocated and initialized for every bio, no matter if it's beeing merged
or not, and are the prime reason why make_request_fn and request_fn
style drivers can't operate on the same data structures.

I've been wondering if it would be possible to simply chain bios for
merging, and while it's theoretically possible I'm wondering about
the impact on drivers.  Fortunately scsi drivers generally don't touch
struct request directly, so it might be doable.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-11-29 20:18                                                   ` Mike Snitzer
  2011-11-30  7:25                                                     ` Hannes Reinecke
@ 2011-12-12 12:39                                                     ` Heiko Carstens
  2011-12-13 16:50                                                       ` Mike Snitzer
  1 sibling, 1 reply; 60+ messages in thread
From: Heiko Carstens @ 2011-12-12 12:39 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Hannes Reinecke, Jun'ichi Nomura, James Bottomley,
	Steffen Maier, linux-scsi, Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas, dm-devel

On Tue, Nov 29, 2011 at 03:18:03PM -0500, Mike Snitzer wrote:
> On Tue, Nov 29 2011 at  7:00am -0500,
> Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > [ 4906.683654] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> 
> ...
> 
> > Gonzalo also tried 2.6.38.8 as suggested and ran into this one:
> > 
> > [  292.877936] ------------[ cut here ]------------
> > [  292.877939] Kernel BUG at 6b6b6b6b6b6b6b6d [verbose debug info unavailable]
> 
> Again, more poison.
> 
> Seems this test is causing us to fall on our face no matter what.
> Likely, best to leave this 2.6.38 blk_unplug crash to one side and
> continue focusing on latest upstream.

Sorry again, for taking so long to come back. This time however with good news:

With 3.2.0-rc4.00255.g77a7300 we were unable to reproduce any I/O stall or
user-after-free bugs even after nearly 3000 test iterations.

The only patches on top we have are:

two patches from Hannes:
http://www.spinics.net/lists/linux-scsi/msg55112.html
http://www.spinics.net/lists/linux-scsi/msg55413.html

and the patch below from Steffen:

Btw. James, any chance to get this one upstream soon? It should be in your
queue for quite some time already, IIRC.

Subject: [PATCH] zfcp: return early from slave_destroy if slave_alloc returned early

From: Steffen Maier <maier@linux.vnet.ibm.com>

zfcp_scsi_slave_destroy erroneously always tried to finish its task
even if the corresponding previous zfcp_scsi_slave_alloc returned
early. This can lead to kernel page faults on accessing uninitialized
fields of struct zfcp_scsi_dev in zfcp_erp_lun_shutdown_wait. Take the
port field of the struct to determine if slave_alloc returned early.

This zfcp bug is exposed by 4e6c82b (in turn fixing f7c9c6b to be
compatible with 21208ae) which can call slave_destroy for a
corresponding previous slave_alloc that did not finish.

This patch is based on James Bottomley's fix suggestion in
http://www.spinics.net/lists/linux-scsi/msg55449.html.

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Cc: <stable@kernel.org> #2.6.38+
---

 drivers/s390/scsi/zfcp_scsi.c |    4 ++++
 1 file changed, 4 insertions(+)

diff -urpN linux-2.6/drivers/s390/scsi/zfcp_scsi.c linux-2.6-patched/drivers/s390/scsi/zfcp_scsi.c
--- linux-2.6/drivers/s390/scsi/zfcp_scsi.c	2011-12-01 13:08:32.000000000 +0100
+++ linux-2.6-patched/drivers/s390/scsi/zfcp_scsi.c	2011-12-01 13:08:52.000000000 +0100
@@ -55,6 +55,10 @@ static void zfcp_scsi_slave_destroy(stru
 {
 	struct zfcp_scsi_dev *zfcp_sdev = sdev_to_zfcp(sdev);
 
+	/* if previous slave_alloc returned early, there is nothing to do */
+	if (!zfcp_sdev->port)
+		return;
+
 	zfcp_erp_lun_shutdown_wait(sdev, "scssd_1");
 	put_device(&zfcp_sdev->port->dev);
 }


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue())
  2011-12-12 12:39                                                     ` Heiko Carstens
@ 2011-12-13 16:50                                                       ` Mike Snitzer
  0 siblings, 0 replies; 60+ messages in thread
From: Mike Snitzer @ 2011-12-13 16:50 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Hannes Reinecke, Jun'ichi Nomura, James Bottomley,
	Steffen Maier, linux-scsi, Jens Axboe, Linux Kernel, Alan Stern,
	Thadeu Lima de Souza Cascardo, Taraka R. Bodireddy,
	Seshagiri N. Ippili, Manvanthara B. Puttashankar, Jeff Moyer,
	Shaohua Li, gmuelas, dm-devel

On Mon, Dec 12 2011 at  7:39am -0500,
Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> On Tue, Nov 29, 2011 at 03:18:03PM -0500, Mike Snitzer wrote:
> > On Tue, Nov 29 2011 at  7:00am -0500,
> > Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > > [ 4906.683654] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> > 
> > ...
> > 
> > > Gonzalo also tried 2.6.38.8 as suggested and ran into this one:
> > > 
> > > [  292.877936] ------------[ cut here ]------------
> > > [  292.877939] Kernel BUG at 6b6b6b6b6b6b6b6d [verbose debug info unavailable]
> > 
> > Again, more poison.
> > 
> > Seems this test is causing us to fall on our face no matter what.
> > Likely, best to leave this 2.6.38 blk_unplug crash to one side and
> > continue focusing on latest upstream.
> 
> Sorry again, for taking so long to come back. This time however with good news:
> 
> With 3.2.0-rc4.00255.g77a7300 we were unable to reproduce any I/O stall or
> user-after-free bugs even after nearly 3000 test iterations.

Great news, so with an eye towards getting these fixes upstream:

> The only patches on top we have are:
> 
> two patches from Hannes:
> http://www.spinics.net/lists/linux-scsi/msg55112.html

Has that scsi_lib.c patch been posted with a formal patch header?

James, I'm not clear on where I should be looking to see what you have
staged but not yet sent to Linus.  Does such a branch exist in your
scsi-misc-2.6 tree?

> http://www.spinics.net/lists/linux-scsi/msg55413.html

Jun'ichi and Hannes said that additional NULL pointer check is needed:
http://www.redhat.com/archives/dm-devel/2011-December/msg00022.html

Hannes said he'd re-post an updated patch (but hasn't yet).

Mike

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2011-12-13 16:51 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <201109221318.p8MDIXPP020015@pentland.suse.de>
2011-09-28  0:47 ` [PATCH] block: Free queue resources at blk_release_queue() Jens Axboe
2011-09-28  0:55   ` Linus Torvalds
2011-09-28  1:15     ` Jens Axboe
2011-09-28  1:59       ` Linus Torvalds
2011-09-28  2:02         ` Jens Axboe
2011-09-28  4:10         ` James Bottomley
2011-09-28 14:08           ` Jens Axboe
2011-09-28 14:11             ` James Bottomley
2011-09-28 14:14               ` [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue resources at blk_release_queue()) Jens Axboe
2011-09-28 15:22                 ` Linus Torvalds
2011-09-28 15:43                   ` James Bottomley
2011-09-28 17:48                     ` Vivek Goyal
2011-09-28 17:53                       ` Christoph Hellwig
2011-09-28 18:09                         ` Vivek Goyal
2011-09-28 18:16                           ` Christoph Hellwig
2011-09-28 19:05                             ` Eric Seppanen
2011-09-28 19:14                               ` Christoph Hellwig
2011-11-30 10:18                               ` Jens Axboe
2011-11-30 10:26                                 ` Christoph Hellwig
2011-09-28 22:34                             ` Vivek Goyal
2011-09-28 17:59                       ` James Bottomley
2011-10-13 13:09                 ` Steffen Maier
2011-10-14 16:03                   ` James Bottomley
2011-10-17  8:46                     ` Jun'ichi Nomura
2011-10-17 14:06                       ` James Bottomley
2011-10-18 13:31                         ` Jun'ichi Nomura
2011-10-18 15:45                           ` Heiko Carstens
2011-10-18 16:29                             ` James Bottomley
2011-10-31 10:05                               ` Heiko Carstens
2011-10-31 10:42                                 ` James Bottomley
2011-10-31 11:46                                   ` Jun'ichi Nomura
2011-10-31 13:00                                     ` Heiko Carstens
2011-11-02 12:37                                       ` Jun'ichi Nomura
2011-11-02 12:44                                         ` Hannes Reinecke
2011-11-02 13:47                                         ` Heiko Carstens
2011-11-04  4:07                                           ` Jun'ichi Nomura
2011-11-04  9:12                                             ` Heiko Carstens
2011-11-03 18:25                                       ` Mike Snitzer
2011-11-04  9:19                                         ` Heiko Carstens
2011-11-04 13:30                                           ` Mike Snitzer
2011-11-04 13:37                                             ` Hannes Reinecke
2011-11-07 11:31                                             ` Jun'ichi Nomura
2011-11-07 13:42                                               ` Mike Snitzer
2011-11-07 12:23                                             ` Heiko Carstens
2011-11-07 11:30                                           ` Jun'ichi Nomura
2011-11-07 15:36                                             ` Mike Snitzer
2011-11-07 16:43                                               ` Heiko Carstens
2011-11-07 17:10                                               ` Mike Snitzer
2011-11-07 21:44                                                 ` Mike Snitzer
2011-11-09  9:37                                           ` Hannes Reinecke
2011-11-10 16:10                                             ` Heiko Carstens
2011-11-17 16:29                                               ` Mike Snitzer
2011-11-29 12:00                                                 ` Heiko Carstens
2011-11-29 20:18                                                   ` Mike Snitzer
2011-11-30  7:25                                                     ` Hannes Reinecke
2011-12-12 12:39                                                     ` Heiko Carstens
2011-12-13 16:50                                                       ` Mike Snitzer
2011-10-31 13:21                                   ` Mike Snitzer
2011-10-31 13:40                                     ` Heiko Carstens
2011-10-31 14:01                                       ` Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).