All of lore.kernel.org
 help / color / mirror / Atom feed
* Deadlocks due to per-process plugging
@ 2012-07-11 13:37 Jan Kara
  2012-07-11 16:05 ` Jeff Moyer
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2012-07-11 13:37 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, Tejun Heo, Jens Axboe

  Hello,

  we've recently hit a deadlock in our QA runs which is caused by the
per-process plugging code. The problem is as follows:
  process A					process B (kjournald)
  generic_file_aio_write()
    blk_start_plug(&plug);
    ...
    somewhere in here we allocate memory and
    direct reclaim submits buffer X for IO
    ...
    ext3_write_begin()
      ext3_journal_start()
        we need more space in a journal
        so we want to checkpoint old transactions,
        we block waiting for kjournald to commit
        a currently running transaction.
						journal_commit_transaction()
						  wait for IO on buffer X
						  to complete as it is part
						  of the current transaction

  => deadlock since A waits for B and B waits for A to do unplug.
BTW: I don't think this is really ext3/ext4 specific. I think other
filesystems can get into problems as well when direct reclaim submits some
IO and the process subsequently blocks without submitting the IO.

Effectively the per process plugging introduces a lock dependency
buffer_lock -> any lock acquired after IO submission before the process'
queue is unplugged. This certainly creates lots of cycles in the lock
dependency graph...

I'm wondering how we should fix this best. Trivial fix would be to flush
the IO plug on every schedule, not just io_schedule(), but that can have
some peformance implications I guess (the effect of plugging would be very
limited). Better (although more tedious) solution would be to push the
plugs from higher levels down into the filesystems where they could be
managed to not create problematic lock dependencies (but e.g. for ext3/ext4
that means we have to unplug after writing each page so it is effectively
rather similar to unplugging on every schedule()).

Thoughts?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-11 13:37 Deadlocks due to per-process plugging Jan Kara
@ 2012-07-11 16:05 ` Jeff Moyer
  2012-07-11 20:16   ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff Moyer @ 2012-07-11 16:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, Tejun Heo, Jens Axboe

Jan Kara <jack@suse.cz> writes:

>   Hello,
>
>   we've recently hit a deadlock in our QA runs which is caused by the
> per-process plugging code. The problem is as follows:
>   process A					process B (kjournald)
>   generic_file_aio_write()
>     blk_start_plug(&plug);
>     ...
>     somewhere in here we allocate memory and
>     direct reclaim submits buffer X for IO
>     ...
>     ext3_write_begin()
>       ext3_journal_start()
>         we need more space in a journal
>         so we want to checkpoint old transactions,
>         we block waiting for kjournald to commit
>         a currently running transaction.
> 						journal_commit_transaction()
> 						  wait for IO on buffer X
> 						  to complete as it is part
> 						  of the current transaction
>
>   => deadlock since A waits for B and B waits for A to do unplug.
> BTW: I don't think this is really ext3/ext4 specific. I think other
> filesystems can get into problems as well when direct reclaim submits some
> IO and the process subsequently blocks without submitting the IO.

So, I thought schedule would do the flush.  Checking the code:

asmlinkage void __sched schedule(void)
{
        struct task_struct *tsk = current;

        sched_submit_work(tsk);
        __schedule();
}

And sched_submit_work looks like this:

static inline void sched_submit_work(struct task_struct *tsk)
{
        if (!tsk->state || tsk_is_pi_blocked(tsk))
                return;
        /*
         * If we are going to sleep and we have plugged IO queued,
         * make sure to submit it to avoid deadlocks.
         */
        if (blk_needs_flush_plug(tsk))
                blk_schedule_flush_plug(tsk);
}

This eventually ends in a call to blk_run_queue_async(q) after
submitting the I/O from the plug list.  Right?  So is the question
really why doesn't the kblockd workqueue get scheduled?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-11 16:05 ` Jeff Moyer
@ 2012-07-11 20:16   ` Jan Kara
  2012-07-11 22:12     ` Thomas Gleixner
                       ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Jan Kara @ 2012-07-11 20:16 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, LKML, linux-fsdevel, Tejun Heo, Jens Axboe, mgalbraith,
	Thomas Gleixner

On Wed 11-07-12 12:05:51, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> >   Hello,
> >
> >   we've recently hit a deadlock in our QA runs which is caused by the
> > per-process plugging code. The problem is as follows:
> >   process A					process B (kjournald)
> >   generic_file_aio_write()
> >     blk_start_plug(&plug);
> >     ...
> >     somewhere in here we allocate memory and
> >     direct reclaim submits buffer X for IO
> >     ...
> >     ext3_write_begin()
> >       ext3_journal_start()
> >         we need more space in a journal
> >         so we want to checkpoint old transactions,
> >         we block waiting for kjournald to commit
> >         a currently running transaction.
> > 						journal_commit_transaction()
> > 						  wait for IO on buffer X
> > 						  to complete as it is part
> > 						  of the current transaction
> >
> >   => deadlock since A waits for B and B waits for A to do unplug.
> > BTW: I don't think this is really ext3/ext4 specific. I think other
> > filesystems can get into problems as well when direct reclaim submits some
> > IO and the process subsequently blocks without submitting the IO.
> 
> So, I thought schedule would do the flush.  Checking the code:
> 
> asmlinkage void __sched schedule(void)
> {
>         struct task_struct *tsk = current;
> 
>         sched_submit_work(tsk);
>         __schedule();
> }
> 
> And sched_submit_work looks like this:
> 
> static inline void sched_submit_work(struct task_struct *tsk)
> {
>         if (!tsk->state || tsk_is_pi_blocked(tsk))
>                 return;
>         /*
>          * If we are going to sleep and we have plugged IO queued,
>          * make sure to submit it to avoid deadlocks.
>          */
>         if (blk_needs_flush_plug(tsk))
>                 blk_schedule_flush_plug(tsk);
> }
> 
> This eventually ends in a call to blk_run_queue_async(q) after
> submitting the I/O from the plug list.  Right?  So is the question
> really why doesn't the kblockd workqueue get scheduled?
  Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
see requests queued in tsk->plug despite the process is sleeping in
TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
kernel (I just didn't originally thought that makes any difference) so
actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
we are sleeping on a mutex. So this seems like a bug in rtmutex code.
Thomas, you seemed to have added that condition... Any idea how to avoid
the deadlock?

									Honza 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-11 20:16   ` Jan Kara
@ 2012-07-11 22:12     ` Thomas Gleixner
  2012-07-12  4:12       ` Mike Galbraith
  2012-07-13 12:38       ` Jan Kara
  2012-07-12  2:07     ` Mike Galbraith
  2012-07-12 14:15     ` Thomas Gleixner
  2 siblings, 2 replies; 33+ messages in thread
From: Thomas Gleixner @ 2012-07-11 22:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe, mgalbraith

On Wed, 11 Jul 2012, Jan Kara wrote:
> On Wed 11-07-12 12:05:51, Jeff Moyer wrote:
> > This eventually ends in a call to blk_run_queue_async(q) after
> > submitting the I/O from the plug list.  Right?  So is the question
> > really why doesn't the kblockd workqueue get scheduled?
>   Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
> see requests queued in tsk->plug despite the process is sleeping in
> TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
> omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
> indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
> kernel (I just didn't originally thought that makes any difference) so
> actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
> we are sleeping on a mutex. So this seems like a bug in rtmutex code.
> Thomas, you seemed to have added that condition... Any idea how to avoid
> the deadlock?

Mike has sent out a fix related to the plug stuff, which I just posted
for the rt stable series. Can you verify against that ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-11 20:16   ` Jan Kara
  2012-07-11 22:12     ` Thomas Gleixner
@ 2012-07-12  2:07     ` Mike Galbraith
  2012-07-12 14:15     ` Thomas Gleixner
  2 siblings, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-12  2:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith, Thomas Gleixner

On Wed, 2012-07-11 at 22:16 +0200, Jan Kara wrote: 
> On Wed 11-07-12 12:05:51, Jeff Moyer wrote:
> > Jan Kara <jack@suse.cz> writes:
> > 
> > >   Hello,
> > >
> > >   we've recently hit a deadlock in our QA runs which is caused by the
> > > per-process plugging code. The problem is as follows:
> > >   process A					process B (kjournald)
> > >   generic_file_aio_write()
> > >     blk_start_plug(&plug);
> > >     ...
> > >     somewhere in here we allocate memory and
> > >     direct reclaim submits buffer X for IO
> > >     ...
> > >     ext3_write_begin()
> > >       ext3_journal_start()
> > >         we need more space in a journal
> > >         so we want to checkpoint old transactions,
> > >         we block waiting for kjournald to commit
> > >         a currently running transaction.
> > > 						journal_commit_transaction()
> > > 						  wait for IO on buffer X
> > > 						  to complete as it is part
> > > 						  of the current transaction
> > >
> > >   => deadlock since A waits for B and B waits for A to do unplug.
> > > BTW: I don't think this is really ext3/ext4 specific. I think other
> > > filesystems can get into problems as well when direct reclaim submits some
> > > IO and the process subsequently blocks without submitting the IO.
> > 
> > So, I thought schedule would do the flush.  Checking the code:
> > 
> > asmlinkage void __sched schedule(void)
> > {
> >         struct task_struct *tsk = current;
> > 
> >         sched_submit_work(tsk);
> >         __schedule();
> > }
> > 
> > And sched_submit_work looks like this:
> > 
> > static inline void sched_submit_work(struct task_struct *tsk)
> > {
> >         if (!tsk->state || tsk_is_pi_blocked(tsk))
> >                 return;
> >         /*
> >          * If we are going to sleep and we have plugged IO queued,
> >          * make sure to submit it to avoid deadlocks.
> >          */
> >         if (blk_needs_flush_plug(tsk))
> >                 blk_schedule_flush_plug(tsk);
> > }
> > 
> > This eventually ends in a call to blk_run_queue_async(q) after
> > submitting the I/O from the plug list.  Right?  So is the question
> > really why doesn't the kblockd workqueue get scheduled?
>   Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
> see requests queued in tsk->plug despite the process is sleeping in
> TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
> omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
> indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
> kernel (I just didn't originally thought that makes any difference) so
> actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
> we are sleeping on a mutex. So this seems like a bug in rtmutex code.
> Thomas, you seemed to have added that condition... Any idea how to avoid
> the deadlock?

Tsk tsk, I completely overlooked sched_submit_work().

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-11 22:12     ` Thomas Gleixner
@ 2012-07-12  4:12       ` Mike Galbraith
  2012-07-13 12:38       ` Jan Kara
  1 sibling, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-12  4:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Thu, 2012-07-12 at 00:12 +0200, Thomas Gleixner wrote: 
> On Wed, 11 Jul 2012, Jan Kara wrote:
> > On Wed 11-07-12 12:05:51, Jeff Moyer wrote:
> > > This eventually ends in a call to blk_run_queue_async(q) after
> > > submitting the I/O from the plug list.  Right?  So is the question
> > > really why doesn't the kblockd workqueue get scheduled?
> >   Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
> > see requests queued in tsk->plug despite the process is sleeping in
> > TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
> > omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
> > indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
> > kernel (I just didn't originally thought that makes any difference) so
> > actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
> > we are sleeping on a mutex. So this seems like a bug in rtmutex code.
> > Thomas, you seemed to have added that condition... Any idea how to avoid
> > the deadlock?
> 
> Mike has sent out a fix related to the plug stuff, which I just posted
> for the rt stable series. Can you verify against that ?

btw, I called io_schedule() instead of a plain unplug thinking we're
going to schedule anyway, but if we unplug and schedule, and we're not
leftmost (non-rt task 'course), while we're away, likely contended mutex
we're about to take may be released or at least become less contended.
What a we won't be doing is accruing sleep time to help trigger yet more
preemption.  Anyone more deserving can move smartly rightward, and thus
out of our way for a bit.

If we're leftmost or rt, all was for naught, but it seemed worth a shot.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-11 20:16   ` Jan Kara
  2012-07-11 22:12     ` Thomas Gleixner
  2012-07-12  2:07     ` Mike Galbraith
@ 2012-07-12 14:15     ` Thomas Gleixner
  2012-07-13 12:33       ` Jan Kara
  2 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2012-07-12 14:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe, mgalbraith

On Wed, 11 Jul 2012, Jan Kara wrote:
> On Wed 11-07-12 12:05:51, Jeff Moyer wrote:
> > Jan Kara <jack@suse.cz> writes:
> > 
> > >   Hello,
> > >
> > >   we've recently hit a deadlock in our QA runs which is caused by the
> > > per-process plugging code. The problem is as follows:
> > >   process A					process B (kjournald)
> > >   generic_file_aio_write()
> > >     blk_start_plug(&plug);
> > >     ...
> > >     somewhere in here we allocate memory and
> > >     direct reclaim submits buffer X for IO
> > >     ...
> > >     ext3_write_begin()
> > >       ext3_journal_start()
> > >         we need more space in a journal
> > >         so we want to checkpoint old transactions,
> > >         we block waiting for kjournald to commit
> > >         a currently running transaction.
> > > 						journal_commit_transaction()
> > > 						  wait for IO on buffer X
> > > 						  to complete as it is part
> > > 						  of the current transaction
> > >
> > >   => deadlock since A waits for B and B waits for A to do unplug.
> > > BTW: I don't think this is really ext3/ext4 specific. I think other
> > > filesystems can get into problems as well when direct reclaim submits some
> > > IO and the process subsequently blocks without submitting the IO.
> > 
> > So, I thought schedule would do the flush.  Checking the code:
> > 
> > asmlinkage void __sched schedule(void)
> > {
> >         struct task_struct *tsk = current;
> > 
> >         sched_submit_work(tsk);
> >         __schedule();
> > }
> > 
> > And sched_submit_work looks like this:
> > 
> > static inline void sched_submit_work(struct task_struct *tsk)
> > {
> >         if (!tsk->state || tsk_is_pi_blocked(tsk))
> >                 return;
> >         /*
> >          * If we are going to sleep and we have plugged IO queued,
> >          * make sure to submit it to avoid deadlocks.
> >          */
> >         if (blk_needs_flush_plug(tsk))
> >                 blk_schedule_flush_plug(tsk);
> > }
> > 
> > This eventually ends in a call to blk_run_queue_async(q) after
> > submitting the I/O from the plug list.  Right?  So is the question
> > really why doesn't the kblockd workqueue get scheduled?

>   Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
> see requests queued in tsk->plug despite the process is sleeping in
> TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
> omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
> indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
> kernel (I just didn't originally thought that makes any difference) so
> actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
> we are sleeping on a mutex. So this seems like a bug in rtmutex code.

Well, the reason why this check is there is that the task which is
blocked on a lock can hold another lock which might cause a deadlock
in the flush path.

> Thomas, you seemed to have added that condition... Any idea how to avoid
> the deadlock?

Good question. We could do the flush when the blocked task does not
hold a lock itself. Might be worth a try.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-12 14:15     ` Thomas Gleixner
@ 2012-07-13 12:33       ` Jan Kara
  2012-07-13 14:25         ` Thomas Gleixner
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2012-07-13 12:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Thu 12-07-12 16:15:29, Thomas Gleixner wrote:
> On Wed, 11 Jul 2012, Jan Kara wrote:
> > On Wed 11-07-12 12:05:51, Jeff Moyer wrote:
> > > Jan Kara <jack@suse.cz> writes:
> > > 
> > > >   Hello,
> > > >
> > > >   we've recently hit a deadlock in our QA runs which is caused by the
> > > > per-process plugging code. The problem is as follows:
> > > >   process A					process B (kjournald)
> > > >   generic_file_aio_write()
> > > >     blk_start_plug(&plug);
> > > >     ...
> > > >     somewhere in here we allocate memory and
> > > >     direct reclaim submits buffer X for IO
> > > >     ...
> > > >     ext3_write_begin()
> > > >       ext3_journal_start()
> > > >         we need more space in a journal
> > > >         so we want to checkpoint old transactions,
> > > >         we block waiting for kjournald to commit
> > > >         a currently running transaction.
> > > > 						journal_commit_transaction()
> > > > 						  wait for IO on buffer X
> > > > 						  to complete as it is part
> > > > 						  of the current transaction
> > > >
> > > >   => deadlock since A waits for B and B waits for A to do unplug.
> > > > BTW: I don't think this is really ext3/ext4 specific. I think other
> > > > filesystems can get into problems as well when direct reclaim submits some
> > > > IO and the process subsequently blocks without submitting the IO.
> > > 
> > > So, I thought schedule would do the flush.  Checking the code:
> > > 
> > > asmlinkage void __sched schedule(void)
> > > {
> > >         struct task_struct *tsk = current;
> > > 
> > >         sched_submit_work(tsk);
> > >         __schedule();
> > > }
> > > 
> > > And sched_submit_work looks like this:
> > > 
> > > static inline void sched_submit_work(struct task_struct *tsk)
> > > {
> > >         if (!tsk->state || tsk_is_pi_blocked(tsk))
> > >                 return;
> > >         /*
> > >          * If we are going to sleep and we have plugged IO queued,
> > >          * make sure to submit it to avoid deadlocks.
> > >          */
> > >         if (blk_needs_flush_plug(tsk))
> > >                 blk_schedule_flush_plug(tsk);
> > > }
> > > 
> > > This eventually ends in a call to blk_run_queue_async(q) after
> > > submitting the I/O from the plug list.  Right?  So is the question
> > > really why doesn't the kblockd workqueue get scheduled?
> 
> >   Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
> > see requests queued in tsk->plug despite the process is sleeping in
> > TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
> > omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
> > indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
> > kernel (I just didn't originally thought that makes any difference) so
> > actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
> > we are sleeping on a mutex. So this seems like a bug in rtmutex code.
> 
> Well, the reason why this check is there is that the task which is
> blocked on a lock can hold another lock which might cause a deadlock
> in the flush path.
  OK. Let me understand the details. Block layer needs just queue_lock for
unplug to succeed. That is a spinlock but in RT kernel, even a process
holding a spinlock can be preempted if I remember correctly. So that
condition is there effectively to not unplug when a task is being scheduled
away while holding queue_lock? Did I get it right?

> > Thomas, you seemed to have added that condition... Any idea how to avoid
> > the deadlock?
> 
> Good question. We could do the flush when the blocked task does not
> hold a lock itself. Might be worth a try.
  Yeah, that should work for avoiding the deadlock as well.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-11 22:12     ` Thomas Gleixner
  2012-07-12  4:12       ` Mike Galbraith
@ 2012-07-13 12:38       ` Jan Kara
  1 sibling, 0 replies; 33+ messages in thread
From: Jan Kara @ 2012-07-13 12:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Thu 12-07-12 00:12:44, Thomas Gleixner wrote:
> On Wed, 11 Jul 2012, Jan Kara wrote:
> > On Wed 11-07-12 12:05:51, Jeff Moyer wrote:
> > > This eventually ends in a call to blk_run_queue_async(q) after
> > > submitting the I/O from the plug list.  Right?  So is the question
> > > really why doesn't the kblockd workqueue get scheduled?
> >   Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
> > see requests queued in tsk->plug despite the process is sleeping in
> > TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
> > omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
> > indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
> > kernel (I just didn't originally thought that makes any difference) so
> > actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
> > we are sleeping on a mutex. So this seems like a bug in rtmutex code.
> > Thomas, you seemed to have added that condition... Any idea how to avoid
> > the deadlock?
> 
> Mike has sent out a fix related to the plug stuff, which I just posted
> for the rt stable series. Can you verify against that ?
  Yeah, that fix from Mike makes us unable to reproduce the problem. But
frankly it is a hack and I wouldn't bet a penny there isn't another
similar problem hiding elsewhere in the code. Just it would need a
different timing / load to trigger. So I think a better solution needs to
be found (an advice from JBD maintainer TM ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-13 12:33       ` Jan Kara
@ 2012-07-13 14:25         ` Thomas Gleixner
  2012-07-13 14:46           ` Jan Kara
  2012-07-14 11:00           ` Mike Galbraith
  0 siblings, 2 replies; 33+ messages in thread
From: Thomas Gleixner @ 2012-07-13 14:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe, mgalbraith

On Fri, 13 Jul 2012, Jan Kara wrote:
> On Thu 12-07-12 16:15:29, Thomas Gleixner wrote:
> > >   Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
> > > see requests queued in tsk->plug despite the process is sleeping in
> > > TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
> > > omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
> > > indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
> > > kernel (I just didn't originally thought that makes any difference) so
> > > actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
> > > we are sleeping on a mutex. So this seems like a bug in rtmutex code.
> > 
> > Well, the reason why this check is there is that the task which is
> > blocked on a lock can hold another lock which might cause a deadlock
> > in the flush path.
>   OK. Let me understand the details. Block layer needs just queue_lock for
> unplug to succeed. That is a spinlock but in RT kernel, even a process
> holding a spinlock can be preempted if I remember correctly. So that
> condition is there effectively to not unplug when a task is being scheduled
> away while holding queue_lock? Did I get it right?

blk_flush_plug_list() is not only queue_lock. There can be other locks
taken in the callbacks, elevator ...

> > > Thomas, you seemed to have added that condition... Any idea how to avoid
> > > the deadlock?
> > 
> > Good question. We could do the flush when the blocked task does not
> > hold a lock itself. Might be worth a try.
>   Yeah, that should work for avoiding the deadlock as well.

Though we don't have a lock held count except when lockdep is enabled,
which you probably don't want to do when running a production system.

But we only care about stuff being scheduled out while blocked on a
"sleeping spinlock" - i.e. spinlock, rwlock.

So the patch below should allow the unplug to take place when blocked
on mutexes etc.

Thanks,

	tglx
----
Index: linux-stable-rt/include/linux/sched.h
===================================================================
--- linux-stable-rt.orig/include/linux/sched.h
+++ linux-stable-rt/include/linux/sched.h
@@ -2145,9 +2145,10 @@ extern unsigned int sysctl_sched_cfs_ban
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
+extern bool pi_blocked_on_rt_lock(struct task_struct *tsk);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
-	return tsk->pi_blocked_on != NULL;
+	return tsk->pi_blocked_on != NULL && pi_blocked_on_rt_lock(tsk);
 }
 #else
 static inline int rt_mutex_getprio(struct task_struct *p)
Index: linux-stable-rt/kernel/rtmutex.c
===================================================================
--- linux-stable-rt.orig/kernel/rtmutex.c
+++ linux-stable-rt/kernel/rtmutex.c
@@ -699,6 +699,11 @@ static int adaptive_wait(struct rt_mutex
 # define pi_lock(lock)			raw_spin_lock_irq(lock)
 # define pi_unlock(lock)		raw_spin_unlock_irq(lock)
 
+bool pi_blocked_on_rt_lock(struct task_struct *tsk)
+{
+	return tsk->pi_blocked_on && tsk->pi_blocked_on->savestate;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-13 14:25         ` Thomas Gleixner
@ 2012-07-13 14:46           ` Jan Kara
  2012-07-15  8:59             ` Thomas Gleixner
  2012-07-14 11:00           ` Mike Galbraith
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Kara @ 2012-07-13 14:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Fri 13-07-12 16:25:05, Thomas Gleixner wrote:
> On Fri, 13 Jul 2012, Jan Kara wrote:
> > On Thu 12-07-12 16:15:29, Thomas Gleixner wrote:
> > > >   Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
> > > > see requests queued in tsk->plug despite the process is sleeping in
> > > > TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
> > > > omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
> > > > indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
> > > > kernel (I just didn't originally thought that makes any difference) so
> > > > actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
> > > > we are sleeping on a mutex. So this seems like a bug in rtmutex code.
> > > 
> > > Well, the reason why this check is there is that the task which is
> > > blocked on a lock can hold another lock which might cause a deadlock
> > > in the flush path.
> >   OK. Let me understand the details. Block layer needs just queue_lock for
> > unplug to succeed. That is a spinlock but in RT kernel, even a process
> > holding a spinlock can be preempted if I remember correctly. So that
> > condition is there effectively to not unplug when a task is being scheduled
> > away while holding queue_lock? Did I get it right?
> 
> blk_flush_plug_list() is not only queue_lock. There can be other locks
> taken in the callbacks, elevator ...
  Yeah, right.

> > > > Thomas, you seemed to have added that condition... Any idea how to avoid
> > > > the deadlock?
> > > 
> > > Good question. We could do the flush when the blocked task does not
> > > hold a lock itself. Might be worth a try.
> >   Yeah, that should work for avoiding the deadlock as well.
> 
> Though we don't have a lock held count except when lockdep is enabled,
> which you probably don't want to do when running a production system.
  Agreed :).

> But we only care about stuff being scheduled out while blocked on a
> "sleeping spinlock" - i.e. spinlock, rwlock.
> 
> So the patch below should allow the unplug to take place when blocked
> on mutexes etc.
  Thanks for the patch! Mike will give it some testing.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-13 14:25         ` Thomas Gleixner
  2012-07-13 14:46           ` Jan Kara
@ 2012-07-14 11:00           ` Mike Galbraith
  2012-07-14 11:06             ` Mike Galbraith
  2012-07-15  7:14             ` Mike Galbraith
  1 sibling, 2 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-14 11:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

I have your patch burning on my 64 core rt box.  If it survives the
weekend, you should be able to replace my jbd hack with your fix..

Tested-by: Mike Galbraith <mgalbraith@suse.de>

..so here, one each chop in advance.  It wouldn't dare work ;-)

On Fri, 2012-07-13 at 16:25 +0200, Thomas Gleixner wrote: 
> On Fri, 13 Jul 2012, Jan Kara wrote:
> > On Thu 12-07-12 16:15:29, Thomas Gleixner wrote:
> > > >   Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can
> > > > see requests queued in tsk->plug despite the process is sleeping in
> > > > TASK_UNINTERRUPTIBLE state.  So the only way how unplug could have been
> > > > omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump...
> > > > indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt
> > > > kernel (I just didn't originally thought that makes any difference) so
> > > > actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever
> > > > we are sleeping on a mutex. So this seems like a bug in rtmutex code.
> > > 
> > > Well, the reason why this check is there is that the task which is
> > > blocked on a lock can hold another lock which might cause a deadlock
> > > in the flush path.
> >   OK. Let me understand the details. Block layer needs just queue_lock for
> > unplug to succeed. That is a spinlock but in RT kernel, even a process
> > holding a spinlock can be preempted if I remember correctly. So that
> > condition is there effectively to not unplug when a task is being scheduled
> > away while holding queue_lock? Did I get it right?
> 
> blk_flush_plug_list() is not only queue_lock. There can be other locks
> taken in the callbacks, elevator ...
> 
> > > > Thomas, you seemed to have added that condition... Any idea how to avoid
> > > > the deadlock?
> > > 
> > > Good question. We could do the flush when the blocked task does not
> > > hold a lock itself. Might be worth a try.
> >   Yeah, that should work for avoiding the deadlock as well.
> 
> Though we don't have a lock held count except when lockdep is enabled,
> which you probably don't want to do when running a production system.
> 
> But we only care about stuff being scheduled out while blocked on a
> "sleeping spinlock" - i.e. spinlock, rwlock.
> 
> So the patch below should allow the unplug to take place when blocked
> on mutexes etc.
> 
> Thanks,
> 
> 	tglx
> ----
> Index: linux-stable-rt/include/linux/sched.h
> ===================================================================
> --- linux-stable-rt.orig/include/linux/sched.h
> +++ linux-stable-rt/include/linux/sched.h
> @@ -2145,9 +2145,10 @@ extern unsigned int sysctl_sched_cfs_ban
>  extern int rt_mutex_getprio(struct task_struct *p);
>  extern void rt_mutex_setprio(struct task_struct *p, int prio);
>  extern void rt_mutex_adjust_pi(struct task_struct *p);
> +extern bool pi_blocked_on_rt_lock(struct task_struct *tsk);
>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
>  {
> -	return tsk->pi_blocked_on != NULL;
> +	return tsk->pi_blocked_on != NULL && pi_blocked_on_rt_lock(tsk);
>  }
>  #else
>  static inline int rt_mutex_getprio(struct task_struct *p)
> Index: linux-stable-rt/kernel/rtmutex.c
> ===================================================================
> --- linux-stable-rt.orig/kernel/rtmutex.c
> +++ linux-stable-rt/kernel/rtmutex.c
> @@ -699,6 +699,11 @@ static int adaptive_wait(struct rt_mutex
>  # define pi_lock(lock)			raw_spin_lock_irq(lock)
>  # define pi_unlock(lock)		raw_spin_unlock_irq(lock)
>  
> +bool pi_blocked_on_rt_lock(struct task_struct *tsk)
> +{
> +	return tsk->pi_blocked_on && tsk->pi_blocked_on->savestate;
> +}
> +
>  /*
>   * Slow path lock function spin_lock style: this variant is very
>   * careful not to miss any non-lock wakeups.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-14 11:00           ` Mike Galbraith
@ 2012-07-14 11:06             ` Mike Galbraith
  2012-07-15  7:14             ` Mike Galbraith
  1 sibling, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-14 11:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Sat, 2012-07-14 at 13:00 +0200, Mike Galbraith wrote: 
> I have your patch burning on my 64 core rt box.  If it survives the
> weekend, you should be able to replace my jbd hack with your fix..
> 
> Tested-by: Mike Galbraith <mgalbraith@suse.de>
> 
> ..so here, one each chop in advance.  It wouldn't dare work ;-)
                                                        ^not


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-14 11:00           ` Mike Galbraith
  2012-07-14 11:06             ` Mike Galbraith
@ 2012-07-15  7:14             ` Mike Galbraith
  1 sibling, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-15  7:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Sat, 2012-07-14 at 13:00 +0200, Mike Galbraith wrote: 
> I have your patch burning on my 64 core rt box.  If it survives the
> weekend, you should be able to replace my jbd hack with your fix.

As expected, box is still going strong.  It would have died by now if
the problem were still lurking.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-13 14:46           ` Jan Kara
@ 2012-07-15  8:59             ` Thomas Gleixner
  2012-07-15  9:14               ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2012-07-15  8:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe, mgalbraith

On Fri, 13 Jul 2012, Jan Kara wrote:
> On Fri 13-07-12 16:25:05, Thomas Gleixner wrote:
> > So the patch below should allow the unplug to take place when blocked
> > on mutexes etc.
>   Thanks for the patch! Mike will give it some testing.

I just found out that this patch will explode nicely when the unplug
code runs into a contended lock. Then we try to block on that lock and
make the rtmutex code unhappy as we are already blocked on something
else.

So no, it's not a solution to the problem. Sigh.

Can you figure out on which lock the stuck thread which did not unplug
due to tsk_is_pi_blocked was blocked?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-15  8:59             ` Thomas Gleixner
@ 2012-07-15  9:14               ` Mike Galbraith
  2012-07-15  9:51                 ` Thomas Gleixner
  2012-07-16  2:22                 ` Mike Galbraith
  0 siblings, 2 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-15  9:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Sun, 2012-07-15 at 10:59 +0200, Thomas Gleixner wrote: 
> On Fri, 13 Jul 2012, Jan Kara wrote:
> > On Fri 13-07-12 16:25:05, Thomas Gleixner wrote:
> > > So the patch below should allow the unplug to take place when blocked
> > > on mutexes etc.
> >   Thanks for the patch! Mike will give it some testing.
> 
> I just found out that this patch will explode nicely when the unplug
> code runs into a contended lock. Then we try to block on that lock and
> make the rtmutex code unhappy as we are already blocked on something
> else.

Kinda like so?  My x3550 M3 just exploded.  Aw poo. 

[ 6669.133081] Kernel panic - not syncing: rt_mutex_real_waiter(task->pi_blocked_on) lock: 0xffff880175dfd588 waiter: 0xffff880121fc2d58
[ 6669.133083] 
[ 6669.133086] Pid: 28240, comm: bonnie++ Tainted: G           N  3.0.35-rt56-rt #20
[ 6669.133088] Call Trace:
[ 6669.133102]  [<ffffffff81004562>] dump_trace+0x82/0x2e0
[ 6669.133109]  [<ffffffff8154d1ee>] dump_stack+0x69/0x6f
[ 6669.133114]  [<ffffffff8154d295>] panic+0xa1/0x1e5
[ 6669.133121]  [<ffffffff81095289>] task_blocks_on_rt_mutex+0x279/0x2c0
[ 6669.133127]  [<ffffffff8154f5d5>] rt_spin_lock_slowlock+0xb5/0x290
[ 6669.133134]  [<ffffffff8131d7e4>] blk_flush_plug_list+0x164/0x200
[ 6669.133139]  [<ffffffff8154dffe>] schedule+0x5e/0xb0
[ 6669.133143]  [<ffffffff8154f1ab>] __rt_mutex_slowlock+0x4b/0xd0
[ 6669.133148]  [<ffffffff8154f39b>] rt_mutex_slowlock+0xeb/0x210
[ 6669.133154]  [<ffffffff81127bce>] page_referenced_file+0x4e/0x190
[ 6669.133160]  [<ffffffff8112954a>] page_referenced+0x6a/0x230
[ 6669.133166]  [<ffffffff8110b5e4>] shrink_active_list+0x214/0x3d0
[ 6669.133170]  [<ffffffff8110b874>] shrink_list+0xd4/0x120
[ 6669.133176]  [<ffffffff8110bc3c>] shrink_zone+0x9c/0x1d0
[ 6669.133180]  [<ffffffff8110c07f>] shrink_zones+0x7f/0x1f0
[ 6669.133185]  [<ffffffff8110c27d>] do_try_to_free_pages+0x8d/0x370
[ 6669.133189]  [<ffffffff8110c8ba>] try_to_free_pages+0xea/0x210
[ 6669.133197]  [<ffffffff810ff5e3>] __alloc_pages_nodemask+0x5b3/0x9f0
[ 6669.133205]  [<ffffffff81138294>] alloc_pages_current+0xc4/0x150
[ 6669.133211]  [<ffffffff810f6296>] find_or_create_page+0x46/0xb0
[ 6669.133217]  [<ffffffff81296cc6>] alloc_extent_buffer+0x226/0x4b0
[ 6669.133225]  [<ffffffff8126f6b9>] readahead_tree_block+0x19/0x50
[ 6669.133231]  [<ffffffff8124f4bf>] reada_for_search+0x1cf/0x230
[ 6669.133237]  [<ffffffff81252faa>] read_block_for_search+0x18a/0x200
[ 6669.133242]  [<ffffffff8125525a>] btrfs_search_slot+0x25a/0x7e0
[ 6669.133248]  [<ffffffff81269144>] btrfs_lookup_csum+0x74/0x180
[ 6669.133254]  [<ffffffff8126940f>] __btrfs_lookup_bio_sums+0x1bf/0x3b0
[ 6669.133260]  [<ffffffff812775c8>] btrfs_submit_bio_hook+0x158/0x1a0
[ 6669.133270]  [<ffffffff81291216>] submit_one_bio+0x66/0xa0
[ 6669.133274]  [<ffffffff81295017>] submit_extent_page+0x107/0x220
[ 6669.133278]  [<ffffffff81295629>] __extent_read_full_page+0x4b9/0x6e0
[ 6669.133284]  [<ffffffff8129669f>] extent_readpages+0xbf/0x100
[ 6669.133289]  [<ffffffff811020fe>] __do_page_cache_readahead+0x1ae/0x250
[ 6669.133295]  [<ffffffff811024dc>] ra_submit+0x1c/0x30
[ 6669.133299]  [<ffffffff810f67eb>] do_generic_file_read.clone.0+0x27b/0x450
[ 6669.133305]  [<ffffffff810f7a9b>] generic_file_aio_read+0x1fb/0x2a0
[ 6669.133313]  [<ffffffff8115454f>] do_sync_read+0xbf/0x100
[ 6669.133319]  [<ffffffff81154e03>] vfs_read+0xc3/0x180
[ 6669.133323]  [<ffffffff81154f11>] sys_read+0x51/0xa0
[ 6669.133329]  [<ffffffff81557092>] system_call_fastpath+0x16/0x1b
[ 6669.133347]  [<00007ff8b95bb370>] 0x7ff8b95bb36f

> So no, it's not a solution to the problem. Sigh.
> 
> Can you figure out on which lock the stuck thread which did not unplug
> due to tsk_is_pi_blocked was blocked?

I'll take a peek.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-15  9:14               ` Mike Galbraith
@ 2012-07-15  9:51                 ` Thomas Gleixner
  2012-07-16  2:22                 ` Mike Galbraith
  1 sibling, 0 replies; 33+ messages in thread
From: Thomas Gleixner @ 2012-07-15  9:51 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Sun, 15 Jul 2012, Mike Galbraith wrote:

> On Sun, 2012-07-15 at 10:59 +0200, Thomas Gleixner wrote: 
> > On Fri, 13 Jul 2012, Jan Kara wrote:
> > > On Fri 13-07-12 16:25:05, Thomas Gleixner wrote:
> > > > So the patch below should allow the unplug to take place when blocked
> > > > on mutexes etc.
> > >   Thanks for the patch! Mike will give it some testing.
> > 
> > I just found out that this patch will explode nicely when the unplug
> > code runs into a contended lock. Then we try to block on that lock and
> > make the rtmutex code unhappy as we are already blocked on something
> > else.
> 
> Kinda like so?  My x3550 M3 just exploded.  Aw poo. 

Yep. Would have surprised me if it never triggered.

> > So no, it's not a solution to the problem. Sigh.
> > 
> > Can you figure out on which lock the stuck thread which did not unplug
> > due to tsk_is_pi_blocked was blocked?
> 
> I'll take a peek.

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-15  9:14               ` Mike Galbraith
  2012-07-15  9:51                 ` Thomas Gleixner
@ 2012-07-16  2:22                 ` Mike Galbraith
  2012-07-16  8:59                   ` Thomas Gleixner
  1 sibling, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2012-07-16  2:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Sun, 2012-07-15 at 11:14 +0200, Mike Galbraith wrote: 
> On Sun, 2012-07-15 at 10:59 +0200, Thomas Gleixner wrote: 

> > Can you figure out on which lock the stuck thread which did not unplug
> > due to tsk_is_pi_blocked was blocked?
> 
> I'll take a peek.

Sorry for late reply, took a half day away from box.  Jan had already
done the full ext3 IO deadlock analysis:

Again kjournald is waiting for buffer IO on block 4367635 (sector
78364838) to finish.  Now it is dbench thread 0xffff88026f330e70 which
has submitted this buffer for IO and is still holding this buffer behind
its plug (request for sector 78364822..78364846). The dbench thread is
waiting on j_checkpoint mutex (apparently it has successfully got the
mutex in the past, checkpointed some buffers, released the mutex and
hung when trying to acquire it again in the next loop of
__log_wait_for_space()). 

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16  2:22                 ` Mike Galbraith
@ 2012-07-16  8:59                   ` Thomas Gleixner
  2012-07-16  9:48                     ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2012-07-16  8:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Mon, 16 Jul 2012, Mike Galbraith wrote:
> On Sun, 2012-07-15 at 11:14 +0200, Mike Galbraith wrote: 
> > On Sun, 2012-07-15 at 10:59 +0200, Thomas Gleixner wrote: 
> 
> > > Can you figure out on which lock the stuck thread which did not unplug
> > > due to tsk_is_pi_blocked was blocked?
> > 
> > I'll take a peek.
> 
> Sorry for late reply, took a half day away from box.  Jan had already
> done the full ext3 IO deadlock analysis:
> 
> Again kjournald is waiting for buffer IO on block 4367635 (sector
> 78364838) to finish.  Now it is dbench thread 0xffff88026f330e70 which
> has submitted this buffer for IO and is still holding this buffer behind
> its plug (request for sector 78364822..78364846). The dbench thread is
> waiting on j_checkpoint mutex (apparently it has successfully got the
> mutex in the past, checkpointed some buffers, released the mutex and
> hung when trying to acquire it again in the next loop of
> __log_wait_for_space()). 

And what's holding j_checkpoint mutex and not making progress?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16  8:59                   ` Thomas Gleixner
@ 2012-07-16  9:48                     ` Mike Galbraith
  2012-07-16  9:59                       ` Thomas Gleixner
  2012-07-16 10:08                       ` Mike Galbraith
  0 siblings, 2 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-16  9:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Mon, 2012-07-16 at 10:59 +0200, Thomas Gleixner wrote: 
> On Mon, 16 Jul 2012, Mike Galbraith wrote:
> > On Sun, 2012-07-15 at 11:14 +0200, Mike Galbraith wrote: 
> > > On Sun, 2012-07-15 at 10:59 +0200, Thomas Gleixner wrote: 
> > 
> > > > Can you figure out on which lock the stuck thread which did not unplug
> > > > due to tsk_is_pi_blocked was blocked?
> > > 
> > > I'll take a peek.
> > 
> > Sorry for late reply, took a half day away from box.  Jan had already
> > done the full ext3 IO deadlock analysis:
> > 
> > Again kjournald is waiting for buffer IO on block 4367635 (sector
> > 78364838) to finish.  Now it is dbench thread 0xffff88026f330e70 which
> > has submitted this buffer for IO and is still holding this buffer behind
> > its plug (request for sector 78364822..78364846). The dbench thread is
> > waiting on j_checkpoint mutex (apparently it has successfully got the
> > mutex in the past, checkpointed some buffers, released the mutex and
> > hung when trying to acquire it again in the next loop of
> > __log_wait_for_space()). 
> 
> And what's holding j_checkpoint mutex and not making progress?

Waiting for wakeup from kjournald.

crash> bt 0xffff880189dd9560
PID: 33382  TASK: ffff880189dd9560  CPU: 3   COMMAND: "dbench"
#0 [ffff880274b61898] schedule at ffffffff8145178e
#1 [ffff880274b61a00] log_wait_commit at ffffffffa0174205 [jbd]
#2 [ffff880274b61a80] __process_buffer at ffffffffa017291b [jbd]
#3 [ffff880274b61ab0] log_do_checkpoint at ffffffffa0172bba [jbd]
#4 [ffff880274b61d20] __log_wait_for_space at ffffffffa0172dcf [jbd]
#5 [ffff880274b61d70] start_this_handle at ffffffffa016ebdf [jbd]
#6 [ffff880274b61e10] journal_start at ffffffffa016f11e [jbd]
#7 [ffff880274b61e40] ext3_unlink at ffffffffa01af757 [ext3]
#8 [ffff880274b61e80] vfs_unlink at ffffffff8115febc
#9 [ffff880274b61ea0] do_unlinkat at ffffffff811645ad
#10 [ffff880274b61f80] system_call_fastpath at ffffffff8145ad92
    RIP: 00007f811338dc37  RSP: 00007fffe247ef78  RFLAGS: 00010216
    RAX: 0000000000000057  RBX: ffffffff8145ad92  RCX: 000000000000000a
    RDX: 0000000000000000  RSI: 00007fffe247eef0  RDI: 0000000000608a10
    RBP: 00007fffe247f830   R8: 0000000000000006   R9: 0000000000000010
    R10: 0000000000000000  R11: 0000000000000206  R12: 00007f811384fc70
    R13: 00007fffe247f17c  R14: 0000000000608a10  R15: 0000000000000000
    ORIG_RAX: 0000000000000057  CS: 0033  SS: 002b



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16  9:48                     ` Mike Galbraith
@ 2012-07-16  9:59                       ` Thomas Gleixner
  2012-07-16 10:13                         ` Mike Galbraith
  2012-07-16 10:08                       ` Mike Galbraith
  1 sibling, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2012-07-16  9:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Mon, 16 Jul 2012, Mike Galbraith wrote:
> On Mon, 2012-07-16 at 10:59 +0200, Thomas Gleixner wrote: 
> > On Mon, 16 Jul 2012, Mike Galbraith wrote:
> > > On Sun, 2012-07-15 at 11:14 +0200, Mike Galbraith wrote: 
> > > > On Sun, 2012-07-15 at 10:59 +0200, Thomas Gleixner wrote: 
> > > 
> > > > > Can you figure out on which lock the stuck thread which did not unplug
> > > > > due to tsk_is_pi_blocked was blocked?
> > > > 
> > > > I'll take a peek.
> > > 
> > > Sorry for late reply, took a half day away from box.  Jan had already
> > > done the full ext3 IO deadlock analysis:
> > > 
> > > Again kjournald is waiting for buffer IO on block 4367635 (sector
> > > 78364838) to finish.  Now it is dbench thread 0xffff88026f330e70 which
> > > has submitted this buffer for IO and is still holding this buffer behind
> > > its plug (request for sector 78364822..78364846). The dbench thread is
> > > waiting on j_checkpoint mutex (apparently it has successfully got the
> > > mutex in the past, checkpointed some buffers, released the mutex and
> > > hung when trying to acquire it again in the next loop of
> > > __log_wait_for_space()). 
> > 
> > And what's holding j_checkpoint mutex and not making progress?
> 
> Waiting for wakeup from kjournald.

So kicking the io in __log_wait_for_space()

                spin_unlock(&journal->j_state_lock);

		--> HERE

                mutex_lock(&journal->j_checkpoint_mutex);

should fix the issue, right ?

Thanks,

	tglx
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16  9:48                     ` Mike Galbraith
  2012-07-16  9:59                       ` Thomas Gleixner
@ 2012-07-16 10:08                       ` Mike Galbraith
  2012-07-16 10:19                         ` Thomas Gleixner
  1 sibling, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2012-07-16 10:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

Hm, wonder how bad this sucks.. and if I should go hide under a big
sturdy rock after I poke xmit :)

---
 block/blk-core.c |    1 +
 kernel/rtmutex.c |   11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2782,6 +2782,7 @@ void blk_flush_plug_list(struct blk_plug
 	if (q)
 		queue_unplugged(q, depth, from_schedule);
 }
+EXPORT_SYMBOL(blk_flush_plug_list);
 
 void blk_finish_plug(struct blk_plug *plug)
 {
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -20,6 +20,7 @@
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/timer.h>
+#include <linux/blkdev.h>
 
 #include "rtmutex_common.h"
 
@@ -647,8 +648,11 @@ static inline void rt_spin_lock_fastlock
 
 	if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
 		rt_mutex_deadlock_account_lock(lock, current);
-	else
+	else {
+		if (blk_needs_flush_plug(current))
+			blk_schedule_flush_plug(current);
 		slowfn(lock);
+	}
 }
 
 static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock,
@@ -1104,8 +1108,11 @@ rt_mutex_fastlock(struct rt_mutex *lock,
 	if (!detect_deadlock && likely(rt_mutex_cmpxchg(lock, NULL, current))) {
 		rt_mutex_deadlock_account_lock(lock, current);
 		return 0;
-	} else
+	} else {
+		if (blk_needs_flush_plug(current))
+			blk_schedule_flush_plug(current);
 		return slowfn(lock, state, NULL, detect_deadlock);
+	}
 }
 
 static inline int



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16  9:59                       ` Thomas Gleixner
@ 2012-07-16 10:13                         ` Mike Galbraith
  0 siblings, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-16 10:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Mon, 2012-07-16 at 11:59 +0200, Thomas Gleixner wrote: 
> On Mon, 16 Jul 2012, Mike Galbraith wrote:
> > On Mon, 2012-07-16 at 10:59 +0200, Thomas Gleixner wrote: 
> > > On Mon, 16 Jul 2012, Mike Galbraith wrote:
> > > > On Sun, 2012-07-15 at 11:14 +0200, Mike Galbraith wrote: 
> > > > > On Sun, 2012-07-15 at 10:59 +0200, Thomas Gleixner wrote: 
> > > > 
> > > > > > Can you figure out on which lock the stuck thread which did not unplug
> > > > > > due to tsk_is_pi_blocked was blocked?
> > > > > 
> > > > > I'll take a peek.
> > > > 
> > > > Sorry for late reply, took a half day away from box.  Jan had already
> > > > done the full ext3 IO deadlock analysis:
> > > > 
> > > > Again kjournald is waiting for buffer IO on block 4367635 (sector
> > > > 78364838) to finish.  Now it is dbench thread 0xffff88026f330e70 which
> > > > has submitted this buffer for IO and is still holding this buffer behind
> > > > its plug (request for sector 78364822..78364846). The dbench thread is
> > > > waiting on j_checkpoint mutex (apparently it has successfully got the
> > > > mutex in the past, checkpointed some buffers, released the mutex and
> > > > hung when trying to acquire it again in the next loop of
> > > > __log_wait_for_space()). 
> > > 
> > > And what's holding j_checkpoint mutex and not making progress?
> > 
> > Waiting for wakeup from kjournald.
> 
> So kicking the io in __log_wait_for_space()
> 
>                 spin_unlock(&journal->j_state_lock);
> 
> 		--> HERE
> 
>                 mutex_lock(&journal->j_checkpoint_mutex);
> 
> should fix the issue, right ?

Should for this scenario.  Jan said he could think of ways to likely
make the same happen in xfs.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16 10:08                       ` Mike Galbraith
@ 2012-07-16 10:19                         ` Thomas Gleixner
  2012-07-16 10:30                           ` Mike Galbraith
                                             ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Thomas Gleixner @ 2012-07-16 10:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Mon, 16 Jul 2012, Mike Galbraith wrote:

> Hm, wonder how bad this sucks.. and if I should go hide under a big
> sturdy rock after I poke xmit :)
> 
> ---
>  block/blk-core.c |    1 +
>  kernel/rtmutex.c |   11 +++++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2782,6 +2782,7 @@ void blk_flush_plug_list(struct blk_plug
>  	if (q)
>  		queue_unplugged(q, depth, from_schedule);
>  }
> +EXPORT_SYMBOL(blk_flush_plug_list);

You don't need that one. blk-core and rtmutex are builtin

>  void blk_finish_plug(struct blk_plug *plug)
>  {
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -20,6 +20,7 @@
>  #include <linux/module.h>
>  #include <linux/sched.h>
>  #include <linux/timer.h>
> +#include <linux/blkdev.h>
>  
>  #include "rtmutex_common.h"
>  
> @@ -647,8 +648,11 @@ static inline void rt_spin_lock_fastlock
>  
>  	if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
>  		rt_mutex_deadlock_account_lock(lock, current);
> -	else
> +	else {
> +		if (blk_needs_flush_plug(current))
> +			blk_schedule_flush_plug(current);
>  		slowfn(lock);
> +	}

That should do the trick.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16 10:19                         ` Thomas Gleixner
@ 2012-07-16 10:30                           ` Mike Galbraith
  2012-07-16 11:24                           ` Mike Galbraith
  2012-07-17 13:10                           ` Mike Galbraith
  2 siblings, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-16 10:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

(hohum, gmx server went down, back to random address mode;)

On Mon, 2012-07-16 at 12:19 +0200, Thomas Gleixner wrote: 
> 
> That should do the trick.

I'll put it to work on 64 core box, and see if it survives.  It better.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16 10:19                         ` Thomas Gleixner
  2012-07-16 10:30                           ` Mike Galbraith
@ 2012-07-16 11:24                           ` Mike Galbraith
  2012-07-16 14:35                             ` Mike Galbraith
  2012-07-17 13:10                           ` Mike Galbraith
  2 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2012-07-16 11:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Mon, 2012-07-16 at 12:19 +0200, Thomas Gleixner wrote: 
> On Mon, 16 Jul 2012, Mike Galbraith wrote:

> > ---
> >  block/blk-core.c |    1 +
> >  kernel/rtmutex.c |   11 +++++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> > 
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2782,6 +2782,7 @@ void blk_flush_plug_list(struct blk_plug
> >  	if (q)
> >  		queue_unplugged(q, depth, from_schedule);
> >  }
> > +EXPORT_SYMBOL(blk_flush_plug_list);
> 
> You don't need that one. blk-core and rtmutex are builtin

Box disagrees.

Waiting for device /dev/cciss/c0d0p6 to appear:  ok
fsck from util-linux 2.19.1
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/cciss/c0d0p6 
SLES11: clean, 141001/305824 files, 1053733/1222940 blocks
fsck succeeded. Mounting root device read-write.
Mounting root /dev/cciss/c0d0p6
mount -o rw,acl,user_xattr -t [   27.558934] jbd: disagrees about
version of symbol rt_spin_trylock
ext3 /dev/cciss/[   27.558938] jbd: Unknown symbol rt_spin_trylock (err
-22)
c0d0p6 /root
[   27.558959] jbd: disagrees about version of symbol
__rt_spin_lock_init
[   27.558961] jbd: Unknown symbol __rt_spin_lock_init (err -22)
[   27.558986] jbd: disagrees about version of symbol rt_spin_unlock
[   27.558988] jbd: Unknown symbol rt_spin_unlock (err -22)
[   27.558990] jbd: disagrees about version of symbol rt_spin_lock
[   27.558992] jbd: Unknown symbol rt_spin_lock (err -22)
[   27.559004] jbd: disagrees about version of symbol __rt_mutex_init
[   27.559006] jbd: Unknown symbol __rt_mutex_init (err -22)
modprobe: FATAL: Error inserting ext3
(/lib/modules/3.0.35-rt56-rt/extra/ext3.ko): Invalid argument

mount: unknown filesystem type 'ext3'
could not mount root filesystem -- exiting to /bin/sh
$



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16 11:24                           ` Mike Galbraith
@ 2012-07-16 14:35                             ` Mike Galbraith
  0 siblings, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-16 14:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Mon, 2012-07-16 at 13:24 +0200, Mike Galbraith wrote:

> Box disagrees.
> 
> Waiting for device /dev/cciss/c0d0p6 to appear:  ok
> fsck from util-linux 2.19.1
> [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/cciss/c0d0p6 
> SLES11: clean, 141001/305824 files, 1053733/1222940 blocks
> fsck succeeded. Mounting root device read-write.
> Mounting root /dev/cciss/c0d0p6
> mount -o rw,acl,user_xattr -t [   27.558934] jbd: disagrees about
> version of symbol rt_spin_trylock
> ext3 /dev/cciss/[   27.558938] jbd: Unknown symbol rt_spin_trylock (err
> -22)
> c0d0p6 /root
> [   27.558959] jbd: disagrees about version of symbol
> __rt_spin_lock_init
> [   27.558961] jbd: Unknown symbol __rt_spin_lock_init (err -22)
> [   27.558986] jbd: disagrees about version of symbol rt_spin_unlock
> [   27.558988] jbd: Unknown symbol rt_spin_unlock (err -22)
> [   27.558990] jbd: disagrees about version of symbol rt_spin_lock
> [   27.558992] jbd: Unknown symbol rt_spin_lock (err -22)
> [   27.559004] jbd: disagrees about version of symbol __rt_mutex_init
> [   27.559006] jbd: Unknown symbol __rt_mutex_init (err -22)
> modprobe: FATAL: Error inserting ext3
> (/lib/modules/3.0.35-rt56-rt/extra/ext3.ko): Invalid argument
> 
> mount: unknown filesystem type 'ext3'
> could not mount root filesystem -- exiting to /bin/sh
> $

Weird.  Make mrproper didn't help.  Whacking every trace that the damn
tree, kernel and modules ever existed did.  Box is grinding away at ext3
dbench testcase.  At 7 hours per run, it'll take a while.  This time
tomorrow box should still be happily grinding away.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-16 10:19                         ` Thomas Gleixner
  2012-07-16 10:30                           ` Mike Galbraith
  2012-07-16 11:24                           ` Mike Galbraith
@ 2012-07-17 13:10                           ` Mike Galbraith
  2012-07-18  4:44                             ` Mike Galbraith
  2 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2012-07-17 13:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith

On Mon, 2012-07-16 at 12:19 +0200, Thomas Gleixner wrote:

> > @@ -647,8 +648,11 @@ static inline void rt_spin_lock_fastlock
> >  
> >  	if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
> >  		rt_mutex_deadlock_account_lock(lock, current);
> > -	else
> > +	else {
> > +		if (blk_needs_flush_plug(current))
> > +			blk_schedule_flush_plug(current);
> >  		slowfn(lock);
> > +	}
> 
> That should do the trick.

Box has been grinding away long enough now to agree that it did.

rt: pull your plug before blocking

Queued IO can lead to IO deadlock should a task require wakeup from as task
which is blocked on that queued IO.

ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
pulled.  dbench2 mutex owner is waiting for kjournald, who is waiting for
the buffer queued by dbench1.  Game over.

Signed-off-by: Mike Galbraith <efault@gmx.de>

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index d58db99..39140a5 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -20,6 +20,7 @@
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/timer.h>
+#include <linux/blkdev.h>
 
 #include "rtmutex_common.h"
 
@@ -647,8 +648,11 @@ static inline void rt_spin_lock_fastlock(struct rt_mutex *lock,
 
 	if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
 		rt_mutex_deadlock_account_lock(lock, current);
-	else
+	else {
+		if (blk_needs_flush_plug(current))
+			blk_schedule_flush_plug(current);
 		slowfn(lock);
+	}
 }
 
 static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock,
@@ -1104,8 +1108,11 @@ rt_mutex_fastlock(struct rt_mutex *lock, int state,
 	if (!detect_deadlock && likely(rt_mutex_cmpxchg(lock, NULL, current))) {
 		rt_mutex_deadlock_account_lock(lock, current);
 		return 0;
-	} else
+	} else {
+		if (blk_needs_flush_plug(current))
+			blk_schedule_flush_plug(current);
 		return slowfn(lock, state, NULL, detect_deadlock);
+	}
 }
 
 static inline int



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-17 13:10                           ` Mike Galbraith
@ 2012-07-18  4:44                             ` Mike Galbraith
  2012-07-18  5:30                               ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2012-07-18  4:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith, Steven Rostedt

(adds rather important missing Cc)

On Tue, 2012-07-17 at 15:10 +0200, Mike Galbraith wrote: 
> On Mon, 2012-07-16 at 12:19 +0200, Thomas Gleixner wrote:
> 
> > > @@ -647,8 +648,11 @@ static inline void rt_spin_lock_fastlock
> > >  
> > >  	if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
> > >  		rt_mutex_deadlock_account_lock(lock, current);
> > > -	else
> > > +	else {
> > > +		if (blk_needs_flush_plug(current))
> > > +			blk_schedule_flush_plug(current);
> > >  		slowfn(lock);
> > > +	}
> > 
> > That should do the trick.
> 
> Box has been grinding away long enough now to agree that it did.
> 
> rt: pull your plug before blocking

Hm.  x3550 seems to have lost interest in nearly instant gratification
ext4 deadlock testcase: taskset -c 3 dbench -t 30 -s 8 in enterprise.
Previously, it _might_ have survived one 30 second test, but never for
minutes, much less several minutes of very many threads, so it appears
to have been another flavor of IO dependency deadlock. 

I just tried virgin 3.4.4-rt13, and it too happily churned away.. until
I tried dbench -t 300 -s 500 that is.  That (seemingly 100% repeatably)
makes rcu stall that doesn't get to serial console, nor will my virgin
source/config setup crash dump.  Damn.  Enterprise kernel will dump, but
won't stall, so I guess I'd better check out the other virgin 3.x-rt
trees to at least narrow down where stall started.

Whatever, RCU stall is a different problem.  Revert unplug patchlet, and
ext4 deadlock is back in virgin 3.4-rt, so methinks it's sufficiently
verified that either we need some form of unplug before blocking, or we
need a pull your plug point is at least two filesystems, maybe more.

-Mike

The patch in question for missing Cc.  Maybe should be only mutex, but I
see no reason why IO dependency can only possibly exist for mutexes...

rt: pull your plug before blocking

Queued IO can lead to IO deadlock should a task require wakeup from as task
which is blocked on that queued IO.

ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
pulled.  dbench2 mutex owner is waiting for kjournald, who is waiting for
the buffer queued by dbench1.  Game over.

Signed-off-by: Mike Galbraith <efault@gmx.de>

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 3bff726..3f6ae32 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -20,6 +20,7 @@
 #include <linux/export.h>
 #include <linux/sched.h>
 #include <linux/timer.h>
+#include <linux/blkdev.h>
 
 #include "rtmutex_common.h"
 
@@ -647,8 +648,11 @@ static inline void rt_spin_lock_fastlock(struct rt_mutex *lock,
 
 	if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
 		rt_mutex_deadlock_account_lock(lock, current);
-	else
+	else {
+		if (blk_needs_flush_plug(current))
+			blk_schedule_flush_plug(current);
 		slowfn(lock);
+	}
 }
 
 static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock,
@@ -1104,8 +1108,11 @@ rt_mutex_fastlock(struct rt_mutex *lock, int state,
 	if (!detect_deadlock && likely(rt_mutex_cmpxchg(lock, NULL, current))) {
 		rt_mutex_deadlock_account_lock(lock, current);
 		return 0;
-	} else
+	} else {
+		if (blk_needs_flush_plug(current))
+			blk_schedule_flush_plug(current);
 		return slowfn(lock, state, NULL, detect_deadlock);
+	}
 }
 
 static inline int




^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-18  4:44                             ` Mike Galbraith
@ 2012-07-18  5:30                               ` Mike Galbraith
  2012-07-21  7:47                                 ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2012-07-18  5:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith, Steven Rostedt

On Wed, 2012-07-18 at 06:44 +0200, Mike Galbraith wrote:

> The patch in question for missing Cc.  Maybe should be only mutex, but I
> see no reason why IO dependency can only possibly exist for mutexes...

Well that was easy, box quickly said "nope, mutex only does NOT cut it".

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-18  5:30                               ` Mike Galbraith
@ 2012-07-21  7:47                                 ` Mike Galbraith
  2012-07-22 18:43                                   ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2012-07-21  7:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith, Steven Rostedt

On Wed, 2012-07-18 at 07:30 +0200, Mike Galbraith wrote: 
> On Wed, 2012-07-18 at 06:44 +0200, Mike Galbraith wrote:
> 
> > The patch in question for missing Cc.  Maybe should be only mutex, but I
> > see no reason why IO dependency can only possibly exist for mutexes...
> 
> Well that was easy, box quickly said "nope, mutex only does NOT cut it".

And I also learned (ouch) that both doesn't cut it either.  Ksoftirqd
(or sirq-blk) being nailed by q->lock in blk_done_softirq() is.. not
particularly wonderful.  As long as that doesn't happen, IO deadlock
doesn't happen, troublesome filesystems just work.  If it does happen
though, you've instantly got a problem.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-21  7:47                                 ` Mike Galbraith
@ 2012-07-22 18:43                                   ` Mike Galbraith
  2012-07-23  9:46                                     ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2012-07-22 18:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith, Steven Rostedt

On Sat, 2012-07-21 at 09:47 +0200, Mike Galbraith wrote: 
> On Wed, 2012-07-18 at 07:30 +0200, Mike Galbraith wrote: 
> > On Wed, 2012-07-18 at 06:44 +0200, Mike Galbraith wrote:
> > 
> > > The patch in question for missing Cc.  Maybe should be only mutex, but I
> > > see no reason why IO dependency can only possibly exist for mutexes...
> > 
> > Well that was easy, box quickly said "nope, mutex only does NOT cut it".
> 
> And I also learned (ouch) that both doesn't cut it either.  Ksoftirqd
> (or sirq-blk) being nailed by q->lock in blk_done_softirq() is.. not
> particularly wonderful.  As long as that doesn't happen, IO deadlock
> doesn't happen, troublesome filesystems just work.  If it does happen
> though, you've instantly got a problem.

That problem being slab_lock in practice btw, though I suppose it could
do the same with any number of others.  In encountered case, ksoftirqd
(or sirq-blk) blocks on slab_lock while holding q->queue_lock, while a
userspace task (dbench) blocks on q->queue_lock while holding slab_lock
on the same cpu.  Game over.

Odd is that it doesn't seem to materialize if you have rt_mutex deadlock
detector enabled, not that that matters.  My 64 core box beat on ext3
for 35 hours without ever hitting it with no deadlock detector (this
time.. other long runs on top thereof, totaling lots of hours), and my
x3550 beat crap out of several fs for a very long day week without
hitting it with deadlock detector, but hits it fairly easily without.  

Hohum, regardless of fickle timing gods mood of the moment, deadlocks
are most definitely possible, and will happen, which leaves us with at
least two filesystems needing strategically placed -rt unplug points,
with no guarantee that this is really solving anything at all (other
than empirical evidence that the bad thing ain't happening 'course).

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Deadlocks due to per-process plugging
  2012-07-22 18:43                                   ` Mike Galbraith
@ 2012-07-23  9:46                                     ` Mike Galbraith
  0 siblings, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2012-07-23  9:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jan Kara, Jeff Moyer, LKML, linux-fsdevel, Tejun Heo, Jens Axboe,
	mgalbraith, Steven Rostedt

On Sun, 2012-07-22 at 20:43 +0200, Mike Galbraith wrote: 
> On Sat, 2012-07-21 at 09:47 +0200, Mike Galbraith wrote: 
> > On Wed, 2012-07-18 at 07:30 +0200, Mike Galbraith wrote: 
> > > On Wed, 2012-07-18 at 06:44 +0200, Mike Galbraith wrote:
> > > 
> > > > The patch in question for missing Cc.  Maybe should be only mutex, but I
> > > > see no reason why IO dependency can only possibly exist for mutexes...
> > > 
> > > Well that was easy, box quickly said "nope, mutex only does NOT cut it".
> > 
> > And I also learned (ouch) that both doesn't cut it either.  Ksoftirqd
> > (or sirq-blk) being nailed by q->lock in blk_done_softirq() is.. not
> > particularly wonderful.  As long as that doesn't happen, IO deadlock
> > doesn't happen, troublesome filesystems just work.  If it does happen
> > though, you've instantly got a problem.
> 
> That problem being slab_lock in practice btw, though I suppose it could
> do the same with any number of others.  In encountered case, ksoftirqd
> (or sirq-blk) blocks on slab_lock while holding q->queue_lock, while a
> userspace task (dbench) blocks on q->queue_lock while holding slab_lock
> on the same cpu.  Game over.

Hello vacationing rt wizards' mail boxen (and others so bored they're
actually reading about obscure -rt IO troubles;).

ext4 is still alive, which is a positive sign, and box hasn't yet
deadlocked either, another sign.  Now all I have to do is (sigh) grind
filesystems to fine powder for a few days.. again.

---
 kernel/rtmutex.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -649,7 +649,14 @@ static inline void rt_spin_lock_fastlock
 	if (likely(rt_mutex_cmpxchg(lock, NULL, current)))
 		rt_mutex_deadlock_account_lock(lock, current);
 	else {
-		if (blk_needs_flush_plug(current))
+		/*
+		 * We can't pull the plug if we're already holding a lock
+		 * else we can deadlock.  eg, if we're holding slab_lock,
+		 * ksoftirqd can block while processing BLOCK_SOFTIRQ after
+		 * having acquired q->queue_lock.  If _we_ then block on
+		 * that q->queue_lock while flushing our plug, deadlock.
+		 */
+		if (__migrate_disabled(current) < 2 && blk_needs_flush_plug(current))
 			blk_schedule_flush_plug(current);
 		slowfn(lock);
 	}



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2012-07-23  9:47 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-11 13:37 Deadlocks due to per-process plugging Jan Kara
2012-07-11 16:05 ` Jeff Moyer
2012-07-11 20:16   ` Jan Kara
2012-07-11 22:12     ` Thomas Gleixner
2012-07-12  4:12       ` Mike Galbraith
2012-07-13 12:38       ` Jan Kara
2012-07-12  2:07     ` Mike Galbraith
2012-07-12 14:15     ` Thomas Gleixner
2012-07-13 12:33       ` Jan Kara
2012-07-13 14:25         ` Thomas Gleixner
2012-07-13 14:46           ` Jan Kara
2012-07-15  8:59             ` Thomas Gleixner
2012-07-15  9:14               ` Mike Galbraith
2012-07-15  9:51                 ` Thomas Gleixner
2012-07-16  2:22                 ` Mike Galbraith
2012-07-16  8:59                   ` Thomas Gleixner
2012-07-16  9:48                     ` Mike Galbraith
2012-07-16  9:59                       ` Thomas Gleixner
2012-07-16 10:13                         ` Mike Galbraith
2012-07-16 10:08                       ` Mike Galbraith
2012-07-16 10:19                         ` Thomas Gleixner
2012-07-16 10:30                           ` Mike Galbraith
2012-07-16 11:24                           ` Mike Galbraith
2012-07-16 14:35                             ` Mike Galbraith
2012-07-17 13:10                           ` Mike Galbraith
2012-07-18  4:44                             ` Mike Galbraith
2012-07-18  5:30                               ` Mike Galbraith
2012-07-21  7:47                                 ` Mike Galbraith
2012-07-22 18:43                                   ` Mike Galbraith
2012-07-23  9:46                                     ` Mike Galbraith
2012-07-14 11:00           ` Mike Galbraith
2012-07-14 11:06             ` Mike Galbraith
2012-07-15  7:14             ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.