All of lore.kernel.org
 help / color / mirror / Atom feed
From: Austin Schuh <austin@peloton-tech.com>
To: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>, xfs <xfs@oss.sgi.com>,
	linux-kernel@vger.kernel.org,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: On-stack work item completion race? (was Re: XFS crash?)
Date: Tue, 24 Jun 2014 20:16:26 -0700	[thread overview]
Message-ID: <CANGgnMa99dCKmFkVAY1XdXVY-UVdbg7Ey8-8iTbs5tm=bYS3uA@mail.gmail.com> (raw)
In-Reply-To: <20140624032521.GA12164@htj.dyndns.org>

[Adding tglx to the cc.  Sorry for any double sends]

On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote:
>> start_flush_work() is effectively a special queue_work()
>> implementation, so if if it's not safe to call complete() from the
>> workqueue as the above patch implies then this code has the same
>> problem.
>>
>> Tejun - is this "do it yourself completion" a known issue w.r.t.
>> workqueues? I can't find any documentation that says "don't do
>> that" so...?
>
> It's more complex than using flush_work() but there's nothing
> fundamentally wrong with it.  A work item is completely unlinked
> before its execution starts.  It's safe to free the work item once its
> work function started, whether through kfree() or returning.
>
> One difference between flush_work() and manual completion would be
> that if the work item gets requeued, flush_work() would wait for the
> queued one to finish but given the work item is one-shot this doesn't
> make any difference.
>
> I can see no reason why manual completion would behave differently
> from flush_work() in this case.

I went looking for a short trace in my original log to show the
problem, and instead found evidence of the second problem.  I still
like the shorter flush_work call, but that's not my call.

I did find this comment in the process_one_work function.  Sounds like
this could be better documented.

  /*
   * It is permissible to free the struct work_struct from
   * inside the function that is called from it, this we need to
   * take into account for lockdep too.  To avoid bogus "held
   * lock freed" warnings as well as problems when looking into
   * work->lockdep_map, make a copy and use that here.
   */

>> As I understand it, what then happens is that the workqueue code
>> grabs another kworker thread and runs the next work item in it's
>> queue. IOWs, work items can block, but doing that does not prevent
>> execution of other work items queued on other work queues or even on
>> the same work queue. Tejun, did I get that correct?
>
> Yes, as long as the workqueue is under its @max_active limit and has
> access to an existing kworker or can create a new one, it'll start
> executing the next work item immediately; however, the guaranteed
> level of concurrency is 1 even for WQ_RECLAIM workqueues.  IOW, the
> work items queued on a workqueue must be able to make forward progress
> with single work item if the work items are being depended upon for
> memory reclaim.

I mentioned this to Dave when I initially started this thread, but I'm
running a RT patched kernel.  I don't see forwards progress.  The
workqueue is only used in 1 spot (xfs_alloc_wq), and has
WQ_MEM_RECLAIM set.  I started with a 10,000,000 line trace and pulled
out the entries which referenced the workqueue and pool leading up to
the lockup.

             scp-4110  [001] ....1.3   101.184640:
workqueue_queue_work: work struct=ffff8803c782d900
function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00
pool=ffff88042dae3fc0 req_cpu=512 cpu=1
             scp-4110  [001] ....1.3   101.184641:
workqueue_activate_work: work struct ffff8803c782d900
     kworker/1:1-89    [001] ....1.1   101.184883:
workqueue_nr_running: pool=ffff88042dae3fc0 nr_running=1
     kworker/1:1-89    [001] ....1..   101.184885:
workqueue_execute_start: work struct ffff8803c782d900: function
xfs_bmapi_allocate_worker
     irq/44-ahci-275   [001] ....1.5   101.185086:
workqueue_queue_work: work struct=ffff8800ae3f01e0 function=xfs_end_io
workqueue=ffff88040b459a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1
     irq/44-ahci-275   [001] ....1.5   101.185088:
workqueue_activate_work: work struct ffff8800ae3f01e0
             scp-4110  [001] ....1..   101.187911: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate
             scp-4110  [001] ....1.3   101.187914:
workqueue_queue_work: work struct=ffff8803c782d900
function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00
pool=ffff88042dae3fc0 req_cpu=512 cpu=1
             scp-4110  [001] ....1.3   101.187915:
workqueue_activate_work: work struct ffff8803c782d900
             scp-4110  [001] ....1.4   101.187926:
workqueue_queue_work: work struct=ffff88040a6a01c8
function=blk_delay_work workqueue=ffff88040c9f4a00
pool=ffff88042dae44c0 req_cpu=512 cpu=1
             scp-4110  [001] ....1.4   101.187926:
workqueue_activate_work: work struct ffff88040a6a01c8
     kworker/1:1-89    [001] ....1..   101.187998:
workqueue_execute_end: work struct ffff8803c782d900
     kworker/1:1-89    [001] ....1..   101.188000:
workqueue_execute_start: work struct ffff8800ae3f01e0: function
xfs_end_io
     kworker/1:1-89    [001] ....1..   101.188001: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize

The last work never runs.  Hangcheck triggers shortly after.

I spent some more time debugging, and I am seeing that
tsk_is_pi_blocked is returning 1 in sched_submit_work
(kernel/sched/core.c).  It looks like sched_submit_work is not
detecting that the worker task is blocked on a mutex.

This looks very RT related right now.  I see 2 problems from my
reading (and experimentation).  The first is that the second worker
isn't getting started because tsk_is_pi_blocked is reporting that the
task isn't blocked on a mutex.  The second is that even if another
worker needs to be scheduled because the original worker is blocked on
a mutex, we need the pool lock to schedule another worker.  The pool
lock can be acquired by any CPU, and is a spin_lock.  If we end up on
the slow path for the pool lock, we hit
BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in
task_blocks_on_rt_mutex in rtmutex.c.  I'm not sure how to deal with
either problem.  Thomas, any ideas?

Hopefully I've got all my facts right...  Debugging kernel code is a
whole new world from userspace code.

Thanks!
Austin

WARNING: multiple messages have this Message-ID (diff)
From: Austin Schuh <austin@peloton-tech.com>
To: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	linux-kernel@vger.kernel.org, xfs <xfs@oss.sgi.com>
Subject: Re: On-stack work item completion race? (was Re: XFS crash?)
Date: Tue, 24 Jun 2014 20:16:26 -0700	[thread overview]
Message-ID: <CANGgnMa99dCKmFkVAY1XdXVY-UVdbg7Ey8-8iTbs5tm=bYS3uA@mail.gmail.com> (raw)
In-Reply-To: <20140624032521.GA12164@htj.dyndns.org>

[Adding tglx to the cc.  Sorry for any double sends]

On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote:
>> start_flush_work() is effectively a special queue_work()
>> implementation, so if if it's not safe to call complete() from the
>> workqueue as the above patch implies then this code has the same
>> problem.
>>
>> Tejun - is this "do it yourself completion" a known issue w.r.t.
>> workqueues? I can't find any documentation that says "don't do
>> that" so...?
>
> It's more complex than using flush_work() but there's nothing
> fundamentally wrong with it.  A work item is completely unlinked
> before its execution starts.  It's safe to free the work item once its
> work function started, whether through kfree() or returning.
>
> One difference between flush_work() and manual completion would be
> that if the work item gets requeued, flush_work() would wait for the
> queued one to finish but given the work item is one-shot this doesn't
> make any difference.
>
> I can see no reason why manual completion would behave differently
> from flush_work() in this case.

I went looking for a short trace in my original log to show the
problem, and instead found evidence of the second problem.  I still
like the shorter flush_work call, but that's not my call.

I did find this comment in the process_one_work function.  Sounds like
this could be better documented.

  /*
   * It is permissible to free the struct work_struct from
   * inside the function that is called from it, this we need to
   * take into account for lockdep too.  To avoid bogus "held
   * lock freed" warnings as well as problems when looking into
   * work->lockdep_map, make a copy and use that here.
   */

>> As I understand it, what then happens is that the workqueue code
>> grabs another kworker thread and runs the next work item in it's
>> queue. IOWs, work items can block, but doing that does not prevent
>> execution of other work items queued on other work queues or even on
>> the same work queue. Tejun, did I get that correct?
>
> Yes, as long as the workqueue is under its @max_active limit and has
> access to an existing kworker or can create a new one, it'll start
> executing the next work item immediately; however, the guaranteed
> level of concurrency is 1 even for WQ_RECLAIM workqueues.  IOW, the
> work items queued on a workqueue must be able to make forward progress
> with single work item if the work items are being depended upon for
> memory reclaim.

I mentioned this to Dave when I initially started this thread, but I'm
running a RT patched kernel.  I don't see forwards progress.  The
workqueue is only used in 1 spot (xfs_alloc_wq), and has
WQ_MEM_RECLAIM set.  I started with a 10,000,000 line trace and pulled
out the entries which referenced the workqueue and pool leading up to
the lockup.

             scp-4110  [001] ....1.3   101.184640:
workqueue_queue_work: work struct=ffff8803c782d900
function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00
pool=ffff88042dae3fc0 req_cpu=512 cpu=1
             scp-4110  [001] ....1.3   101.184641:
workqueue_activate_work: work struct ffff8803c782d900
     kworker/1:1-89    [001] ....1.1   101.184883:
workqueue_nr_running: pool=ffff88042dae3fc0 nr_running=1
     kworker/1:1-89    [001] ....1..   101.184885:
workqueue_execute_start: work struct ffff8803c782d900: function
xfs_bmapi_allocate_worker
     irq/44-ahci-275   [001] ....1.5   101.185086:
workqueue_queue_work: work struct=ffff8800ae3f01e0 function=xfs_end_io
workqueue=ffff88040b459a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1
     irq/44-ahci-275   [001] ....1.5   101.185088:
workqueue_activate_work: work struct ffff8800ae3f01e0
             scp-4110  [001] ....1..   101.187911: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate
             scp-4110  [001] ....1.3   101.187914:
workqueue_queue_work: work struct=ffff8803c782d900
function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00
pool=ffff88042dae3fc0 req_cpu=512 cpu=1
             scp-4110  [001] ....1.3   101.187915:
workqueue_activate_work: work struct ffff8803c782d900
             scp-4110  [001] ....1.4   101.187926:
workqueue_queue_work: work struct=ffff88040a6a01c8
function=blk_delay_work workqueue=ffff88040c9f4a00
pool=ffff88042dae44c0 req_cpu=512 cpu=1
             scp-4110  [001] ....1.4   101.187926:
workqueue_activate_work: work struct ffff88040a6a01c8
     kworker/1:1-89    [001] ....1..   101.187998:
workqueue_execute_end: work struct ffff8803c782d900
     kworker/1:1-89    [001] ....1..   101.188000:
workqueue_execute_start: work struct ffff8800ae3f01e0: function
xfs_end_io
     kworker/1:1-89    [001] ....1..   101.188001: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize

The last work never runs.  Hangcheck triggers shortly after.

I spent some more time debugging, and I am seeing that
tsk_is_pi_blocked is returning 1 in sched_submit_work
(kernel/sched/core.c).  It looks like sched_submit_work is not
detecting that the worker task is blocked on a mutex.

This looks very RT related right now.  I see 2 problems from my
reading (and experimentation).  The first is that the second worker
isn't getting started because tsk_is_pi_blocked is reporting that the
task isn't blocked on a mutex.  The second is that even if another
worker needs to be scheduled because the original worker is blocked on
a mutex, we need the pool lock to schedule another worker.  The pool
lock can be acquired by any CPU, and is a spin_lock.  If we end up on
the slow path for the pool lock, we hit
BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in
task_blocks_on_rt_mutex in rtmutex.c.  I'm not sure how to deal with
either problem.  Thomas, any ideas?

Hopefully I've got all my facts right...  Debugging kernel code is a
whole new world from userspace code.

Thanks!
Austin

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2014-06-25  3:16 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-05 23:08 XFS crash? Austin Schuh
2014-03-05 23:35 ` Dave Chinner
2014-03-06  0:53   ` Austin Schuh
2014-05-13  1:29     ` Austin Schuh
2014-05-13  3:10       ` Austin Schuh
2014-05-13  3:33       ` Austin Schuh
2014-05-13  3:46       ` Dave Chinner
2014-05-13  4:03         ` Austin Schuh
2014-05-13  6:39           ` Dave Chinner
2014-05-13  7:02             ` Austin Schuh
2014-05-13  9:03               ` Dave Chinner
2014-05-13 17:11                 ` Austin Schuh
2014-06-23 20:05                   ` Austin Schuh
2014-06-24  3:02                     ` On-stack work item completion race? (was Re: XFS crash?) Dave Chinner
2014-06-24  3:02                       ` Dave Chinner
2014-06-24  3:25                       ` Tejun Heo
2014-06-24  3:25                         ` Tejun Heo
2014-06-25  3:05                         ` Austin Schuh
2014-06-25 14:00                           ` Tejun Heo
2014-06-25 14:00                             ` Tejun Heo
2014-06-25 17:04                             ` Austin Schuh
2014-06-25 17:04                               ` Austin Schuh
2014-06-25  3:16                         ` Austin Schuh [this message]
2014-06-25  3:16                           ` Austin Schuh
2014-06-25  5:56                         ` Dave Chinner
2014-06-25  5:56                           ` Dave Chinner
2014-06-25 14:18                           ` Tejun Heo
2014-06-25 14:18                             ` Tejun Heo
2014-06-25 22:08                             ` Dave Chinner
2014-06-25 22:08                               ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANGgnMa99dCKmFkVAY1XdXVY-UVdbg7Ey8-8iTbs5tm=bYS3uA@mail.gmail.com' \
    --to=austin@peloton-tech.com \
    --cc=david@fromorbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.