From: Austin Schuh <austin@peloton-tech.com> To: Tejun Heo <tj@kernel.org> Cc: Dave Chinner <david@fromorbit.com>, xfs <xfs@oss.sgi.com>, linux-kernel@vger.kernel.org, Thomas Gleixner <tglx@linutronix.de> Subject: Re: On-stack work item completion race? (was Re: XFS crash?) Date: Tue, 24 Jun 2014 20:16:26 -0700 [thread overview] Message-ID: <CANGgnMa99dCKmFkVAY1XdXVY-UVdbg7Ey8-8iTbs5tm=bYS3uA@mail.gmail.com> (raw) In-Reply-To: <20140624032521.GA12164@htj.dyndns.org> [Adding tglx to the cc. Sorry for any double sends] On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote: >> start_flush_work() is effectively a special queue_work() >> implementation, so if if it's not safe to call complete() from the >> workqueue as the above patch implies then this code has the same >> problem. >> >> Tejun - is this "do it yourself completion" a known issue w.r.t. >> workqueues? I can't find any documentation that says "don't do >> that" so...? > > It's more complex than using flush_work() but there's nothing > fundamentally wrong with it. A work item is completely unlinked > before its execution starts. It's safe to free the work item once its > work function started, whether through kfree() or returning. > > One difference between flush_work() and manual completion would be > that if the work item gets requeued, flush_work() would wait for the > queued one to finish but given the work item is one-shot this doesn't > make any difference. > > I can see no reason why manual completion would behave differently > from flush_work() in this case. I went looking for a short trace in my original log to show the problem, and instead found evidence of the second problem. I still like the shorter flush_work call, but that's not my call. I did find this comment in the process_one_work function. Sounds like this could be better documented. /* * It is permissible to free the struct work_struct from * inside the function that is called from it, this we need to * take into account for lockdep too. To avoid bogus "held * lock freed" warnings as well as problems when looking into * work->lockdep_map, make a copy and use that here. */ >> As I understand it, what then happens is that the workqueue code >> grabs another kworker thread and runs the next work item in it's >> queue. IOWs, work items can block, but doing that does not prevent >> execution of other work items queued on other work queues or even on >> the same work queue. Tejun, did I get that correct? > > Yes, as long as the workqueue is under its @max_active limit and has > access to an existing kworker or can create a new one, it'll start > executing the next work item immediately; however, the guaranteed > level of concurrency is 1 even for WQ_RECLAIM workqueues. IOW, the > work items queued on a workqueue must be able to make forward progress > with single work item if the work items are being depended upon for > memory reclaim. I mentioned this to Dave when I initially started this thread, but I'm running a RT patched kernel. I don't see forwards progress. The workqueue is only used in 1 spot (xfs_alloc_wq), and has WQ_MEM_RECLAIM set. I started with a 10,000,000 line trace and pulled out the entries which referenced the workqueue and pool leading up to the lockup. scp-4110 [001] ....1.3 101.184640: workqueue_queue_work: work struct=ffff8803c782d900 function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1 scp-4110 [001] ....1.3 101.184641: workqueue_activate_work: work struct ffff8803c782d900 kworker/1:1-89 [001] ....1.1 101.184883: workqueue_nr_running: pool=ffff88042dae3fc0 nr_running=1 kworker/1:1-89 [001] ....1.. 101.184885: workqueue_execute_start: work struct ffff8803c782d900: function xfs_bmapi_allocate_worker irq/44-ahci-275 [001] ....1.5 101.185086: workqueue_queue_work: work struct=ffff8800ae3f01e0 function=xfs_end_io workqueue=ffff88040b459a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1 irq/44-ahci-275 [001] ....1.5 101.185088: workqueue_activate_work: work struct ffff8800ae3f01e0 scp-4110 [001] ....1.. 101.187911: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate scp-4110 [001] ....1.3 101.187914: workqueue_queue_work: work struct=ffff8803c782d900 function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1 scp-4110 [001] ....1.3 101.187915: workqueue_activate_work: work struct ffff8803c782d900 scp-4110 [001] ....1.4 101.187926: workqueue_queue_work: work struct=ffff88040a6a01c8 function=blk_delay_work workqueue=ffff88040c9f4a00 pool=ffff88042dae44c0 req_cpu=512 cpu=1 scp-4110 [001] ....1.4 101.187926: workqueue_activate_work: work struct ffff88040a6a01c8 kworker/1:1-89 [001] ....1.. 101.187998: workqueue_execute_end: work struct ffff8803c782d900 kworker/1:1-89 [001] ....1.. 101.188000: workqueue_execute_start: work struct ffff8800ae3f01e0: function xfs_end_io kworker/1:1-89 [001] ....1.. 101.188001: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize The last work never runs. Hangcheck triggers shortly after. I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is returning 1 in sched_submit_work (kernel/sched/core.c). It looks like sched_submit_work is not detecting that the worker task is blocked on a mutex. This looks very RT related right now. I see 2 problems from my reading (and experimentation). The first is that the second worker isn't getting started because tsk_is_pi_blocked is reporting that the task isn't blocked on a mutex. The second is that even if another worker needs to be scheduled because the original worker is blocked on a mutex, we need the pool lock to schedule another worker. The pool lock can be acquired by any CPU, and is a spin_lock. If we end up on the slow path for the pool lock, we hit BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in task_blocks_on_rt_mutex in rtmutex.c. I'm not sure how to deal with either problem. Thomas, any ideas? Hopefully I've got all my facts right... Debugging kernel code is a whole new world from userspace code. Thanks! Austin
WARNING: multiple messages have this Message-ID (diff)
From: Austin Schuh <austin@peloton-tech.com> To: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de>, linux-kernel@vger.kernel.org, xfs <xfs@oss.sgi.com> Subject: Re: On-stack work item completion race? (was Re: XFS crash?) Date: Tue, 24 Jun 2014 20:16:26 -0700 [thread overview] Message-ID: <CANGgnMa99dCKmFkVAY1XdXVY-UVdbg7Ey8-8iTbs5tm=bYS3uA@mail.gmail.com> (raw) In-Reply-To: <20140624032521.GA12164@htj.dyndns.org> [Adding tglx to the cc. Sorry for any double sends] On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote: >> start_flush_work() is effectively a special queue_work() >> implementation, so if if it's not safe to call complete() from the >> workqueue as the above patch implies then this code has the same >> problem. >> >> Tejun - is this "do it yourself completion" a known issue w.r.t. >> workqueues? I can't find any documentation that says "don't do >> that" so...? > > It's more complex than using flush_work() but there's nothing > fundamentally wrong with it. A work item is completely unlinked > before its execution starts. It's safe to free the work item once its > work function started, whether through kfree() or returning. > > One difference between flush_work() and manual completion would be > that if the work item gets requeued, flush_work() would wait for the > queued one to finish but given the work item is one-shot this doesn't > make any difference. > > I can see no reason why manual completion would behave differently > from flush_work() in this case. I went looking for a short trace in my original log to show the problem, and instead found evidence of the second problem. I still like the shorter flush_work call, but that's not my call. I did find this comment in the process_one_work function. Sounds like this could be better documented. /* * It is permissible to free the struct work_struct from * inside the function that is called from it, this we need to * take into account for lockdep too. To avoid bogus "held * lock freed" warnings as well as problems when looking into * work->lockdep_map, make a copy and use that here. */ >> As I understand it, what then happens is that the workqueue code >> grabs another kworker thread and runs the next work item in it's >> queue. IOWs, work items can block, but doing that does not prevent >> execution of other work items queued on other work queues or even on >> the same work queue. Tejun, did I get that correct? > > Yes, as long as the workqueue is under its @max_active limit and has > access to an existing kworker or can create a new one, it'll start > executing the next work item immediately; however, the guaranteed > level of concurrency is 1 even for WQ_RECLAIM workqueues. IOW, the > work items queued on a workqueue must be able to make forward progress > with single work item if the work items are being depended upon for > memory reclaim. I mentioned this to Dave when I initially started this thread, but I'm running a RT patched kernel. I don't see forwards progress. The workqueue is only used in 1 spot (xfs_alloc_wq), and has WQ_MEM_RECLAIM set. I started with a 10,000,000 line trace and pulled out the entries which referenced the workqueue and pool leading up to the lockup. scp-4110 [001] ....1.3 101.184640: workqueue_queue_work: work struct=ffff8803c782d900 function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1 scp-4110 [001] ....1.3 101.184641: workqueue_activate_work: work struct ffff8803c782d900 kworker/1:1-89 [001] ....1.1 101.184883: workqueue_nr_running: pool=ffff88042dae3fc0 nr_running=1 kworker/1:1-89 [001] ....1.. 101.184885: workqueue_execute_start: work struct ffff8803c782d900: function xfs_bmapi_allocate_worker irq/44-ahci-275 [001] ....1.5 101.185086: workqueue_queue_work: work struct=ffff8800ae3f01e0 function=xfs_end_io workqueue=ffff88040b459a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1 irq/44-ahci-275 [001] ....1.5 101.185088: workqueue_activate_work: work struct ffff8800ae3f01e0 scp-4110 [001] ....1.. 101.187911: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate scp-4110 [001] ....1.3 101.187914: workqueue_queue_work: work struct=ffff8803c782d900 function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1 scp-4110 [001] ....1.3 101.187915: workqueue_activate_work: work struct ffff8803c782d900 scp-4110 [001] ....1.4 101.187926: workqueue_queue_work: work struct=ffff88040a6a01c8 function=blk_delay_work workqueue=ffff88040c9f4a00 pool=ffff88042dae44c0 req_cpu=512 cpu=1 scp-4110 [001] ....1.4 101.187926: workqueue_activate_work: work struct ffff88040a6a01c8 kworker/1:1-89 [001] ....1.. 101.187998: workqueue_execute_end: work struct ffff8803c782d900 kworker/1:1-89 [001] ....1.. 101.188000: workqueue_execute_start: work struct ffff8800ae3f01e0: function xfs_end_io kworker/1:1-89 [001] ....1.. 101.188001: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize The last work never runs. Hangcheck triggers shortly after. I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is returning 1 in sched_submit_work (kernel/sched/core.c). It looks like sched_submit_work is not detecting that the worker task is blocked on a mutex. This looks very RT related right now. I see 2 problems from my reading (and experimentation). The first is that the second worker isn't getting started because tsk_is_pi_blocked is reporting that the task isn't blocked on a mutex. The second is that even if another worker needs to be scheduled because the original worker is blocked on a mutex, we need the pool lock to schedule another worker. The pool lock can be acquired by any CPU, and is a spin_lock. If we end up on the slow path for the pool lock, we hit BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in task_blocks_on_rt_mutex in rtmutex.c. I'm not sure how to deal with either problem. Thomas, any ideas? Hopefully I've got all my facts right... Debugging kernel code is a whole new world from userspace code. Thanks! Austin _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-06-25 3:16 UTC|newest] Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top 2014-03-05 23:08 XFS crash? Austin Schuh 2014-03-05 23:35 ` Dave Chinner 2014-03-06 0:53 ` Austin Schuh 2014-05-13 1:29 ` Austin Schuh 2014-05-13 3:10 ` Austin Schuh 2014-05-13 3:33 ` Austin Schuh 2014-05-13 3:46 ` Dave Chinner 2014-05-13 4:03 ` Austin Schuh 2014-05-13 6:39 ` Dave Chinner 2014-05-13 7:02 ` Austin Schuh 2014-05-13 9:03 ` Dave Chinner 2014-05-13 17:11 ` Austin Schuh 2014-06-23 20:05 ` Austin Schuh 2014-06-24 3:02 ` On-stack work item completion race? (was Re: XFS crash?) Dave Chinner 2014-06-24 3:02 ` Dave Chinner 2014-06-24 3:25 ` Tejun Heo 2014-06-24 3:25 ` Tejun Heo 2014-06-25 3:05 ` Austin Schuh 2014-06-25 14:00 ` Tejun Heo 2014-06-25 14:00 ` Tejun Heo 2014-06-25 17:04 ` Austin Schuh 2014-06-25 17:04 ` Austin Schuh 2014-06-25 3:16 ` Austin Schuh [this message] 2014-06-25 3:16 ` Austin Schuh 2014-06-25 5:56 ` Dave Chinner 2014-06-25 5:56 ` Dave Chinner 2014-06-25 14:18 ` Tejun Heo 2014-06-25 14:18 ` Tejun Heo 2014-06-25 22:08 ` Dave Chinner 2014-06-25 22:08 ` Dave Chinner
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CANGgnMa99dCKmFkVAY1XdXVY-UVdbg7Ey8-8iTbs5tm=bYS3uA@mail.gmail.com' \ --to=austin@peloton-tech.com \ --cc=david@fromorbit.com \ --cc=linux-kernel@vger.kernel.org \ --cc=tglx@linutronix.de \ --cc=tj@kernel.org \ --cc=xfs@oss.sgi.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.