From: Dave Chinner <david@fromorbit.com> To: Austin Schuh <austin@peloton-tech.com> Cc: xfs <xfs@oss.sgi.com>, linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org> Subject: On-stack work item completion race? (was Re: XFS crash?) Date: Tue, 24 Jun 2014 13:02:40 +1000 [thread overview] Message-ID: <20140624030240.GB9508@dastard> (raw) In-Reply-To: <CANGgnMZ8OwzfBj5m9H7c6q2yahGhU7oFZLsJfVxnWoqZExkZmQ@mail.gmail.com> [Adding Tejun and lkml to the cc list] On Mon, Jun 23, 2014 at 01:05:54PM -0700, Austin Schuh wrote: > I found 1 bug in XFS which I fixed, and I've uncovered something else that > I'm not completely sure how to fix. > > In xfs_bmapi_allocate, you create a completion, and use that to wait until > the work has finished. Occasionally, I'm seeing a case where I get a > context switch after the completion has been completed, but before the > workqueue has finished doing it's internal book-keeping. This results in > the work being deleted before the workqueue is done using it, corrupting > the internal data structures. I fixed it by waiting using flush_work and > removing the completion entirely. > > --- a/fs/xfs/xfs_bmap_util.c 2014-06-23 12:59:10.008678410 -0700 > +++ b/fs/xfs/xfs_bmap_util.c 2014-06-23 12:59:14.116678239 -0700 > @@ -263,7 +263,6 @@ > current_set_flags_nested(&pflags, PF_FSTRANS); > > args->result = __xfs_bmapi_allocate(args); > - complete(args->done); > > current_restore_flags_nested(&pflags, PF_FSTRANS); > } > @@ -277,16 +276,13 @@ > xfs_bmapi_allocate( > struct xfs_bmalloca *args) > { > - DECLARE_COMPLETION_ONSTACK(done); > - > if (!args->stack_switch) > return __xfs_bmapi_allocate(args); > > > - args->done = &done; > INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker); > queue_work(xfs_alloc_wq, &args->work); > - wait_for_completion(&done); > + flush_work(&args->work); > destroy_work_on_stack(&args->work); > return args->result; > } > --- a/fs/xfs/xfs_bmap_util.h 2014-06-23 12:59:10.008678410 -0700 > +++ b/fs/xfs/xfs_bmap_util.h 2014-06-23 12:59:11.708678340 -0700 > @@ -57,7 +57,6 @@ > char conv; /* overwriting unwritten extents */ > char stack_switch; > int flags; > - struct completion *done; > struct work_struct work; > int result; > }; Ok, that's a surprise. However, I can't see how using flush_work() solves that underlying context switch problem, because it's implemented the same way: bool flush_work(struct work_struct *work) { struct wq_barrier barr; lock_map_acquire(&work->lockdep_map); lock_map_release(&work->lockdep_map); if (start_flush_work(work, &barr)) { wait_for_completion(&barr.done); destroy_work_on_stack(&barr.work); return true; } else { return false; } } start_flush_work() is effectively a special queue_work() implementation, so if if it's not safe to call complete() from the workqueue as the above patch implies then this code has the same problem. Tejun - is this "do it yourself completion" a known issue w.r.t. workqueues? I can't find any documentation that says "don't do that" so...? A quick grep also shows up the same queue_work/wait_for_completion pattern in arch/x86/kernel/hpet.c, drivers/md/dm-thin.c, fs/fs-writeback.c, drivers/block/drbd/drbd_main.c.... > I enabled event tracing (and added a new event which lists the number of > workers running in a queue whenever that is changed). > > To me, it looks like work is scheduled from irq/44-ahci-273 that will > acquire an inode lock. scp-3986 then acquires the lock, and then goes and > schedules work. That work is then scheduled behind the work from > irq/44-ahci-273 in the same pool. The first work blocks waiting on the > lock, and scp-3986 won't finish and release that lock until the second work > gets run. IOWs, scp takes an inode lock and queues work to the xfs_alloc_wq, then schedules. Then a kworker runs an xfs-data work item, which tries to take the inode lock and blocks. As I understand it, what then happens is that the workqueue code grabs another kworker thread and runs the next work item in it's queue. IOWs, work items can block, but doing that does not prevent execution of other work items queued on other work queues or even on the same work queue. Tejun, did I get that correct? Hence the work on the xfs-data queue will block until another kworker processes the item on the xfs-alloc-wq which means progress is made and the inode gets unlocked. Then the kworker for the work on the xfs-data queue will get the lock, complete it's work and everything has resolved itself. Cheers, Dave. -- Dave Chinner david@fromorbit.com
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com> To: Austin Schuh <austin@peloton-tech.com> Cc: Tejun Heo <tj@kernel.org>, linux-kernel@vger.kernel.org, xfs <xfs@oss.sgi.com> Subject: On-stack work item completion race? (was Re: XFS crash?) Date: Tue, 24 Jun 2014 13:02:40 +1000 [thread overview] Message-ID: <20140624030240.GB9508@dastard> (raw) In-Reply-To: <CANGgnMZ8OwzfBj5m9H7c6q2yahGhU7oFZLsJfVxnWoqZExkZmQ@mail.gmail.com> [Adding Tejun and lkml to the cc list] On Mon, Jun 23, 2014 at 01:05:54PM -0700, Austin Schuh wrote: > I found 1 bug in XFS which I fixed, and I've uncovered something else that > I'm not completely sure how to fix. > > In xfs_bmapi_allocate, you create a completion, and use that to wait until > the work has finished. Occasionally, I'm seeing a case where I get a > context switch after the completion has been completed, but before the > workqueue has finished doing it's internal book-keeping. This results in > the work being deleted before the workqueue is done using it, corrupting > the internal data structures. I fixed it by waiting using flush_work and > removing the completion entirely. > > --- a/fs/xfs/xfs_bmap_util.c 2014-06-23 12:59:10.008678410 -0700 > +++ b/fs/xfs/xfs_bmap_util.c 2014-06-23 12:59:14.116678239 -0700 > @@ -263,7 +263,6 @@ > current_set_flags_nested(&pflags, PF_FSTRANS); > > args->result = __xfs_bmapi_allocate(args); > - complete(args->done); > > current_restore_flags_nested(&pflags, PF_FSTRANS); > } > @@ -277,16 +276,13 @@ > xfs_bmapi_allocate( > struct xfs_bmalloca *args) > { > - DECLARE_COMPLETION_ONSTACK(done); > - > if (!args->stack_switch) > return __xfs_bmapi_allocate(args); > > > - args->done = &done; > INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker); > queue_work(xfs_alloc_wq, &args->work); > - wait_for_completion(&done); > + flush_work(&args->work); > destroy_work_on_stack(&args->work); > return args->result; > } > --- a/fs/xfs/xfs_bmap_util.h 2014-06-23 12:59:10.008678410 -0700 > +++ b/fs/xfs/xfs_bmap_util.h 2014-06-23 12:59:11.708678340 -0700 > @@ -57,7 +57,6 @@ > char conv; /* overwriting unwritten extents */ > char stack_switch; > int flags; > - struct completion *done; > struct work_struct work; > int result; > }; Ok, that's a surprise. However, I can't see how using flush_work() solves that underlying context switch problem, because it's implemented the same way: bool flush_work(struct work_struct *work) { struct wq_barrier barr; lock_map_acquire(&work->lockdep_map); lock_map_release(&work->lockdep_map); if (start_flush_work(work, &barr)) { wait_for_completion(&barr.done); destroy_work_on_stack(&barr.work); return true; } else { return false; } } start_flush_work() is effectively a special queue_work() implementation, so if if it's not safe to call complete() from the workqueue as the above patch implies then this code has the same problem. Tejun - is this "do it yourself completion" a known issue w.r.t. workqueues? I can't find any documentation that says "don't do that" so...? A quick grep also shows up the same queue_work/wait_for_completion pattern in arch/x86/kernel/hpet.c, drivers/md/dm-thin.c, fs/fs-writeback.c, drivers/block/drbd/drbd_main.c.... > I enabled event tracing (and added a new event which lists the number of > workers running in a queue whenever that is changed). > > To me, it looks like work is scheduled from irq/44-ahci-273 that will > acquire an inode lock. scp-3986 then acquires the lock, and then goes and > schedules work. That work is then scheduled behind the work from > irq/44-ahci-273 in the same pool. The first work blocks waiting on the > lock, and scp-3986 won't finish and release that lock until the second work > gets run. IOWs, scp takes an inode lock and queues work to the xfs_alloc_wq, then schedules. Then a kworker runs an xfs-data work item, which tries to take the inode lock and blocks. As I understand it, what then happens is that the workqueue code grabs another kworker thread and runs the next work item in it's queue. IOWs, work items can block, but doing that does not prevent execution of other work items queued on other work queues or even on the same work queue. Tejun, did I get that correct? Hence the work on the xfs-data queue will block until another kworker processes the item on the xfs-alloc-wq which means progress is made and the inode gets unlocked. Then the kworker for the work on the xfs-data queue will get the lock, complete it's work and everything has resolved itself. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-06-24 3:03 UTC|newest] Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top 2014-03-05 23:08 XFS crash? Austin Schuh 2014-03-05 23:35 ` Dave Chinner 2014-03-06 0:53 ` Austin Schuh 2014-05-13 1:29 ` Austin Schuh 2014-05-13 3:10 ` Austin Schuh 2014-05-13 3:33 ` Austin Schuh 2014-05-13 3:46 ` Dave Chinner 2014-05-13 4:03 ` Austin Schuh 2014-05-13 6:39 ` Dave Chinner 2014-05-13 7:02 ` Austin Schuh 2014-05-13 9:03 ` Dave Chinner 2014-05-13 17:11 ` Austin Schuh 2014-06-23 20:05 ` Austin Schuh 2014-06-24 3:02 ` Dave Chinner [this message] 2014-06-24 3:02 ` On-stack work item completion race? (was Re: XFS crash?) Dave Chinner 2014-06-24 3:25 ` Tejun Heo 2014-06-24 3:25 ` Tejun Heo 2014-06-25 3:05 ` Austin Schuh 2014-06-25 14:00 ` Tejun Heo 2014-06-25 14:00 ` Tejun Heo 2014-06-25 17:04 ` Austin Schuh 2014-06-25 17:04 ` Austin Schuh 2014-06-25 3:16 ` Austin Schuh 2014-06-25 3:16 ` Austin Schuh 2014-06-25 5:56 ` Dave Chinner 2014-06-25 5:56 ` Dave Chinner 2014-06-25 14:18 ` Tejun Heo 2014-06-25 14:18 ` Tejun Heo 2014-06-25 22:08 ` Dave Chinner 2014-06-25 22:08 ` Dave Chinner
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20140624030240.GB9508@dastard \ --to=david@fromorbit.com \ --cc=austin@peloton-tech.com \ --cc=linux-kernel@vger.kernel.org \ --cc=tj@kernel.org \ --cc=xfs@oss.sgi.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.