* Writeback threads and freezable @ 2013-12-13 17:49 Tejun Heo 2013-12-13 18:52 ` Tejun Heo 2013-12-14 1:53 ` Writeback threads and freezable Dave Chinner 0 siblings, 2 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-13 17:49 UTC (permalink / raw) To: Rafael J. Wysocki, Jens Axboe Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, guys. This is discovered while investigating bug 62801 - "EliteBoot hangs at dock, suspend, undock, resume". https://bugzilla.kernel.org/show_bug.cgi?id=62801 The laptop locks up during resume if undocked while suspended. The dock contains a hard drive and the drive removal path and resume path get stuck. This got bisected to 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") by the reporter. The problem can be reproduced by just removing mounted harddrive while a machine is suspended. I first thought it was some dumb mistake but it turns out to be something fundamental. So, here's the lock up. * Resume starts and libata resume is kicked off. * libata EH determines that the device is gone. Eventually it invokes device_del() through a work item. * device_del() tries to delete the block device which invokes writeback_inodes_sb() on the mounted filesystem which in turn schedules and flushes bdi work item. Note that at this point, the kworker is holding multiple driver layer locks. Now, if the rest of dpm resume races with the above and gets stuck behind one of the locks libata device removal path is holding, it deadlocks because thaw_workqueues() is called only after dpm resume is complete. libata device removal waits for the scheduled writeback work item to finish which in turn is waiting for the workqueue to be thawed, and dpm resume can't make it to thaw_workqueues() because it's stuck behind one of the mutexes held by the libata device removal path. I think the issues already existed before 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") although the patch seems to have exposed it a lot more. The issue probably was a lot less visible because, if the bdi had a flusher spawned, kthread_stop() overrides freezer and if not the device has been idle & clean for some time and unlikely to require any action on shutdown, but the forker thread was still freezable and I think it should be possible to trigger some condition where forker action is waited upon during shutdown sequence. The fundamental problem, really, is that device drivers make use of kthreads and/or workqueues in their operation and freezable kthreads and workqueues essentially behave as a giant lock. If any suspend/resume operation somehow ends up depending on a freezable kthread or workqueue, which can happen through quite convoluted and long dependency chain, there are possibilities for deadlock even if those operations themselves aren't interlocked. In this case, libata resume finishes fine and device removal is run completely asynchronously but the dependency chain is still completed through driver layer locks. I've always thought that using freezer for kernel threads / workqueues is something fundamentally undesirable, exactly because freezer acts as a giant lock and we don't have any meangingful debug support around it. Even in leaf drivers proper where the dependency isn't complicated, this has ample possibility to lead to surprise lockups. I'm unsure why bdi should be freezable. It's not like corking things at bdi can reliably prevent block layer from issuing further commands down to drivers. Maybe there are things which directly hook into bdi and depend on it being frozen, I don't know, but it just seems dangerous to inject a large locking dependency into the chain from such high level construct. I'm not completely sure what to do. We can make an exception for bdi_wq and thaw it early but that breaks the guarantee that bdi is frozen between device suspend and resume invocations and we might as well just remove WQ_FREEZABLE from it. I don't know. Ideas? Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-13 17:49 Writeback threads and freezable Tejun Heo @ 2013-12-13 18:52 ` Tejun Heo 2013-12-13 20:40 ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo 2013-12-14 1:53 ` Writeback threads and freezable Dave Chinner 1 sibling, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-13 18:52 UTC (permalink / raw) To: Rafael J. Wysocki, Jens Axboe Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Fri, Dec 13, 2013 at 12:49:32PM -0500, Tejun Heo wrote: > I'm not completely sure what to do. We can make an exception for > bdi_wq and thaw it early but that breaks the guarantee that bdi is > frozen between device suspend and resume invocations and we might as > well just remove WQ_FREEZABLE from it. I don't know. Ugh, so I removed WQ_FREEZABLE on bdi and the same deadlock now happens through jbd2 which is a freezable kthread. This whole kthread freezer thing is rotten and fundamentally broken. :( -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-13 18:52 ` Tejun Heo @ 2013-12-13 20:40 ` Tejun Heo 2013-12-13 22:45 ` Nigel Cunningham ` (2 more replies) 0 siblings, 3 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-13 20:40 UTC (permalink / raw) To: Rafael J. Wysocki, Jens Axboe Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, So, this is the laughable workaround that I came up with. Seriously, this is tragic. :( Thanks. ------- 8< ------- Freezable kthreads and workqueues are fundamentally problematic in that they effectively introduce a big kernel lock widely used in the kernel and have already been the culprit of several deadlock scenarios. This is the latest occurrence. During resume, libata rescans all the ports and revalidates all pre-existing devices. If it determines that a device has gone missing, the device is removed from the system which involves invalidating block device and flushing bdi while holding driver core layer locks. Unfortunately, this can race with the rest of device resume. Because freezable kthreads and workqueues are thawed after device resume is complete and block device removal depends on freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make progress, this can lead to deadlock - block device removal can't proceed because kthreads are frozen and kthreads can't be thawed because device resume is blocked behind block device removal. 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") made this particular deadlock scenario more visible but the underlying problem has always been there - the original forker task and jbd2 are freezable too. In fact, this is highly likely just one of many possible deadlock scenarios given that freezer behaves as a big kernel lock and we don't have any debug mechanism around it. I believe the right thing to do is getting rid of freezable kthreads and workqueues. This is something fundamentally broken. For now, implement a funny workaround in libata - just avoid doing block device hot[un]plug while the system is frozen. Kernel engineering at its finest. :( Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Tomaž Šolc <tomaz.solc@tablix.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801 Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Len Brown <len.brown@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org --- drivers/ata/libata-scsi.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index ab58556..60a01d3 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -3872,6 +3872,25 @@ void ata_scsi_hotplug(struct work_struct *work) return; } + /* + * XXX - UGLY HACK + * + * The core suspend/resume path is fundamentally broken due to + * freezable kthreads and workqueue and may deadlock if a block + * device gets removed while resume is in progress. I don't know + * what the solution is short of removing freezable kthreads and + * workqueues altogether. + * + * The following is an ugly hack to avoid kicking off device + * removal while freezer is active. This is a joke but does avoid + * this particular deadlock scenario. + * + * https://bugzilla.kernel.org/show_bug.cgi?id=62801 + * http://marc.info/?l=linux-kernel&m=138695698516487 + */ + while (pm_freezing) + msleep(100); + DPRINTK("ENTER\n"); mutex_lock(&ap->scsi_scan_mutex); ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-13 20:40 ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo @ 2013-12-13 22:45 ` Nigel Cunningham 2013-12-13 23:07 ` Tejun Heo 2013-12-17 2:38 ` Rafael J. Wysocki 2013-12-17 12:50 ` [PATCH v2] " Tejun Heo 2 siblings, 1 reply; 54+ messages in thread From: Nigel Cunningham @ 2013-12-13 22:45 UTC (permalink / raw) To: Tejun Heo, Rafael J. Wysocki, Jens Axboe Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hi Tejun. Thanks for your work on this. In your first email, in the first substantial paragraph (starting "Now, if the rest.."), you say "libata device removal waits for the scheduled writeback work item to finish". I wonder if that's the lynchpin. If we know the device is gone, why are we trying to write to it? All pending I/O should have been flushed when suspend/hibernate started, and there's no point in trying to update metadata on a device we can't access, so there should be no writeback needed (and anything that does somehow get there should just be discarded since it will never succeed anyway). Or am I missing something? Having said the above, I agree that we shouldn't need to freeze kernel threads and workqueues themselves. I think we should be giving the producers of I/O the nous needed to avoid producing I/O during suspend/hibernate. But perhaps I'm missing something here, too. Regards, Nigel ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-13 22:45 ` Nigel Cunningham @ 2013-12-13 23:07 ` Tejun Heo 2013-12-13 23:15 ` Nigel Cunningham 2013-12-16 12:12 ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei 0 siblings, 2 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-13 23:07 UTC (permalink / raw) To: Nigel Cunningham Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Nigel. On Sat, Dec 14, 2013 at 09:45:59AM +1100, Nigel Cunningham wrote: > In your first email, in the first substantial paragraph (starting > "Now, if the rest.."), you say "libata device removal waits for the > scheduled writeback work item to finish". I wonder if that's the > lynchpin. If we know the device is gone, why are we trying to write > to it? It's just a standard part of block device removal - invalidate_partition(), bdi_wb_shutdown(). > All pending I/O should have been flushed when suspend/hibernate > started, and there's no point in trying to update metadata on a Frozen or not, it isn't guaranteed that bdi wb queue is empty when the system went to suspend. They're likely to be empty but there's no guarantee. Conversion to workqueue only makes the behavior more deterministic. > device we can't access, so there should be no writeback needed (and > anything that does somehow get there should just be discarded since > it will never succeed anyway). Even if they'll never succeed, they still need to be issued and drained; otherwise, we'll end up with leaked items and hung issuers. > Having said the above, I agree that we shouldn't need to freeze > kernel threads and workqueues themselves. I think we should be > giving the producers of I/O the nous needed to avoid producing I/O > during suspend/hibernate. But perhaps I'm missing something here, > too. I never understood that part. Why do we need to control the producers? The chain between the producer and consumer is a long one and no matter what we do with the producers, the consumers need to be plugged all the same. Why bother with the producers at all? I think that's where all this freezable kthreads started but I don't understand what the benefit of that is. Not only that, freezer is awefully inadequate in its role too. There are flurry of activities which happen in the IO path without any thread involved and many of them can lead to issuance of new IO, so the only thing freezer is achieving is making existing bugs less visible, which is a bad thing especially for suspend/resume as the failure mode often doesn't yield to easy debugging. I asked the same question years ago and ISTR getting only fairly vague answers but this whole freezable kthread is expectedly proving to be a continuous source of problems. Let's at least find out whether we need it and why if so. Not some "I feel better knowing things are calmer" type vagueness but actual technical necessity of it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-13 23:07 ` Tejun Heo @ 2013-12-13 23:15 ` Nigel Cunningham 2013-12-14 1:55 ` Dave Chinner 2013-12-14 20:31 ` Tejun Heo 2013-12-16 12:12 ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei 1 sibling, 2 replies; 54+ messages in thread From: Nigel Cunningham @ 2013-12-13 23:15 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hi again. On 14/12/13 10:07, Tejun Heo wrote: > Hello, Nigel. > > On Sat, Dec 14, 2013 at 09:45:59AM +1100, Nigel Cunningham wrote: >> In your first email, in the first substantial paragraph (starting >> "Now, if the rest.."), you say "libata device removal waits for the >> scheduled writeback work item to finish". I wonder if that's the >> lynchpin. If we know the device is gone, why are we trying to write >> to it? > It's just a standard part of block device removal - > invalidate_partition(), bdi_wb_shutdown(). Mmm. But perhaps there needs to be some special code in there to handle the "we can't write to this device anymore" case? > >> All pending I/O should have been flushed when suspend/hibernate >> started, and there's no point in trying to update metadata on a > Frozen or not, it isn't guaranteed that bdi wb queue is empty when the > system went to suspend. They're likely to be empty but there's no > guarantee. Conversion to workqueue only makes the behavior more > deterministic. > >> device we can't access, so there should be no writeback needed (and >> anything that does somehow get there should just be discarded since >> it will never succeed anyway). > Even if they'll never succeed, they still need to be issued and > drained; otherwise, we'll end up with leaked items and hung issuers. Yeah - I get that, but drained needs to work differently if the device doesn't exist? >> Having said the above, I agree that we shouldn't need to freeze >> kernel threads and workqueues themselves. I think we should be >> giving the producers of I/O the nous needed to avoid producing I/O >> during suspend/hibernate. But perhaps I'm missing something here, >> too. > I never understood that part. Why do we need to control the > producers? The chain between the producer and consumer is a long one > and no matter what we do with the producers, the consumers need to be > plugged all the same. Why bother with the producers at all? I think > that's where all this freezable kthreads started but I don't > understand what the benefit of that is. Not only that, freezer is > awefully inadequate in its role too. There are flurry of activities > which happen in the IO path without any thread involved and many of > them can lead to issuance of new IO, so the only thing freezer is > achieving is making existing bugs less visible, which is a bad thing > especially for suspend/resume as the failure mode often doesn't yield > to easy debugging. > > I asked the same question years ago and ISTR getting only fairly vague > answers but this whole freezable kthread is expectedly proving to be a > continuous source of problems. Let's at least find out whether we > need it and why if so. Not some "I feel better knowing things are > calmer" type vagueness but actual technical necessity of it. > My understanding is that the point is ensuring that - particularly in the case of hibernation - we don't cause filesystem corruption by writing one thing while writing the image and then doing something else (without knowledge of what happened while the image was being written) while reading the image or after restoring it. Regards, Nigel ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-13 23:15 ` Nigel Cunningham @ 2013-12-14 1:55 ` Dave Chinner 2013-12-14 20:31 ` Tejun Heo 1 sibling, 0 replies; 54+ messages in thread From: Dave Chinner @ 2013-12-14 1:55 UTC (permalink / raw) To: Nigel Cunningham Cc: Tejun Heo, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Sat, Dec 14, 2013 at 10:15:21AM +1100, Nigel Cunningham wrote: > Hi again. > > On 14/12/13 10:07, Tejun Heo wrote: > >Hello, Nigel. > > > >On Sat, Dec 14, 2013 at 09:45:59AM +1100, Nigel Cunningham wrote: > >>In your first email, in the first substantial paragraph (starting > >>"Now, if the rest.."), you say "libata device removal waits for the > >>scheduled writeback work item to finish". I wonder if that's the > >>lynchpin. If we know the device is gone, why are we trying to write > >>to it? > >It's just a standard part of block device removal - > >invalidate_partition(), bdi_wb_shutdown(). > Mmm. But perhaps there needs to be some special code in there to > handle the "we can't write to this device anymore" case? I've been asking for that "special code" for years, Nigel... :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-13 23:15 ` Nigel Cunningham 2013-12-14 1:55 ` Dave Chinner @ 2013-12-14 20:31 ` Tejun Heo 2013-12-14 20:36 ` Tejun Heo 2013-12-17 2:34 ` Rafael J. Wysocki 1 sibling, 2 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-14 20:31 UTC (permalink / raw) To: Nigel Cunningham Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Nigel. On Sat, Dec 14, 2013 at 10:15:21AM +1100, Nigel Cunningham wrote: > My understanding is that the point is ensuring that - particularly > in the case of hibernation - we don't cause filesystem corruption by > writing one thing while writing the image and then doing something > else (without knowledge of what happened while the image was being > written) while reading the image or after restoring it. So, all this is about hibernation? Does that mean that it's safe to unfreeze before invoking resume? ie. we currently do, freeze suspend devs resume devs unfreeze If we can just do, freeze suspend devs unfreeze resume devs it should be a lot better and would remove all locking dependency chains in the resume path. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-14 20:31 ` Tejun Heo @ 2013-12-14 20:36 ` Tejun Heo 2013-12-14 21:21 ` Nigel Cunningham 2013-12-17 2:34 ` Rafael J. Wysocki 1 sibling, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-14 20:36 UTC (permalink / raw) To: Nigel Cunningham Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Sat, Dec 14, 2013 at 03:31:21PM -0500, Tejun Heo wrote: > So, all this is about hibernation? Does that mean that it's safe to > unfreeze before invoking resume? ie. we currently do, > > freeze > suspend devs > resume devs > unfreeze > > If we can just do, > > freeze > suspend devs > unfreeze > resume devs Ummm... even better, does that mean, in the long term, we can decouple freezing and device pm operations? If this is just about hibernation, there's no reason to wrap device driver pm ops with freezer, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-14 20:36 ` Tejun Heo @ 2013-12-14 21:21 ` Nigel Cunningham 2013-12-17 2:35 ` Rafael J. Wysocki 0 siblings, 1 reply; 54+ messages in thread From: Nigel Cunningham @ 2013-12-14 21:21 UTC (permalink / raw) To: Tejun Heo, Rafael J. Wysocki Cc: Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hi. On 15/12/13 07:36, Tejun Heo wrote: > On Sat, Dec 14, 2013 at 03:31:21PM -0500, Tejun Heo wrote: >> So, all this is about hibernation? Does that mean that it's safe to >> unfreeze before invoking resume? ie. we currently do, >> >> freeze >> suspend devs >> resume devs >> unfreeze >> >> If we can just do, >> >> freeze >> suspend devs >> unfreeze >> resume devs > Ummm... even better, does that mean, in the long term, we can decouple > freezing and device pm operations? If this is just about hibernation, > there's no reason to wrap device driver pm ops with freezer, right? > > Thanks. > Rafael has spent far more time than me thinking about the semantics here, so I'd like to defer to him. Rafael?... Regards, Nigel ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-14 21:21 ` Nigel Cunningham @ 2013-12-17 2:35 ` Rafael J. Wysocki 0 siblings, 0 replies; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-17 2:35 UTC (permalink / raw) To: Nigel Cunningham Cc: Tejun Heo, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Sunday, December 15, 2013 08:21:36 AM Nigel Cunningham wrote: > Hi. > > On 15/12/13 07:36, Tejun Heo wrote: > > On Sat, Dec 14, 2013 at 03:31:21PM -0500, Tejun Heo wrote: > >> So, all this is about hibernation? Does that mean that it's safe to > >> unfreeze before invoking resume? ie. we currently do, > >> > >> freeze > >> suspend devs > >> resume devs > >> unfreeze > >> > >> If we can just do, > >> > >> freeze > >> suspend devs > >> unfreeze > >> resume devs > > Ummm... even better, does that mean, in the long term, we can decouple > > freezing and device pm operations? If this is just about hibernation, > > there's no reason to wrap device driver pm ops with freezer, right? > > > > Thanks. > > > Rafael has spent far more time than me thinking about the semantics > here, so I'd like to defer to him. Rafael?... Please see my reply to Tejun. Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-14 20:31 ` Tejun Heo 2013-12-14 20:36 ` Tejun Heo @ 2013-12-17 2:34 ` Rafael J. Wysocki 2013-12-17 12:34 ` Tejun Heo 1 sibling, 1 reply; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-17 2:34 UTC (permalink / raw) To: Tejun Heo Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Saturday, December 14, 2013 03:31:21 PM Tejun Heo wrote: > Hello, Nigel. > > On Sat, Dec 14, 2013 at 10:15:21AM +1100, Nigel Cunningham wrote: > > My understanding is that the point is ensuring that - particularly > > in the case of hibernation - we don't cause filesystem corruption by > > writing one thing while writing the image and then doing something > > else (without knowledge of what happened while the image was being > > written) while reading the image or after restoring it. > > So, all this is about hibernation? No, it isn't. [I guess it was originally, but it has not been the case for a very long time.] It is about getting user space interactions (all of the sysfs/ioctl/mmap/read/write/you-name-it thingies user space can do to devices) when we're calling device suspend/resume routines. The reason is that otherwise all of them would have had to do a "oh, are we suspending by the way?" check pretty much on every code path that can be triggered by user space. And I'd appreciate it if people didn't repeat the nonsense about possible filesystem corruption. The freezing of tasks doesn't prevent that from happening (which is easy to demonstrate by artificially fail restore from hibernation multiple times in a row). > Does that mean that it's safe to > unfreeze before invoking resume? No, it isn't. Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-17 2:34 ` Rafael J. Wysocki @ 2013-12-17 12:34 ` Tejun Heo 2013-12-18 0:35 ` Rafael J. Wysocki 0 siblings, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-17 12:34 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Rafael. On Tue, Dec 17, 2013 at 03:34:00AM +0100, Rafael J. Wysocki wrote: > No, it isn't. [I guess it was originally, but it has not been the case > for a very long time.] It is about getting user space interactions (all of Heh... no wonder people are all so confused about this thing. > the sysfs/ioctl/mmap/read/write/you-name-it thingies user space can do to > devices) when we're calling device suspend/resume routines. The reason is > that otherwise all of them would have had to do a "oh, are we suspending by > the way?" check pretty much on every code path that can be triggered by > user space. Freezing userland is fine. I have no problem with that but up until now the only use case that seems fundamentally valid to me is freezing IO processing kthread in a driver as a cheap way to implement suspend/resume. At this point, given the general level of confusion, it seems to be costing more than benefiting. > > Does that mean that it's safe to unfreeze before invoking resume? > > No, it isn't. So, are you saying it's really about giving device drivers easy way to implement suspend/resume? If that's the case, let's please make it *way* more specific and clear - ie. things like helpers to implement suspend/resume hooks trivially or whatnot. Freezable kthreads (and now workqueues) have been becoming a giant mess for a while now. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-17 12:34 ` Tejun Heo @ 2013-12-18 0:35 ` Rafael J. Wysocki 2013-12-18 11:17 ` Tejun Heo 0 siblings, 1 reply; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-18 0:35 UTC (permalink / raw) To: Tejun Heo Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Tuesday, December 17, 2013 07:34:57 AM Tejun Heo wrote: > Hello, Rafael. > > On Tue, Dec 17, 2013 at 03:34:00AM +0100, Rafael J. Wysocki wrote: > > No, it isn't. [I guess it was originally, but it has not been the case > > for a very long time.] It is about getting user space interactions (all of > > Heh... no wonder people are all so confused about this thing. > > > the sysfs/ioctl/mmap/read/write/you-name-it thingies user space can do to > > devices) when we're calling device suspend/resume routines. The reason is > > that otherwise all of them would have had to do a "oh, are we suspending by > > the way?" check pretty much on every code path that can be triggered by > > user space. > > Freezing userland is fine. OK So do I understand correctly that you're talking about kernel threads/worqueues freezing below? > I have no problem with that but up until > now the only use case that seems fundamentally valid to me is freezing > IO processing kthread in a driver as a cheap way to implement > suspend/resume. At this point, given the general level of confusion, > it seems to be costing more than benefiting. There are corner cases like the runtime PM workqueue that's freezable by design. > > > Does that mean that it's safe to unfreeze before invoking resume? > > > > No, it isn't. > > So, are you saying it's really about giving device drivers easy way to > implement suspend/resume? Well, that's a side effect rather than a recommeded interface. A *few* pieces of code need to freeze kernel threads/workqueues, but they should know who they are and they really really should know *why* they need that (the above-mentioned runtime PM workqueue is one example). The rest is just doing that because they can, which may not be entirely reasonable (or because they did that in the past and the original author is not responsive and everyone else does not dare to try removing that). > If that's the case, let's please make it > *way* more specific and clear - ie. things like helpers to implement > suspend/resume hooks trivially or whatnot. Freezable kthreads (and > now workqueues) have been becoming a giant mess for a while now. They were a lot more of that to start with really. We removed quite a number of try_to_freeze() instances from the kernel a few years ago, but apparently people are taking shortcuts. The rule should be to require patch submitters to put in comments explaining why they need their kernel threads/workqueues to be freezable and generally there should be no such things in drivers. Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-18 0:35 ` Rafael J. Wysocki @ 2013-12-18 11:17 ` Tejun Heo 2013-12-18 21:48 ` Rafael J. Wysocki 0 siblings, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-18 11:17 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Rafael. On Wed, Dec 18, 2013 at 01:35:13AM +0100, Rafael J. Wysocki wrote: > So do I understand correctly that you're talking about kernel threads/worqueues > freezing below? Yeap, I'm strictly talking about kernel freezables. > > So, are you saying it's really about giving device drivers easy way to > > implement suspend/resume? > > Well, that's a side effect rather than a recommeded interface. A *few* pieces > of code need to freeze kernel threads/workqueues, but they should know who they > are and they really really should know *why* they need that (the above-mentioned > runtime PM workqueue is one example). The rest is just doing that because they > can, which may not be entirely reasonable (or because they did that in the past > and the original author is not responsive and everyone else does not dare to try > removing that). I see. In the long term, I think the right thing to do is making the freezer interface more specific so that only the ones which actually need it do so explicitly. Right now, kernel freezables are conceptually at a very high level - it's a global task attribute and a major knob in workqueue. I suppose most of that is historical but by perpetuating the model we're encouraging misuse of freezer in large swaths of the kernel. Even in this specific case, both writeback and jbd workers have no fundamental reason to be freezable and yet they're, eventually developing into totally unnecessary deadlocks. > They were a lot more of that to start with really. We removed quite a number > of try_to_freeze() instances from the kernel a few years ago, but apparently > people are taking shortcuts. Great, so we're at least headed towards the right direction. > The rule should be to require patch submitters to put in comments explaining > why they need their kernel threads/workqueues to be freezable and generally > there should be no such things in drivers. I'm not so sure whether that's something which can be effectively enforced given the current high visibility and confusion around freezer. I think the only way to get this under control is weed out the current spurious users actively, deprecate the existing interface and switch the real ones to something a lot more specific. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-18 11:17 ` Tejun Heo @ 2013-12-18 21:48 ` Rafael J. Wysocki 2013-12-18 21:39 ` Tejun Heo 0 siblings, 1 reply; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-18 21:48 UTC (permalink / raw) To: Tejun Heo Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Wednesday, December 18, 2013 06:17:26 AM Tejun Heo wrote: > Hello, Rafael. > > On Wed, Dec 18, 2013 at 01:35:13AM +0100, Rafael J. Wysocki wrote: > > So do I understand correctly that you're talking about kernel threads/worqueues > > freezing below? > > Yeap, I'm strictly talking about kernel freezables. > > > > So, are you saying it's really about giving device drivers easy way to > > > implement suspend/resume? > > > > Well, that's a side effect rather than a recommeded interface. A *few* pieces > > of code need to freeze kernel threads/workqueues, but they should know who they > > are and they really really should know *why* they need that (the above-mentioned > > runtime PM workqueue is one example). The rest is just doing that because they > > can, which may not be entirely reasonable (or because they did that in the past > > and the original author is not responsive and everyone else does not dare to try > > removing that). > > I see. In the long term, I think the right thing to do is making the > freezer interface more specific so that only the ones which actually > need it do so explicitly. Right now, kernel freezables are > conceptually at a very high level - it's a global task attribute and a > major knob in workqueue. I suppose most of that is historical but by > perpetuating the model we're encouraging misuse of freezer in large > swaths of the kernel. Even in this specific case, both writeback and > jbd workers have no fundamental reason to be freezable and yet > they're, eventually developing into totally unnecessary deadlocks. You're right, but I'm not sure how we can make the interface for workqueues more specific, for example. I guess we can simply drop create_freezable_workqueue() so that whoever wants to create a freezable workqueue has to use the right combination of flags. Can we make it more specific than that? BTW, pm_start_workqueue(), which is a legitimate user, doesn't even use that macro. :-) > > They were a lot more of that to start with really. We removed quite a number > > of try_to_freeze() instances from the kernel a few years ago, but apparently > > people are taking shortcuts. > > Great, so we're at least headed towards the right direction. > > > The rule should be to require patch submitters to put in comments explaining > > why they need their kernel threads/workqueues to be freezable and generally > > there should be no such things in drivers. > > I'm not so sure whether that's something which can be effectively > enforced given the current high visibility and confusion around > freezer. I think the only way to get this under control is weed out > the current spurious users actively, deprecate the existing interface > and switch the real ones to something a lot more specific. That'd be fine by me modulo the above remarks. Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-18 21:48 ` Rafael J. Wysocki @ 2013-12-18 21:39 ` Tejun Heo 2013-12-18 21:41 ` Tejun Heo 0 siblings, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-18 21:39 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Rafael. On Wed, Dec 18, 2013 at 10:48:31PM +0100, Rafael J. Wysocki wrote: > > I see. In the long term, I think the right thing to do is making the > > freezer interface more specific so that only the ones which actually > > need it do so explicitly. Right now, kernel freezables are > > conceptually at a very high level - it's a global task attribute and a > > major knob in workqueue. I suppose most of that is historical but by > > perpetuating the model we're encouraging misuse of freezer in large > > swaths of the kernel. Even in this specific case, both writeback and > > jbd workers have no fundamental reason to be freezable and yet > > they're, eventually developing into totally unnecessary deadlocks. > > You're right, but I'm not sure how we can make the interface for workqueues > more specific, for example. I guess we can simply drop create_freezable_workqueue() > so that whoever wants to create a freezable workqueue has to use the right > combination of flags. Can we make it more specific than that? > > BTW, pm_start_workqueue(), which is a legitimate user, doesn't even use that macro. :-) Yeah, we can just rip out the whole freezer support and let the caller's pm notification hooks implement it by doing workqueue_set_max_active(wq, 0); flush_workqueue(wq); when it needs to "freeze" the workqueue and then reverse it by doing the following. workqueue_set_max_active(wq, WQ_DFL_ACTIVE); It'll be a bit more code in the specific users but given the specificity of the usage I think that's the appropriate way to do it. It'll drop quite a bit of complexity from the core freezer and workqueue code paths too. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-18 21:39 ` Tejun Heo @ 2013-12-18 21:41 ` Tejun Heo 2013-12-18 22:04 ` Rafael J. Wysocki 0 siblings, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-18 21:41 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Wed, Dec 18, 2013 at 04:39:36PM -0500, Tejun Heo wrote: > Yeah, we can just rip out the whole freezer support and let the > caller's pm notification hooks implement it by doing > > workqueue_set_max_active(wq, 0); > flush_workqueue(wq); Oops, the above doesn't work but I can trivially add a new interface for this. Something which waits for the max_active to drop below the newly set level before returning. workqueue_set_max_active_and_wait(wq, 0); Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-18 21:41 ` Tejun Heo @ 2013-12-18 22:04 ` Rafael J. Wysocki 2013-12-19 23:35 ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo 2013-12-19 23:37 ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo 0 siblings, 2 replies; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-18 22:04 UTC (permalink / raw) To: Tejun Heo Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Wednesday, December 18, 2013 04:41:28 PM Tejun Heo wrote: > On Wed, Dec 18, 2013 at 04:39:36PM -0500, Tejun Heo wrote: > > Yeah, we can just rip out the whole freezer support and let the > > caller's pm notification hooks implement it by doing > > > > workqueue_set_max_active(wq, 0); > > flush_workqueue(wq); > > Oops, the above doesn't work but I can trivially add a new interface > for this. Something which waits for the max_active to drop below the > newly set level before returning. > > workqueue_set_max_active_and_wait(wq, 0); If you add that, I can convert the PM workqueue to using it and then we can go through all of the existing users and make similar changes - or just make them non-freezable if there's no good reason for freezing those particular ones. Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules 2013-12-18 22:04 ` Rafael J. Wysocki @ 2013-12-19 23:35 ` Tejun Heo 2013-12-20 1:26 ` Rafael J. Wysocki 2013-12-19 23:37 ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo 1 sibling, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-19 23:35 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu, Lai Jiangshan, David Howells >From bdd220b2a1b86fee14a12b69fb0cadafe60a1dac Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Thu, 19 Dec 2013 18:33:09 -0500 @max_active handling is currently inconsistent. * In alloc_workqueue(), 0 gets converted to the default and the value gets clamped to [1, lim]. * In workqueue_set_max_active(), 0 is not converted to the default and the value is clamped to [1, lim]. * When a workqueue is exposed through sysfs, input < 1 fails with -EINVAL; otherwise, the value is clamped to [1, lim]. * fscache exposes max_active through a sysctl and clamps the value in [1, lim]. We want to be able to express zero @max_active but as it's a special case and 0 is already used for default, don't want to use 0 for it. Update @max_active clamping so that the following rules are followed. * In both alloc_workqueue() and workqueue_set_max_active(), 0 is converted to the default, and a new special value WQ_FROZEN_ACTIVE becomes 0; otherwise, the value is clamped to [1, lim]. * In both sysfs and fscache sysctl, input < 1 fails with -EINVAL; otherwise, the value is clamped to [1, lim]. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: David Howells <dhowells@redhat.com> --- fs/fscache/main.c | 10 +++++++--- include/linux/workqueue.h | 1 + kernel/workqueue.c | 6 +++++- 3 files changed, 13 insertions(+), 4 deletions(-) diff --git a/fs/fscache/main.c b/fs/fscache/main.c index 7c27907..9d5a716 100644 --- a/fs/fscache/main.c +++ b/fs/fscache/main.c @@ -62,9 +62,13 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write, int ret; ret = proc_dointvec(table, write, buffer, lenp, ppos); - if (ret == 0) - workqueue_set_max_active(*wqp, *datap); - return ret; + if (ret < 0) + return ret; + if (*datap < 1) + return -EINVAL; + + workqueue_set_max_active(*wqp, *datap); + return 0; } ctl_table fscache_sysctls[] = { diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 594521b..334daa3 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -338,6 +338,7 @@ enum { __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */ __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ + WQ_FROZEN_ACTIVE = -1, /* special value for frozen wq */ WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */ WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2, diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 987293d..6748fbf 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4136,6 +4136,11 @@ static int wq_clamp_max_active(int max_active, unsigned int flags, { int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE; + if (max_active == 0) + return WQ_DFL_ACTIVE; + if (max_active == WQ_FROZEN_ACTIVE) + return 0; + if (max_active < 1 || max_active > lim) pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n", max_active, name, 1, lim); @@ -4176,7 +4181,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt, vsnprintf(wq->name, sizeof(wq->name), fmt, args); va_end(args); - max_active = max_active ?: WQ_DFL_ACTIVE; max_active = wq_clamp_max_active(max_active, flags, wq->name); /* init wq */ -- 1.8.4.2 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules 2013-12-19 23:35 ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo @ 2013-12-20 1:26 ` Rafael J. Wysocki 0 siblings, 0 replies; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-20 1:26 UTC (permalink / raw) To: Tejun Heo Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu, Lai Jiangshan, David Howells On Thursday, December 19, 2013 06:35:26 PM Tejun Heo wrote: > From bdd220b2a1b86fee14a12b69fb0cadafe60a1dac Mon Sep 17 00:00:00 2001 > From: Tejun Heo <tj@kernel.org> > Date: Thu, 19 Dec 2013 18:33:09 -0500 > > @max_active handling is currently inconsistent. > > * In alloc_workqueue(), 0 gets converted to the default and the value > gets clamped to [1, lim]. > > * In workqueue_set_max_active(), 0 is not converted to the default and > the value is clamped to [1, lim]. > > * When a workqueue is exposed through sysfs, input < 1 fails with > -EINVAL; otherwise, the value is clamped to [1, lim]. > > * fscache exposes max_active through a sysctl and clamps the value in > [1, lim]. > > We want to be able to express zero @max_active but as it's a special > case and 0 is already used for default, don't want to use 0 for it. > Update @max_active clamping so that the following rules are followed. > > * In both alloc_workqueue() and workqueue_set_max_active(), 0 is > converted to the default, and a new special value WQ_FROZEN_ACTIVE > becomes 0; otherwise, the value is clamped to [1, lim]. > > * In both sysfs and fscache sysctl, input < 1 fails with -EINVAL; > otherwise, the value is clamped to [1, lim]. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: Lai Jiangshan <laijs@cn.fujitsu.com> > Cc: David Howells <dhowells@redhat.com> Well, this one looks good to me: Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > fs/fscache/main.c | 10 +++++++--- > include/linux/workqueue.h | 1 + > kernel/workqueue.c | 6 +++++- > 3 files changed, 13 insertions(+), 4 deletions(-) > > diff --git a/fs/fscache/main.c b/fs/fscache/main.c > index 7c27907..9d5a716 100644 > --- a/fs/fscache/main.c > +++ b/fs/fscache/main.c > @@ -62,9 +62,13 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write, > int ret; > > ret = proc_dointvec(table, write, buffer, lenp, ppos); > - if (ret == 0) > - workqueue_set_max_active(*wqp, *datap); > - return ret; > + if (ret < 0) > + return ret; > + if (*datap < 1) > + return -EINVAL; > + > + workqueue_set_max_active(*wqp, *datap); > + return 0; > } > > ctl_table fscache_sysctls[] = { > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h > index 594521b..334daa3 100644 > --- a/include/linux/workqueue.h > +++ b/include/linux/workqueue.h > @@ -338,6 +338,7 @@ enum { > __WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */ > __WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */ > > + WQ_FROZEN_ACTIVE = -1, /* special value for frozen wq */ > WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ > WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */ > WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2, > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 987293d..6748fbf 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -4136,6 +4136,11 @@ static int wq_clamp_max_active(int max_active, unsigned int flags, > { > int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE; > > + if (max_active == 0) > + return WQ_DFL_ACTIVE; > + if (max_active == WQ_FROZEN_ACTIVE) > + return 0; > + > if (max_active < 1 || max_active > lim) > pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n", > max_active, name, 1, lim); > @@ -4176,7 +4181,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt, > vsnprintf(wq->name, sizeof(wq->name), fmt, args); > va_end(args); > > - max_active = max_active ?: WQ_DFL_ACTIVE; > max_active = wq_clamp_max_active(max_active, flags, wq->name); > > /* init wq */ > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() 2013-12-18 22:04 ` Rafael J. Wysocki 2013-12-19 23:35 ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo @ 2013-12-19 23:37 ` Tejun Heo 2013-12-20 1:31 ` Rafael J. Wysocki 1 sibling, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-19 23:37 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu, Lai Jiangshan, David Howells Hello, Rafael. If this looks good to you, I'll commit it to wq/for-3.14 and we can at least start to clean up things. Thanks! ------- 8< ------- >From 6b5182f3f193d0ff9296f53a4665a55b6477aa77 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Thu, 19 Dec 2013 18:33:12 -0500 workqueue_set_max_active() currently doesn't wait for the number of in-flight work items to fall under the new @max_active. This patch adds @drain paramter to workqueue_set_max_active(), if set, the function sleeps until nr_active on each pool_workqueue of the target workqueue drops below the current saved_max_active. This is planned to replace freezable workqueues. It is determined that kernel freezables - both kthreads and workqueues - aren't necessary and just add to the confusion and unnecessary deadlocks. There are only a handful which actually need to stop processing for system power events and they can be converted to use workqueue_set_max_active(WQ_FROZEN_ACTIVE) instead of freezable workqueues. Ultimately, all other uses of freezables will be dropped and the whole system freezer machinery will be excised. Well, that's the plan anyway. The implementation is fairly straight-forward. As this is expected to be used by only a handful and most likely not concurrently, a single wait queue is used. set_max_active drainers wait on it as necessary and pwq_dec_nr_in_flight() triggers it if nr_active == max_active after nr_active is decremented. This unfortunately adds on unlikely branch to the work item execution path but this is extremely unlikely to be noticeable and I think it's worthwhile to avoid polling here as there may be multiple calls to this function in succession during suspend and some platforms use suspend quite frequently. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/20131218213936.GA8218@mtj.dyndns.org Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: David Howells <dhowells@redhat.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> --- fs/fscache/main.c | 2 +- include/linux/workqueue.h | 2 +- kernel/workqueue.c | 58 ++++++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 57 insertions(+), 5 deletions(-) diff --git a/fs/fscache/main.c b/fs/fscache/main.c index 9d5a716..e2837f2 100644 --- a/fs/fscache/main.c +++ b/fs/fscache/main.c @@ -67,7 +67,7 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write, if (*datap < 1) return -EINVAL; - workqueue_set_max_active(*wqp, *datap); + workqueue_set_max_active(*wqp, *datap, false); return 0; } diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 334daa3..8b9c628 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -488,7 +488,7 @@ extern bool cancel_delayed_work(struct delayed_work *dwork); extern bool cancel_delayed_work_sync(struct delayed_work *dwork); extern void workqueue_set_max_active(struct workqueue_struct *wq, - int max_active); + int max_active, bool drain); extern bool current_is_workqueue_rescuer(void); extern bool workqueue_congested(int cpu, struct workqueue_struct *wq); extern unsigned int work_busy(struct work_struct *work); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 6748fbf..f18c35b 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -293,6 +293,12 @@ static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */ static LIST_HEAD(workqueues); /* PL: list of all workqueues */ static bool workqueue_freezing; /* PL: have wqs started freezing? */ +/* + * Wait for nr_active to drain after max_active adjustment. This is a cold + * path and not expected to have many users. A global waitq should do. + */ +static DECLARE_WAIT_QUEUE_HEAD(wq_nr_active_drain_waitq); + /* the per-cpu worker pools */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], cpu_worker_pools); @@ -1123,6 +1129,14 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color) pwq->nr_in_flight[color]--; pwq->nr_active--; + + /* + * This can happen only after max_active is lowered. Tell the + * waiters that draining might be complete. + */ + if (unlikely(pwq->nr_active == pwq->max_active)) + wake_up_all(&wq_nr_active_drain_waitq); + if (!list_empty(&pwq->delayed_works)) { /* one down, submit a delayed one */ if (pwq->nr_active < pwq->max_active) @@ -3140,7 +3154,7 @@ static ssize_t max_active_store(struct device *dev, if (sscanf(buf, "%d", &val) != 1 || val <= 0) return -EINVAL; - workqueue_set_max_active(wq, val); + workqueue_set_max_active(wq, val, false); return count; } static DEVICE_ATTR_RW(max_active); @@ -4339,16 +4353,22 @@ EXPORT_SYMBOL_GPL(destroy_workqueue); * workqueue_set_max_active - adjust max_active of a workqueue * @wq: target workqueue * @max_active: new max_active value. + * @drain: wait until the actual level of concurrency becomes <= @max_active * - * Set max_active of @wq to @max_active. + * Set max_active of @wq to @max_active. If @drain is true, wait until the + * in-flight work items are drained to the new level. * * CONTEXT: * Don't call from IRQ context. */ -void workqueue_set_max_active(struct workqueue_struct *wq, int max_active) +void workqueue_set_max_active(struct workqueue_struct *wq, int max_active, + bool drain) { + DEFINE_WAIT(wait); struct pool_workqueue *pwq; + might_sleep_if(drain); + /* disallow meddling with max_active for ordered workqueues */ if (WARN_ON(wq->flags & __WQ_ORDERED)) return; @@ -4363,6 +4383,38 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active) pwq_adjust_max_active(pwq); mutex_unlock(&wq->mutex); + + /* + * If we have increased max_active, pwq_dec_nr_in_flight() might + * not trigger for other instances of this function waiting for + * drain. Force them to check. + */ + wake_up_all(&wq_nr_active_drain_waitq); + + if (!drain) + return; + + /* let's wait for the drain to complete */ +restart: + mutex_lock(&wq->mutex); + prepare_to_wait(&wq_nr_active_drain_waitq, &wait, TASK_UNINTERRUPTIBLE); + + for_each_pwq(pwq, wq) { + /* + * nr_active should be monotonously decreasing as long as + * it's over max_active, so no need to grab pool lock. + * Also, test against saved_max_active in case multiple + * instances and/or system freezer are racing. + */ + if (pwq->nr_active > wq->saved_max_active) { + mutex_unlock(&wq->mutex); + schedule(); + goto restart; + } + } + + finish_wait(&wq_nr_active_drain_waitq, &wait); + mutex_unlock(&wq->mutex); } EXPORT_SYMBOL_GPL(workqueue_set_max_active); -- 1.8.4.2 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() 2013-12-19 23:37 ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo @ 2013-12-20 1:31 ` Rafael J. Wysocki 2013-12-20 13:32 ` Tejun Heo 0 siblings, 1 reply; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-20 1:31 UTC (permalink / raw) To: Tejun Heo Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu, Lai Jiangshan, David Howells On Thursday, December 19, 2013 06:37:20 PM Tejun Heo wrote: > Hello, Rafael. Hi, > If this looks good to you, I'll commit it to wq/for-3.14 and we can at > least start to clean up things. Yes, it does. So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on my workqueue, right? Rafael > ------- 8< ------- > From 6b5182f3f193d0ff9296f53a4665a55b6477aa77 Mon Sep 17 00:00:00 2001 > From: Tejun Heo <tj@kernel.org> > Date: Thu, 19 Dec 2013 18:33:12 -0500 > > workqueue_set_max_active() currently doesn't wait for the number of > in-flight work items to fall under the new @max_active. This patch > adds @drain paramter to workqueue_set_max_active(), if set, the > function sleeps until nr_active on each pool_workqueue of the target > workqueue drops below the current saved_max_active. > > This is planned to replace freezable workqueues. It is determined > that kernel freezables - both kthreads and workqueues - aren't > necessary and just add to the confusion and unnecessary deadlocks. > There are only a handful which actually need to stop processing for > system power events and they can be converted to use > workqueue_set_max_active(WQ_FROZEN_ACTIVE) instead of freezable > workqueues. Ultimately, all other uses of freezables will be dropped > and the whole system freezer machinery will be excised. Well, that's > the plan anyway. > > The implementation is fairly straight-forward. As this is expected to > be used by only a handful and most likely not concurrently, a single > wait queue is used. set_max_active drainers wait on it as necessary > and pwq_dec_nr_in_flight() triggers it if nr_active == max_active > after nr_active is decremented. This unfortunately adds on unlikely > branch to the work item execution path but this is extremely unlikely > to be noticeable and I think it's worthwhile to avoid polling here as > there may be multiple calls to this function in succession during > suspend and some platforms use suspend quite frequently. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Link: http://lkml.kernel.org/g/20131218213936.GA8218@mtj.dyndns.org > Cc: Lai Jiangshan <laijs@cn.fujitsu.com> > Cc: David Howells <dhowells@redhat.com> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > --- > fs/fscache/main.c | 2 +- > include/linux/workqueue.h | 2 +- > kernel/workqueue.c | 58 ++++++++++++++++++++++++++++++++++++++++++++--- > 3 files changed, 57 insertions(+), 5 deletions(-) > > diff --git a/fs/fscache/main.c b/fs/fscache/main.c > index 9d5a716..e2837f2 100644 > --- a/fs/fscache/main.c > +++ b/fs/fscache/main.c > @@ -67,7 +67,7 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write, > if (*datap < 1) > return -EINVAL; > > - workqueue_set_max_active(*wqp, *datap); > + workqueue_set_max_active(*wqp, *datap, false); > return 0; > } > > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h > index 334daa3..8b9c628 100644 > --- a/include/linux/workqueue.h > +++ b/include/linux/workqueue.h > @@ -488,7 +488,7 @@ extern bool cancel_delayed_work(struct delayed_work *dwork); > extern bool cancel_delayed_work_sync(struct delayed_work *dwork); > > extern void workqueue_set_max_active(struct workqueue_struct *wq, > - int max_active); > + int max_active, bool drain); > extern bool current_is_workqueue_rescuer(void); > extern bool workqueue_congested(int cpu, struct workqueue_struct *wq); > extern unsigned int work_busy(struct work_struct *work); > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 6748fbf..f18c35b 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -293,6 +293,12 @@ static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */ > static LIST_HEAD(workqueues); /* PL: list of all workqueues */ > static bool workqueue_freezing; /* PL: have wqs started freezing? */ > > +/* > + * Wait for nr_active to drain after max_active adjustment. This is a cold > + * path and not expected to have many users. A global waitq should do. > + */ > +static DECLARE_WAIT_QUEUE_HEAD(wq_nr_active_drain_waitq); > + > /* the per-cpu worker pools */ > static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], > cpu_worker_pools); > @@ -1123,6 +1129,14 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color) > pwq->nr_in_flight[color]--; > > pwq->nr_active--; > + > + /* > + * This can happen only after max_active is lowered. Tell the > + * waiters that draining might be complete. > + */ > + if (unlikely(pwq->nr_active == pwq->max_active)) > + wake_up_all(&wq_nr_active_drain_waitq); > + > if (!list_empty(&pwq->delayed_works)) { > /* one down, submit a delayed one */ > if (pwq->nr_active < pwq->max_active) > @@ -3140,7 +3154,7 @@ static ssize_t max_active_store(struct device *dev, > if (sscanf(buf, "%d", &val) != 1 || val <= 0) > return -EINVAL; > > - workqueue_set_max_active(wq, val); > + workqueue_set_max_active(wq, val, false); > return count; > } > static DEVICE_ATTR_RW(max_active); > @@ -4339,16 +4353,22 @@ EXPORT_SYMBOL_GPL(destroy_workqueue); > * workqueue_set_max_active - adjust max_active of a workqueue > * @wq: target workqueue > * @max_active: new max_active value. > + * @drain: wait until the actual level of concurrency becomes <= @max_active > * > - * Set max_active of @wq to @max_active. > + * Set max_active of @wq to @max_active. If @drain is true, wait until the > + * in-flight work items are drained to the new level. > * > * CONTEXT: > * Don't call from IRQ context. > */ > -void workqueue_set_max_active(struct workqueue_struct *wq, int max_active) > +void workqueue_set_max_active(struct workqueue_struct *wq, int max_active, > + bool drain) > { > + DEFINE_WAIT(wait); > struct pool_workqueue *pwq; > > + might_sleep_if(drain); > + > /* disallow meddling with max_active for ordered workqueues */ > if (WARN_ON(wq->flags & __WQ_ORDERED)) > return; > @@ -4363,6 +4383,38 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active) > pwq_adjust_max_active(pwq); > > mutex_unlock(&wq->mutex); > + > + /* > + * If we have increased max_active, pwq_dec_nr_in_flight() might > + * not trigger for other instances of this function waiting for > + * drain. Force them to check. > + */ > + wake_up_all(&wq_nr_active_drain_waitq); > + > + if (!drain) > + return; > + > + /* let's wait for the drain to complete */ > +restart: > + mutex_lock(&wq->mutex); > + prepare_to_wait(&wq_nr_active_drain_waitq, &wait, TASK_UNINTERRUPTIBLE); > + > + for_each_pwq(pwq, wq) { > + /* > + * nr_active should be monotonously decreasing as long as > + * it's over max_active, so no need to grab pool lock. > + * Also, test against saved_max_active in case multiple > + * instances and/or system freezer are racing. > + */ > + if (pwq->nr_active > wq->saved_max_active) { > + mutex_unlock(&wq->mutex); > + schedule(); > + goto restart; > + } > + } > + > + finish_wait(&wq_nr_active_drain_waitq, &wait); > + mutex_unlock(&wq->mutex); > } > EXPORT_SYMBOL_GPL(workqueue_set_max_active); > > -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() 2013-12-20 1:31 ` Rafael J. Wysocki @ 2013-12-20 13:32 ` Tejun Heo 2013-12-20 13:56 ` Rafael J. Wysocki 0 siblings, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-20 13:32 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu, Lai Jiangshan, David Howells Hello, Rafael. On Fri, Dec 20, 2013 at 02:31:37AM +0100, Rafael J. Wysocki wrote: > > If this looks good to you, I'll commit it to wq/for-3.14 and we can at > > least start to clean up things. > > Yes, it does. > > So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during > suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on > my workqueue, right? Yeah, you can also do workqueue_set_max_active(wq, 0) during resume to restore the default value, which is what's used by alloc_workqueue() anyway. I'll apply the two patches to wq/for-3.14 once David reviews them or if you wanna take them through the pm tree, that's fine too. I don't have anything scheduled for wq for the next merge window. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() 2013-12-20 13:32 ` Tejun Heo @ 2013-12-20 13:56 ` Rafael J. Wysocki 2013-12-20 14:23 ` Tejun Heo 0 siblings, 1 reply; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-20 13:56 UTC (permalink / raw) To: Tejun Heo Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu, Lai Jiangshan, David Howells On Friday, December 20, 2013 08:32:20 AM Tejun Heo wrote: > Hello, Rafael. > > On Fri, Dec 20, 2013 at 02:31:37AM +0100, Rafael J. Wysocki wrote: > > > If this looks good to you, I'll commit it to wq/for-3.14 and we can at > > > least start to clean up things. > > > > Yes, it does. > > > > So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during > > suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on > > my workqueue, right? > > Yeah, you can also do workqueue_set_max_active(wq, 0) during resume to > restore the default value, which is what's used by alloc_workqueue() > anyway. OK > I'll apply the two patches to wq/for-3.14 once David reviews them or > if you wanna take them through the pm tree, that's fine too. I don't > have anything scheduled for wq for the next merge window. I have some material queued up, so I can take them once they've been reviewed. Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() 2013-12-20 13:56 ` Rafael J. Wysocki @ 2013-12-20 14:23 ` Tejun Heo 0 siblings, 0 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-20 14:23 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu, Lai Jiangshan, David Howells On Fri, Dec 20, 2013 at 02:56:09PM +0100, Rafael J. Wysocki wrote: > On Friday, December 20, 2013 08:32:20 AM Tejun Heo wrote: > > Hello, Rafael. > > > > On Fri, Dec 20, 2013 at 02:31:37AM +0100, Rafael J. Wysocki wrote: > > > > If this looks good to you, I'll commit it to wq/for-3.14 and we can at > > > > least start to clean up things. > > > > > > Yes, it does. > > > > > > So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during > > > suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on > > > my workqueue, right? > > > > Yeah, you can also do workqueue_set_max_active(wq, 0) during resume to > > restore the default value, which is what's used by alloc_workqueue() > > anyway. Oops, it sould be obvious but just in case. The call to use during suspend is workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE, true). :) -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-13 23:07 ` Tejun Heo 2013-12-13 23:15 ` Nigel Cunningham @ 2013-12-16 12:12 ` Ming Lei 2013-12-16 12:45 ` Tejun Heo 1 sibling, 1 reply; 54+ messages in thread From: Ming Lei @ 2013-12-16 12:12 UTC (permalink / raw) To: Tejun Heo Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, Linux Kernel Mailing List, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Sat, Dec 14, 2013 at 7:07 AM, Tejun Heo <tj@kernel.org> wrote: > >> All pending I/O should have been flushed when suspend/hibernate >> started, and there's no point in trying to update metadata on a > > Frozen or not, it isn't guaranteed that bdi wb queue is empty when the > system went to suspend. They're likely to be empty but there's no > guarantee. Conversion to workqueue only makes the behavior more > deterministic. Looks we should guarantee that I/O queue is emptied before system sleep, otherwise it may be a bug, since battery may be drained up to cause data loss or block devices may be disconnected during sleep. Thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-16 12:12 ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei @ 2013-12-16 12:45 ` Tejun Heo 2013-12-16 13:24 ` Ming Lei 0 siblings, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-16 12:45 UTC (permalink / raw) To: Ming Lei Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, Linux Kernel Mailing List, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, On Mon, Dec 16, 2013 at 08:12:07PM +0800, Ming Lei wrote: > Looks we should guarantee that I/O queue is emptied before system sleep, > otherwise it may be a bug, since battery may be drained up to cause > data loss or block devices may be disconnected during sleep. It's a bit beside the point as that's what journaling is for. Power loss during suspend isn't all that different from power loss while suspended. Also, the filesystems are already synced before entering suspend, so it shouldn't be an issue. However, that doesn't guarantee there aren't no pending bdi work items. We *hope* to control the sources with freezer but there's no guarantee about the coverage and no way to verify that either. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-16 12:45 ` Tejun Heo @ 2013-12-16 13:24 ` Ming Lei 2013-12-16 16:05 ` Tejun Heo 0 siblings, 1 reply; 54+ messages in thread From: Ming Lei @ 2013-12-16 13:24 UTC (permalink / raw) To: Tejun Heo Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, Linux Kernel Mailing List, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Mon, Dec 16, 2013 at 8:45 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Mon, Dec 16, 2013 at 08:12:07PM +0800, Ming Lei wrote: >> Looks we should guarantee that I/O queue is emptied before system sleep, >> otherwise it may be a bug, since battery may be drained up to cause >> data loss or block devices may be disconnected during sleep. > > It's a bit beside the point as that's what journaling is for. Power > loss during suspend isn't all that different from power loss while > suspended. Also, the filesystems are already synced before entering > suspend, so it shouldn't be an issue. However, that doesn't guarantee > there aren't no pending bdi work items. We *hope* to control the You mean there are still some write I/O scheduled after processes are frozen? by unfreezable kernel threads? > sources with freezer but there's no guarantee about the coverage and > no way to verify that either. Thanks, -- Ming Lei ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-16 13:24 ` Ming Lei @ 2013-12-16 16:05 ` Tejun Heo 0 siblings, 0 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-16 16:05 UTC (permalink / raw) To: Ming Lei Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, Linux Kernel Mailing List, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Ming. On Mon, Dec 16, 2013 at 09:24:40PM +0800, Ming Lei wrote: > You mean there are still some write I/O scheduled after processes are > frozen? by unfreezable kernel threads? By unfreezable kernel threads, timer, bh, freezable kernel threads, whatever really. Please note that freezer doesn't enforce any order in how the target tasks are frozen. There's no mechanism to guarantee that flusher is frozen after all other freezable tasks are frozen. There isn't even a meachnism which guarantees flusher flushes all its queues before getting frozen. There's no interlocking whatsoever. The only thing which happens is that we flush all filesystems before freezing the kernel threads, so the queues are *likely* to be empty. I think trying to control all IO sources using the freezer is a fundamentally flawed idea. There are N sources which are difficult to track down reliably while there is *single* queue all those have to go through, which needs to be quiesced anyway. The only thing which makes sense is controlling the queue. Maybe it really is necessary for hibernation. If so, let's please make it something tailored for that purpose - quiesce only the ones which are actually relevant and in the places where it's necessary. Not this "we stopped most of the system, whatever that means, and it feels good" thing which doesn't really solve anything while introducing this fuzzy wishful idea of mostly stopped system and giant lock semantics all over the place. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-13 20:40 ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo 2013-12-13 22:45 ` Nigel Cunningham @ 2013-12-17 2:38 ` Rafael J. Wysocki 2013-12-17 12:36 ` Tejun Heo 2013-12-17 12:50 ` [PATCH v2] " Tejun Heo 2 siblings, 1 reply; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-17 2:38 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Friday, December 13, 2013 03:40:34 PM Tejun Heo wrote: > Hello, > > So, this is the laughable workaround that I came up with. Seriously, > this is tragic. :( > > Thanks. > > ------- 8< ------- > Freezable kthreads and workqueues are fundamentally problematic in > that they effectively introduce a big kernel lock widely used in the > kernel and have already been the culprit of several deadlock > scenarios. This is the latest occurrence. OK, so I'm too tired now to go through all that, but I'll look at it tomorrow. The rule of thumb is to get rid of freezable kernel threads in the first place if possible anyway if they are causing problems to happen. Thanks! > During resume, libata rescans all the ports and revalidates all > pre-existing devices. If it determines that a device has gone > missing, the device is removed from the system which involves > invalidating block device and flushing bdi while holding driver core > layer locks. Unfortunately, this can race with the rest of device > resume. Because freezable kthreads and workqueues are thawed after > device resume is complete and block device removal depends on > freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make > progress, this can lead to deadlock - block device removal can't > proceed because kthreads are frozen and kthreads can't be thawed > because device resume is blocked behind block device removal. > > 839a8e8660b6 ("writeback: replace custom worker pool implementation > with unbound workqueue") made this particular deadlock scenario more > visible but the underlying problem has always been there - the > original forker task and jbd2 are freezable too. In fact, this is > highly likely just one of many possible deadlock scenarios given that > freezer behaves as a big kernel lock and we don't have any debug > mechanism around it. > > I believe the right thing to do is getting rid of freezable kthreads > and workqueues. This is something fundamentally broken. For now, > implement a funny workaround in libata - just avoid doing block device > hot[un]plug while the system is frozen. Kernel engineering at its > finest. :( > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reported-by: Tomaž Šolc <tomaz.solc@tablix.org> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801 > Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Len Brown <len.brown@intel.com> > Cc: Oleg Nesterov <oleg@redhat.com> > Cc: stable@vger.kernel.org > --- > drivers/ata/libata-scsi.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c > index ab58556..60a01d3 100644 > --- a/drivers/ata/libata-scsi.c > +++ b/drivers/ata/libata-scsi.c > @@ -3872,6 +3872,25 @@ void ata_scsi_hotplug(struct work_struct *work) > return; > } > > + /* > + * XXX - UGLY HACK > + * > + * The core suspend/resume path is fundamentally broken due to > + * freezable kthreads and workqueue and may deadlock if a block > + * device gets removed while resume is in progress. I don't know > + * what the solution is short of removing freezable kthreads and > + * workqueues altogether. > + * > + * The following is an ugly hack to avoid kicking off device > + * removal while freezer is active. This is a joke but does avoid > + * this particular deadlock scenario. > + * > + * https://bugzilla.kernel.org/show_bug.cgi?id=62801 > + * http://marc.info/?l=linux-kernel&m=138695698516487 > + */ > + while (pm_freezing) > + msleep(100); > + > DPRINTK("ENTER\n"); > mutex_lock(&ap->scsi_scan_mutex); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-17 2:38 ` Rafael J. Wysocki @ 2013-12-17 12:36 ` Tejun Heo 2013-12-18 0:23 ` Rafael J. Wysocki 0 siblings, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-17 12:36 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Rafael. On Tue, Dec 17, 2013 at 03:38:05AM +0100, Rafael J. Wysocki wrote: > On Friday, December 13, 2013 03:40:34 PM Tejun Heo wrote: > > Hello, > > > > So, this is the laughable workaround that I came up with. Seriously, > > this is tragic. :( > > > > Thanks. > > > > ------- 8< ------- > > Freezable kthreads and workqueues are fundamentally problematic in > > that they effectively introduce a big kernel lock widely used in the > > kernel and have already been the culprit of several deadlock > > scenarios. This is the latest occurrence. > > OK, so I'm too tired now to go through all that, but I'll look at it tomorrow. > > The rule of thumb is to get rid of freezable kernel threads in the first > place if possible anyway if they are causing problems to happen. Yes, that'd be awesome. In fact, getting rid of all kernel freezables in non-low-level-drivers would be a great step forward. That said, we need something easily backportable so I think we probably need this bandaid for immediate fix for now. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen 2013-12-17 12:36 ` Tejun Heo @ 2013-12-18 0:23 ` Rafael J. Wysocki 0 siblings, 0 replies; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-18 0:23 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Tuesday, December 17, 2013 07:36:43 AM Tejun Heo wrote: > Hello, Rafael. > > On Tue, Dec 17, 2013 at 03:38:05AM +0100, Rafael J. Wysocki wrote: > > On Friday, December 13, 2013 03:40:34 PM Tejun Heo wrote: > > > Hello, > > > > > > So, this is the laughable workaround that I came up with. Seriously, > > > this is tragic. :( > > > > > > Thanks. > > > > > > ------- 8< ------- > > > Freezable kthreads and workqueues are fundamentally problematic in > > > that they effectively introduce a big kernel lock widely used in the > > > kernel and have already been the culprit of several deadlock > > > scenarios. This is the latest occurrence. > > > > OK, so I'm too tired now to go through all that, but I'll look at it tomorrow. > > > > The rule of thumb is to get rid of freezable kernel threads in the first > > place if possible anyway if they are causing problems to happen. > > Yes, that'd be awesome. In fact, getting rid of all kernel freezables > in non-low-level-drivers would be a great step forward. Agreed. > That said, we need something easily backportable so I think we probably need > this bandaid for immediate fix for now. Well, we need something, but I have a couple of concerns about this particular patch (I'll reply to it with comments shortly). Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2] libata, freezer: avoid block device removal while system is frozen 2013-12-13 20:40 ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo 2013-12-13 22:45 ` Nigel Cunningham 2013-12-17 2:38 ` Rafael J. Wysocki @ 2013-12-17 12:50 ` Tejun Heo 2013-12-18 1:04 ` Rafael J. Wysocki 2013-12-18 12:07 ` [PATCH v3] " Tejun Heo 2 siblings, 2 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-17 12:50 UTC (permalink / raw) To: Rafael J. Wysocki, Jens Axboe Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Rafael, if you're okay with the workaround, I'll route it through libata/for-3.13-fixes. Thanks. ------- 8< ------- Freezable kthreads and workqueues are fundamentally problematic in that they effectively introduce a big kernel lock widely used in the kernel and have already been the culprit of several deadlock scenarios. This is the latest occurrence. During resume, libata rescans all the ports and revalidates all pre-existing devices. If it determines that a device has gone missing, the device is removed from the system which involves invalidating block device and flushing bdi while holding driver core layer locks. Unfortunately, this can race with the rest of device resume. Because freezable kthreads and workqueues are thawed after device resume is complete and block device removal depends on freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make progress, this can lead to deadlock - block device removal can't proceed because kthreads are frozen and kthreads can't be thawed because device resume is blocked behind block device removal. 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") made this particular deadlock scenario more visible but the underlying problem has always been there - the original forker task and jbd2 are freezable too. In fact, this is highly likely just one of many possible deadlock scenarios given that freezer behaves as a big kernel lock and we don't have any debug mechanism around it. I believe the right thing to do is getting rid of freezable kthreads and workqueues. This is something fundamentally broken. For now, implement a funny workaround in libata - just avoid doing block device hot[un]plug while the system is frozen. Kernel engineering at its finest. :( v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built as a module. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Tomaž Šolc <tomaz.solc@tablix.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801 Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Len Brown <len.brown@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org --- drivers/ata/libata-scsi.c | 19 +++++++++++++++++++ kernel/freezer.c | 6 ++++++ 2 files changed, 25 insertions(+) --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct return; } + /* + * XXX - UGLY HACK + * + * The core suspend/resume path is fundamentally broken due to + * freezable kthreads and workqueue and may deadlock if a block + * device gets removed while resume is in progress. I don't know + * what the solution is short of removing freezable kthreads and + * workqueues altogether. + * + * The following is an ugly hack to avoid kicking off device + * removal while freezer is active. This is a joke but does avoid + * this particular deadlock scenario. + * + * https://bugzilla.kernel.org/show_bug.cgi?id=62801 + * http://marc.info/?l=linux-kernel&m=138695698516487 + */ + while (pm_freezing) + msleep(100); + DPRINTK("ENTER\n"); mutex_lock(&ap->scsi_scan_mutex); --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt); bool pm_freezing; bool pm_nosig_freezing; +/* + * Temporary export for the deadlock workaround in ata_scsi_hotplug(). + * Remove once the hack becomes unnecessary. + */ +EXPORT_SYMBOL_GPL(pm_freezing); + /* protects freezing and frozen transitions */ static DEFINE_SPINLOCK(freezer_lock); ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2] libata, freezer: avoid block device removal while system is frozen 2013-12-17 12:50 ` [PATCH v2] " Tejun Heo @ 2013-12-18 1:04 ` Rafael J. Wysocki 2013-12-18 11:08 ` Tejun Heo 2013-12-18 12:07 ` [PATCH v3] " Tejun Heo 1 sibling, 1 reply; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-18 1:04 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Tuesday, December 17, 2013 07:50:42 AM Tejun Heo wrote: > Hello, > > Rafael, if you're okay with the workaround, I'll route it through > libata/for-3.13-fixes. > > Thanks. > ------- 8< ------- > Freezable kthreads and workqueues are fundamentally problematic in > that they effectively introduce a big kernel lock widely used in the > kernel and have already been the culprit of several deadlock > scenarios. This is the latest occurrence. > > During resume, libata rescans all the ports and revalidates all > pre-existing devices. If it determines that a device has gone > missing, the device is removed from the system which involves > invalidating block device and flushing bdi while holding driver core > layer locks. Unfortunately, this can race with the rest of device > resume. Because freezable kthreads and workqueues are thawed after > device resume is complete and block device removal depends on > freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make > progress, this can lead to deadlock - block device removal can't > proceed because kthreads are frozen and kthreads can't be thawed > because device resume is blocked behind block device removal. > > 839a8e8660b6 ("writeback: replace custom worker pool implementation > with unbound workqueue") made this particular deadlock scenario more > visible but the underlying problem has always been there - the > original forker task and jbd2 are freezable too. In fact, this is > highly likely just one of many possible deadlock scenarios given that > freezer behaves as a big kernel lock and we don't have any debug > mechanism around it. > > I believe the right thing to do is getting rid of freezable kthreads > and workqueues. This is something fundamentally broken. For now, > implement a funny workaround in libata - just avoid doing block device > hot[un]plug while the system is frozen. Kernel engineering at its > finest. :( > > v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built > as a module. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reported-by: Tomaž Šolc <tomaz.solc@tablix.org> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801 > Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Len Brown <len.brown@intel.com> > Cc: Oleg Nesterov <oleg@redhat.com> > Cc: stable@vger.kernel.org > --- > drivers/ata/libata-scsi.c | 19 +++++++++++++++++++ > kernel/freezer.c | 6 ++++++ > 2 files changed, 25 insertions(+) > > --- a/drivers/ata/libata-scsi.c > +++ b/drivers/ata/libata-scsi.c > @@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct > return; > } > > + /* > + * XXX - UGLY HACK > + * > + * The core suspend/resume path is fundamentally broken due to > + * freezable kthreads and workqueue and may deadlock if a block > + * device gets removed while resume is in progress. I don't know > + * what the solution is short of removing freezable kthreads and > + * workqueues altogether. Do you mean the block device core or the SCSI core or something else? It would be good to clarify that here to avoid confusion. > + * The following is an ugly hack to avoid kicking off device > + * removal while freezer is active. This is a joke but does avoid > + * this particular deadlock scenario. > + * > + * https://bugzilla.kernel.org/show_bug.cgi?id=62801 > + * http://marc.info/?l=linux-kernel&m=138695698516487 > + */ > + while (pm_freezing) > + msleep(100); Why is the sleep time 100 ms exactly? And why does it matter? For example, what would change if it were 10 ms? > + > DPRINTK("ENTER\n"); > mutex_lock(&ap->scsi_scan_mutex); > > --- a/kernel/freezer.c > +++ b/kernel/freezer.c > @@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt); > bool pm_freezing; > bool pm_nosig_freezing; > > +/* > + * Temporary export for the deadlock workaround in ata_scsi_hotplug(). > + * Remove once the hack becomes unnecessary. > + */ > +EXPORT_SYMBOL_GPL(pm_freezing); > + > /* protects freezing and frozen transitions */ > static DEFINE_SPINLOCK(freezer_lock); > Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2] libata, freezer: avoid block device removal while system is frozen 2013-12-18 1:04 ` Rafael J. Wysocki @ 2013-12-18 11:08 ` Tejun Heo 0 siblings, 0 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-18 11:08 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hey, Rafael. On Wed, Dec 18, 2013 at 02:04:35AM +0100, Rafael J. Wysocki wrote: > > + * The core suspend/resume path is fundamentally broken due to > > + * freezable kthreads and workqueue and may deadlock if a block > > + * device gets removed while resume is in progress. I don't know > > + * what the solution is short of removing freezable kthreads and > > + * workqueues altogether. > > Do you mean the block device core or the SCSI core or something else? It would > be good to clarify that here to avoid confusion. Will clarify. > > + * The following is an ugly hack to avoid kicking off device > > + * removal while freezer is active. This is a joke but does avoid > > + * this particular deadlock scenario. > > + * > > + * https://bugzilla.kernel.org/show_bug.cgi?id=62801 > > + * http://marc.info/?l=linux-kernel&m=138695698516487 > > + */ > > + while (pm_freezing) > > + msleep(100); > > Why is the sleep time 100 ms exactly? And why does it matter? Just a number I pulled out of my ass. > For example, what would change if it were 10 ms? Yeah, 10ms is my favorite human-visible polling duration too (because it's slow enough not to cause overhead issues while fast enough to be mostly unnoticeable to humans). This one doesn't really matter because the operation's latency isn't something which the user would wait for actively. That said, yeah, why not, I'll change it to 10ms. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v3] libata, freezer: avoid block device removal while system is frozen 2013-12-17 12:50 ` [PATCH v2] " Tejun Heo 2013-12-18 1:04 ` Rafael J. Wysocki @ 2013-12-18 12:07 ` Tejun Heo 2013-12-18 22:08 ` Rafael J. Wysocki 2013-12-19 18:54 ` [PATCH v4] " Tejun Heo 1 sibling, 2 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-18 12:07 UTC (permalink / raw) To: Rafael J. Wysocki, Jens Axboe Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Freezable kthreads and workqueues are fundamentally problematic in that they effectively introduce a big kernel lock widely used in the kernel and have already been the culprit of several deadlock scenarios. This is the latest occurrence. During resume, libata rescans all the ports and revalidates all pre-existing devices. If it determines that a device has gone missing, the device is removed from the system which involves invalidating block device and flushing bdi while holding driver core layer locks. Unfortunately, this can race with the rest of device resume. Because freezable kthreads and workqueues are thawed after device resume is complete and block device removal depends on freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make progress, this can lead to deadlock - block device removal can't proceed because kthreads are frozen and kthreads can't be thawed because device resume is blocked behind block device removal. 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") made this particular deadlock scenario more visible but the underlying problem has always been there - the original forker task and jbd2 are freezable too. In fact, this is highly likely just one of many possible deadlock scenarios given that freezer behaves as a big kernel lock and we don't have any debug mechanism around it. I believe the right thing to do is getting rid of freezable kthreads and workqueues. This is something fundamentally broken. For now, implement a funny workaround in libata - just avoid doing block device hot[un]plug while the system is frozen. Kernel engineering at its finest. :( v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built as a module. v3: Comment updated and polling interval changed to 10ms as suggested by Rafael. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Tomaž Šolc <tomaz.solc@tablix.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801 Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Len Brown <len.brown@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org --- drivers/ata/libata-scsi.c | 19 +++++++++++++++++++ kernel/freezer.c | 6 ++++++ 2 files changed, 25 insertions(+) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index db6dfcf..f519868 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct *work) return; } + /* + * XXX - UGLY HACK + * + * The block layer suspend/resume path is fundamentally broken due + * to freezable kthreads and workqueue and may deadlock if a block + * device gets removed while resume is in progress. I don't know + * what the solution is short of removing freezable kthreads and + * workqueues altogether. + * + * The following is an ugly hack to avoid kicking off device + * removal while freezer is active. This is a joke but does avoid + * this particular deadlock scenario. + * + * https://bugzilla.kernel.org/show_bug.cgi?id=62801 + * http://marc.info/?l=linux-kernel&m=138695698516487 + */ + while (pm_freezing) + msleep(10); + DPRINTK("ENTER\n"); mutex_lock(&ap->scsi_scan_mutex); diff --git a/kernel/freezer.c b/kernel/freezer.c index b462fa1..aa6a8aa 100644 --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt); bool pm_freezing; bool pm_nosig_freezing; +/* + * Temporary export for the deadlock workaround in ata_scsi_hotplug(). + * Remove once the hack becomes unnecessary. + */ +EXPORT_SYMBOL_GPL(pm_freezing); + /* protects freezing and frozen transitions */ static DEFINE_SPINLOCK(freezer_lock); ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v3] libata, freezer: avoid block device removal while system is frozen 2013-12-18 12:07 ` [PATCH v3] " Tejun Heo @ 2013-12-18 22:08 ` Rafael J. Wysocki 2013-12-19 17:24 ` Tejun Heo 2013-12-19 18:54 ` [PATCH v4] " Tejun Heo 1 sibling, 1 reply; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-18 22:08 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Wednesday, December 18, 2013 07:07:32 AM Tejun Heo wrote: > Freezable kthreads and workqueues are fundamentally problematic in > that they effectively introduce a big kernel lock widely used in the > kernel and have already been the culprit of several deadlock > scenarios. This is the latest occurrence. > > During resume, libata rescans all the ports and revalidates all > pre-existing devices. If it determines that a device has gone > missing, the device is removed from the system which involves > invalidating block device and flushing bdi while holding driver core > layer locks. Unfortunately, this can race with the rest of device > resume. Because freezable kthreads and workqueues are thawed after > device resume is complete and block device removal depends on > freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make > progress, this can lead to deadlock - block device removal can't > proceed because kthreads are frozen and kthreads can't be thawed > because device resume is blocked behind block device removal. > > 839a8e8660b6 ("writeback: replace custom worker pool implementation > with unbound workqueue") made this particular deadlock scenario more > visible but the underlying problem has always been there - the > original forker task and jbd2 are freezable too. In fact, this is > highly likely just one of many possible deadlock scenarios given that > freezer behaves as a big kernel lock and we don't have any debug > mechanism around it. > > I believe the right thing to do is getting rid of freezable kthreads > and workqueues. I agree. It may be useful to block them over suspend/resume, but that doesn't have to be done through the freezer. > This is something fundamentally broken. For now, > implement a funny workaround in libata - just avoid doing block device > hot[un]plug while the system is frozen. Kernel engineering at its > finest. :( > > v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built > as a module. > > v3: Comment updated and polling interval changed to 10ms as suggested > by Rafael. This one is fine by my FWIW. Thanks! > Signed-off-by: Tejun Heo <tj@kernel.org> > Reported-by: Tomaž Šolc <tomaz.solc@tablix.org> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801 > Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Len Brown <len.brown@intel.com> > Cc: Oleg Nesterov <oleg@redhat.com> > Cc: stable@vger.kernel.org > --- > drivers/ata/libata-scsi.c | 19 +++++++++++++++++++ > kernel/freezer.c | 6 ++++++ > 2 files changed, 25 insertions(+) > > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c > index db6dfcf..f519868 100644 > --- a/drivers/ata/libata-scsi.c > +++ b/drivers/ata/libata-scsi.c > @@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct *work) > return; > } > > + /* > + * XXX - UGLY HACK > + * > + * The block layer suspend/resume path is fundamentally broken due > + * to freezable kthreads and workqueue and may deadlock if a block > + * device gets removed while resume is in progress. I don't know > + * what the solution is short of removing freezable kthreads and > + * workqueues altogether. > + * > + * The following is an ugly hack to avoid kicking off device > + * removal while freezer is active. This is a joke but does avoid > + * this particular deadlock scenario. > + * > + * https://bugzilla.kernel.org/show_bug.cgi?id=62801 > + * http://marc.info/?l=linux-kernel&m=138695698516487 > + */ > + while (pm_freezing) > + msleep(10); > + > DPRINTK("ENTER\n"); > mutex_lock(&ap->scsi_scan_mutex); > > diff --git a/kernel/freezer.c b/kernel/freezer.c > index b462fa1..aa6a8aa 100644 > --- a/kernel/freezer.c > +++ b/kernel/freezer.c > @@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt); > bool pm_freezing; > bool pm_nosig_freezing; > > +/* > + * Temporary export for the deadlock workaround in ata_scsi_hotplug(). > + * Remove once the hack becomes unnecessary. > + */ > +EXPORT_SYMBOL_GPL(pm_freezing); > + > /* protects freezing and frozen transitions */ > static DEFINE_SPINLOCK(freezer_lock); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] libata, freezer: avoid block device removal while system is frozen 2013-12-18 22:08 ` Rafael J. Wysocki @ 2013-12-19 17:24 ` Tejun Heo 0 siblings, 0 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-19 17:24 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, On Wed, Dec 18, 2013 at 11:08:41PM +0100, Rafael J. Wysocki wrote: > > This is something fundamentally broken. For now, > > implement a funny workaround in libata - just avoid doing block device > > hot[un]plug while the system is frozen. Kernel engineering at its > > finest. :( > > > > v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built > > as a module. > > > > v3: Comment updated and polling interval changed to 10ms as suggested > > by Rafael. > > This one is fine by my FWIW. Applied to libata/for-3.13-fixes. I liberally added your reviewed-by from the above response. Please let me know if I should remove it. I'll push it out to Linus after it spends a couple days in -next. Thanks! -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v4] libata, freezer: avoid block device removal while system is frozen 2013-12-18 12:07 ` [PATCH v3] " Tejun Heo 2013-12-18 22:08 ` Rafael J. Wysocki @ 2013-12-19 18:54 ` Tejun Heo 1 sibling, 0 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-19 18:54 UTC (permalink / raw) To: Rafael J. Wysocki, Jens Axboe Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, again. Oops, it was missing CONFIG_FREEZER. Updated version committed. How did we even survive before kbuild test robot? :) Thanks. ------ 8< ------ >From 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Wed, 18 Dec 2013 07:07:32 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Freezable kthreads and workqueues are fundamentally problematic in that they effectively introduce a big kernel lock widely used in the kernel and have already been the culprit of several deadlock scenarios. This is the latest occurrence. During resume, libata rescans all the ports and revalidates all pre-existing devices. If it determines that a device has gone missing, the device is removed from the system which involves invalidating block device and flushing bdi while holding driver core layer locks. Unfortunately, this can race with the rest of device resume. Because freezable kthreads and workqueues are thawed after device resume is complete and block device removal depends on freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make progress, this can lead to deadlock - block device removal can't proceed because kthreads are frozen and kthreads can't be thawed because device resume is blocked behind block device removal. 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") made this particular deadlock scenario more visible but the underlying problem has always been there - the original forker task and jbd2 are freezable too. In fact, this is highly likely just one of many possible deadlock scenarios given that freezer behaves as a big kernel lock and we don't have any debug mechanism around it. I believe the right thing to do is getting rid of freezable kthreads and workqueues. This is something fundamentally broken. For now, implement a funny workaround in libata - just avoid doing block device hot[un]plug while the system is frozen. Kernel engineering at its finest. :( v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built as a module. v3: Comment updated and polling interval changed to 10ms as suggested by Rafael. v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not defined when FREEZER is not configured thus breaking build. Reported by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Tomaž Šolc <tomaz.solc@tablix.org> Reviewed-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801 Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Len Brown <len.brown@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org Cc: kbuild test robot <fengguang.wu@intel.com> --- drivers/ata/libata-scsi.c | 21 +++++++++++++++++++++ kernel/freezer.c | 6 ++++++ 2 files changed, 27 insertions(+) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index db6dfcf..176f629 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -3871,6 +3871,27 @@ void ata_scsi_hotplug(struct work_struct *work) return; } + /* + * XXX - UGLY HACK + * + * The block layer suspend/resume path is fundamentally broken due + * to freezable kthreads and workqueue and may deadlock if a block + * device gets removed while resume is in progress. I don't know + * what the solution is short of removing freezable kthreads and + * workqueues altogether. + * + * The following is an ugly hack to avoid kicking off device + * removal while freezer is active. This is a joke but does avoid + * this particular deadlock scenario. + * + * https://bugzilla.kernel.org/show_bug.cgi?id=62801 + * http://marc.info/?l=linux-kernel&m=138695698516487 + */ +#ifdef CONFIG_FREEZER + while (pm_freezing) + msleep(10); +#endif + DPRINTK("ENTER\n"); mutex_lock(&ap->scsi_scan_mutex); diff --git a/kernel/freezer.c b/kernel/freezer.c index b462fa1..aa6a8aa 100644 --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt); bool pm_freezing; bool pm_nosig_freezing; +/* + * Temporary export for the deadlock workaround in ata_scsi_hotplug(). + * Remove once the hack becomes unnecessary. + */ +EXPORT_SYMBOL_GPL(pm_freezing); + /* protects freezing and frozen transitions */ static DEFINE_SPINLOCK(freezer_lock); -- 1.8.4.2 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-13 17:49 Writeback threads and freezable Tejun Heo 2013-12-13 18:52 ` Tejun Heo @ 2013-12-14 1:53 ` Dave Chinner 2013-12-14 17:30 ` Greg Kroah-Hartman 2013-12-14 20:23 ` Tejun Heo 1 sibling, 2 replies; 54+ messages in thread From: Dave Chinner @ 2013-12-14 1:53 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Fri, Dec 13, 2013 at 12:49:32PM -0500, Tejun Heo wrote: > Hello, guys. > > This is discovered while investigating bug 62801 - "EliteBoot hangs at > dock, suspend, undock, resume". > > https://bugzilla.kernel.org/show_bug.cgi?id=62801 > > The laptop locks up during resume if undocked while suspended. The > dock contains a hard drive and the drive removal path and resume path > get stuck. This got bisected to 839a8e8660b6 ("writeback: replace > custom worker pool implementation with unbound workqueue") by the > reporter. The problem can be reproduced by just removing mounted > harddrive while a machine is suspended. I first thought it was some > dumb mistake but it turns out to be something fundamental. > > So, here's the lock up. > > * Resume starts and libata resume is kicked off. > > * libata EH determines that the device is gone. Eventually it invokes > device_del() through a work item. > > * device_del() tries to delete the block device which invokes > writeback_inodes_sb() on the mounted filesystem which in turn > schedules and flushes bdi work item. Note that at this point, the > kworker is holding multiple driver layer locks. That's the fundamental problem here - device removal asks the device to fsync the filesystem on top of the device that was just removed. The simple way to trigger this is to pull a device from underneath an active filesystem (e.g. user unplugs a USB device without first unmounting it). There are many years worth of bug reports showing that this attempt by the device removal code to sync the filesystem leads to deadlocks. It's simply not a valid thing to do - just how is the filesystem supposed to sync to a non-existent device? I've raised this each time a user reports it over the past few years and never been able to convince any to fix the filesystem re-entrancy problem device removal causes. Syncing the filesystem will require taking locks that are by IO in progress, and so can't make progress until the IO is completed, but that can't happen until the error handling completes the sync of the filesystem.... Preventing fs/io re-entrancy from contexts where we might be holding locks is the same reason we have GFP_NOFS and GFP_NOIO for memory allocation: re-entering a filesystem or IO subsystem whenever we are holding locks or serialised context in the fs/io path can deadlock the fs/io path. IOWs, syncing the filesystem from the device delete code is just plain wrong. That's what needs fixing - removing the cause of re-entrancy, not the workqueue or writeback code... > Ideas? Fix the device delete error handling not to re-enter the filesystem and IO path. It's just wrong. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-14 1:53 ` Writeback threads and freezable Dave Chinner @ 2013-12-14 17:30 ` Greg Kroah-Hartman 2013-12-14 20:23 ` Tejun Heo 1 sibling, 0 replies; 54+ messages in thread From: Greg Kroah-Hartman @ 2013-12-14 17:30 UTC (permalink / raw) To: Dave Chinner Cc: Tejun Heo, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Fengguang Wu On Sat, Dec 14, 2013 at 12:53:43PM +1100, Dave Chinner wrote: > > Ideas? > > Fix the device delete error handling not to re-enter the > filesystem and IO path. It's just wrong. That sounds totally reasonable to me, what is preventing someone from making this change? thanks, greg k-h ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-14 1:53 ` Writeback threads and freezable Dave Chinner 2013-12-14 17:30 ` Greg Kroah-Hartman @ 2013-12-14 20:23 ` Tejun Heo 2013-12-16 3:56 ` Dave Chinner 1 sibling, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-14 20:23 UTC (permalink / raw) To: Dave Chinner Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Dave. On Sat, Dec 14, 2013 at 12:53:43PM +1100, Dave Chinner wrote: > That's the fundamental problem here - device removal asks the device > to fsync the filesystem on top of the device that was just removed. > The simple way to trigger this is to pull a device from underneath > an active filesystem (e.g. user unplugs a USB device without first > unmounting it). There are many years worth of bug reports showing > that this attempt by the device removal code to sync the filesystem > leads to deadlocks. What are you suggesting? Implementing separate warm and hot unplug paths? That makes no sense whatsoever. Device hot unplug is just a sub operation of general device unplug which should be able to succeed whether the underlying device is failing IOs or not. Another thing to consider is that you want to excercise your error handling path as often as possible. IOs could still fail even during device warm unplugs. If anything atop the driver can't handle that, that's a bug and we want to fix them. The fact that we have bug reports going back years on the subject simply means that we've been traditionally lousy with error handling throughout the layers - those were the occassions where we fixed actual error handling bugs, hot unplug or not. Special casing hot unplug would just make existing error handling bugs more difficult to find. > It's simply not a valid thing to do - just how is the filesystem > supposed to sync to a non-existent device? By issuing all IOs and then handling the failures appropriately. Exactly just as it would handle *ANY* io errors. This actually is the simplest thing to do to drain the pipe - pushing things down all the way and fail them at single point. What's the alternative? Implement quick exit path at each layer? That's just silly and error-prone. > I've raised this each time a user reports it over the past few years > and never been able to convince any to fix the filesystem > re-entrancy problem device removal causes. Syncing the filesystem Well, it isn't a good idea. > will require taking locks that are by IO in progress, and so can't > make progress until the IO is completed, but that can't happen > until the error handling completes the sync of the filesystem.... IOs in progress shouldn't ever be stalled by filesystem sync. That's a massive layering violation. Where and how does that happen? > Preventing fs/io re-entrancy from contexts where we might be holding > locks is the same reason we have GFP_NOFS and GFP_NOIO for memory > allocation: re-entering a filesystem or IO subsystem whenever we are > holding locks or serialised context in the fs/io path can deadlock > the fs/io path. > > IOWs, syncing the filesystem from the device delete code is just > plain wrong. That's what needs fixing - removing the cause of > re-entrancy, not the workqueue or writeback code... No, the wrong thing there is having IO failures depending on filesystem sync. Nothing else. > > Ideas? > > Fix the device delete error handling not to re-enter the > filesystem and IO path. It's just wrong. Nope, bad idea. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-14 20:23 ` Tejun Heo @ 2013-12-16 3:56 ` Dave Chinner 2013-12-16 12:51 ` Tejun Heo 0 siblings, 1 reply; 54+ messages in thread From: Dave Chinner @ 2013-12-16 3:56 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Sat, Dec 14, 2013 at 03:23:24PM -0500, Tejun Heo wrote: > Hello, Dave. > > On Sat, Dec 14, 2013 at 12:53:43PM +1100, Dave Chinner wrote: > > That's the fundamental problem here - device removal asks the device > > to fsync the filesystem on top of the device that was just removed. > > The simple way to trigger this is to pull a device from underneath > > an active filesystem (e.g. user unplugs a USB device without first > > unmounting it). There are many years worth of bug reports showing > > that this attempt by the device removal code to sync the filesystem > > leads to deadlocks. > > What are you suggesting? Implementing separate warm and hot unplug > paths? That makes no sense whatsoever. Device hot unplug is just a > sub operation of general device unplug which should be able to succeed > whether the underlying device is failing IOs or not. I don't care. Trying to issue IO from an an IO error handling path where the device has just been removed is fundamentally broken. > Another thing to consider is that you want to excercise your error > handling path as often as possible. IOs could still fail even during > device warm unplugs. If anything atop the driver can't handle that, > that's a bug and we want to fix them. The fact that we have bug > reports going back years on the subject simply means that we've been > traditionally lousy with error handling throughout the layers - those > were the occassions where we fixed actual error handling bugs, hot > unplug or not. Special casing hot unplug would just make existing > error handling bugs more difficult to find. This has nothing to do with existing filesystem error handling paths. We get plenty of testing of them without having to pull devices out of machines. Why do you think we have stuff like dm-flakey or forced shutdown ioctls for XFS? Indeed, what I'm asking for is for a notification so that we can *shut the filesystem down* straight away, rather than have to wait for an IO error in a critical metadata structure to trigger the shutdown for us. > > It's simply not a valid thing to do - just how is the filesystem > > supposed to sync to a non-existent device? > > By issuing all IOs and then handling the failures appropriately. Which bit of "filesystem might deadlock trying to issue IO" didn't you understand? > Exactly just as it would handle *ANY* io errors. This actually is the > simplest thing to do to drain the pipe - pushing things down all the > way and fail them at single point. What's the alternative? Implement > quick exit path at each layer? That's just silly and error-prone. > > > I've raised this each time a user reports it over the past few years > > and never been able to convince any to fix the filesystem > > re-entrancy problem device removal causes. Syncing the filesystem > > Well, it isn't a good idea. > > > will require taking locks that are by IO in progress, and so can't > > make progress until the IO is completed, but that can't happen > > until the error handling completes the sync of the filesystem.... > > IOs in progress shouldn't ever be stalled by filesystem sync. That's > a massive layering violation. Where and how does that happen? IOs in progress get stalled by IO error handling. Then IO error handling tries to issue more IO via fsync_bdev(). The the filesystem tries to look a metadata buffer that is still under IO (i.e. in error handling) and stalls. That's the problem here - IO error handling is re-entering the the filesystem inappropriately. If you have to invalidate the device due to errors, fsync_bdev() is not the function to call. That's for *flushing* the filesystem. A shutdown is the only sane thing for an active filesystem to do when the device underneath it has been invalidated. Really, it gets worse - the block device is deleted and freed from the system while the filesystem on top of it still has an open reference to it. Why do you think we have all that nasty default BDI crap in the writeback path, Tejun? it's because the disk hot-unplug paths free the bdi the filesystem is using out from underneath it. What the hell happened to the usual rule of "don't free objects until the last user goes away? What makes the error handling paths of the layers below the filesystem not subject to common sense? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-16 3:56 ` Dave Chinner @ 2013-12-16 12:51 ` Tejun Heo 2013-12-16 12:56 ` Tejun Heo 0 siblings, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-16 12:51 UTC (permalink / raw) To: Dave Chinner Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Mon, Dec 16, 2013 at 02:56:52PM +1100, Dave Chinner wrote: > > What are you suggesting? Implementing separate warm and hot unplug > > paths? That makes no sense whatsoever. Device hot unplug is just a > > sub operation of general device unplug which should be able to succeed > > whether the underlying device is failing IOs or not. > > I don't care. Trying to issue IO from an an IO error handling path > where the device has just been removed is fundamentally broken. What? Have you even read the original message? IO error handling path isn't issuing the IO here. The hot unplug operation is completely asynchronous to the IO path. What's dead locking is not the filesystem and IO path but device driver layer and hot unplug path. IOs are not stalled. > Indeed, what I'm asking for is for a notification so that we can > *shut the filesystem down* straight away, rather than have to wait > for an IO error in a critical metadata structure to trigger the > shutdown for us. which has exactly zero to do with the issue at hand. Would you please stop derailing the discussion? Again, IOs are not stalled. That's not the issue here. > > > It's simply not a valid thing to do - just how is the filesystem > > > supposed to sync to a non-existent device? > > > > By issuing all IOs and then handling the failures appropriately. > > Which bit of "filesystem might deadlock trying to issue IO" didn't > you understand? smh, I give up. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-16 12:51 ` Tejun Heo @ 2013-12-16 12:56 ` Tejun Heo 2013-12-18 0:35 ` Dave Chinner 0 siblings, 1 reply; 54+ messages in thread From: Tejun Heo @ 2013-12-16 12:56 UTC (permalink / raw) To: Dave Chinner Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Mon, Dec 16, 2013 at 07:51:24AM -0500, Tejun Heo wrote: > On Mon, Dec 16, 2013 at 02:56:52PM +1100, Dave Chinner wrote: > > > What are you suggesting? Implementing separate warm and hot unplug > > > paths? That makes no sense whatsoever. Device hot unplug is just a > > > sub operation of general device unplug which should be able to succeed > > > whether the underlying device is failing IOs or not. > > > > I don't care. Trying to issue IO from an an IO error handling path > > where the device has just been removed is fundamentally broken. > > What? Have you even read the original message? IO error handling > path isn't issuing the IO here. The hot unplug operation is > completely asynchronous to the IO path. What's dead locking is not > the filesystem and IO path but device driver layer and hot unplug > path. IOs are not stalled. In fact, this deadlock can be reproduced without hotunplug at all. If you initiate warm unplug and warm unplugging races with suspend/resume cycle, it'll behave exactly the same - the IOs from flushing in that scenario would succeed but IOs failing or not has *NOTHING* to do with this deadlock. It'd still happen. It's freezer behaving as a giant lock and other locks of course getting dragged into giant dependency loop. Can you please at least *try* to understand what's going on before throwing strong assertions? -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-16 12:56 ` Tejun Heo @ 2013-12-18 0:35 ` Dave Chinner 2013-12-18 11:43 ` Tejun Heo 0 siblings, 1 reply; 54+ messages in thread From: Dave Chinner @ 2013-12-18 0:35 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Mon, Dec 16, 2013 at 07:56:04AM -0500, Tejun Heo wrote: > On Mon, Dec 16, 2013 at 07:51:24AM -0500, Tejun Heo wrote: > > On Mon, Dec 16, 2013 at 02:56:52PM +1100, Dave Chinner wrote: > > > > What are you suggesting? Implementing separate warm and hot unplug > > > > paths? That makes no sense whatsoever. Device hot unplug is just a > > > > sub operation of general device unplug which should be able to succeed > > > > whether the underlying device is failing IOs or not. > > > > > > I don't care. Trying to issue IO from an an IO error handling path > > > where the device has just been removed is fundamentally broken. > > > > What? Have you even read the original message? IO error handling > > path isn't issuing the IO here. The hot unplug operation is > > completely asynchronous to the IO path. What's dead locking is not > > the filesystem and IO path but device driver layer and hot unplug > > path. IOs are not stalled. > > In fact, this deadlock can be reproduced without hotunplug at all. If > you initiate warm unplug and warm unplugging races with suspend/resume > cycle, it'll behave exactly the same - the IOs from flushing in that > scenario would succeed but IOs failing or not has *NOTHING* to do with > this deadlock. It'd still happen. It's freezer behaving as a giant > lock and other locks of course getting dragged into giant dependency > loop. Can you please at least *try* to understand what's going on > before throwing strong assertions? Sure I understand the problem. Part of that dependency loop is caused by filesystem re-entrancy from the hot unplug code. Regardless of this /specific freezer problem/, that filesystem re-entrancy path is *never OK* during a hot-unplug event and it needs to be fixed. Perhaps the function "invalidate_partition()" is badly named. To state the obvious, fsync != invalidation. What it does is: 1. sync filesystem 2. shrink the dcache 3. invalidates inodes and kills dirty inodes 4. invalidates block device (removes cached bdev pages) Basically, the first step is "flush", the remainder is "invalidate". Indeed, step 3 throws away dirty inodes, so why on earth would we even bother with step 1 to try to clean them in the first place? IOWs, the flush is irrelevant in the hot-unplug case as it will fail to flush stuff, and then we just throw the stuff we failed to write away. But in attempting to flush all the dirty data and metadata, we can cause all sorts of other potential re-entrancy based deadlocks due to attempting to issue IO. Whether they be freezer based or through IO error handling triggering device removal or some other means, it is irrelevant - it is the flush that causes all the problems. We need to either get rid of the flush on device failure/hot-unplug, or turn it into a callout for the superblock to take an appropriate action (e.g. shutting down the filesystem) rather than trying to issue IO. i.e. allow the filesystem to take appropriate action of shutting down the filesystem and invalidating it's caches. Indeed, in XFS there's several other caches that could contain dirty metadata that isn't invalidated by invalidate_partition(), and so unless the filesystem is shut down it can continue to try to issue IO on those buffers to the removed device until the filesystem is shutdown or unmounted. Seriously, Tejun, the assumption that invalidate_partition() knows enough about filesystems to safely "invalidate" them is just broken. These days, filesystems often reference multiple block devices, and so the way hotplug currently treats them as "one device, one filesystem" is also fundamentally wrong. So there's many ways in which the hot-unplug code is broken in it's use of invalidate_partition(), the least of which is the dependencies caused by re-entrancy. We really need a "sb->shutdown()" style callout as step one in the above process, not fsync_bdev(). Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-18 0:35 ` Dave Chinner @ 2013-12-18 11:43 ` Tejun Heo 2013-12-18 22:14 ` Rafael J. Wysocki 2013-12-19 4:08 ` Dave Chinner 0 siblings, 2 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-18 11:43 UTC (permalink / raw) To: Dave Chinner Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Dave. On Wed, Dec 18, 2013 at 11:35:10AM +1100, Dave Chinner wrote: > Sure I understand the problem. Part of that dependency loop is > caused by filesystem re-entrancy from the hot unplug code. > Regardless of this /specific freezer problem/, that filesystem > re-entrancy path is *never OK* during a hot-unplug event and it > needs to be fixed. I don't think you do. The only relation between the two problems is that they're traveling the same code path. Again, the deadlock can be triggered *the same* with warm unplug && it *doesn't* need writeback or jbd - we can deadlock with any freezables and we have drivers which use freezables in IO issue path. I don't really mind the discussion branching off to a related topic but you've been consistently misunderstanding the issue at hand and derailing the larger discussion. If you want to discuss the issue that you brought up, that's completely fine but *PLEASE* stop getting confused and mixing up the two. > Perhaps the function "invalidate_partition()" is badly named. To > state the obvious, fsync != invalidation. What it does is: > > 1. sync filesystem > 2. shrink the dcache > 3. invalidates inodes and kills dirty inodes > 4. invalidates block device (removes cached bdev pages) > > Basically, the first step is "flush", the remainder is "invalidate". > > Indeed, step 3 throws away dirty inodes, so why on earth would we > even bother with step 1 to try to clean them in the first place? > IOWs, the flush is irrelevant in the hot-unplug case as it will > fail to flush stuff, and then we just throw the stuff we > failed to write away. > > But in attempting to flush all the dirty data and metadata, we can > cause all sorts of other potential re-entrancy based deadlocks due > to attempting to issue IO. Whether they be freezer based or through > IO error handling triggering device removal or some other means, it > is irrelevant - it is the flush that causes all the problems. Isn't the root cause there hotunplug reentering anything above it in the first place. The problem with your proposal is that filesystem isn't the only place where this could happen. Even with no filesystem involved, block device could still be dirty and IOs pending in whatever form - dirty bdi, bios queued in dm, requests queued in request_queue, whatever really - and if the hotunplug path reenters any of the higher layers in a way which blocks IO processing, it will deadlock. If knowing that the underlying device has gone away somehow helps filesystem, maybe we can expose that interface and avoid flushing after hotunplug but that merely hides the possible deadlock scenario that you're concerned about. Nothing is really solved. We can try to do the same thing at each layer and implement quick exit path for hot unplug all the way down to the driver but that kinda sounds complex and fragile to me. It's a lot larger surface to cover when the root cause is hotunplug allowed to reenter anything at all from IO path. This is especially true because hotunplug can trivially be made fully asynchronous in most cases. In terms of destruction of higher level objects, warm and hot unplugs can and should behave identical. In fact, it isn't possible to draw a strict line between warm and hot unplugs - what if the device gets detached while warm unplug is in progress? We'd be already traveling warm unplug path when the operating condition becomes identical to hot unplug. > We need to either get rid of the flush on device failure/hot-unplug, > or turn it into a callout for the superblock to take an appropriate > action (e.g. shutting down the filesystem) rather than trying to > issue IO. i.e. allow the filesystem to take appropriate action of > shutting down the filesystem and invalidating it's caches. There could be cases where some optimizations for hot unplug could be useful. Maybe suppressing pointless duplicate warning messages or whatnot but I'm highly doubtful anything will be actually fixed that way. We'll be most likely making bugs just less reproducible. > Indeed, in XFS there's several other caches that could contain dirty > metadata that isn't invalidated by invalidate_partition(), and so > unless the filesystem is shut down it can continue to try to issue > IO on those buffers to the removed device until the filesystem is > shutdown or unmounted. Do you mean xfs never gives up after IO failures? > Seriously, Tejun, the assumption that invalidate_partition() knows > enough about filesystems to safely "invalidate" them is just broken. > These days, filesystems often reference multiple block devices, and > so the way hotplug currently treats them as "one device, one > filesystem" is also fundamentally wrong. > > So there's many ways in which the hot-unplug code is broken in it's > use of invalidate_partition(), the least of which is the > dependencies caused by re-entrancy. We really need a > "sb->shutdown()" style callout as step one in the above process, not > fsync_bdev(). If filesystems need an indication that the underlying device is no longer functional, please go ahead and add it, but please keep in mind all these are completely asynchronous. Nothing guarantees you that such events would happen in any specific order. IOW, you can be at *ANY* point in your warm unplug path and the device is hot unplugged, which essentially forces all the code paths to be ready for the worst, and that's exactly why there isn't much effort in trying to separate out warm and hot unplug paths. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-18 11:43 ` Tejun Heo @ 2013-12-18 22:14 ` Rafael J. Wysocki 2013-12-19 4:08 ` Dave Chinner 1 sibling, 0 replies; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-18 22:14 UTC (permalink / raw) To: Tejun Heo Cc: Dave Chinner, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Wednesday, December 18, 2013 06:43:43 AM Tejun Heo wrote: [...] > If filesystems need an indication that the underlying device is no > longer functional, please go ahead and add it, but please keep in mind > all these are completely asynchronous. Nothing guarantees you that > such events would happen in any specific order. IOW, you can be at > *ANY* point in your warm unplug path and the device is hot unplugged, > which essentially forces all the code paths to be ready for the worst, > and that's exactly why there isn't much effort in trying to separate > out warm and hot unplug paths. Yes. Devices can go away at any point without notice. Even PCI devices that have never been assumed to be hot-removable. Any piece of code in the kernel needs to be prepared to deal with such situations. Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-18 11:43 ` Tejun Heo 2013-12-18 22:14 ` Rafael J. Wysocki @ 2013-12-19 4:08 ` Dave Chinner 2013-12-19 16:24 ` Tejun Heo 1 sibling, 1 reply; 54+ messages in thread From: Dave Chinner @ 2013-12-19 4:08 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Wed, Dec 18, 2013 at 06:43:43AM -0500, Tejun Heo wrote: > Hello, Dave. > > On Wed, Dec 18, 2013 at 11:35:10AM +1100, Dave Chinner wrote: > > Perhaps the function "invalidate_partition()" is badly named. To > > state the obvious, fsync != invalidation. What it does is: > > > > 1. sync filesystem > > 2. shrink the dcache > > 3. invalidates inodes and kills dirty inodes > > 4. invalidates block device (removes cached bdev pages) > > > > Basically, the first step is "flush", the remainder is "invalidate". > > > > Indeed, step 3 throws away dirty inodes, so why on earth would we > > even bother with step 1 to try to clean them in the first place? > > IOWs, the flush is irrelevant in the hot-unplug case as it will > > fail to flush stuff, and then we just throw the stuff we > > failed to write away. > > > > But in attempting to flush all the dirty data and metadata, we can > > cause all sorts of other potential re-entrancy based deadlocks due > > to attempting to issue IO. Whether they be freezer based or through > > IO error handling triggering device removal or some other means, it > > is irrelevant - it is the flush that causes all the problems. > > Isn't the root cause there hotunplug reentering anything above it in > the first place. The problem with your proposal is that filesystem > isn't the only place where this could happen. Even with no filesystem > involved, block device could still be dirty and IOs pending in > whatever form - dirty bdi, bios queued in dm, requests queued in > request_queue, whatever really - and if the hotunplug path reenters > any of the higher layers in a way which blocks IO processing, it will > deadlock. Entirely possible. > If knowing that the underlying device has gone away somehow helps > filesystem, maybe we can expose that interface and avoid flushing > after hotunplug but that merely hides the possible deadlock scenario > that you're concerned about. Nothing is really solved. Except that a user of the block device has been informed that it is now gone and has been freed from under it. i.e. we can *immediately* inform the user that their mounted filesystem is now stuffed and supress all the errors that are going to occur as a result of sync_filesystem() triggering IO failures all over the place and then having to react to that.i Indeed, there is no guarantee that sync_filesystem will result in the filesystem being shut down - if the filesystem is clean then nothing will happen, and it won't be until the user modifies some metadata that a shutdown will be triggered. That could be a long time after the device has been removed.... > We can try to do the same thing at each layer and implement quick exit > path for hot unplug all the way down to the driver but that kinda > sounds complex and fragile to me. It's a lot larger surface to cover > when the root cause is hotunplug allowed to reenter anything at all > from IO path. This is especially true because hotunplug can trivially > be made fully asynchronous in most cases. In terms of destruction of > higher level objects, warm and hot unplugs can and should behave > identical. I don't see that there is a difference between a warm and hot unplug from a filesystem point of view - both result in the filesystem's backing device being deleted and freed, and in both cases we have to take the same action.... > > We need to either get rid of the flush on device failure/hot-unplug, > > or turn it into a callout for the superblock to take an appropriate > > action (e.g. shutting down the filesystem) rather than trying to > > issue IO. i.e. allow the filesystem to take appropriate action of > > shutting down the filesystem and invalidating it's caches. > > There could be cases where some optimizations for hot unplug could be > useful. Maybe suppressing pointless duplicate warning messages or > whatnot but I'm highly doubtful anything will be actually fixed that > way. We'll be most likely making bugs just less reproducible. > > > Indeed, in XFS there's several other caches that could contain dirty > > metadata that isn't invalidated by invalidate_partition(), and so > > unless the filesystem is shut down it can continue to try to issue > > IO on those buffers to the removed device until the filesystem is > > shutdown or unmounted. > > Do you mean xfs never gives up after IO failures? There's this thing called a transient IO failure which we have to handle. e.g multipath taking several minutes to detect a path failure and fail over, whilst in the mean time IO errors are reported after a 30s timeout. So some types of async metadata write IO failures are simply rescheduled for a short time in the future. They'll either succeed, or continual failure will eventually trigger some kind of filesystem failure. If it's a synchronous write or a write that we cannot tolerate even transient errors on (e.g. journal writes), then we'll shut down the filesystem immediately. > > Seriously, Tejun, the assumption that invalidate_partition() knows > > enough about filesystems to safely "invalidate" them is just broken. > > These days, filesystems often reference multiple block devices, and > > so the way hotplug currently treats them as "one device, one > > filesystem" is also fundamentally wrong. > > > > So there's many ways in which the hot-unplug code is broken in it's > > use of invalidate_partition(), the least of which is the > > dependencies caused by re-entrancy. We really need a > > "sb->shutdown()" style callout as step one in the above process, not > > fsync_bdev(). > > If filesystems need an indication that the underlying device is no > longer functional, please go ahead and add it, but please keep in mind > all these are completely asynchronous. Nothing guarantees you that > such events would happen in any specific order. IOW, you can be at > *ANY* point in your warm unplug path and the device is hot unplugged, > which essentially forces all the code paths to be ready for the worst, > and that's exactly why there isn't much effort in trying to separate > out warm and hot unplug paths. I'm not concerned about the problems that might happen if you hot unplug during a warm unplug. All I care about is when a device is invalidated the filesystem on top of it can take appropriate action. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-19 4:08 ` Dave Chinner @ 2013-12-19 16:24 ` Tejun Heo 2013-12-20 0:51 ` Dave Chinner 2013-12-20 14:00 ` Rafael J. Wysocki 0 siblings, 2 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-19 16:24 UTC (permalink / raw) To: Dave Chinner Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Yo, Dave. On Thu, Dec 19, 2013 at 03:08:21PM +1100, Dave Chinner wrote: > > If knowing that the underlying device has gone away somehow helps > > filesystem, maybe we can expose that interface and avoid flushing > > after hotunplug but that merely hides the possible deadlock scenario > > that you're concerned about. Nothing is really solved. > > Except that a user of the block device has been informed that it is > now gone and has been freed from under it. i.e. we can *immediately* > inform the user that their mounted filesystem is now stuffed and > supress all the errors that are going to occur as a result of > sync_filesystem() triggering IO failures all over the place and then > having to react to that.i Please note that there's no real "immediacy" in that it's inherently racy and that the extent of the usefulness of such notification can't reach much further than suppressing error messages. Even that benefit is kinda dubious. Don't we want to generate errors when a device is removed while dirty data / IOs are pending on it? I fail to see how "supressing all the errors" would be a sane thing to do. Another thing is that I think it's actually healthier in terms of excercise of code paths to travel those error paths on hot unplugs which are relatively common than taking a different behavior on them. It'll inevitably lower our test coverage. > Indeed, there is no guarantee that sync_filesystem will result in > the filesystem being shut down - if the filesystem is clean then > nothing will happen, and it won't be until the user modifies some > metadata that a shutdown will be triggered. That could be a long > time after the device has been removed.... I still fail to see that why that is a problem. Filesystems should be able to handle hot unplug or IO failures at any point in a reasonable way, so what difference would having a notification make other than introducing yet another exception code path? > I don't see that there is a difference between a warm and hot unplug > from a filesystem point of view - both result in the filesystem's > backing device being deleted and freed, and in both cases we have to > take the same action.... Yeah, exactly, so what'd be the point of getting separate notification for hot unplug events? > > Do you mean xfs never gives up after IO failures? > > There's this thing called a transient IO failure which we have to > handle. e.g multipath taking several minutes to detect a path > failure and fail over, whilst in the mean time IO errors are > reported after a 30s timeout. So some types of async metadata write > IO failures are simply rescheduled for a short time in the future. > They'll either succeed, or continual failure will eventually trigger > some kind of filesystem failure. > > If it's a synchronous write or a write that we cannot tolerate even > transient errors on (e.g. journal writes), then we'll shut down the > filesystem immediately. Sure, filesystems should (be able to) react to different types of errors in different ways. We still have a long way to go to do that properly but that should be done through IO failures not some side channel one-off "hotunplug" happened call. Again, it doesn't solve anything. It just side steps one very specific case in a half-assed way. > > If filesystems need an indication that the underlying device is no > > longer functional, please go ahead and add it, but please keep in mind > > all these are completely asynchronous. Nothing guarantees you that > > such events would happen in any specific order. IOW, you can be at > > *ANY* point in your warm unplug path and the device is hot unplugged, > > which essentially forces all the code paths to be ready for the worst, > > and that's exactly why there isn't much effort in trying to separate > > out warm and hot unplug paths. > > I'm not concerned about the problems that might happen if you hot > unplug during a warm unplug. All I care about is when a device is > invalidated the filesystem on top of it can take appropriate action. I can't follow your logic here. You started with a deadlock scenario where lower layer calls into upper layer while blocking its own operation, which apparently is a bug to be fixed in the lower layer as discussed above; otherwise, we'd be chasing the symptoms rather than plugging the source. Combined with the fact that you can't really prevent hotunplug happening during warmunplug (it doesn't even have to be hotunplug, there are other conditions which would match such IO failure pattern), this reduces the benefit of such notification to optimizations far from correctness issues. These are all logically connected; yet, you claim that you're not concerned about part of it and then continue to assert your original position. It doesn't compute. You proposed it as a fix for a deadlock issue, but it turns out your proposal can't fix the deadlock issue exactly because of the part you aren't concerned about and you continue to assert the original proposal. What's going on here? Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-19 16:24 ` Tejun Heo @ 2013-12-20 0:51 ` Dave Chinner 2013-12-20 14:51 ` Tejun Heo 2013-12-20 14:00 ` Rafael J. Wysocki 1 sibling, 1 reply; 54+ messages in thread From: Dave Chinner @ 2013-12-20 0:51 UTC (permalink / raw) To: Tejun Heo Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Thu, Dec 19, 2013 at 11:24:11AM -0500, Tejun Heo wrote: > Yo, Dave. > > On Thu, Dec 19, 2013 at 03:08:21PM +1100, Dave Chinner wrote: > > > If knowing that the underlying device has gone away somehow helps > > > filesystem, maybe we can expose that interface and avoid flushing > > > after hotunplug but that merely hides the possible deadlock scenario > > > that you're concerned about. Nothing is really solved. > > > > Except that a user of the block device has been informed that it is > > now gone and has been freed from under it. i.e. we can *immediately* > > inform the user that their mounted filesystem is now stuffed and > > supress all the errors that are going to occur as a result of > > sync_filesystem() triggering IO failures all over the place and then > > having to react to that.i > > Please note that there's no real "immediacy" in that it's inherently > racy and that the extent of the usefulness of such notification can't > reach much further than suppressing error messages. So says you. What about btrfs, which might use such a notification to start replicating metadata and data to other devices to maintain redundancy levels? Or even XFS, which might have a real-time device go away - that's not a fatal error for the filesystem, but we sure as hell want the user to know about it and be able to set a flag indicating access to data in real-time files need to be errored out at the .aio_write level rather than some time later when writeback fails with EIO and the application cannot capture the IO error. Seriously, Tejun, you need to stop focussing on tearing apart micro-examples and consider the wider context of the issue that the examples are being used to demonstrate. The notifications being talked about here are already being delivered to userspace because there are OS management applications that need to know about devices going away. Filesystems are no different - there's all sorts of optimisations we can make if we get such a notification. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-20 0:51 ` Dave Chinner @ 2013-12-20 14:51 ` Tejun Heo 0 siblings, 0 replies; 54+ messages in thread From: Tejun Heo @ 2013-12-20 14:51 UTC (permalink / raw) To: Dave Chinner Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu Hello, Dave. On Fri, Dec 20, 2013 at 11:51:14AM +1100, Dave Chinner wrote: > > Please note that there's no real "immediacy" in that it's inherently > > racy and that the extent of the usefulness of such notification can't > > reach much further than suppressing error messages. > > So says you. What about btrfs, which might use such a notification > to start replicating metadata and data to other devices to maintain > redundancy levels? Or even XFS, which might have a real-time device > go away - that's not a fatal error for the filesystem, but we sure > as hell want the user to know about it and be able to set a flag > indicating access to data in real-time files need to be errored out > at the .aio_write level rather than some time later when writeback > fails with EIO and the application cannot capture the IO error. Hmmm... but shouldn't such facilities also kick in on other fatal failure conditions? I have no fundamental objection against adding someway to distinguish / notify hotunplug events but am a bit worried that it has the potential to branch out the hotunplug path unnecessarily and it seems that such tendency has been shown amply in this thread too. As I've pointed out multiple times in the thread, hotunplug path serves very well as the continual testbed for catastrphic failure cases for the whole stack from filesystems all the way down to low-level drivers and can be attributed for high ratio of error handling improvements / bugfixes we made over the years, so even if we add something to distinguish hotunplug for upper layers, I think it should be something which at least doesn't encourage implementing a completely separate path and preferably something more generic. > Seriously, Tejun, you need to stop focussing on tearing apart > micro-examples and consider the wider context of the issue that the > examples are being used to demonstrate. The notifications being Dear Dave, would you please drop the attitude? You've been consistently wrong in this thread. Being aggressively assertive is okay. Being wrong is okay too. Continuing the combination of the two is not. That's actively harmful, especially when the one doing so commands respect in the community - it has high chance of derailing the discussion and/or leading to the wrong conclusions. Should I reiterate the messed up confusions you showed at each turn of this very thread? Let's please stay technical. You apparently didn't have much idea how things work in the lower layers, which is completely fine. Just don't over-extend your asssertiveness without substance. If you just made your technical input from the get-go when viewed from your side of the stack, we could have been a lot more productive. I still think there's something to be extracted from this thread - how we should handle catastrophic failure scenarios and the shortcomings of the current shutdown procedure, both of which are far from perfect. I'm not yet convinced with the idea of handling hot-unplug as something completely special for the reasons I've stated multiple times now and curious about the specifics of what you have on mind because I don't really know what would be convenient from the filesystem side. > talked about here are already being delivered to userspace because > there are OS management applications that need to know about devices > going away. Filesystems are no different - there's all sorts of > optimisations we can make if we get such a notification. I'm not sure that analogy is too relevant. The notifications delivered to userland aren't different for hot or warm unplugs and they don't participate in any of the exception handling details, which is what I'm really worried about. Again, I'm not necessarily against the idea of flagging / notifying upper layers but please do keep in mind that hotunplug handling shouldn't deviate much from other parts of exception handling, which at least from the discussions we had in this thread doesn't seem to be what you had on mind. There are other exception cases which share many, if not most, characteristics making such approach a natural fit and hotunplug is one of the best tools we have in ensuring those paths are in a healthy shape. Thanks. -- tejun ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Writeback threads and freezable 2013-12-19 16:24 ` Tejun Heo 2013-12-20 0:51 ` Dave Chinner @ 2013-12-20 14:00 ` Rafael J. Wysocki 1 sibling, 0 replies; 54+ messages in thread From: Rafael J. Wysocki @ 2013-12-20 14:00 UTC (permalink / raw) To: Tejun Heo Cc: Dave Chinner, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu On Thursday, December 19, 2013 11:24:11 AM Tejun Heo wrote: > Yo, Dave. > > On Thu, Dec 19, 2013 at 03:08:21PM +1100, Dave Chinner wrote: > > > If knowing that the underlying device has gone away somehow helps > > > filesystem, maybe we can expose that interface and avoid flushing > > > after hotunplug but that merely hides the possible deadlock scenario > > > that you're concerned about. Nothing is really solved. > > > > Except that a user of the block device has been informed that it is > > now gone and has been freed from under it. i.e. we can *immediately* > > inform the user that their mounted filesystem is now stuffed and > > supress all the errors that are going to occur as a result of > > sync_filesystem() triggering IO failures all over the place and then > > having to react to that.i > > Please note that there's no real "immediacy" in that it's inherently > racy and that the extent of the usefulness of such notification can't > reach much further than suppressing error messages. Even that benefit > is kinda dubious. Don't we want to generate errors when a device is > removed while dirty data / IOs are pending on it? I fail to see how > "supressing all the errors" would be a sane thing to do. > > Another thing is that I think it's actually healthier in terms of > excercise of code paths to travel those error paths on hot unplugs > which are relatively common than taking a different behavior on them. > It'll inevitably lower our test coverage. > > > Indeed, there is no guarantee that sync_filesystem will result in > > the filesystem being shut down - if the filesystem is clean then > > nothing will happen, and it won't be until the user modifies some > > metadata that a shutdown will be triggered. That could be a long > > time after the device has been removed.... > > I still fail to see that why that is a problem. Filesystems should be > able to handle hot unplug or IO failures at any point in a reasonable > way, so what difference would having a notification make other than > introducing yet another exception code path? > > > I don't see that there is a difference between a warm and hot unplug > > from a filesystem point of view - both result in the filesystem's > > backing device being deleted and freed, and in both cases we have to > > take the same action.... > > Yeah, exactly, so what'd be the point of getting separate notification > for hot unplug events? > > > > Do you mean xfs never gives up after IO failures? > > > > There's this thing called a transient IO failure which we have to > > handle. e.g multipath taking several minutes to detect a path > > failure and fail over, whilst in the mean time IO errors are > > reported after a 30s timeout. So some types of async metadata write > > IO failures are simply rescheduled for a short time in the future. > > They'll either succeed, or continual failure will eventually trigger > > some kind of filesystem failure. > > > > If it's a synchronous write or a write that we cannot tolerate even > > transient errors on (e.g. journal writes), then we'll shut down the > > filesystem immediately. > > Sure, filesystems should (be able to) react to different types of > errors in different ways. We still have a long way to go to do that > properly but that should be done through IO failures not some side > channel one-off "hotunplug" happened call. Again, it doesn't solve > anything. It just side steps one very specific case in a half-assed > way. Another problem is to distinguish "hotunplug" from "power failure", for example, because it may not be entirely clear what happened to start with until way after when we get a "these devices are gone" notification from the platform firmware. So even if something may be regarded as a "hotunplug" from the order of the Universe perspective, it still may look like a regular IO error at the device level until device_del() is called for that device (which may not happen for quite a while). [That said, there are situations in which we may want to wait until it is "safe" to eject stuff, or even we may want to allow subsystems to fail "offline" requests, like in the memory eject case. In those cases the hot-removal actually consists of two parts, an offline and a removal, where devices are supposed to be not in use after offline (which may fail). But that is a different story. :-)] Thanks, Rafael ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2013-12-20 14:51 UTC | newest] Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-12-13 17:49 Writeback threads and freezable Tejun Heo 2013-12-13 18:52 ` Tejun Heo 2013-12-13 20:40 ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo 2013-12-13 22:45 ` Nigel Cunningham 2013-12-13 23:07 ` Tejun Heo 2013-12-13 23:15 ` Nigel Cunningham 2013-12-14 1:55 ` Dave Chinner 2013-12-14 20:31 ` Tejun Heo 2013-12-14 20:36 ` Tejun Heo 2013-12-14 21:21 ` Nigel Cunningham 2013-12-17 2:35 ` Rafael J. Wysocki 2013-12-17 2:34 ` Rafael J. Wysocki 2013-12-17 12:34 ` Tejun Heo 2013-12-18 0:35 ` Rafael J. Wysocki 2013-12-18 11:17 ` Tejun Heo 2013-12-18 21:48 ` Rafael J. Wysocki 2013-12-18 21:39 ` Tejun Heo 2013-12-18 21:41 ` Tejun Heo 2013-12-18 22:04 ` Rafael J. Wysocki 2013-12-19 23:35 ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo 2013-12-20 1:26 ` Rafael J. Wysocki 2013-12-19 23:37 ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo 2013-12-20 1:31 ` Rafael J. Wysocki 2013-12-20 13:32 ` Tejun Heo 2013-12-20 13:56 ` Rafael J. Wysocki 2013-12-20 14:23 ` Tejun Heo 2013-12-16 12:12 ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei 2013-12-16 12:45 ` Tejun Heo 2013-12-16 13:24 ` Ming Lei 2013-12-16 16:05 ` Tejun Heo 2013-12-17 2:38 ` Rafael J. Wysocki 2013-12-17 12:36 ` Tejun Heo 2013-12-18 0:23 ` Rafael J. Wysocki 2013-12-17 12:50 ` [PATCH v2] " Tejun Heo 2013-12-18 1:04 ` Rafael J. Wysocki 2013-12-18 11:08 ` Tejun Heo 2013-12-18 12:07 ` [PATCH v3] " Tejun Heo 2013-12-18 22:08 ` Rafael J. Wysocki 2013-12-19 17:24 ` Tejun Heo 2013-12-19 18:54 ` [PATCH v4] " Tejun Heo 2013-12-14 1:53 ` Writeback threads and freezable Dave Chinner 2013-12-14 17:30 ` Greg Kroah-Hartman 2013-12-14 20:23 ` Tejun Heo 2013-12-16 3:56 ` Dave Chinner 2013-12-16 12:51 ` Tejun Heo 2013-12-16 12:56 ` Tejun Heo 2013-12-18 0:35 ` Dave Chinner 2013-12-18 11:43 ` Tejun Heo 2013-12-18 22:14 ` Rafael J. Wysocki 2013-12-19 4:08 ` Dave Chinner 2013-12-19 16:24 ` Tejun Heo 2013-12-20 0:51 ` Dave Chinner 2013-12-20 14:51 ` Tejun Heo 2013-12-20 14:00 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).