All of lore.kernel.org
 help / color / mirror / Atom feed
* Writeback threads and freezable
@ 2013-12-13 17:49 Tejun Heo
  2013-12-13 18:52 ` Tejun Heo
  2013-12-14  1:53 ` Writeback threads and freezable Dave Chinner
  0 siblings, 2 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-13 17:49 UTC (permalink / raw)
  To: Rafael J. Wysocki, Jens Axboe
  Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

Hello, guys.

This is discovered while investigating bug 62801 - "EliteBoot hangs at
dock, suspend, undock, resume".

 https://bugzilla.kernel.org/show_bug.cgi?id=62801

The laptop locks up during resume if undocked while suspended.  The
dock contains a hard drive and the drive removal path and resume path
get stuck.  This got bisected to 839a8e8660b6 ("writeback: replace
custom worker pool implementation with unbound workqueue") by the
reporter.  The problem can be reproduced by just removing mounted
harddrive while a machine is suspended.  I first thought it was some
dumb mistake but it turns out to be something fundamental.

So, here's the lock up.

* Resume starts and libata resume is kicked off.

* libata EH determines that the device is gone.  Eventually it invokes
  device_del() through a work item.

* device_del() tries to delete the block device which invokes
  writeback_inodes_sb() on the mounted filesystem which in turn
  schedules and flushes bdi work item.  Note that at this point, the
  kworker is holding multiple driver layer locks.

Now, if the rest of dpm resume races with the above and gets stuck
behind one of the locks libata device removal path is holding, it
deadlocks because thaw_workqueues() is called only after dpm resume is
complete.  libata device removal waits for the scheduled writeback
work item to finish which in turn is waiting for the workqueue to be
thawed, and dpm resume can't make it to thaw_workqueues() because it's
stuck behind one of the mutexes held by the libata device removal
path.

I think the issues already existed before 839a8e8660b6 ("writeback:
replace custom worker pool implementation with unbound workqueue")
although the patch seems to have exposed it a lot more.  The issue
probably was a lot less visible because, if the bdi had a flusher
spawned, kthread_stop() overrides freezer and if not the device has
been idle & clean for some time and unlikely to require any action on
shutdown, but the forker thread was still freezable and I think it
should be possible to trigger some condition where forker action is
waited upon during shutdown sequence.

The fundamental problem, really, is that device drivers make use of
kthreads and/or workqueues in their operation and freezable kthreads
and workqueues essentially behave as a giant lock.  If any
suspend/resume operation somehow ends up depending on a freezable
kthread or workqueue, which can happen through quite convoluted and
long dependency chain, there are possibilities for deadlock even if
those operations themselves aren't interlocked.  In this case, libata
resume finishes fine and device removal is run completely
asynchronously but the dependency chain is still completed through
driver layer locks.

I've always thought that using freezer for kernel threads / workqueues
is something fundamentally undesirable, exactly because freezer acts
as a giant lock and we don't have any meangingful debug support around
it.  Even in leaf drivers proper where the dependency isn't
complicated, this has ample possibility to lead to surprise lockups.

I'm unsure why bdi should be freezable.  It's not like corking things
at bdi can reliably prevent block layer from issuing further commands
down to drivers.  Maybe there are things which directly hook into bdi
and depend on it being frozen, I don't know, but it just seems
dangerous to inject a large locking dependency into the chain from
such high level construct.

I'm not completely sure what to do.  We can make an exception for
bdi_wq and thaw it early but that breaks the guarantee that bdi is
frozen between device suspend and resume invocations and we might as
well just remove WQ_FREEZABLE from it.  I don't know.

Ideas?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-13 17:49 Writeback threads and freezable Tejun Heo
@ 2013-12-13 18:52 ` Tejun Heo
  2013-12-13 20:40   ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo
  2013-12-14  1:53 ` Writeback threads and freezable Dave Chinner
  1 sibling, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-13 18:52 UTC (permalink / raw)
  To: Rafael J. Wysocki, Jens Axboe
  Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

On Fri, Dec 13, 2013 at 12:49:32PM -0500, Tejun Heo wrote:
> I'm not completely sure what to do.  We can make an exception for
> bdi_wq and thaw it early but that breaks the guarantee that bdi is
> frozen between device suspend and resume invocations and we might as
> well just remove WQ_FREEZABLE from it.  I don't know.

Ugh, so I removed WQ_FREEZABLE on bdi and the same deadlock now
happens through jbd2 which is a freezable kthread.  This whole kthread
freezer thing is rotten and fundamentally broken. :(

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-13 18:52 ` Tejun Heo
@ 2013-12-13 20:40   ` Tejun Heo
  2013-12-13 22:45     ` Nigel Cunningham
                       ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-13 20:40 UTC (permalink / raw)
  To: Rafael J. Wysocki, Jens Axboe
  Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

Hello,

So, this is the laughable workaround that I came up with.  Seriously,
this is tragic.  :(

Thanks.

------- 8< -------
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios.  This is the latest occurrence.

During resume, libata rescans all the ports and revalidates all
pre-existing devices.  If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks.  Unfortunately, this can race with the rest of device
resume.  Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.

839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too.  In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.

I believe the right thing to do is getting rid of freezable kthreads
and workqueues.  This is something fundamentally broken.  For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen.  Kernel engineering at its
finest.  :(

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
---
 drivers/ata/libata-scsi.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index ab58556..60a01d3 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3872,6 +3872,25 @@ void ata_scsi_hotplug(struct work_struct *work)
 		return;
 	}
 
+	/*
+	 * XXX - UGLY HACK
+	 *
+	 * The core suspend/resume path is fundamentally broken due to
+	 * freezable kthreads and workqueue and may deadlock if a block
+	 * device gets removed while resume is in progress.  I don't know
+	 * what the solution is short of removing freezable kthreads and
+	 * workqueues altogether.
+	 *
+	 * The following is an ugly hack to avoid kicking off device
+	 * removal while freezer is active.  This is a joke but does avoid
+	 * this particular deadlock scenario.
+	 *
+	 * https://bugzilla.kernel.org/show_bug.cgi?id=62801
+	 * http://marc.info/?l=linux-kernel&m=138695698516487
+	 */
+	while (pm_freezing)
+		msleep(100);
+
 	DPRINTK("ENTER\n");
 	mutex_lock(&ap->scsi_scan_mutex);
 

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-13 20:40   ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo
@ 2013-12-13 22:45     ` Nigel Cunningham
  2013-12-13 23:07       ` Tejun Heo
  2013-12-17  2:38     ` Rafael J. Wysocki
  2013-12-17 12:50     ` [PATCH v2] " Tejun Heo
  2 siblings, 1 reply; 54+ messages in thread
From: Nigel Cunningham @ 2013-12-13 22:45 UTC (permalink / raw)
  To: Tejun Heo, Rafael J. Wysocki, Jens Axboe
  Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

Hi Tejun.

Thanks for your work on this.

In your first email, in the first substantial paragraph (starting "Now, 
if the rest.."), you say "libata device removal waits for the scheduled 
writeback work item to finish". I wonder if that's the lynchpin. If we 
know the device is gone, why are we trying to write to it?

All pending I/O should have been flushed when suspend/hibernate started, 
and there's no point in trying to update metadata on a device we can't 
access, so there should be no writeback needed (and anything that does 
somehow get there should just be discarded since it will never succeed 
anyway).

Or am I missing something?

Having said the above, I agree that we shouldn't need to freeze kernel 
threads and workqueues themselves. I think we should be giving the 
producers of I/O the nous needed to avoid producing I/O during 
suspend/hibernate. But perhaps I'm missing something here, too.

Regards,

Nigel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-13 22:45     ` Nigel Cunningham
@ 2013-12-13 23:07       ` Tejun Heo
  2013-12-13 23:15         ` Nigel Cunningham
  2013-12-16 12:12         ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei
  0 siblings, 2 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-13 23:07 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Hello, Nigel.

On Sat, Dec 14, 2013 at 09:45:59AM +1100, Nigel Cunningham wrote:
> In your first email, in the first substantial paragraph (starting
> "Now, if the rest.."), you say "libata device removal waits for the
> scheduled writeback work item to finish". I wonder if that's the
> lynchpin. If we know the device is gone, why are we trying to write
> to it?

It's just a standard part of block device removal -
invalidate_partition(), bdi_wb_shutdown().

> All pending I/O should have been flushed when suspend/hibernate
> started, and there's no point in trying to update metadata on a

Frozen or not, it isn't guaranteed that bdi wb queue is empty when the
system went to suspend.  They're likely to be empty but there's no
guarantee.  Conversion to workqueue only makes the behavior more
deterministic.

> device we can't access, so there should be no writeback needed (and
> anything that does somehow get there should just be discarded since
> it will never succeed anyway).

Even if they'll never succeed, they still need to be issued and
drained; otherwise, we'll end up with leaked items and hung issuers.

> Having said the above, I agree that we shouldn't need to freeze
> kernel threads and workqueues themselves. I think we should be
> giving the producers of I/O the nous needed to avoid producing I/O
> during suspend/hibernate. But perhaps I'm missing something here,
> too.

I never understood that part.  Why do we need to control the
producers?  The chain between the producer and consumer is a long one
and no matter what we do with the producers, the consumers need to be
plugged all the same.  Why bother with the producers at all?  I think
that's where all this freezable kthreads started but I don't
understand what the benefit of that is.  Not only that, freezer is
awefully inadequate in its role too.  There are flurry of activities
which happen in the IO path without any thread involved and many of
them can lead to issuance of new IO, so the only thing freezer is
achieving is making existing bugs less visible, which is a bad thing
especially for suspend/resume as the failure mode often doesn't yield
to easy debugging.

I asked the same question years ago and ISTR getting only fairly vague
answers but this whole freezable kthread is expectedly proving to be a
continuous source of problems.  Let's at least find out whether we
need it and why if so.  Not some "I feel better knowing things are
calmer" type vagueness but actual technical necessity of it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-13 23:07       ` Tejun Heo
@ 2013-12-13 23:15         ` Nigel Cunningham
  2013-12-14  1:55           ` Dave Chinner
  2013-12-14 20:31           ` Tejun Heo
  2013-12-16 12:12         ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei
  1 sibling, 2 replies; 54+ messages in thread
From: Nigel Cunningham @ 2013-12-13 23:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Hi again.

On 14/12/13 10:07, Tejun Heo wrote:
> Hello, Nigel.
>
> On Sat, Dec 14, 2013 at 09:45:59AM +1100, Nigel Cunningham wrote:
>> In your first email, in the first substantial paragraph (starting
>> "Now, if the rest.."), you say "libata device removal waits for the
>> scheduled writeback work item to finish". I wonder if that's the
>> lynchpin. If we know the device is gone, why are we trying to write
>> to it?
> It's just a standard part of block device removal -
> invalidate_partition(), bdi_wb_shutdown().
Mmm. But perhaps there needs to be some special code in there to handle 
the "we can't write to this device anymore" case?
>
>> All pending I/O should have been flushed when suspend/hibernate
>> started, and there's no point in trying to update metadata on a
> Frozen or not, it isn't guaranteed that bdi wb queue is empty when the
> system went to suspend.  They're likely to be empty but there's no
> guarantee.  Conversion to workqueue only makes the behavior more
> deterministic.
>
>> device we can't access, so there should be no writeback needed (and
>> anything that does somehow get there should just be discarded since
>> it will never succeed anyway).
> Even if they'll never succeed, they still need to be issued and
> drained; otherwise, we'll end up with leaked items and hung issuers.
Yeah - I get that, but drained needs to work differently if the device 
doesn't exist?
>> Having said the above, I agree that we shouldn't need to freeze
>> kernel threads and workqueues themselves. I think we should be
>> giving the producers of I/O the nous needed to avoid producing I/O
>> during suspend/hibernate. But perhaps I'm missing something here,
>> too.
> I never understood that part.  Why do we need to control the
> producers?  The chain between the producer and consumer is a long one
> and no matter what we do with the producers, the consumers need to be
> plugged all the same.  Why bother with the producers at all?  I think
> that's where all this freezable kthreads started but I don't
> understand what the benefit of that is.  Not only that, freezer is
> awefully inadequate in its role too.  There are flurry of activities
> which happen in the IO path without any thread involved and many of
> them can lead to issuance of new IO, so the only thing freezer is
> achieving is making existing bugs less visible, which is a bad thing
> especially for suspend/resume as the failure mode often doesn't yield
> to easy debugging.
>
> I asked the same question years ago and ISTR getting only fairly vague
> answers but this whole freezable kthread is expectedly proving to be a
> continuous source of problems.  Let's at least find out whether we
> need it and why if so.  Not some "I feel better knowing things are
> calmer" type vagueness but actual technical necessity of it.
>
My understanding is that the point is ensuring that - particularly in 
the case of hibernation - we don't cause filesystem corruption by 
writing one thing while writing the image and then doing something else 
(without knowledge of what happened while the image was being written) 
while reading the image or after restoring it.

Regards,

Nigel


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-13 17:49 Writeback threads and freezable Tejun Heo
  2013-12-13 18:52 ` Tejun Heo
@ 2013-12-14  1:53 ` Dave Chinner
  2013-12-14 17:30   ` Greg Kroah-Hartman
  2013-12-14 20:23   ` Tejun Heo
  1 sibling, 2 replies; 54+ messages in thread
From: Dave Chinner @ 2013-12-14  1:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Fri, Dec 13, 2013 at 12:49:32PM -0500, Tejun Heo wrote:
> Hello, guys.
> 
> This is discovered while investigating bug 62801 - "EliteBoot hangs at
> dock, suspend, undock, resume".
> 
>  https://bugzilla.kernel.org/show_bug.cgi?id=62801
> 
> The laptop locks up during resume if undocked while suspended.  The
> dock contains a hard drive and the drive removal path and resume path
> get stuck.  This got bisected to 839a8e8660b6 ("writeback: replace
> custom worker pool implementation with unbound workqueue") by the
> reporter.  The problem can be reproduced by just removing mounted
> harddrive while a machine is suspended.  I first thought it was some
> dumb mistake but it turns out to be something fundamental.
> 
> So, here's the lock up.
> 
> * Resume starts and libata resume is kicked off.
> 
> * libata EH determines that the device is gone.  Eventually it invokes
>   device_del() through a work item.
> 
> * device_del() tries to delete the block device which invokes
>   writeback_inodes_sb() on the mounted filesystem which in turn
>   schedules and flushes bdi work item.  Note that at this point, the
>   kworker is holding multiple driver layer locks.

That's the fundamental problem here - device removal asks the device
to fsync the filesystem on top of the device that was just removed.
The simple way to trigger this is to pull a device from underneath
an active filesystem (e.g. user unplugs a USB device without first
unmounting it). There are many years worth of bug reports showing
that this attempt by the device removal code to sync the filesystem
leads to deadlocks.

It's simply not a valid thing to do - just how is the filesystem
supposed to sync to a non-existent device?

I've raised this each time a user reports it over the past few years
and never been able to convince any to fix the filesystem
re-entrancy problem device removal causes. Syncing the filesystem
will require taking locks that are by IO in progress, and so can't
make progress until the IO is completed, but that can't happen
until the error handling completes the sync of the filesystem....

Preventing fs/io re-entrancy from contexts where we might be holding
locks is the same reason we have GFP_NOFS and GFP_NOIO for memory
allocation: re-entering a filesystem or IO subsystem whenever we are
holding locks or serialised context in the fs/io path can deadlock
the fs/io path.

IOWs, syncing the filesystem from the device delete code is just
plain wrong. That's what needs fixing - removing the cause of
re-entrancy, not the workqueue or writeback code...

> Ideas?

Fix the device delete error handling not to re-enter the
filesystem and IO path. It's just wrong.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-13 23:15         ` Nigel Cunningham
@ 2013-12-14  1:55           ` Dave Chinner
  2013-12-14 20:31           ` Tejun Heo
  1 sibling, 0 replies; 54+ messages in thread
From: Dave Chinner @ 2013-12-14  1:55 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Tejun Heo, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Sat, Dec 14, 2013 at 10:15:21AM +1100, Nigel Cunningham wrote:
> Hi again.
> 
> On 14/12/13 10:07, Tejun Heo wrote:
> >Hello, Nigel.
> >
> >On Sat, Dec 14, 2013 at 09:45:59AM +1100, Nigel Cunningham wrote:
> >>In your first email, in the first substantial paragraph (starting
> >>"Now, if the rest.."), you say "libata device removal waits for the
> >>scheduled writeback work item to finish". I wonder if that's the
> >>lynchpin. If we know the device is gone, why are we trying to write
> >>to it?
> >It's just a standard part of block device removal -
> >invalidate_partition(), bdi_wb_shutdown().
> Mmm. But perhaps there needs to be some special code in there to
> handle the "we can't write to this device anymore" case?

I've been asking for that "special code" for years, Nigel... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-14  1:53 ` Writeback threads and freezable Dave Chinner
@ 2013-12-14 17:30   ` Greg Kroah-Hartman
  2013-12-14 20:23   ` Tejun Heo
  1 sibling, 0 replies; 54+ messages in thread
From: Greg Kroah-Hartman @ 2013-12-14 17:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tejun Heo, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Fengguang Wu

On Sat, Dec 14, 2013 at 12:53:43PM +1100, Dave Chinner wrote:
> > Ideas?
> 
> Fix the device delete error handling not to re-enter the
> filesystem and IO path. It's just wrong.

That sounds totally reasonable to me, what is preventing someone from
making this change?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-14  1:53 ` Writeback threads and freezable Dave Chinner
  2013-12-14 17:30   ` Greg Kroah-Hartman
@ 2013-12-14 20:23   ` Tejun Heo
  2013-12-16  3:56     ` Dave Chinner
  1 sibling, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-14 20:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Hello, Dave.

On Sat, Dec 14, 2013 at 12:53:43PM +1100, Dave Chinner wrote:
> That's the fundamental problem here - device removal asks the device
> to fsync the filesystem on top of the device that was just removed.
> The simple way to trigger this is to pull a device from underneath
> an active filesystem (e.g. user unplugs a USB device without first
> unmounting it). There are many years worth of bug reports showing
> that this attempt by the device removal code to sync the filesystem
> leads to deadlocks.

What are you suggesting?  Implementing separate warm and hot unplug
paths?  That makes no sense whatsoever.  Device hot unplug is just a
sub operation of general device unplug which should be able to succeed
whether the underlying device is failing IOs or not.

Another thing to consider is that you want to excercise your error
handling path as often as possible.  IOs could still fail even during
device warm unplugs.  If anything atop the driver can't handle that,
that's a bug and we want to fix them.  The fact that we have bug
reports going back years on the subject simply means that we've been
traditionally lousy with error handling throughout the layers - those
were the occassions where we fixed actual error handling bugs, hot
unplug or not.  Special casing hot unplug would just make existing
error handling bugs more difficult to find.

> It's simply not a valid thing to do - just how is the filesystem
> supposed to sync to a non-existent device?

By issuing all IOs and then handling the failures appropriately.
Exactly just as it would handle *ANY* io errors.  This actually is the
simplest thing to do to drain the pipe - pushing things down all the
way and fail them at single point.  What's the alternative?  Implement
quick exit path at each layer?  That's just silly and error-prone.

> I've raised this each time a user reports it over the past few years
> and never been able to convince any to fix the filesystem
> re-entrancy problem device removal causes. Syncing the filesystem

Well, it isn't a good idea.

> will require taking locks that are by IO in progress, and so can't
> make progress until the IO is completed, but that can't happen
> until the error handling completes the sync of the filesystem....

IOs in progress shouldn't ever be stalled by filesystem sync.  That's
a massive layering violation.  Where and how does that happen?

> Preventing fs/io re-entrancy from contexts where we might be holding
> locks is the same reason we have GFP_NOFS and GFP_NOIO for memory
> allocation: re-entering a filesystem or IO subsystem whenever we are
> holding locks or serialised context in the fs/io path can deadlock
> the fs/io path.
>
> IOWs, syncing the filesystem from the device delete code is just
> plain wrong. That's what needs fixing - removing the cause of
> re-entrancy, not the workqueue or writeback code...

No, the wrong thing there is having IO failures depending on
filesystem sync.  Nothing else.

> > Ideas?
> 
> Fix the device delete error handling not to re-enter the
> filesystem and IO path. It's just wrong.

Nope, bad idea.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-13 23:15         ` Nigel Cunningham
  2013-12-14  1:55           ` Dave Chinner
@ 2013-12-14 20:31           ` Tejun Heo
  2013-12-14 20:36             ` Tejun Heo
  2013-12-17  2:34             ` Rafael J. Wysocki
  1 sibling, 2 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-14 20:31 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Hello, Nigel.

On Sat, Dec 14, 2013 at 10:15:21AM +1100, Nigel Cunningham wrote:
> My understanding is that the point is ensuring that - particularly
> in the case of hibernation - we don't cause filesystem corruption by
> writing one thing while writing the image and then doing something
> else (without knowledge of what happened while the image was being
> written) while reading the image or after restoring it.

So, all this is about hibernation?  Does that mean that it's safe to
unfreeze before invoking resume?  ie. we currently do,

	freeze
	suspend devs
	resume devs
	unfreeze

If we can just do,

	freeze
	suspend devs
	unfreeze
	resume devs

it should be a lot better and would remove all locking dependency
chains in the resume path.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-14 20:31           ` Tejun Heo
@ 2013-12-14 20:36             ` Tejun Heo
  2013-12-14 21:21               ` Nigel Cunningham
  2013-12-17  2:34             ` Rafael J. Wysocki
  1 sibling, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-14 20:36 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Sat, Dec 14, 2013 at 03:31:21PM -0500, Tejun Heo wrote:
> So, all this is about hibernation?  Does that mean that it's safe to
> unfreeze before invoking resume?  ie. we currently do,
> 
> 	freeze
> 	suspend devs
> 	resume devs
> 	unfreeze
> 
> If we can just do,
> 
> 	freeze
> 	suspend devs
> 	unfreeze
> 	resume devs

Ummm... even better, does that mean, in the long term, we can decouple
freezing and device pm operations?  If this is just about hibernation,
there's no reason to wrap device driver pm ops with freezer, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-14 20:36             ` Tejun Heo
@ 2013-12-14 21:21               ` Nigel Cunningham
  2013-12-17  2:35                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 54+ messages in thread
From: Nigel Cunningham @ 2013-12-14 21:21 UTC (permalink / raw)
  To: Tejun Heo, Rafael J. Wysocki
  Cc: Jens Axboe, tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

Hi.

On 15/12/13 07:36, Tejun Heo wrote:
> On Sat, Dec 14, 2013 at 03:31:21PM -0500, Tejun Heo wrote:
>> So, all this is about hibernation?  Does that mean that it's safe to
>> unfreeze before invoking resume?  ie. we currently do,
>>
>> 	freeze
>> 	suspend devs
>> 	resume devs
>> 	unfreeze
>>
>> If we can just do,
>>
>> 	freeze
>> 	suspend devs
>> 	unfreeze
>> 	resume devs
> Ummm... even better, does that mean, in the long term, we can decouple
> freezing and device pm operations?  If this is just about hibernation,
> there's no reason to wrap device driver pm ops with freezer, right?
>
> Thanks.
>
Rafael has spent far more time than me thinking about the semantics 
here, so I'd like to defer to him. Rafael?...

Regards,

Nigel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-14 20:23   ` Tejun Heo
@ 2013-12-16  3:56     ` Dave Chinner
  2013-12-16 12:51       ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2013-12-16  3:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Sat, Dec 14, 2013 at 03:23:24PM -0500, Tejun Heo wrote:
> Hello, Dave.
> 
> On Sat, Dec 14, 2013 at 12:53:43PM +1100, Dave Chinner wrote:
> > That's the fundamental problem here - device removal asks the device
> > to fsync the filesystem on top of the device that was just removed.
> > The simple way to trigger this is to pull a device from underneath
> > an active filesystem (e.g. user unplugs a USB device without first
> > unmounting it). There are many years worth of bug reports showing
> > that this attempt by the device removal code to sync the filesystem
> > leads to deadlocks.
> 
> What are you suggesting?  Implementing separate warm and hot unplug
> paths?  That makes no sense whatsoever.  Device hot unplug is just a
> sub operation of general device unplug which should be able to succeed
> whether the underlying device is failing IOs or not.

I don't care. Trying to issue IO from an an IO error handling path
where the device has just been removed is fundamentally broken.

> Another thing to consider is that you want to excercise your error
> handling path as often as possible.  IOs could still fail even during
> device warm unplugs.  If anything atop the driver can't handle that,
> that's a bug and we want to fix them.  The fact that we have bug
> reports going back years on the subject simply means that we've been
> traditionally lousy with error handling throughout the layers - those
> were the occassions where we fixed actual error handling bugs, hot
> unplug or not.  Special casing hot unplug would just make existing
> error handling bugs more difficult to find.

This has nothing to do with existing filesystem error handling paths.
We get plenty of testing of them without having to pull devices out
of machines. Why do you think we have stuff like dm-flakey or forced
shutdown ioctls for XFS?

Indeed, what I'm asking for is for a notification so that we can
*shut the filesystem down* straight away, rather than have to wait
for an IO error in a critical metadata structure to trigger the
shutdown for us.

> > It's simply not a valid thing to do - just how is the filesystem
> > supposed to sync to a non-existent device?
> 
> By issuing all IOs and then handling the failures appropriately.

Which bit of "filesystem might deadlock trying to issue IO" didn't
you understand?

> Exactly just as it would handle *ANY* io errors.  This actually is the
> simplest thing to do to drain the pipe - pushing things down all the
> way and fail them at single point.  What's the alternative?  Implement
> quick exit path at each layer?  That's just silly and error-prone.
> 
> > I've raised this each time a user reports it over the past few years
> > and never been able to convince any to fix the filesystem
> > re-entrancy problem device removal causes. Syncing the filesystem
> 
> Well, it isn't a good idea.
> 
> > will require taking locks that are by IO in progress, and so can't
> > make progress until the IO is completed, but that can't happen
> > until the error handling completes the sync of the filesystem....
> 
> IOs in progress shouldn't ever be stalled by filesystem sync.  That's
> a massive layering violation.  Where and how does that happen?

IOs in progress get stalled by IO error handling.

Then IO error handling tries to issue more IO via fsync_bdev().

The the filesystem tries to look a metadata buffer that is still
under IO (i.e. in error handling) and stalls.

That's the problem here - IO error handling is re-entering the the
filesystem inappropriately.

If you have to invalidate the device due to errors, fsync_bdev() is
not the function to call. That's for *flushing* the filesystem. A
shutdown is the only sane thing for an active filesystem to do when
the device underneath it has been invalidated.

Really, it gets worse - the block device is deleted and freed from
the system while the filesystem on top of it still has an open
reference to it. Why do you think we have all that nasty default BDI
crap in the writeback path, Tejun? it's because the disk hot-unplug
paths free the bdi the filesystem is using out from underneath it.

What the hell happened to the usual rule of "don't free objects until
the last user goes away? What makes the error handling paths of the
layers below the filesystem not subject to common sense?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-13 23:07       ` Tejun Heo
  2013-12-13 23:15         ` Nigel Cunningham
@ 2013-12-16 12:12         ` Ming Lei
  2013-12-16 12:45           ` Tejun Heo
  1 sibling, 1 reply; 54+ messages in thread
From: Ming Lei @ 2013-12-16 12:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, Linux Kernel Mailing List, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

On Sat, Dec 14, 2013 at 7:07 AM, Tejun Heo <tj@kernel.org> wrote:
>
>> All pending I/O should have been flushed when suspend/hibernate
>> started, and there's no point in trying to update metadata on a
>
> Frozen or not, it isn't guaranteed that bdi wb queue is empty when the
> system went to suspend.  They're likely to be empty but there's no
> guarantee.  Conversion to workqueue only makes the behavior more
> deterministic.

Looks we should guarantee that I/O queue is emptied before system sleep,
otherwise it may be a bug, since battery may be drained up to cause
data loss or block devices may be disconnected during sleep.

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-16 12:12         ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei
@ 2013-12-16 12:45           ` Tejun Heo
  2013-12-16 13:24             ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 12:45 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, Linux Kernel Mailing List, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

Hello,

On Mon, Dec 16, 2013 at 08:12:07PM +0800, Ming Lei wrote:
> Looks we should guarantee that I/O queue is emptied before system sleep,
> otherwise it may be a bug, since battery may be drained up to cause
> data loss or block devices may be disconnected during sleep.

It's a bit beside the point as that's what journaling is for.  Power
loss during suspend isn't all that different from power loss while
suspended.  Also, the filesystems are already synced before entering
suspend, so it shouldn't be an issue.  However, that doesn't guarantee
there aren't no pending bdi work items.  We *hope* to control the
sources with freezer but there's no guarantee about the coverage and
no way to verify that either.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-16  3:56     ` Dave Chinner
@ 2013-12-16 12:51       ` Tejun Heo
  2013-12-16 12:56         ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 12:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Mon, Dec 16, 2013 at 02:56:52PM +1100, Dave Chinner wrote:
> > What are you suggesting?  Implementing separate warm and hot unplug
> > paths?  That makes no sense whatsoever.  Device hot unplug is just a
> > sub operation of general device unplug which should be able to succeed
> > whether the underlying device is failing IOs or not.
> 
> I don't care. Trying to issue IO from an an IO error handling path
> where the device has just been removed is fundamentally broken.

What?  Have you even read the original message?  IO error handling
path isn't issuing the IO here.  The hot unplug operation is
completely asynchronous to the IO path.  What's dead locking is not
the filesystem and IO path but device driver layer and hot unplug
path.  IOs are not stalled.

> Indeed, what I'm asking for is for a notification so that we can
> *shut the filesystem down* straight away, rather than have to wait
> for an IO error in a critical metadata structure to trigger the
> shutdown for us.

which has exactly zero to do with the issue at hand.  Would you please
stop derailing the discussion?  Again, IOs are not stalled.  That's
not the issue here.

> > > It's simply not a valid thing to do - just how is the filesystem
> > > supposed to sync to a non-existent device?
> > 
> > By issuing all IOs and then handling the failures appropriately.
> 
> Which bit of "filesystem might deadlock trying to issue IO" didn't
> you understand?

smh, I give up.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-16 12:51       ` Tejun Heo
@ 2013-12-16 12:56         ` Tejun Heo
  2013-12-18  0:35           ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 12:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Mon, Dec 16, 2013 at 07:51:24AM -0500, Tejun Heo wrote:
> On Mon, Dec 16, 2013 at 02:56:52PM +1100, Dave Chinner wrote:
> > > What are you suggesting?  Implementing separate warm and hot unplug
> > > paths?  That makes no sense whatsoever.  Device hot unplug is just a
> > > sub operation of general device unplug which should be able to succeed
> > > whether the underlying device is failing IOs or not.
> > 
> > I don't care. Trying to issue IO from an an IO error handling path
> > where the device has just been removed is fundamentally broken.
> 
> What?  Have you even read the original message?  IO error handling
> path isn't issuing the IO here.  The hot unplug operation is
> completely asynchronous to the IO path.  What's dead locking is not
> the filesystem and IO path but device driver layer and hot unplug
> path.  IOs are not stalled.

In fact, this deadlock can be reproduced without hotunplug at all.  If
you initiate warm unplug and warm unplugging races with suspend/resume
cycle, it'll behave exactly the same - the IOs from flushing in that
scenario would succeed but IOs failing or not has *NOTHING* to do with
this deadlock.  It'd still happen.  It's freezer behaving as a giant
lock and other locks of course getting dragged into giant dependency
loop.  Can you please at least *try* to understand what's going on
before throwing strong assertions?

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-16 12:45           ` Tejun Heo
@ 2013-12-16 13:24             ` Ming Lei
  2013-12-16 16:05               ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2013-12-16 13:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, Linux Kernel Mailing List, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

On Mon, Dec 16, 2013 at 8:45 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Mon, Dec 16, 2013 at 08:12:07PM +0800, Ming Lei wrote:
>> Looks we should guarantee that I/O queue is emptied before system sleep,
>> otherwise it may be a bug, since battery may be drained up to cause
>> data loss or block devices may be disconnected during sleep.
>
> It's a bit beside the point as that's what journaling is for.  Power
> loss during suspend isn't all that different from power loss while
> suspended.  Also, the filesystems are already synced before entering
> suspend, so it shouldn't be an issue.  However, that doesn't guarantee
> there aren't no pending bdi work items.  We *hope* to control the

You mean there are still some write I/O scheduled after processes are
frozen? by unfreezable kernel threads?

> sources with freezer but there's no guarantee about the coverage and
> no way to verify that either.

Thanks,
-- 
Ming Lei

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-16 13:24             ` Ming Lei
@ 2013-12-16 16:05               ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 16:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, Linux Kernel Mailing List, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

Hello, Ming.

On Mon, Dec 16, 2013 at 09:24:40PM +0800, Ming Lei wrote:
> You mean there are still some write I/O scheduled after processes are
> frozen? by unfreezable kernel threads?

By unfreezable kernel threads, timer, bh, freezable kernel threads,
whatever really.  Please note that freezer doesn't enforce any order
in how the target tasks are frozen.  There's no mechanism to guarantee
that flusher is frozen after all other freezable tasks are frozen.
There isn't even a meachnism which guarantees flusher flushes all its
queues before getting frozen.  There's no interlocking whatsoever.
The only thing which happens is that we flush all filesystems before
freezing the kernel threads, so the queues are *likely* to be empty.

I think trying to control all IO sources using the freezer is a
fundamentally flawed idea.  There are N sources which are difficult to
track down reliably while there is *single* queue all those have to go
through, which needs to be quiesced anyway.  The only thing which
makes sense is controlling the queue.

Maybe it really is necessary for hibernation.  If so, let's please
make it something tailored for that purpose - quiesce only the ones
which are actually relevant and in the places where it's necessary.
Not this "we stopped most of the system, whatever that means, and it
feels good" thing which doesn't really solve anything while
introducing this fuzzy wishful idea of mostly stopped system and giant
lock semantics all over the place.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-14 20:31           ` Tejun Heo
  2013-12-14 20:36             ` Tejun Heo
@ 2013-12-17  2:34             ` Rafael J. Wysocki
  2013-12-17 12:34               ` Tejun Heo
  1 sibling, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-17  2:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu

On Saturday, December 14, 2013 03:31:21 PM Tejun Heo wrote:
> Hello, Nigel.
> 
> On Sat, Dec 14, 2013 at 10:15:21AM +1100, Nigel Cunningham wrote:
> > My understanding is that the point is ensuring that - particularly
> > in the case of hibernation - we don't cause filesystem corruption by
> > writing one thing while writing the image and then doing something
> > else (without knowledge of what happened while the image was being
> > written) while reading the image or after restoring it.
> 
> So, all this is about hibernation?

No, it isn't.  [I guess it was originally, but it has not been the case
for a very long time.]  It is about getting user space interactions (all of
the sysfs/ioctl/mmap/read/write/you-name-it thingies user space can do to
devices) when we're calling device suspend/resume routines.  The reason is
that otherwise all of them would have had to do a "oh, are we suspending by
the way?" check pretty much on every code path that can be triggered by
user space.

And I'd appreciate it if people didn't repeat the nonsense about possible
filesystem corruption.  The freezing of tasks doesn't prevent that from
happening (which is easy to demonstrate by artificially fail restore from
hibernation multiple times in a row).

> Does that mean that it's safe to
> unfreeze before invoking resume?

No, it isn't.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-14 21:21               ` Nigel Cunningham
@ 2013-12-17  2:35                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-17  2:35 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Tejun Heo, Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Sunday, December 15, 2013 08:21:36 AM Nigel Cunningham wrote:
> Hi.
> 
> On 15/12/13 07:36, Tejun Heo wrote:
> > On Sat, Dec 14, 2013 at 03:31:21PM -0500, Tejun Heo wrote:
> >> So, all this is about hibernation?  Does that mean that it's safe to
> >> unfreeze before invoking resume?  ie. we currently do,
> >>
> >> 	freeze
> >> 	suspend devs
> >> 	resume devs
> >> 	unfreeze
> >>
> >> If we can just do,
> >>
> >> 	freeze
> >> 	suspend devs
> >> 	unfreeze
> >> 	resume devs
> > Ummm... even better, does that mean, in the long term, we can decouple
> > freezing and device pm operations?  If this is just about hibernation,
> > there's no reason to wrap device driver pm ops with freezer, right?
> >
> > Thanks.
> >
> Rafael has spent far more time than me thinking about the semantics 
> here, so I'd like to defer to him. Rafael?...

Please see my reply to Tejun.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-13 20:40   ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo
  2013-12-13 22:45     ` Nigel Cunningham
@ 2013-12-17  2:38     ` Rafael J. Wysocki
  2013-12-17 12:36       ` Tejun Heo
  2013-12-17 12:50     ` [PATCH v2] " Tejun Heo
  2 siblings, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-17  2:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Friday, December 13, 2013 03:40:34 PM Tejun Heo wrote:
> Hello,
> 
> So, this is the laughable workaround that I came up with.  Seriously,
> this is tragic.  :(
> 
> Thanks.
> 
> ------- 8< -------
> Freezable kthreads and workqueues are fundamentally problematic in
> that they effectively introduce a big kernel lock widely used in the
> kernel and have already been the culprit of several deadlock
> scenarios.  This is the latest occurrence.

OK, so I'm too tired now to go through all that, but I'll look at it tomorrow.

The rule of thumb is to get rid of freezable kernel threads in the first
place if possible anyway if they are causing problems to happen.

Thanks!

> During resume, libata rescans all the ports and revalidates all
> pre-existing devices.  If it determines that a device has gone
> missing, the device is removed from the system which involves
> invalidating block device and flushing bdi while holding driver core
> layer locks.  Unfortunately, this can race with the rest of device
> resume.  Because freezable kthreads and workqueues are thawed after
> device resume is complete and block device removal depends on
> freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
> progress, this can lead to deadlock - block device removal can't
> proceed because kthreads are frozen and kthreads can't be thawed
> because device resume is blocked behind block device removal.
> 
> 839a8e8660b6 ("writeback: replace custom worker pool implementation
> with unbound workqueue") made this particular deadlock scenario more
> visible but the underlying problem has always been there - the
> original forker task and jbd2 are freezable too.  In fact, this is
> highly likely just one of many possible deadlock scenarios given that
> freezer behaves as a big kernel lock and we don't have any debug
> mechanism around it.
> 
> I believe the right thing to do is getting rid of freezable kthreads
> and workqueues.  This is something fundamentally broken.  For now,
> implement a funny workaround in libata - just avoid doing block device
> hot[un]plug while the system is frozen.  Kernel engineering at its
> finest.  :(
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
> Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/ata/libata-scsi.c |   19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index ab58556..60a01d3 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -3872,6 +3872,25 @@ void ata_scsi_hotplug(struct work_struct *work)
>  		return;
>  	}
>  
> +	/*
> +	 * XXX - UGLY HACK
> +	 *
> +	 * The core suspend/resume path is fundamentally broken due to
> +	 * freezable kthreads and workqueue and may deadlock if a block
> +	 * device gets removed while resume is in progress.  I don't know
> +	 * what the solution is short of removing freezable kthreads and
> +	 * workqueues altogether.
> +	 *
> +	 * The following is an ugly hack to avoid kicking off device
> +	 * removal while freezer is active.  This is a joke but does avoid
> +	 * this particular deadlock scenario.
> +	 *
> +	 * https://bugzilla.kernel.org/show_bug.cgi?id=62801
> +	 * http://marc.info/?l=linux-kernel&m=138695698516487
> +	 */
> +	while (pm_freezing)
> +		msleep(100);
> +
>  	DPRINTK("ENTER\n");
>  	mutex_lock(&ap->scsi_scan_mutex);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-17  2:34             ` Rafael J. Wysocki
@ 2013-12-17 12:34               ` Tejun Heo
  2013-12-18  0:35                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 12:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu

Hello, Rafael.

On Tue, Dec 17, 2013 at 03:34:00AM +0100, Rafael J. Wysocki wrote:
> No, it isn't.  [I guess it was originally, but it has not been the case
> for a very long time.]  It is about getting user space interactions (all of

Heh... no wonder people are all so confused about this thing.

> the sysfs/ioctl/mmap/read/write/you-name-it thingies user space can do to
> devices) when we're calling device suspend/resume routines.  The reason is
> that otherwise all of them would have had to do a "oh, are we suspending by
> the way?" check pretty much on every code path that can be triggered by
> user space.

Freezing userland is fine.  I have no problem with that but up until
now the only use case that seems fundamentally valid to me is freezing
IO processing kthread in a driver as a cheap way to implement
suspend/resume.  At this point, given the general level of confusion,
it seems to be costing more than benefiting.

> > Does that mean that it's safe to unfreeze before invoking resume?
> 
> No, it isn't.

So, are you saying it's really about giving device drivers easy way to
implement suspend/resume?  If that's the case, let's please make it
*way* more specific and clear - ie. things like helpers to implement
suspend/resume hooks trivially or whatnot.  Freezable kthreads (and
now workqueues) have been becoming a giant mess for a while now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-17  2:38     ` Rafael J. Wysocki
@ 2013-12-17 12:36       ` Tejun Heo
  2013-12-18  0:23         ` Rafael J. Wysocki
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 12:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Hello, Rafael.

On Tue, Dec 17, 2013 at 03:38:05AM +0100, Rafael J. Wysocki wrote:
> On Friday, December 13, 2013 03:40:34 PM Tejun Heo wrote:
> > Hello,
> > 
> > So, this is the laughable workaround that I came up with.  Seriously,
> > this is tragic.  :(
> > 
> > Thanks.
> > 
> > ------- 8< -------
> > Freezable kthreads and workqueues are fundamentally problematic in
> > that they effectively introduce a big kernel lock widely used in the
> > kernel and have already been the culprit of several deadlock
> > scenarios.  This is the latest occurrence.
> 
> OK, so I'm too tired now to go through all that, but I'll look at it tomorrow.
> 
> The rule of thumb is to get rid of freezable kernel threads in the first
> place if possible anyway if they are causing problems to happen.

Yes, that'd be awesome.  In fact, getting rid of all kernel freezables
in non-low-level-drivers would be a great step forward.  That said, we
need something easily backportable so I think we probably need this
bandaid for immediate fix for now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2] libata, freezer: avoid block device removal while system is frozen
  2013-12-13 20:40   ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo
  2013-12-13 22:45     ` Nigel Cunningham
  2013-12-17  2:38     ` Rafael J. Wysocki
@ 2013-12-17 12:50     ` Tejun Heo
  2013-12-18  1:04       ` Rafael J. Wysocki
  2013-12-18 12:07       ` [PATCH v3] " Tejun Heo
  2 siblings, 2 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 12:50 UTC (permalink / raw)
  To: Rafael J. Wysocki, Jens Axboe
  Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

Hello,

Rafael, if you're okay with the workaround, I'll route it through
libata/for-3.13-fixes.

Thanks.
------- 8< -------
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios.  This is the latest occurrence.

During resume, libata rescans all the ports and revalidates all
pre-existing devices.  If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks.  Unfortunately, this can race with the rest of device
resume.  Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.

839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too.  In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.

I believe the right thing to do is getting rid of freezable kthreads
and workqueues.  This is something fundamentally broken.  For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen.  Kernel engineering at its
finest.  :(

v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
    as a module.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
---
 drivers/ata/libata-scsi.c |   19 +++++++++++++++++++
 kernel/freezer.c          |    6 ++++++
 2 files changed, 25 insertions(+)

--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct
 		return;
 	}
 
+	/*
+	 * XXX - UGLY HACK
+	 *
+	 * The core suspend/resume path is fundamentally broken due to
+	 * freezable kthreads and workqueue and may deadlock if a block
+	 * device gets removed while resume is in progress.  I don't know
+	 * what the solution is short of removing freezable kthreads and
+	 * workqueues altogether.
+	 *
+	 * The following is an ugly hack to avoid kicking off device
+	 * removal while freezer is active.  This is a joke but does avoid
+	 * this particular deadlock scenario.
+	 *
+	 * https://bugzilla.kernel.org/show_bug.cgi?id=62801
+	 * http://marc.info/?l=linux-kernel&m=138695698516487
+	 */
+	while (pm_freezing)
+		msleep(100);
+
 	DPRINTK("ENTER\n");
 	mutex_lock(&ap->scsi_scan_mutex);
 
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
 bool pm_freezing;
 bool pm_nosig_freezing;
 
+/*
+ * Temporary export for the deadlock workaround in ata_scsi_hotplug().
+ * Remove once the hack becomes unnecessary.
+ */
+EXPORT_SYMBOL_GPL(pm_freezing);
+
 /* protects freezing and frozen transitions */
 static DEFINE_SPINLOCK(freezer_lock);
 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-17 12:36       ` Tejun Heo
@ 2013-12-18  0:23         ` Rafael J. Wysocki
  0 siblings, 0 replies; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-18  0:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Tuesday, December 17, 2013 07:36:43 AM Tejun Heo wrote:
> Hello, Rafael.
> 
> On Tue, Dec 17, 2013 at 03:38:05AM +0100, Rafael J. Wysocki wrote:
> > On Friday, December 13, 2013 03:40:34 PM Tejun Heo wrote:
> > > Hello,
> > > 
> > > So, this is the laughable workaround that I came up with.  Seriously,
> > > this is tragic.  :(
> > > 
> > > Thanks.
> > > 
> > > ------- 8< -------
> > > Freezable kthreads and workqueues are fundamentally problematic in
> > > that they effectively introduce a big kernel lock widely used in the
> > > kernel and have already been the culprit of several deadlock
> > > scenarios.  This is the latest occurrence.
> > 
> > OK, so I'm too tired now to go through all that, but I'll look at it tomorrow.
> > 
> > The rule of thumb is to get rid of freezable kernel threads in the first
> > place if possible anyway if they are causing problems to happen.
> 
> Yes, that'd be awesome.  In fact, getting rid of all kernel freezables
> in non-low-level-drivers would be a great step forward.

Agreed.

> That said, we need something easily backportable so I think we probably need
> this bandaid for immediate fix for now.

Well, we need something, but I have a couple of concerns about this particular
patch (I'll reply to it with comments shortly).

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-16 12:56         ` Tejun Heo
@ 2013-12-18  0:35           ` Dave Chinner
  2013-12-18 11:43             ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2013-12-18  0:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Mon, Dec 16, 2013 at 07:56:04AM -0500, Tejun Heo wrote:
> On Mon, Dec 16, 2013 at 07:51:24AM -0500, Tejun Heo wrote:
> > On Mon, Dec 16, 2013 at 02:56:52PM +1100, Dave Chinner wrote:
> > > > What are you suggesting?  Implementing separate warm and hot unplug
> > > > paths?  That makes no sense whatsoever.  Device hot unplug is just a
> > > > sub operation of general device unplug which should be able to succeed
> > > > whether the underlying device is failing IOs or not.
> > > 
> > > I don't care. Trying to issue IO from an an IO error handling path
> > > where the device has just been removed is fundamentally broken.
> > 
> > What?  Have you even read the original message?  IO error handling
> > path isn't issuing the IO here.  The hot unplug operation is
> > completely asynchronous to the IO path.  What's dead locking is not
> > the filesystem and IO path but device driver layer and hot unplug
> > path.  IOs are not stalled.
> 
> In fact, this deadlock can be reproduced without hotunplug at all.  If
> you initiate warm unplug and warm unplugging races with suspend/resume
> cycle, it'll behave exactly the same - the IOs from flushing in that
> scenario would succeed but IOs failing or not has *NOTHING* to do with
> this deadlock.  It'd still happen.  It's freezer behaving as a giant
> lock and other locks of course getting dragged into giant dependency
> loop.  Can you please at least *try* to understand what's going on
> before throwing strong assertions?

Sure I understand the problem. Part of that dependency loop is
caused by filesystem re-entrancy from the hot unplug code.
Regardless of this /specific freezer problem/, that filesystem
re-entrancy path is *never OK* during a hot-unplug event and it
needs to be fixed.

Perhaps the function "invalidate_partition()" is badly named. To
state the obvious, fsync != invalidation. What it does is:

	1. sync filesystem
	2. shrink the dcache
	3. invalidates inodes and kills dirty inodes
	4. invalidates block device (removes cached bdev pages)

Basically, the first step is "flush", the remainder is "invalidate".

Indeed, step 3 throws away dirty inodes, so why on earth would we
even bother with step 1 to try to clean them in the first place?
IOWs, the flush is irrelevant in the hot-unplug case as it will
fail to flush stuff, and then we just throw the stuff we
failed to write away.

But in attempting to flush all the dirty data and metadata, we can
cause all sorts of other potential re-entrancy based deadlocks due
to attempting to issue IO. Whether they be freezer based or through
IO error handling triggering device removal or some other means, it
is irrelevant - it is the flush that causes all the problems.

We need to either get rid of the flush on device failure/hot-unplug,
or turn it into a callout for the superblock to take an appropriate
action (e.g. shutting down the filesystem) rather than trying to
issue IO. i.e. allow the filesystem to take appropriate action of
shutting down the filesystem and invalidating it's caches.

Indeed, in XFS there's several other caches that could contain dirty
metadata that isn't invalidated by invalidate_partition(), and so
unless the filesystem is shut down it can continue to try to issue
IO on those buffers to the removed device until the filesystem is
shutdown or unmounted.

Seriously, Tejun, the assumption that invalidate_partition() knows
enough about filesystems to safely "invalidate" them is just broken.
These days, filesystems often reference multiple block devices, and
so the way hotplug currently treats them as "one device, one
filesystem" is also fundamentally wrong.

So there's many ways in which the hot-unplug code is broken in it's
use of invalidate_partition(), the least of which is the
dependencies caused by re-entrancy. We really need a
"sb->shutdown()" style callout as step one in the above process, not
fsync_bdev().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-17 12:34               ` Tejun Heo
@ 2013-12-18  0:35                 ` Rafael J. Wysocki
  2013-12-18 11:17                   ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-18  0:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu

On Tuesday, December 17, 2013 07:34:57 AM Tejun Heo wrote:
> Hello, Rafael.
> 
> On Tue, Dec 17, 2013 at 03:34:00AM +0100, Rafael J. Wysocki wrote:
> > No, it isn't.  [I guess it was originally, but it has not been the case
> > for a very long time.]  It is about getting user space interactions (all of
> 
> Heh... no wonder people are all so confused about this thing.
> 
> > the sysfs/ioctl/mmap/read/write/you-name-it thingies user space can do to
> > devices) when we're calling device suspend/resume routines.  The reason is
> > that otherwise all of them would have had to do a "oh, are we suspending by
> > the way?" check pretty much on every code path that can be triggered by
> > user space.
> 
> Freezing userland is fine.

OK

So do I understand correctly that you're talking about kernel threads/worqueues
freezing below?

> I have no problem with that but up until
> now the only use case that seems fundamentally valid to me is freezing
> IO processing kthread in a driver as a cheap way to implement
> suspend/resume.  At this point, given the general level of confusion,
> it seems to be costing more than benefiting.

There are corner cases like the runtime PM workqueue that's freezable by
design.

> > > Does that mean that it's safe to unfreeze before invoking resume?
> > 
> > No, it isn't.
> 
> So, are you saying it's really about giving device drivers easy way to
> implement suspend/resume?

Well, that's a side effect rather than a recommeded interface.  A *few* pieces
of code need to freeze kernel threads/workqueues, but they should know who they
are and they really really should know *why* they need that (the above-mentioned
runtime PM workqueue is one example).  The rest is just doing that because they
can, which may not be entirely reasonable (or because they did that in the past
and the original author is not responsive and everyone else does not dare to try
removing that).

> If that's the case, let's please make it
> *way* more specific and clear - ie. things like helpers to implement
> suspend/resume hooks trivially or whatnot.  Freezable kthreads (and
> now workqueues) have been becoming a giant mess for a while now.

They were a lot more of that to start with really.  We removed quite a number
of try_to_freeze() instances from the kernel a few years ago, but apparently
people are taking shortcuts.

The rule should be to require patch submitters to put in comments explaining
why they need their kernel threads/workqueues to be freezable and generally
there should be no such things in drivers.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2] libata, freezer: avoid block device removal while system is frozen
  2013-12-17 12:50     ` [PATCH v2] " Tejun Heo
@ 2013-12-18  1:04       ` Rafael J. Wysocki
  2013-12-18 11:08         ` Tejun Heo
  2013-12-18 12:07       ` [PATCH v3] " Tejun Heo
  1 sibling, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-18  1:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Tuesday, December 17, 2013 07:50:42 AM Tejun Heo wrote:
> Hello,
> 
> Rafael, if you're okay with the workaround, I'll route it through
> libata/for-3.13-fixes.
> 
> Thanks.
> ------- 8< -------
> Freezable kthreads and workqueues are fundamentally problematic in
> that they effectively introduce a big kernel lock widely used in the
> kernel and have already been the culprit of several deadlock
> scenarios.  This is the latest occurrence.
> 
> During resume, libata rescans all the ports and revalidates all
> pre-existing devices.  If it determines that a device has gone
> missing, the device is removed from the system which involves
> invalidating block device and flushing bdi while holding driver core
> layer locks.  Unfortunately, this can race with the rest of device
> resume.  Because freezable kthreads and workqueues are thawed after
> device resume is complete and block device removal depends on
> freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
> progress, this can lead to deadlock - block device removal can't
> proceed because kthreads are frozen and kthreads can't be thawed
> because device resume is blocked behind block device removal.
> 
> 839a8e8660b6 ("writeback: replace custom worker pool implementation
> with unbound workqueue") made this particular deadlock scenario more
> visible but the underlying problem has always been there - the
> original forker task and jbd2 are freezable too.  In fact, this is
> highly likely just one of many possible deadlock scenarios given that
> freezer behaves as a big kernel lock and we don't have any debug
> mechanism around it.
> 
> I believe the right thing to do is getting rid of freezable kthreads
> and workqueues.  This is something fundamentally broken.  For now,
> implement a funny workaround in libata - just avoid doing block device
> hot[un]plug while the system is frozen.  Kernel engineering at its
> finest.  :(
> 
> v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
>     as a module.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
> Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/ata/libata-scsi.c |   19 +++++++++++++++++++
>  kernel/freezer.c          |    6 ++++++
>  2 files changed, 25 insertions(+)
> 
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct
>  		return;
>  	}
>  
> +	/*
> +	 * XXX - UGLY HACK
> +	 *
> +	 * The core suspend/resume path is fundamentally broken due to
> +	 * freezable kthreads and workqueue and may deadlock if a block
> +	 * device gets removed while resume is in progress.  I don't know
> +	 * what the solution is short of removing freezable kthreads and
> +	 * workqueues altogether.

Do you mean the block device core or the SCSI core or something else?  It would
be good to clarify that here to avoid confusion.

> +	 * The following is an ugly hack to avoid kicking off device
> +	 * removal while freezer is active.  This is a joke but does avoid
> +	 * this particular deadlock scenario.
> +	 *
> +	 * https://bugzilla.kernel.org/show_bug.cgi?id=62801
> +	 * http://marc.info/?l=linux-kernel&m=138695698516487
> +	 */
> +	while (pm_freezing)
> +		msleep(100);

Why is the sleep time 100 ms exactly?  And why does it matter?

For example, what would change if it were 10 ms?

> +
>  	DPRINTK("ENTER\n");
>  	mutex_lock(&ap->scsi_scan_mutex);
>  
> --- a/kernel/freezer.c
> +++ b/kernel/freezer.c
> @@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
>  bool pm_freezing;
>  bool pm_nosig_freezing;
>  
> +/*
> + * Temporary export for the deadlock workaround in ata_scsi_hotplug().
> + * Remove once the hack becomes unnecessary.
> + */
> +EXPORT_SYMBOL_GPL(pm_freezing);
> +
>  /* protects freezing and frozen transitions */
>  static DEFINE_SPINLOCK(freezer_lock);
>  

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2] libata, freezer: avoid block device removal while system is frozen
  2013-12-18  1:04       ` Rafael J. Wysocki
@ 2013-12-18 11:08         ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-18 11:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Hey, Rafael.

On Wed, Dec 18, 2013 at 02:04:35AM +0100, Rafael J. Wysocki wrote:
> > +	 * The core suspend/resume path is fundamentally broken due to
> > +	 * freezable kthreads and workqueue and may deadlock if a block
> > +	 * device gets removed while resume is in progress.  I don't know
> > +	 * what the solution is short of removing freezable kthreads and
> > +	 * workqueues altogether.
> 
> Do you mean the block device core or the SCSI core or something else?  It would
> be good to clarify that here to avoid confusion.

Will clarify.

> > +	 * The following is an ugly hack to avoid kicking off device
> > +	 * removal while freezer is active.  This is a joke but does avoid
> > +	 * this particular deadlock scenario.
> > +	 *
> > +	 * https://bugzilla.kernel.org/show_bug.cgi?id=62801
> > +	 * http://marc.info/?l=linux-kernel&m=138695698516487
> > +	 */
> > +	while (pm_freezing)
> > +		msleep(100);
> 
> Why is the sleep time 100 ms exactly?  And why does it matter?

Just a number I pulled out of my ass.

> For example, what would change if it were 10 ms?

Yeah, 10ms is my favorite human-visible polling duration too (because
it's slow enough not to cause overhead issues while fast enough to be
mostly unnoticeable to humans).  This one doesn't really matter
because the operation's latency isn't something which the user would
wait for actively.  That said, yeah, why not, I'll change it to 10ms.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-18  0:35                 ` Rafael J. Wysocki
@ 2013-12-18 11:17                   ` Tejun Heo
  2013-12-18 21:48                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-18 11:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu

Hello, Rafael.

On Wed, Dec 18, 2013 at 01:35:13AM +0100, Rafael J. Wysocki wrote:
> So do I understand correctly that you're talking about kernel threads/worqueues
> freezing below?

Yeap, I'm strictly talking about kernel freezables.

> > So, are you saying it's really about giving device drivers easy way to
> > implement suspend/resume?
> 
> Well, that's a side effect rather than a recommeded interface.  A *few* pieces
> of code need to freeze kernel threads/workqueues, but they should know who they
> are and they really really should know *why* they need that (the above-mentioned
> runtime PM workqueue is one example).  The rest is just doing that because they
> can, which may not be entirely reasonable (or because they did that in the past
> and the original author is not responsive and everyone else does not dare to try
> removing that).

I see.  In the long term, I think the right thing to do is making the
freezer interface more specific so that only the ones which actually
need it do so explicitly.  Right now, kernel freezables are
conceptually at a very high level - it's a global task attribute and a
major knob in workqueue.  I suppose most of that is historical but by
perpetuating the model we're encouraging misuse of freezer in large
swaths of the kernel.  Even in this specific case, both writeback and
jbd workers have no fundamental reason to be freezable and yet
they're, eventually developing into totally unnecessary deadlocks.

> They were a lot more of that to start with really.  We removed quite a number
> of try_to_freeze() instances from the kernel a few years ago, but apparently
> people are taking shortcuts.

Great, so we're at least headed towards the right direction.

> The rule should be to require patch submitters to put in comments explaining
> why they need their kernel threads/workqueues to be freezable and generally
> there should be no such things in drivers.

I'm not so sure whether that's something which can be effectively
enforced given the current high visibility and confusion around
freezer.  I think the only way to get this under control is weed out
the current spurious users actively, deprecate the existing interface
and switch the real ones to something a lot more specific.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-18  0:35           ` Dave Chinner
@ 2013-12-18 11:43             ` Tejun Heo
  2013-12-18 22:14               ` Rafael J. Wysocki
  2013-12-19  4:08               ` Dave Chinner
  0 siblings, 2 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-18 11:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Hello, Dave.

On Wed, Dec 18, 2013 at 11:35:10AM +1100, Dave Chinner wrote:
> Sure I understand the problem. Part of that dependency loop is
> caused by filesystem re-entrancy from the hot unplug code.
> Regardless of this /specific freezer problem/, that filesystem
> re-entrancy path is *never OK* during a hot-unplug event and it
> needs to be fixed.

I don't think you do.  The only relation between the two problems is
that they're traveling the same code path.  Again, the deadlock can be
triggered *the same* with warm unplug && it *doesn't* need writeback
or jbd - we can deadlock with any freezables and we have drivers which
use freezables in IO issue path.

I don't really mind the discussion branching off to a related topic
but you've been consistently misunderstanding the issue at hand and
derailing the larger discussion.  If you want to discuss the issue
that you brought up, that's completely fine but *PLEASE* stop getting
confused and mixing up the two.

> Perhaps the function "invalidate_partition()" is badly named. To
> state the obvious, fsync != invalidation. What it does is:
> 
> 	1. sync filesystem
> 	2. shrink the dcache
> 	3. invalidates inodes and kills dirty inodes
> 	4. invalidates block device (removes cached bdev pages)
> 
> Basically, the first step is "flush", the remainder is "invalidate".
> 
> Indeed, step 3 throws away dirty inodes, so why on earth would we
> even bother with step 1 to try to clean them in the first place?
> IOWs, the flush is irrelevant in the hot-unplug case as it will
> fail to flush stuff, and then we just throw the stuff we
> failed to write away.
>
> But in attempting to flush all the dirty data and metadata, we can
> cause all sorts of other potential re-entrancy based deadlocks due
> to attempting to issue IO. Whether they be freezer based or through
> IO error handling triggering device removal or some other means, it
> is irrelevant - it is the flush that causes all the problems.

Isn't the root cause there hotunplug reentering anything above it in
the first place.  The problem with your proposal is that filesystem
isn't the only place where this could happen.  Even with no filesystem
involved, block device could still be dirty and IOs pending in
whatever form - dirty bdi, bios queued in dm, requests queued in
request_queue, whatever really - and if the hotunplug path reenters
any of the higher layers in a way which blocks IO processing, it will
deadlock.

If knowing that the underlying device has gone away somehow helps
filesystem, maybe we can expose that interface and avoid flushing
after hotunplug but that merely hides the possible deadlock scenario
that you're concerned about.  Nothing is really solved.

We can try to do the same thing at each layer and implement quick exit
path for hot unplug all the way down to the driver but that kinda
sounds complex and fragile to me.  It's a lot larger surface to cover
when the root cause is hotunplug allowed to reenter anything at all
from IO path.  This is especially true because hotunplug can trivially
be made fully asynchronous in most cases.  In terms of destruction of
higher level objects, warm and hot unplugs can and should behave
identical.

In fact, it isn't possible to draw a strict line between warm and hot
unplugs - what if the device gets detached while warm unplug is in
progress?  We'd be already traveling warm unplug path when the
operating condition becomes identical to hot unplug.

> We need to either get rid of the flush on device failure/hot-unplug,
> or turn it into a callout for the superblock to take an appropriate
> action (e.g. shutting down the filesystem) rather than trying to
> issue IO. i.e. allow the filesystem to take appropriate action of
> shutting down the filesystem and invalidating it's caches.

There could be cases where some optimizations for hot unplug could be
useful.  Maybe suppressing pointless duplicate warning messages or
whatnot but I'm highly doubtful anything will be actually fixed that
way.  We'll be most likely making bugs just less reproducible.

> Indeed, in XFS there's several other caches that could contain dirty
> metadata that isn't invalidated by invalidate_partition(), and so
> unless the filesystem is shut down it can continue to try to issue
> IO on those buffers to the removed device until the filesystem is
> shutdown or unmounted.

Do you mean xfs never gives up after IO failures?

> Seriously, Tejun, the assumption that invalidate_partition() knows
> enough about filesystems to safely "invalidate" them is just broken.
> These days, filesystems often reference multiple block devices, and
> so the way hotplug currently treats them as "one device, one
> filesystem" is also fundamentally wrong.
> 
> So there's many ways in which the hot-unplug code is broken in it's
> use of invalidate_partition(), the least of which is the
> dependencies caused by re-entrancy. We really need a
> "sb->shutdown()" style callout as step one in the above process, not
> fsync_bdev().

If filesystems need an indication that the underlying device is no
longer functional, please go ahead and add it, but please keep in mind
all these are completely asynchronous.  Nothing guarantees you that
such events would happen in any specific order.  IOW, you can be at
*ANY* point in your warm unplug path and the device is hot unplugged,
which essentially forces all the code paths to be ready for the worst,
and that's exactly why there isn't much effort in trying to separate
out warm and hot unplug paths.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3] libata, freezer: avoid block device removal while system is frozen
  2013-12-17 12:50     ` [PATCH v2] " Tejun Heo
  2013-12-18  1:04       ` Rafael J. Wysocki
@ 2013-12-18 12:07       ` Tejun Heo
  2013-12-18 22:08         ` Rafael J. Wysocki
  2013-12-19 18:54         ` [PATCH v4] " Tejun Heo
  1 sibling, 2 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-18 12:07 UTC (permalink / raw)
  To: Rafael J. Wysocki, Jens Axboe
  Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios.  This is the latest occurrence.

During resume, libata rescans all the ports and revalidates all
pre-existing devices.  If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks.  Unfortunately, this can race with the rest of device
resume.  Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.

839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too.  In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.

I believe the right thing to do is getting rid of freezable kthreads
and workqueues.  This is something fundamentally broken.  For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen.  Kernel engineering at its
finest.  :(

v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
    as a module.

v3: Comment updated and polling interval changed to 10ms as suggested
    by Rafael.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
---
 drivers/ata/libata-scsi.c |   19 +++++++++++++++++++
 kernel/freezer.c          |    6 ++++++
 2 files changed, 25 insertions(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index db6dfcf..f519868 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct *work)
 		return;
 	}
 
+	/*
+	 * XXX - UGLY HACK
+	 *
+	 * The block layer suspend/resume path is fundamentally broken due
+	 * to freezable kthreads and workqueue and may deadlock if a block
+	 * device gets removed while resume is in progress.  I don't know
+	 * what the solution is short of removing freezable kthreads and
+	 * workqueues altogether.
+	 *
+	 * The following is an ugly hack to avoid kicking off device
+	 * removal while freezer is active.  This is a joke but does avoid
+	 * this particular deadlock scenario.
+	 *
+	 * https://bugzilla.kernel.org/show_bug.cgi?id=62801
+	 * http://marc.info/?l=linux-kernel&m=138695698516487
+	 */
+	while (pm_freezing)
+		msleep(10);
+
 	DPRINTK("ENTER\n");
 	mutex_lock(&ap->scsi_scan_mutex);
 
diff --git a/kernel/freezer.c b/kernel/freezer.c
index b462fa1..aa6a8aa 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
 bool pm_freezing;
 bool pm_nosig_freezing;
 
+/*
+ * Temporary export for the deadlock workaround in ata_scsi_hotplug().
+ * Remove once the hack becomes unnecessary.
+ */
+EXPORT_SYMBOL_GPL(pm_freezing);
+
 /* protects freezing and frozen transitions */
 static DEFINE_SPINLOCK(freezer_lock);
 

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-18 21:48                     ` Rafael J. Wysocki
@ 2013-12-18 21:39                       ` Tejun Heo
  2013-12-18 21:41                         ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-18 21:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu

Hello, Rafael.

On Wed, Dec 18, 2013 at 10:48:31PM +0100, Rafael J. Wysocki wrote:
> > I see.  In the long term, I think the right thing to do is making the
> > freezer interface more specific so that only the ones which actually
> > need it do so explicitly.  Right now, kernel freezables are
> > conceptually at a very high level - it's a global task attribute and a
> > major knob in workqueue.  I suppose most of that is historical but by
> > perpetuating the model we're encouraging misuse of freezer in large
> > swaths of the kernel.  Even in this specific case, both writeback and
> > jbd workers have no fundamental reason to be freezable and yet
> > they're, eventually developing into totally unnecessary deadlocks.
> 
> You're right, but I'm not sure how we can make the interface for workqueues
> more specific, for example.  I guess we can simply drop create_freezable_workqueue()
> so that whoever wants to create a freezable workqueue has to use the right
> combination of flags.  Can we make it more specific than that?
> 
> BTW, pm_start_workqueue(), which is a legitimate user, doesn't even use that macro. :-)

Yeah, we can just rip out the whole freezer support and let the
caller's pm notification hooks implement it by doing

	workqueue_set_max_active(wq, 0);
	flush_workqueue(wq);

when it needs to "freeze" the workqueue and then reverse it by doing
the following.

	workqueue_set_max_active(wq, WQ_DFL_ACTIVE);

It'll be a bit more code in the specific users but given the
specificity of the usage I think that's the appropriate way to do it.
It'll drop quite a bit of complexity from the core freezer and
workqueue code paths too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-18 21:39                       ` Tejun Heo
@ 2013-12-18 21:41                         ` Tejun Heo
  2013-12-18 22:04                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-18 21:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu

On Wed, Dec 18, 2013 at 04:39:36PM -0500, Tejun Heo wrote:
> Yeah, we can just rip out the whole freezer support and let the
> caller's pm notification hooks implement it by doing
> 
> 	workqueue_set_max_active(wq, 0);
> 	flush_workqueue(wq);

Oops, the above doesn't work but I can trivially add a new interface
for this.  Something which waits for the max_active to drop below the
newly set level before returning.

	workqueue_set_max_active_and_wait(wq, 0);

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-18 11:17                   ` Tejun Heo
@ 2013-12-18 21:48                     ` Rafael J. Wysocki
  2013-12-18 21:39                       ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-18 21:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu

On Wednesday, December 18, 2013 06:17:26 AM Tejun Heo wrote:
> Hello, Rafael.
> 
> On Wed, Dec 18, 2013 at 01:35:13AM +0100, Rafael J. Wysocki wrote:
> > So do I understand correctly that you're talking about kernel threads/worqueues
> > freezing below?
> 
> Yeap, I'm strictly talking about kernel freezables.
> 
> > > So, are you saying it's really about giving device drivers easy way to
> > > implement suspend/resume?
> > 
> > Well, that's a side effect rather than a recommeded interface.  A *few* pieces
> > of code need to freeze kernel threads/workqueues, but they should know who they
> > are and they really really should know *why* they need that (the above-mentioned
> > runtime PM workqueue is one example).  The rest is just doing that because they
> > can, which may not be entirely reasonable (or because they did that in the past
> > and the original author is not responsive and everyone else does not dare to try
> > removing that).
> 
> I see.  In the long term, I think the right thing to do is making the
> freezer interface more specific so that only the ones which actually
> need it do so explicitly.  Right now, kernel freezables are
> conceptually at a very high level - it's a global task attribute and a
> major knob in workqueue.  I suppose most of that is historical but by
> perpetuating the model we're encouraging misuse of freezer in large
> swaths of the kernel.  Even in this specific case, both writeback and
> jbd workers have no fundamental reason to be freezable and yet
> they're, eventually developing into totally unnecessary deadlocks.

You're right, but I'm not sure how we can make the interface for workqueues
more specific, for example.  I guess we can simply drop create_freezable_workqueue()
so that whoever wants to create a freezable workqueue has to use the right
combination of flags.  Can we make it more specific than that?

BTW, pm_start_workqueue(), which is a legitimate user, doesn't even use that macro. :-)

> > They were a lot more of that to start with really.  We removed quite a number
> > of try_to_freeze() instances from the kernel a few years ago, but apparently
> > people are taking shortcuts.
> 
> Great, so we're at least headed towards the right direction.
> 
> > The rule should be to require patch submitters to put in comments explaining
> > why they need their kernel threads/workqueues to be freezable and generally
> > there should be no such things in drivers.
> 
> I'm not so sure whether that's something which can be effectively
> enforced given the current high visibility and confusion around
> freezer.  I think the only way to get this under control is weed out
> the current spurious users actively, deprecate the existing interface
> and switch the real ones to something a lot more specific.

That'd be fine by me modulo the above remarks.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] libata, freezer: avoid block device removal while system is frozen
  2013-12-18 21:41                         ` Tejun Heo
@ 2013-12-18 22:04                           ` Rafael J. Wysocki
  2013-12-19 23:35                             ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo
  2013-12-19 23:37                             ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo
  0 siblings, 2 replies; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-18 22:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu

On Wednesday, December 18, 2013 04:41:28 PM Tejun Heo wrote:
> On Wed, Dec 18, 2013 at 04:39:36PM -0500, Tejun Heo wrote:
> > Yeah, we can just rip out the whole freezer support and let the
> > caller's pm notification hooks implement it by doing
> > 
> > 	workqueue_set_max_active(wq, 0);
> > 	flush_workqueue(wq);
> 
> Oops, the above doesn't work but I can trivially add a new interface
> for this.  Something which waits for the max_active to drop below the
> newly set level before returning.
> 
> 	workqueue_set_max_active_and_wait(wq, 0);

If you add that, I can convert the PM workqueue to using it and then
we can go through all of the existing users and make similar changes - or
just make them non-freezable if there's no good reason for freezing those
particular ones.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3] libata, freezer: avoid block device removal while system is frozen
  2013-12-18 12:07       ` [PATCH v3] " Tejun Heo
@ 2013-12-18 22:08         ` Rafael J. Wysocki
  2013-12-19 17:24           ` Tejun Heo
  2013-12-19 18:54         ` [PATCH v4] " Tejun Heo
  1 sibling, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-18 22:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Wednesday, December 18, 2013 07:07:32 AM Tejun Heo wrote:
> Freezable kthreads and workqueues are fundamentally problematic in
> that they effectively introduce a big kernel lock widely used in the
> kernel and have already been the culprit of several deadlock
> scenarios.  This is the latest occurrence.
> 
> During resume, libata rescans all the ports and revalidates all
> pre-existing devices.  If it determines that a device has gone
> missing, the device is removed from the system which involves
> invalidating block device and flushing bdi while holding driver core
> layer locks.  Unfortunately, this can race with the rest of device
> resume.  Because freezable kthreads and workqueues are thawed after
> device resume is complete and block device removal depends on
> freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
> progress, this can lead to deadlock - block device removal can't
> proceed because kthreads are frozen and kthreads can't be thawed
> because device resume is blocked behind block device removal.
> 
> 839a8e8660b6 ("writeback: replace custom worker pool implementation
> with unbound workqueue") made this particular deadlock scenario more
> visible but the underlying problem has always been there - the
> original forker task and jbd2 are freezable too.  In fact, this is
> highly likely just one of many possible deadlock scenarios given that
> freezer behaves as a big kernel lock and we don't have any debug
> mechanism around it.
> 
> I believe the right thing to do is getting rid of freezable kthreads
> and workqueues.

I agree.  It may be useful to block them over suspend/resume, but that doesn't
have to be done through the freezer.

> This is something fundamentally broken.  For now,
> implement a funny workaround in libata - just avoid doing block device
> hot[un]plug while the system is frozen.  Kernel engineering at its
> finest.  :(
> 
> v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
>     as a module.
> 
> v3: Comment updated and polling interval changed to 10ms as suggested
>     by Rafael.

This one is fine by my FWIW.

Thanks!

> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
> Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/ata/libata-scsi.c |   19 +++++++++++++++++++
>  kernel/freezer.c          |    6 ++++++
>  2 files changed, 25 insertions(+)
> 
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index db6dfcf..f519868 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct *work)
>  		return;
>  	}
>  
> +	/*
> +	 * XXX - UGLY HACK
> +	 *
> +	 * The block layer suspend/resume path is fundamentally broken due
> +	 * to freezable kthreads and workqueue and may deadlock if a block
> +	 * device gets removed while resume is in progress.  I don't know
> +	 * what the solution is short of removing freezable kthreads and
> +	 * workqueues altogether.
> +	 *
> +	 * The following is an ugly hack to avoid kicking off device
> +	 * removal while freezer is active.  This is a joke but does avoid
> +	 * this particular deadlock scenario.
> +	 *
> +	 * https://bugzilla.kernel.org/show_bug.cgi?id=62801
> +	 * http://marc.info/?l=linux-kernel&m=138695698516487
> +	 */
> +	while (pm_freezing)
> +		msleep(10);
> +
>  	DPRINTK("ENTER\n");
>  	mutex_lock(&ap->scsi_scan_mutex);
>  
> diff --git a/kernel/freezer.c b/kernel/freezer.c
> index b462fa1..aa6a8aa 100644
> --- a/kernel/freezer.c
> +++ b/kernel/freezer.c
> @@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
>  bool pm_freezing;
>  bool pm_nosig_freezing;
>  
> +/*
> + * Temporary export for the deadlock workaround in ata_scsi_hotplug().
> + * Remove once the hack becomes unnecessary.
> + */
> +EXPORT_SYMBOL_GPL(pm_freezing);
> +
>  /* protects freezing and frozen transitions */
>  static DEFINE_SPINLOCK(freezer_lock);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-18 11:43             ` Tejun Heo
@ 2013-12-18 22:14               ` Rafael J. Wysocki
  2013-12-19  4:08               ` Dave Chinner
  1 sibling, 0 replies; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-18 22:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dave Chinner, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel,
	Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Wednesday, December 18, 2013 06:43:43 AM Tejun Heo wrote:

[...]

> If filesystems need an indication that the underlying device is no
> longer functional, please go ahead and add it, but please keep in mind
> all these are completely asynchronous.  Nothing guarantees you that
> such events would happen in any specific order.  IOW, you can be at
> *ANY* point in your warm unplug path and the device is hot unplugged,
> which essentially forces all the code paths to be ready for the worst,
> and that's exactly why there isn't much effort in trying to separate
> out warm and hot unplug paths.

Yes.  Devices can go away at any point without notice.  Even PCI devices
that have never been assumed to be hot-removable.  Any piece of code in the
kernel needs to be prepared to deal with such situations.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-18 11:43             ` Tejun Heo
  2013-12-18 22:14               ` Rafael J. Wysocki
@ 2013-12-19  4:08               ` Dave Chinner
  2013-12-19 16:24                 ` Tejun Heo
  1 sibling, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2013-12-19  4:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Wed, Dec 18, 2013 at 06:43:43AM -0500, Tejun Heo wrote:
> Hello, Dave.
> 
> On Wed, Dec 18, 2013 at 11:35:10AM +1100, Dave Chinner wrote:
> > Perhaps the function "invalidate_partition()" is badly named. To
> > state the obvious, fsync != invalidation. What it does is:
> > 
> > 	1. sync filesystem
> > 	2. shrink the dcache
> > 	3. invalidates inodes and kills dirty inodes
> > 	4. invalidates block device (removes cached bdev pages)
> > 
> > Basically, the first step is "flush", the remainder is "invalidate".
> > 
> > Indeed, step 3 throws away dirty inodes, so why on earth would we
> > even bother with step 1 to try to clean them in the first place?
> > IOWs, the flush is irrelevant in the hot-unplug case as it will
> > fail to flush stuff, and then we just throw the stuff we
> > failed to write away.
> >
> > But in attempting to flush all the dirty data and metadata, we can
> > cause all sorts of other potential re-entrancy based deadlocks due
> > to attempting to issue IO. Whether they be freezer based or through
> > IO error handling triggering device removal or some other means, it
> > is irrelevant - it is the flush that causes all the problems.
> 
> Isn't the root cause there hotunplug reentering anything above it in
> the first place.  The problem with your proposal is that filesystem
> isn't the only place where this could happen.  Even with no filesystem
> involved, block device could still be dirty and IOs pending in
> whatever form - dirty bdi, bios queued in dm, requests queued in
> request_queue, whatever really - and if the hotunplug path reenters
> any of the higher layers in a way which blocks IO processing, it will
> deadlock.

Entirely possible.

> If knowing that the underlying device has gone away somehow helps
> filesystem, maybe we can expose that interface and avoid flushing
> after hotunplug but that merely hides the possible deadlock scenario
> that you're concerned about.  Nothing is really solved.

Except that a user of the block device has been informed that it is
now gone and has been freed from under it. i.e. we can *immediately*
inform the user that their mounted filesystem is now stuffed and
supress all the errors that are going to occur as a result of
sync_filesystem() triggering IO failures all over the place and then
having to react to that.i

Indeed, there is no guarantee that sync_filesystem will result in
the filesystem being shut down - if the filesystem is clean then
nothing will happen, and it won't be until the user modifies some
metadata that a shutdown will be triggered. That could be a long
time after the device has been removed....

> We can try to do the same thing at each layer and implement quick exit
> path for hot unplug all the way down to the driver but that kinda
> sounds complex and fragile to me.  It's a lot larger surface to cover
> when the root cause is hotunplug allowed to reenter anything at all
> from IO path.  This is especially true because hotunplug can trivially
> be made fully asynchronous in most cases.  In terms of destruction of
> higher level objects, warm and hot unplugs can and should behave
> identical.

I don't see that there is a difference between a warm and hot unplug
from a filesystem point of view - both result in the filesystem's
backing device being deleted and freed, and in both cases we have to
take the same action....

> > We need to either get rid of the flush on device failure/hot-unplug,
> > or turn it into a callout for the superblock to take an appropriate
> > action (e.g. shutting down the filesystem) rather than trying to
> > issue IO. i.e. allow the filesystem to take appropriate action of
> > shutting down the filesystem and invalidating it's caches.
> 
> There could be cases where some optimizations for hot unplug could be
> useful.  Maybe suppressing pointless duplicate warning messages or
> whatnot but I'm highly doubtful anything will be actually fixed that
> way.  We'll be most likely making bugs just less reproducible.
> 
> > Indeed, in XFS there's several other caches that could contain dirty
> > metadata that isn't invalidated by invalidate_partition(), and so
> > unless the filesystem is shut down it can continue to try to issue
> > IO on those buffers to the removed device until the filesystem is
> > shutdown or unmounted.
> 
> Do you mean xfs never gives up after IO failures?

There's this thing called a transient IO failure which we have to
handle. e.g multipath taking several minutes to detect a path
failure and fail over, whilst in the mean time IO errors are
reported after a 30s timeout. So some types of async metadata write
IO failures are simply rescheduled for a short time in the future.
They'll either succeed, or continual failure will eventually trigger
some kind of filesystem failure.

If it's a synchronous write or a write that we cannot tolerate even
transient errors on (e.g. journal writes), then we'll shut down the
filesystem immediately.

> > Seriously, Tejun, the assumption that invalidate_partition() knows
> > enough about filesystems to safely "invalidate" them is just broken.
> > These days, filesystems often reference multiple block devices, and
> > so the way hotplug currently treats them as "one device, one
> > filesystem" is also fundamentally wrong.
> > 
> > So there's many ways in which the hot-unplug code is broken in it's
> > use of invalidate_partition(), the least of which is the
> > dependencies caused by re-entrancy. We really need a
> > "sb->shutdown()" style callout as step one in the above process, not
> > fsync_bdev().
> 
> If filesystems need an indication that the underlying device is no
> longer functional, please go ahead and add it, but please keep in mind
> all these are completely asynchronous.  Nothing guarantees you that
> such events would happen in any specific order.  IOW, you can be at
> *ANY* point in your warm unplug path and the device is hot unplugged,
> which essentially forces all the code paths to be ready for the worst,
> and that's exactly why there isn't much effort in trying to separate
> out warm and hot unplug paths.

I'm not concerned about the problems that might happen if you hot
unplug during a warm unplug. All I care about is when a device is
invalidated the filesystem on top of it can take appropriate action.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-19  4:08               ` Dave Chinner
@ 2013-12-19 16:24                 ` Tejun Heo
  2013-12-20  0:51                   ` Dave Chinner
  2013-12-20 14:00                   ` Rafael J. Wysocki
  0 siblings, 2 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-19 16:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Yo, Dave.

On Thu, Dec 19, 2013 at 03:08:21PM +1100, Dave Chinner wrote:
> > If knowing that the underlying device has gone away somehow helps
> > filesystem, maybe we can expose that interface and avoid flushing
> > after hotunplug but that merely hides the possible deadlock scenario
> > that you're concerned about.  Nothing is really solved.
> 
> Except that a user of the block device has been informed that it is
> now gone and has been freed from under it. i.e. we can *immediately*
> inform the user that their mounted filesystem is now stuffed and
> supress all the errors that are going to occur as a result of
> sync_filesystem() triggering IO failures all over the place and then
> having to react to that.i

Please note that there's no real "immediacy" in that it's inherently
racy and that the extent of the usefulness of such notification can't
reach much further than suppressing error messages.  Even that benefit
is kinda dubious.  Don't we want to generate errors when a device is
removed while dirty data / IOs are pending on it?  I fail to see how
"supressing all the errors" would be a sane thing to do.

Another thing is that I think it's actually healthier in terms of
excercise of code paths to travel those error paths on hot unplugs
which are relatively common than taking a different behavior on them.
It'll inevitably lower our test coverage.

> Indeed, there is no guarantee that sync_filesystem will result in
> the filesystem being shut down - if the filesystem is clean then
> nothing will happen, and it won't be until the user modifies some
> metadata that a shutdown will be triggered. That could be a long
> time after the device has been removed....

I still fail to see that why that is a problem.  Filesystems should be
able to handle hot unplug or IO failures at any point in a reasonable
way, so what difference would having a notification make other than
introducing yet another exception code path?

> I don't see that there is a difference between a warm and hot unplug
> from a filesystem point of view - both result in the filesystem's
> backing device being deleted and freed, and in both cases we have to
> take the same action....

Yeah, exactly, so what'd be the point of getting separate notification
for hot unplug events?

> > Do you mean xfs never gives up after IO failures?
> 
> There's this thing called a transient IO failure which we have to
> handle. e.g multipath taking several minutes to detect a path
> failure and fail over, whilst in the mean time IO errors are
> reported after a 30s timeout. So some types of async metadata write
> IO failures are simply rescheduled for a short time in the future.
> They'll either succeed, or continual failure will eventually trigger
> some kind of filesystem failure.
> 
> If it's a synchronous write or a write that we cannot tolerate even
> transient errors on (e.g. journal writes), then we'll shut down the
> filesystem immediately.

Sure, filesystems should (be able to) react to different types of
errors in different ways.  We still have a long way to go to do that
properly but that should be done through IO failures not some side
channel one-off "hotunplug" happened call.  Again, it doesn't solve
anything.  It just side steps one very specific case in a half-assed
way.

> > If filesystems need an indication that the underlying device is no
> > longer functional, please go ahead and add it, but please keep in mind
> > all these are completely asynchronous.  Nothing guarantees you that
> > such events would happen in any specific order.  IOW, you can be at
> > *ANY* point in your warm unplug path and the device is hot unplugged,
> > which essentially forces all the code paths to be ready for the worst,
> > and that's exactly why there isn't much effort in trying to separate
> > out warm and hot unplug paths.
> 
> I'm not concerned about the problems that might happen if you hot
> unplug during a warm unplug. All I care about is when a device is
> invalidated the filesystem on top of it can take appropriate action.

I can't follow your logic here.  You started with a deadlock scenario
where lower layer calls into upper layer while blocking its own
operation, which apparently is a bug to be fixed in the lower layer as
discussed above; otherwise, we'd be chasing the symptoms rather than
plugging the source.  Combined with the fact that you can't really
prevent hotunplug happening during warmunplug (it doesn't even have to
be hotunplug, there are other conditions which would match such IO
failure pattern), this reduces the benefit of such notification to
optimizations far from correctness issues.

These are all logically connected; yet, you claim that you're not
concerned about part of it and then continue to assert your original
position.  It doesn't compute.  You proposed it as a fix for a
deadlock issue, but it turns out your proposal can't fix the deadlock
issue exactly because of the part you aren't concerned about and you
continue to assert the original proposal.  What's going on here?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3] libata, freezer: avoid block device removal while system is frozen
  2013-12-18 22:08         ` Rafael J. Wysocki
@ 2013-12-19 17:24           ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-19 17:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Hello,

On Wed, Dec 18, 2013 at 11:08:41PM +0100, Rafael J. Wysocki wrote:
> > This is something fundamentally broken.  For now,
> > implement a funny workaround in libata - just avoid doing block device
> > hot[un]plug while the system is frozen.  Kernel engineering at its
> > finest.  :(
> > 
> > v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
> >     as a module.
> > 
> > v3: Comment updated and polling interval changed to 10ms as suggested
> >     by Rafael.
> 
> This one is fine by my FWIW.

Applied to libata/for-3.13-fixes.  I liberally added your reviewed-by
from the above response.  Please let me know if I should remove it.
I'll push it out to Linus after it spends a couple days in -next.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v4] libata, freezer: avoid block device removal while system is frozen
  2013-12-18 12:07       ` [PATCH v3] " Tejun Heo
  2013-12-18 22:08         ` Rafael J. Wysocki
@ 2013-12-19 18:54         ` Tejun Heo
  1 sibling, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-19 18:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Jens Axboe
  Cc: tomaz.solc, aaron.lu, linux-kernel, Oleg Nesterov,
	Greg Kroah-Hartman, Fengguang Wu

Hello, again.

Oops, it was missing CONFIG_FREEZER.  Updated version committed.  How
did we even survive before kbuild test robot?  :)

Thanks.

------ 8< ------
>From 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 18 Dec 2013 07:07:32 -0500
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios.  This is the latest occurrence.

During resume, libata rescans all the ports and revalidates all
pre-existing devices.  If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks.  Unfortunately, this can race with the rest of device
resume.  Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.

839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too.  In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.

I believe the right thing to do is getting rid of freezable kthreads
and workqueues.  This is something fundamentally broken.  For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen.  Kernel engineering at its
finest.  :(

v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
    as a module.

v3: Comment updated and polling interval changed to 10ms as suggested
    by Rafael.

v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
    defined when FREEZER is not configured thus breaking build.
    Reported by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
Reviewed-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Cc: kbuild test robot <fengguang.wu@intel.com>
---
 drivers/ata/libata-scsi.c | 21 +++++++++++++++++++++
 kernel/freezer.c          |  6 ++++++
 2 files changed, 27 insertions(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index db6dfcf..176f629 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3871,6 +3871,27 @@ void ata_scsi_hotplug(struct work_struct *work)
 		return;
 	}
 
+	/*
+	 * XXX - UGLY HACK
+	 *
+	 * The block layer suspend/resume path is fundamentally broken due
+	 * to freezable kthreads and workqueue and may deadlock if a block
+	 * device gets removed while resume is in progress.  I don't know
+	 * what the solution is short of removing freezable kthreads and
+	 * workqueues altogether.
+	 *
+	 * The following is an ugly hack to avoid kicking off device
+	 * removal while freezer is active.  This is a joke but does avoid
+	 * this particular deadlock scenario.
+	 *
+	 * https://bugzilla.kernel.org/show_bug.cgi?id=62801
+	 * http://marc.info/?l=linux-kernel&m=138695698516487
+	 */
+#ifdef CONFIG_FREEZER
+	while (pm_freezing)
+		msleep(10);
+#endif
+
 	DPRINTK("ENTER\n");
 	mutex_lock(&ap->scsi_scan_mutex);
 
diff --git a/kernel/freezer.c b/kernel/freezer.c
index b462fa1..aa6a8aa 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
 bool pm_freezing;
 bool pm_nosig_freezing;
 
+/*
+ * Temporary export for the deadlock workaround in ata_scsi_hotplug().
+ * Remove once the hack becomes unnecessary.
+ */
+EXPORT_SYMBOL_GPL(pm_freezing);
+
 /* protects freezing and frozen transitions */
 static DEFINE_SPINLOCK(freezer_lock);
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules
  2013-12-18 22:04                           ` Rafael J. Wysocki
@ 2013-12-19 23:35                             ` Tejun Heo
  2013-12-20  1:26                               ` Rafael J. Wysocki
  2013-12-19 23:37                             ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo
  1 sibling, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-19 23:35 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu, Lai Jiangshan, David Howells

>From bdd220b2a1b86fee14a12b69fb0cadafe60a1dac Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 19 Dec 2013 18:33:09 -0500

@max_active handling is currently inconsistent.

* In alloc_workqueue(), 0 gets converted to the default and the value
  gets clamped to [1, lim].

* In workqueue_set_max_active(), 0 is not converted to the default and
  the value is clamped to [1, lim].

* When a workqueue is exposed through sysfs, input < 1 fails with
  -EINVAL; otherwise, the value is clamped to [1, lim].

* fscache exposes max_active through a sysctl and clamps the value in
  [1, lim].

We want to be able to express zero @max_active but as it's a special
case and 0 is already used for default, don't want to use 0 for it.
Update @max_active clamping so that the following rules are followed.

* In both alloc_workqueue() and workqueue_set_max_active(), 0 is
  converted to the default, and a new special value WQ_FROZEN_ACTIVE
  becomes 0; otherwise, the value is clamped to [1, lim].

* In both sysfs and fscache sysctl, input < 1 fails with -EINVAL;
  otherwise, the value is clamped to [1, lim].

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: David Howells <dhowells@redhat.com>
---
 fs/fscache/main.c         | 10 +++++++---
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c        |  6 +++++-
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 7c27907..9d5a716 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -62,9 +62,13 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write,
 	int ret;
 
 	ret = proc_dointvec(table, write, buffer, lenp, ppos);
-	if (ret == 0)
-		workqueue_set_max_active(*wqp, *datap);
-	return ret;
+	if (ret < 0)
+		return ret;
+	if (*datap < 1)
+		return -EINVAL;
+
+	workqueue_set_max_active(*wqp, *datap);
+	return 0;
 }
 
 ctl_table fscache_sysctls[] = {
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 594521b..334daa3 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -338,6 +338,7 @@ enum {
 	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
 	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
 
+	WQ_FROZEN_ACTIVE	= -1,	  /* special value for frozen wq */
 	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
 	WQ_MAX_UNBOUND_PER_CPU	= 4,	  /* 4 * #cpus for unbound wq */
 	WQ_DFL_ACTIVE		= WQ_MAX_ACTIVE / 2,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 987293d..6748fbf 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4136,6 +4136,11 @@ static int wq_clamp_max_active(int max_active, unsigned int flags,
 {
 	int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE;
 
+	if (max_active == 0)
+		return WQ_DFL_ACTIVE;
+	if (max_active == WQ_FROZEN_ACTIVE)
+		return 0;
+
 	if (max_active < 1 || max_active > lim)
 		pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n",
 			max_active, name, 1, lim);
@@ -4176,7 +4181,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 	vsnprintf(wq->name, sizeof(wq->name), fmt, args);
 	va_end(args);
 
-	max_active = max_active ?: WQ_DFL_ACTIVE;
 	max_active = wq_clamp_max_active(max_active, flags, wq->name);
 
 	/* init wq */
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active()
  2013-12-18 22:04                           ` Rafael J. Wysocki
  2013-12-19 23:35                             ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo
@ 2013-12-19 23:37                             ` Tejun Heo
  2013-12-20  1:31                               ` Rafael J. Wysocki
  1 sibling, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-19 23:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu, Lai Jiangshan, David Howells

Hello, Rafael.

If this looks good to you, I'll commit it to wq/for-3.14 and we can at
least start to clean up things.

Thanks!

------- 8< -------
>From 6b5182f3f193d0ff9296f53a4665a55b6477aa77 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 19 Dec 2013 18:33:12 -0500

workqueue_set_max_active() currently doesn't wait for the number of
in-flight work items to fall under the new @max_active.  This patch
adds @drain paramter to workqueue_set_max_active(), if set, the
function sleeps until nr_active on each pool_workqueue of the target
workqueue drops below the current saved_max_active.

This is planned to replace freezable workqueues.  It is determined
that kernel freezables - both kthreads and workqueues - aren't
necessary and just add to the confusion and unnecessary deadlocks.
There are only a handful which actually need to stop processing for
system power events and they can be converted to use
workqueue_set_max_active(WQ_FROZEN_ACTIVE) instead of freezable
workqueues.  Ultimately, all other uses of freezables will be dropped
and the whole system freezer machinery will be excised.  Well, that's
the plan anyway.

The implementation is fairly straight-forward.  As this is expected to
be used by only a handful and most likely not concurrently, a single
wait queue is used.  set_max_active drainers wait on it as necessary
and pwq_dec_nr_in_flight() triggers it if nr_active == max_active
after nr_active is decremented.  This unfortunately adds on unlikely
branch to the work item execution path but this is extremely unlikely
to be noticeable and I think it's worthwhile to avoid polling here as
there may be multiple calls to this function in succession during
suspend and some platforms use suspend quite frequently.

Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/20131218213936.GA8218@mtj.dyndns.org
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
---
 fs/fscache/main.c         |  2 +-
 include/linux/workqueue.h |  2 +-
 kernel/workqueue.c        | 58 ++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 9d5a716..e2837f2 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -67,7 +67,7 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write,
 	if (*datap < 1)
 		return -EINVAL;
 
-	workqueue_set_max_active(*wqp, *datap);
+	workqueue_set_max_active(*wqp, *datap, false);
 	return 0;
 }
 
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 334daa3..8b9c628 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -488,7 +488,7 @@ extern bool cancel_delayed_work(struct delayed_work *dwork);
 extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
 
 extern void workqueue_set_max_active(struct workqueue_struct *wq,
-				     int max_active);
+				     int max_active, bool drain);
 extern bool current_is_workqueue_rescuer(void);
 extern bool workqueue_congested(int cpu, struct workqueue_struct *wq);
 extern unsigned int work_busy(struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6748fbf..f18c35b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -293,6 +293,12 @@ static DEFINE_SPINLOCK(wq_mayday_lock);	/* protects wq->maydays list */
 static LIST_HEAD(workqueues);		/* PL: list of all workqueues */
 static bool workqueue_freezing;		/* PL: have wqs started freezing? */
 
+/*
+ * Wait for nr_active to drain after max_active adjustment.  This is a cold
+ * path and not expected to have many users.  A global waitq should do.
+ */
+static DECLARE_WAIT_QUEUE_HEAD(wq_nr_active_drain_waitq);
+
 /* the per-cpu worker pools */
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
 				     cpu_worker_pools);
@@ -1123,6 +1129,14 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
 	pwq->nr_in_flight[color]--;
 
 	pwq->nr_active--;
+
+	/*
+	 * This can happen only after max_active is lowered.  Tell the
+	 * waiters that draining might be complete.
+	 */
+	if (unlikely(pwq->nr_active == pwq->max_active))
+		wake_up_all(&wq_nr_active_drain_waitq);
+
 	if (!list_empty(&pwq->delayed_works)) {
 		/* one down, submit a delayed one */
 		if (pwq->nr_active < pwq->max_active)
@@ -3140,7 +3154,7 @@ static ssize_t max_active_store(struct device *dev,
 	if (sscanf(buf, "%d", &val) != 1 || val <= 0)
 		return -EINVAL;
 
-	workqueue_set_max_active(wq, val);
+	workqueue_set_max_active(wq, val, false);
 	return count;
 }
 static DEVICE_ATTR_RW(max_active);
@@ -4339,16 +4353,22 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
  * workqueue_set_max_active - adjust max_active of a workqueue
  * @wq: target workqueue
  * @max_active: new max_active value.
+ * @drain: wait until the actual level of concurrency becomes <= @max_active
  *
- * Set max_active of @wq to @max_active.
+ * Set max_active of @wq to @max_active.  If @drain is true, wait until the
+ * in-flight work items are drained to the new level.
  *
  * CONTEXT:
  * Don't call from IRQ context.
  */
-void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
+void workqueue_set_max_active(struct workqueue_struct *wq, int max_active,
+			      bool drain)
 {
+	DEFINE_WAIT(wait);
 	struct pool_workqueue *pwq;
 
+	might_sleep_if(drain);
+
 	/* disallow meddling with max_active for ordered workqueues */
 	if (WARN_ON(wq->flags & __WQ_ORDERED))
 		return;
@@ -4363,6 +4383,38 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
 		pwq_adjust_max_active(pwq);
 
 	mutex_unlock(&wq->mutex);
+
+	/*
+	 * If we have increased max_active, pwq_dec_nr_in_flight() might
+	 * not trigger for other instances of this function waiting for
+	 * drain.  Force them to check.
+	 */
+	wake_up_all(&wq_nr_active_drain_waitq);
+
+	if (!drain)
+		return;
+
+	/* let's wait for the drain to complete */
+restart:
+	mutex_lock(&wq->mutex);
+	prepare_to_wait(&wq_nr_active_drain_waitq, &wait, TASK_UNINTERRUPTIBLE);
+
+	for_each_pwq(pwq, wq) {
+		/*
+		 * nr_active should be monotonously decreasing as long as
+		 * it's over max_active, so no need to grab pool lock.
+		 * Also, test against saved_max_active in case multiple
+		 * instances and/or system freezer are racing.
+		 */
+		if (pwq->nr_active > wq->saved_max_active) {
+			mutex_unlock(&wq->mutex);
+			schedule();
+			goto restart;
+		}
+	}
+
+	finish_wait(&wq_nr_active_drain_waitq, &wait);
+	mutex_unlock(&wq->mutex);
 }
 EXPORT_SYMBOL_GPL(workqueue_set_max_active);
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-19 16:24                 ` Tejun Heo
@ 2013-12-20  0:51                   ` Dave Chinner
  2013-12-20 14:51                     ` Tejun Heo
  2013-12-20 14:00                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2013-12-20  0:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Thu, Dec 19, 2013 at 11:24:11AM -0500, Tejun Heo wrote:
> Yo, Dave.
> 
> On Thu, Dec 19, 2013 at 03:08:21PM +1100, Dave Chinner wrote:
> > > If knowing that the underlying device has gone away somehow helps
> > > filesystem, maybe we can expose that interface and avoid flushing
> > > after hotunplug but that merely hides the possible deadlock scenario
> > > that you're concerned about.  Nothing is really solved.
> > 
> > Except that a user of the block device has been informed that it is
> > now gone and has been freed from under it. i.e. we can *immediately*
> > inform the user that their mounted filesystem is now stuffed and
> > supress all the errors that are going to occur as a result of
> > sync_filesystem() triggering IO failures all over the place and then
> > having to react to that.i
> 
> Please note that there's no real "immediacy" in that it's inherently
> racy and that the extent of the usefulness of such notification can't
> reach much further than suppressing error messages.

So says you. What about btrfs, which might use such a notification
to start replicating metadata and data to other devices to maintain
redundancy levels? Or even XFS, which might have a real-time device
go away - that's not a fatal error for the filesystem, but we sure
as hell want the user to know about it and be able to set a flag
indicating access to data in real-time files need to be errored out
at the .aio_write level rather than some time later when writeback
fails with EIO and the application cannot capture the IO error.

Seriously, Tejun, you need to stop focussing on tearing apart
micro-examples and consider the wider context of the issue that the
examples are being used to demonstrate.  The notifications being
talked about here are already being delivered to userspace because
there are OS management applications that need to know about devices
going away. Filesystems are no different - there's all sorts of
optimisations we can make if we get such a notification.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules
  2013-12-19 23:35                             ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo
@ 2013-12-20  1:26                               ` Rafael J. Wysocki
  0 siblings, 0 replies; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-20  1:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu, Lai Jiangshan, David Howells

On Thursday, December 19, 2013 06:35:26 PM Tejun Heo wrote:
> From bdd220b2a1b86fee14a12b69fb0cadafe60a1dac Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Thu, 19 Dec 2013 18:33:09 -0500
> 
> @max_active handling is currently inconsistent.
> 
> * In alloc_workqueue(), 0 gets converted to the default and the value
>   gets clamped to [1, lim].
> 
> * In workqueue_set_max_active(), 0 is not converted to the default and
>   the value is clamped to [1, lim].
> 
> * When a workqueue is exposed through sysfs, input < 1 fails with
>   -EINVAL; otherwise, the value is clamped to [1, lim].
> 
> * fscache exposes max_active through a sysctl and clamps the value in
>   [1, lim].
> 
> We want to be able to express zero @max_active but as it's a special
> case and 0 is already used for default, don't want to use 0 for it.
> Update @max_active clamping so that the following rules are followed.
> 
> * In both alloc_workqueue() and workqueue_set_max_active(), 0 is
>   converted to the default, and a new special value WQ_FROZEN_ACTIVE
>   becomes 0; otherwise, the value is clamped to [1, lim].
> 
> * In both sysfs and fscache sysctl, input < 1 fails with -EINVAL;
>   otherwise, the value is clamped to [1, lim].
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: David Howells <dhowells@redhat.com>

Well, this one looks good to me:

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  fs/fscache/main.c         | 10 +++++++---
>  include/linux/workqueue.h |  1 +
>  kernel/workqueue.c        |  6 +++++-
>  3 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fscache/main.c b/fs/fscache/main.c
> index 7c27907..9d5a716 100644
> --- a/fs/fscache/main.c
> +++ b/fs/fscache/main.c
> @@ -62,9 +62,13 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write,
>  	int ret;
>  
>  	ret = proc_dointvec(table, write, buffer, lenp, ppos);
> -	if (ret == 0)
> -		workqueue_set_max_active(*wqp, *datap);
> -	return ret;
> +	if (ret < 0)
> +		return ret;
> +	if (*datap < 1)
> +		return -EINVAL;
> +
> +	workqueue_set_max_active(*wqp, *datap);
> +	return 0;
>  }
>  
>  ctl_table fscache_sysctls[] = {
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 594521b..334daa3 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -338,6 +338,7 @@ enum {
>  	__WQ_DRAINING		= 1 << 16, /* internal: workqueue is draining */
>  	__WQ_ORDERED		= 1 << 17, /* internal: workqueue is ordered */
>  
> +	WQ_FROZEN_ACTIVE	= -1,	  /* special value for frozen wq */
>  	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
>  	WQ_MAX_UNBOUND_PER_CPU	= 4,	  /* 4 * #cpus for unbound wq */
>  	WQ_DFL_ACTIVE		= WQ_MAX_ACTIVE / 2,
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 987293d..6748fbf 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4136,6 +4136,11 @@ static int wq_clamp_max_active(int max_active, unsigned int flags,
>  {
>  	int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE;
>  
> +	if (max_active == 0)
> +		return WQ_DFL_ACTIVE;
> +	if (max_active == WQ_FROZEN_ACTIVE)
> +		return 0;
> +
>  	if (max_active < 1 || max_active > lim)
>  		pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n",
>  			max_active, name, 1, lim);
> @@ -4176,7 +4181,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
>  	vsnprintf(wq->name, sizeof(wq->name), fmt, args);
>  	va_end(args);
>  
> -	max_active = max_active ?: WQ_DFL_ACTIVE;
>  	max_active = wq_clamp_max_active(max_active, flags, wq->name);
>  
>  	/* init wq */
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active()
  2013-12-19 23:37                             ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo
@ 2013-12-20  1:31                               ` Rafael J. Wysocki
  2013-12-20 13:32                                 ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-20  1:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu, Lai Jiangshan, David Howells

On Thursday, December 19, 2013 06:37:20 PM Tejun Heo wrote:
> Hello, Rafael.

Hi,

> If this looks good to you, I'll commit it to wq/for-3.14 and we can at
> least start to clean up things.

Yes, it does.

So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during
suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on
my workqueue, right?

Rafael


> ------- 8< -------
> From 6b5182f3f193d0ff9296f53a4665a55b6477aa77 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Thu, 19 Dec 2013 18:33:12 -0500
> 
> workqueue_set_max_active() currently doesn't wait for the number of
> in-flight work items to fall under the new @max_active.  This patch
> adds @drain paramter to workqueue_set_max_active(), if set, the
> function sleeps until nr_active on each pool_workqueue of the target
> workqueue drops below the current saved_max_active.
> 
> This is planned to replace freezable workqueues.  It is determined
> that kernel freezables - both kthreads and workqueues - aren't
> necessary and just add to the confusion and unnecessary deadlocks.
> There are only a handful which actually need to stop processing for
> system power events and they can be converted to use
> workqueue_set_max_active(WQ_FROZEN_ACTIVE) instead of freezable
> workqueues.  Ultimately, all other uses of freezables will be dropped
> and the whole system freezer machinery will be excised.  Well, that's
> the plan anyway.
> 
> The implementation is fairly straight-forward.  As this is expected to
> be used by only a handful and most likely not concurrently, a single
> wait queue is used.  set_max_active drainers wait on it as necessary
> and pwq_dec_nr_in_flight() triggers it if nr_active == max_active
> after nr_active is decremented.  This unfortunately adds on unlikely
> branch to the work item execution path but this is extremely unlikely
> to be noticeable and I think it's worthwhile to avoid polling here as
> there may be multiple calls to this function in succession during
> suspend and some platforms use suspend quite frequently.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Link: http://lkml.kernel.org/g/20131218213936.GA8218@mtj.dyndns.org
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> ---
>  fs/fscache/main.c         |  2 +-
>  include/linux/workqueue.h |  2 +-
>  kernel/workqueue.c        | 58 ++++++++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 57 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fscache/main.c b/fs/fscache/main.c
> index 9d5a716..e2837f2 100644
> --- a/fs/fscache/main.c
> +++ b/fs/fscache/main.c
> @@ -67,7 +67,7 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write,
>  	if (*datap < 1)
>  		return -EINVAL;
>  
> -	workqueue_set_max_active(*wqp, *datap);
> +	workqueue_set_max_active(*wqp, *datap, false);
>  	return 0;
>  }
>  
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 334daa3..8b9c628 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -488,7 +488,7 @@ extern bool cancel_delayed_work(struct delayed_work *dwork);
>  extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
>  
>  extern void workqueue_set_max_active(struct workqueue_struct *wq,
> -				     int max_active);
> +				     int max_active, bool drain);
>  extern bool current_is_workqueue_rescuer(void);
>  extern bool workqueue_congested(int cpu, struct workqueue_struct *wq);
>  extern unsigned int work_busy(struct work_struct *work);
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 6748fbf..f18c35b 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -293,6 +293,12 @@ static DEFINE_SPINLOCK(wq_mayday_lock);	/* protects wq->maydays list */
>  static LIST_HEAD(workqueues);		/* PL: list of all workqueues */
>  static bool workqueue_freezing;		/* PL: have wqs started freezing? */
>  
> +/*
> + * Wait for nr_active to drain after max_active adjustment.  This is a cold
> + * path and not expected to have many users.  A global waitq should do.
> + */
> +static DECLARE_WAIT_QUEUE_HEAD(wq_nr_active_drain_waitq);
> +
>  /* the per-cpu worker pools */
>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
>  				     cpu_worker_pools);
> @@ -1123,6 +1129,14 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
>  	pwq->nr_in_flight[color]--;
>  
>  	pwq->nr_active--;
> +
> +	/*
> +	 * This can happen only after max_active is lowered.  Tell the
> +	 * waiters that draining might be complete.
> +	 */
> +	if (unlikely(pwq->nr_active == pwq->max_active))
> +		wake_up_all(&wq_nr_active_drain_waitq);
> +
>  	if (!list_empty(&pwq->delayed_works)) {
>  		/* one down, submit a delayed one */
>  		if (pwq->nr_active < pwq->max_active)
> @@ -3140,7 +3154,7 @@ static ssize_t max_active_store(struct device *dev,
>  	if (sscanf(buf, "%d", &val) != 1 || val <= 0)
>  		return -EINVAL;
>  
> -	workqueue_set_max_active(wq, val);
> +	workqueue_set_max_active(wq, val, false);
>  	return count;
>  }
>  static DEVICE_ATTR_RW(max_active);
> @@ -4339,16 +4353,22 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
>   * workqueue_set_max_active - adjust max_active of a workqueue
>   * @wq: target workqueue
>   * @max_active: new max_active value.
> + * @drain: wait until the actual level of concurrency becomes <= @max_active
>   *
> - * Set max_active of @wq to @max_active.
> + * Set max_active of @wq to @max_active.  If @drain is true, wait until the
> + * in-flight work items are drained to the new level.
>   *
>   * CONTEXT:
>   * Don't call from IRQ context.
>   */
> -void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
> +void workqueue_set_max_active(struct workqueue_struct *wq, int max_active,
> +			      bool drain)
>  {
> +	DEFINE_WAIT(wait);
>  	struct pool_workqueue *pwq;
>  
> +	might_sleep_if(drain);
> +
>  	/* disallow meddling with max_active for ordered workqueues */
>  	if (WARN_ON(wq->flags & __WQ_ORDERED))
>  		return;
> @@ -4363,6 +4383,38 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
>  		pwq_adjust_max_active(pwq);
>  
>  	mutex_unlock(&wq->mutex);
> +
> +	/*
> +	 * If we have increased max_active, pwq_dec_nr_in_flight() might
> +	 * not trigger for other instances of this function waiting for
> +	 * drain.  Force them to check.
> +	 */
> +	wake_up_all(&wq_nr_active_drain_waitq);
> +
> +	if (!drain)
> +		return;
> +
> +	/* let's wait for the drain to complete */
> +restart:
> +	mutex_lock(&wq->mutex);
> +	prepare_to_wait(&wq_nr_active_drain_waitq, &wait, TASK_UNINTERRUPTIBLE);
> +
> +	for_each_pwq(pwq, wq) {
> +		/*
> +		 * nr_active should be monotonously decreasing as long as
> +		 * it's over max_active, so no need to grab pool lock.
> +		 * Also, test against saved_max_active in case multiple
> +		 * instances and/or system freezer are racing.
> +		 */
> +		if (pwq->nr_active > wq->saved_max_active) {
> +			mutex_unlock(&wq->mutex);
> +			schedule();
> +			goto restart;
> +		}
> +	}
> +
> +	finish_wait(&wq_nr_active_drain_waitq, &wait);
> +	mutex_unlock(&wq->mutex);
>  }
>  EXPORT_SYMBOL_GPL(workqueue_set_max_active);
>  
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active()
  2013-12-20  1:31                               ` Rafael J. Wysocki
@ 2013-12-20 13:32                                 ` Tejun Heo
  2013-12-20 13:56                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2013-12-20 13:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu, Lai Jiangshan, David Howells

Hello, Rafael.

On Fri, Dec 20, 2013 at 02:31:37AM +0100, Rafael J. Wysocki wrote:
> > If this looks good to you, I'll commit it to wq/for-3.14 and we can at
> > least start to clean up things.
> 
> Yes, it does.
> 
> So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during
> suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on
> my workqueue, right?

Yeah, you can also do workqueue_set_max_active(wq, 0) during resume to
restore the default value, which is what's used by alloc_workqueue()
anyway.

I'll apply the two patches to wq/for-3.14 once David reviews them or
if you wanna take them through the pm tree, that's fine too.  I don't
have anything scheduled for wq for the next merge window.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active()
  2013-12-20 13:32                                 ` Tejun Heo
@ 2013-12-20 13:56                                   ` Rafael J. Wysocki
  2013-12-20 14:23                                     ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-20 13:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu, Lai Jiangshan, David Howells

On Friday, December 20, 2013 08:32:20 AM Tejun Heo wrote:
> Hello, Rafael.
> 
> On Fri, Dec 20, 2013 at 02:31:37AM +0100, Rafael J. Wysocki wrote:
> > > If this looks good to you, I'll commit it to wq/for-3.14 and we can at
> > > least start to clean up things.
> > 
> > Yes, it does.
> > 
> > So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during
> > suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on
> > my workqueue, right?
> 
> Yeah, you can also do workqueue_set_max_active(wq, 0) during resume to
> restore the default value, which is what's used by alloc_workqueue()
> anyway.

OK

> I'll apply the two patches to wq/for-3.14 once David reviews them or
> if you wanna take them through the pm tree, that's fine too.  I don't
> have anything scheduled for wq for the next merge window.

I have some material queued up, so I can take them once they've been reviewed.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-19 16:24                 ` Tejun Heo
  2013-12-20  0:51                   ` Dave Chinner
@ 2013-12-20 14:00                   ` Rafael J. Wysocki
  1 sibling, 0 replies; 54+ messages in thread
From: Rafael J. Wysocki @ 2013-12-20 14:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dave Chinner, Jens Axboe, tomaz.solc, aaron.lu, linux-kernel,
	Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

On Thursday, December 19, 2013 11:24:11 AM Tejun Heo wrote:
> Yo, Dave.
> 
> On Thu, Dec 19, 2013 at 03:08:21PM +1100, Dave Chinner wrote:
> > > If knowing that the underlying device has gone away somehow helps
> > > filesystem, maybe we can expose that interface and avoid flushing
> > > after hotunplug but that merely hides the possible deadlock scenario
> > > that you're concerned about.  Nothing is really solved.
> > 
> > Except that a user of the block device has been informed that it is
> > now gone and has been freed from under it. i.e. we can *immediately*
> > inform the user that their mounted filesystem is now stuffed and
> > supress all the errors that are going to occur as a result of
> > sync_filesystem() triggering IO failures all over the place and then
> > having to react to that.i
> 
> Please note that there's no real "immediacy" in that it's inherently
> racy and that the extent of the usefulness of such notification can't
> reach much further than suppressing error messages.  Even that benefit
> is kinda dubious.  Don't we want to generate errors when a device is
> removed while dirty data / IOs are pending on it?  I fail to see how
> "supressing all the errors" would be a sane thing to do.
> 
> Another thing is that I think it's actually healthier in terms of
> excercise of code paths to travel those error paths on hot unplugs
> which are relatively common than taking a different behavior on them.
> It'll inevitably lower our test coverage.
> 
> > Indeed, there is no guarantee that sync_filesystem will result in
> > the filesystem being shut down - if the filesystem is clean then
> > nothing will happen, and it won't be until the user modifies some
> > metadata that a shutdown will be triggered. That could be a long
> > time after the device has been removed....
> 
> I still fail to see that why that is a problem.  Filesystems should be
> able to handle hot unplug or IO failures at any point in a reasonable
> way, so what difference would having a notification make other than
> introducing yet another exception code path?
> 
> > I don't see that there is a difference between a warm and hot unplug
> > from a filesystem point of view - both result in the filesystem's
> > backing device being deleted and freed, and in both cases we have to
> > take the same action....
> 
> Yeah, exactly, so what'd be the point of getting separate notification
> for hot unplug events?
> 
> > > Do you mean xfs never gives up after IO failures?
> > 
> > There's this thing called a transient IO failure which we have to
> > handle. e.g multipath taking several minutes to detect a path
> > failure and fail over, whilst in the mean time IO errors are
> > reported after a 30s timeout. So some types of async metadata write
> > IO failures are simply rescheduled for a short time in the future.
> > They'll either succeed, or continual failure will eventually trigger
> > some kind of filesystem failure.
> > 
> > If it's a synchronous write or a write that we cannot tolerate even
> > transient errors on (e.g. journal writes), then we'll shut down the
> > filesystem immediately.
> 
> Sure, filesystems should (be able to) react to different types of
> errors in different ways.  We still have a long way to go to do that
> properly but that should be done through IO failures not some side
> channel one-off "hotunplug" happened call.  Again, it doesn't solve
> anything.  It just side steps one very specific case in a half-assed
> way.

Another problem is to distinguish "hotunplug" from "power failure", for
example, because it may not be entirely clear what happened to start with
until way after when we get a "these devices are gone" notification from the
platform firmware.

So even if something may be regarded as a "hotunplug" from the order of the
Universe perspective, it still may look like a regular IO error at the device
level until device_del() is called for that device (which may not happen for
quite a while).

[That said, there are situations in which we may want to wait until it is
"safe" to eject stuff, or even we may want to allow subsystems to fail
"offline" requests, like in the memory eject case.  In those cases the
hot-removal actually consists of two parts, an offline and a removal,
where devices are supposed to be not in use after offline (which may fail).
But that is a different story. :-)]

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active()
  2013-12-20 13:56                                   ` Rafael J. Wysocki
@ 2013-12-20 14:23                                     ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-20 14:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Rafael J. Wysocki, Jens Axboe, tomaz.solc,
	aaron.lu, linux-kernel, Oleg Nesterov, Greg Kroah-Hartman,
	Fengguang Wu, Lai Jiangshan, David Howells

On Fri, Dec 20, 2013 at 02:56:09PM +0100, Rafael J. Wysocki wrote:
> On Friday, December 20, 2013 08:32:20 AM Tejun Heo wrote:
> > Hello, Rafael.
> > 
> > On Fri, Dec 20, 2013 at 02:31:37AM +0100, Rafael J. Wysocki wrote:
> > > > If this looks good to you, I'll commit it to wq/for-3.14 and we can at
> > > > least start to clean up things.
> > > 
> > > Yes, it does.
> > > 
> > > So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during
> > > suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on
> > > my workqueue, right?
> > 
> > Yeah, you can also do workqueue_set_max_active(wq, 0) during resume to
> > restore the default value, which is what's used by alloc_workqueue()
> > anyway.

Oops, it sould be obvious but just in case.  The call to use during
suspend is workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE, true). :)

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: Writeback threads and freezable
  2013-12-20  0:51                   ` Dave Chinner
@ 2013-12-20 14:51                     ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-20 14:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rafael J. Wysocki, Jens Axboe, tomaz.solc, aaron.lu,
	linux-kernel, Oleg Nesterov, Greg Kroah-Hartman, Fengguang Wu

Hello, Dave.

On Fri, Dec 20, 2013 at 11:51:14AM +1100, Dave Chinner wrote:
> > Please note that there's no real "immediacy" in that it's inherently
> > racy and that the extent of the usefulness of such notification can't
> > reach much further than suppressing error messages.
> 
> So says you. What about btrfs, which might use such a notification
> to start replicating metadata and data to other devices to maintain
> redundancy levels? Or even XFS, which might have a real-time device
> go away - that's not a fatal error for the filesystem, but we sure
> as hell want the user to know about it and be able to set a flag
> indicating access to data in real-time files need to be errored out
> at the .aio_write level rather than some time later when writeback
> fails with EIO and the application cannot capture the IO error.

Hmmm... but shouldn't such facilities also kick in on other fatal
failure conditions?  I have no fundamental objection against adding
someway to distinguish / notify hotunplug events but am a bit worried
that it has the potential to branch out the hotunplug path
unnecessarily and it seems that such tendency has been shown amply in
this thread too.

As I've pointed out multiple times in the thread, hotunplug path
serves very well as the continual testbed for catastrphic failure
cases for the whole stack from filesystems all the way down to
low-level drivers and can be attributed for high ratio of error
handling improvements / bugfixes we made over the years, so even if we
add something to distinguish hotunplug for upper layers, I think it
should be something which at least doesn't encourage implementing a
completely separate path and preferably something more generic.

> Seriously, Tejun, you need to stop focussing on tearing apart
> micro-examples and consider the wider context of the issue that the
> examples are being used to demonstrate.  The notifications being

Dear Dave, would you please drop the attitude?  You've been
consistently wrong in this thread.  Being aggressively assertive is
okay.  Being wrong is okay too.  Continuing the combination of the two
is not.  That's actively harmful, especially when the one doing so
commands respect in the community - it has high chance of derailing
the discussion and/or leading to the wrong conclusions.  Should I
reiterate the messed up confusions you showed at each turn of this
very thread?

Let's please stay technical.  You apparently didn't have much idea how
things work in the lower layers, which is completely fine.  Just don't
over-extend your asssertiveness without substance.  If you just made
your technical input from the get-go when viewed from your side of the
stack, we could have been a lot more productive.

I still think there's something to be extracted from this thread - how
we should handle catastrophic failure scenarios and the shortcomings
of the current shutdown procedure, both of which are far from perfect.
I'm not yet convinced with the idea of handling hot-unplug as
something completely special for the reasons I've stated multiple
times now and curious about the specifics of what you have on mind
because I don't really know what would be convenient from the
filesystem side.

> talked about here are already being delivered to userspace because
> there are OS management applications that need to know about devices
> going away. Filesystems are no different - there's all sorts of
> optimisations we can make if we get such a notification.

I'm not sure that analogy is too relevant.  The notifications
delivered to userland aren't different for hot or warm unplugs and
they don't participate in any of the exception handling details, which
is what I'm really worried about.  Again, I'm not necessarily against
the idea of flagging / notifying upper layers but please do keep in
mind that hotunplug handling shouldn't deviate much from other parts
of exception handling, which at least from the discussions we had in
this thread doesn't seem to be what you had on mind.  There are other
exception cases which share many, if not most, characteristics making
such approach a natural fit and hotunplug is one of the best tools we
have in ensuring those paths are in a healthy shape.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2013-12-20 14:51 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-13 17:49 Writeback threads and freezable Tejun Heo
2013-12-13 18:52 ` Tejun Heo
2013-12-13 20:40   ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo
2013-12-13 22:45     ` Nigel Cunningham
2013-12-13 23:07       ` Tejun Heo
2013-12-13 23:15         ` Nigel Cunningham
2013-12-14  1:55           ` Dave Chinner
2013-12-14 20:31           ` Tejun Heo
2013-12-14 20:36             ` Tejun Heo
2013-12-14 21:21               ` Nigel Cunningham
2013-12-17  2:35                 ` Rafael J. Wysocki
2013-12-17  2:34             ` Rafael J. Wysocki
2013-12-17 12:34               ` Tejun Heo
2013-12-18  0:35                 ` Rafael J. Wysocki
2013-12-18 11:17                   ` Tejun Heo
2013-12-18 21:48                     ` Rafael J. Wysocki
2013-12-18 21:39                       ` Tejun Heo
2013-12-18 21:41                         ` Tejun Heo
2013-12-18 22:04                           ` Rafael J. Wysocki
2013-12-19 23:35                             ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo
2013-12-20  1:26                               ` Rafael J. Wysocki
2013-12-19 23:37                             ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo
2013-12-20  1:31                               ` Rafael J. Wysocki
2013-12-20 13:32                                 ` Tejun Heo
2013-12-20 13:56                                   ` Rafael J. Wysocki
2013-12-20 14:23                                     ` Tejun Heo
2013-12-16 12:12         ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei
2013-12-16 12:45           ` Tejun Heo
2013-12-16 13:24             ` Ming Lei
2013-12-16 16:05               ` Tejun Heo
2013-12-17  2:38     ` Rafael J. Wysocki
2013-12-17 12:36       ` Tejun Heo
2013-12-18  0:23         ` Rafael J. Wysocki
2013-12-17 12:50     ` [PATCH v2] " Tejun Heo
2013-12-18  1:04       ` Rafael J. Wysocki
2013-12-18 11:08         ` Tejun Heo
2013-12-18 12:07       ` [PATCH v3] " Tejun Heo
2013-12-18 22:08         ` Rafael J. Wysocki
2013-12-19 17:24           ` Tejun Heo
2013-12-19 18:54         ` [PATCH v4] " Tejun Heo
2013-12-14  1:53 ` Writeback threads and freezable Dave Chinner
2013-12-14 17:30   ` Greg Kroah-Hartman
2013-12-14 20:23   ` Tejun Heo
2013-12-16  3:56     ` Dave Chinner
2013-12-16 12:51       ` Tejun Heo
2013-12-16 12:56         ` Tejun Heo
2013-12-18  0:35           ` Dave Chinner
2013-12-18 11:43             ` Tejun Heo
2013-12-18 22:14               ` Rafael J. Wysocki
2013-12-19  4:08               ` Dave Chinner
2013-12-19 16:24                 ` Tejun Heo
2013-12-20  0:51                   ` Dave Chinner
2013-12-20 14:51                     ` Tejun Heo
2013-12-20 14:00                   ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.