linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>, Jens Axboe <axboe@kernel.dk>,
	tomaz.solc@tablix.org, aaron.lu@intel.com,
	linux-kernel@vger.kernel.org, Oleg Nesterov <oleg@redhat.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Fengguang Wu <fengguang.wu@intel.com>
Subject: Re: Writeback threads and freezable
Date: Thu, 19 Dec 2013 15:08:21 +1100	[thread overview]
Message-ID: <20131219040821.GW31386@dastard> (raw)
In-Reply-To: <20131218114343.GA4324@htj.dyndns.org>

On Wed, Dec 18, 2013 at 06:43:43AM -0500, Tejun Heo wrote:
> Hello, Dave.
> 
> On Wed, Dec 18, 2013 at 11:35:10AM +1100, Dave Chinner wrote:
> > Perhaps the function "invalidate_partition()" is badly named. To
> > state the obvious, fsync != invalidation. What it does is:
> > 
> > 	1. sync filesystem
> > 	2. shrink the dcache
> > 	3. invalidates inodes and kills dirty inodes
> > 	4. invalidates block device (removes cached bdev pages)
> > 
> > Basically, the first step is "flush", the remainder is "invalidate".
> > 
> > Indeed, step 3 throws away dirty inodes, so why on earth would we
> > even bother with step 1 to try to clean them in the first place?
> > IOWs, the flush is irrelevant in the hot-unplug case as it will
> > fail to flush stuff, and then we just throw the stuff we
> > failed to write away.
> >
> > But in attempting to flush all the dirty data and metadata, we can
> > cause all sorts of other potential re-entrancy based deadlocks due
> > to attempting to issue IO. Whether they be freezer based or through
> > IO error handling triggering device removal or some other means, it
> > is irrelevant - it is the flush that causes all the problems.
> 
> Isn't the root cause there hotunplug reentering anything above it in
> the first place.  The problem with your proposal is that filesystem
> isn't the only place where this could happen.  Even with no filesystem
> involved, block device could still be dirty and IOs pending in
> whatever form - dirty bdi, bios queued in dm, requests queued in
> request_queue, whatever really - and if the hotunplug path reenters
> any of the higher layers in a way which blocks IO processing, it will
> deadlock.

Entirely possible.

> If knowing that the underlying device has gone away somehow helps
> filesystem, maybe we can expose that interface and avoid flushing
> after hotunplug but that merely hides the possible deadlock scenario
> that you're concerned about.  Nothing is really solved.

Except that a user of the block device has been informed that it is
now gone and has been freed from under it. i.e. we can *immediately*
inform the user that their mounted filesystem is now stuffed and
supress all the errors that are going to occur as a result of
sync_filesystem() triggering IO failures all over the place and then
having to react to that.i

Indeed, there is no guarantee that sync_filesystem will result in
the filesystem being shut down - if the filesystem is clean then
nothing will happen, and it won't be until the user modifies some
metadata that a shutdown will be triggered. That could be a long
time after the device has been removed....

> We can try to do the same thing at each layer and implement quick exit
> path for hot unplug all the way down to the driver but that kinda
> sounds complex and fragile to me.  It's a lot larger surface to cover
> when the root cause is hotunplug allowed to reenter anything at all
> from IO path.  This is especially true because hotunplug can trivially
> be made fully asynchronous in most cases.  In terms of destruction of
> higher level objects, warm and hot unplugs can and should behave
> identical.

I don't see that there is a difference between a warm and hot unplug
from a filesystem point of view - both result in the filesystem's
backing device being deleted and freed, and in both cases we have to
take the same action....

> > We need to either get rid of the flush on device failure/hot-unplug,
> > or turn it into a callout for the superblock to take an appropriate
> > action (e.g. shutting down the filesystem) rather than trying to
> > issue IO. i.e. allow the filesystem to take appropriate action of
> > shutting down the filesystem and invalidating it's caches.
> 
> There could be cases where some optimizations for hot unplug could be
> useful.  Maybe suppressing pointless duplicate warning messages or
> whatnot but I'm highly doubtful anything will be actually fixed that
> way.  We'll be most likely making bugs just less reproducible.
> 
> > Indeed, in XFS there's several other caches that could contain dirty
> > metadata that isn't invalidated by invalidate_partition(), and so
> > unless the filesystem is shut down it can continue to try to issue
> > IO on those buffers to the removed device until the filesystem is
> > shutdown or unmounted.
> 
> Do you mean xfs never gives up after IO failures?

There's this thing called a transient IO failure which we have to
handle. e.g multipath taking several minutes to detect a path
failure and fail over, whilst in the mean time IO errors are
reported after a 30s timeout. So some types of async metadata write
IO failures are simply rescheduled for a short time in the future.
They'll either succeed, or continual failure will eventually trigger
some kind of filesystem failure.

If it's a synchronous write or a write that we cannot tolerate even
transient errors on (e.g. journal writes), then we'll shut down the
filesystem immediately.

> > Seriously, Tejun, the assumption that invalidate_partition() knows
> > enough about filesystems to safely "invalidate" them is just broken.
> > These days, filesystems often reference multiple block devices, and
> > so the way hotplug currently treats them as "one device, one
> > filesystem" is also fundamentally wrong.
> > 
> > So there's many ways in which the hot-unplug code is broken in it's
> > use of invalidate_partition(), the least of which is the
> > dependencies caused by re-entrancy. We really need a
> > "sb->shutdown()" style callout as step one in the above process, not
> > fsync_bdev().
> 
> If filesystems need an indication that the underlying device is no
> longer functional, please go ahead and add it, but please keep in mind
> all these are completely asynchronous.  Nothing guarantees you that
> such events would happen in any specific order.  IOW, you can be at
> *ANY* point in your warm unplug path and the device is hot unplugged,
> which essentially forces all the code paths to be ready for the worst,
> and that's exactly why there isn't much effort in trying to separate
> out warm and hot unplug paths.

I'm not concerned about the problems that might happen if you hot
unplug during a warm unplug. All I care about is when a device is
invalidated the filesystem on top of it can take appropriate action.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2013-12-19  4:08 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-13 17:49 Writeback threads and freezable Tejun Heo
2013-12-13 18:52 ` Tejun Heo
2013-12-13 20:40   ` [PATCH] libata, freezer: avoid block device removal while system is frozen Tejun Heo
2013-12-13 22:45     ` Nigel Cunningham
2013-12-13 23:07       ` Tejun Heo
2013-12-13 23:15         ` Nigel Cunningham
2013-12-14  1:55           ` Dave Chinner
2013-12-14 20:31           ` Tejun Heo
2013-12-14 20:36             ` Tejun Heo
2013-12-14 21:21               ` Nigel Cunningham
2013-12-17  2:35                 ` Rafael J. Wysocki
2013-12-17  2:34             ` Rafael J. Wysocki
2013-12-17 12:34               ` Tejun Heo
2013-12-18  0:35                 ` Rafael J. Wysocki
2013-12-18 11:17                   ` Tejun Heo
2013-12-18 21:48                     ` Rafael J. Wysocki
2013-12-18 21:39                       ` Tejun Heo
2013-12-18 21:41                         ` Tejun Heo
2013-12-18 22:04                           ` Rafael J. Wysocki
2013-12-19 23:35                             ` [PATCH wq/for-3.14 1/2] workqueue: update max_active clamping rules Tejun Heo
2013-12-20  1:26                               ` Rafael J. Wysocki
2013-12-19 23:37                             ` [PATCH wq/for-3.14 2/2] workqueue: implement @drain for workqueue_set_max_active() Tejun Heo
2013-12-20  1:31                               ` Rafael J. Wysocki
2013-12-20 13:32                                 ` Tejun Heo
2013-12-20 13:56                                   ` Rafael J. Wysocki
2013-12-20 14:23                                     ` Tejun Heo
2013-12-16 12:12         ` [PATCH] libata, freezer: avoid block device removal while system is frozen Ming Lei
2013-12-16 12:45           ` Tejun Heo
2013-12-16 13:24             ` Ming Lei
2013-12-16 16:05               ` Tejun Heo
2013-12-17  2:38     ` Rafael J. Wysocki
2013-12-17 12:36       ` Tejun Heo
2013-12-18  0:23         ` Rafael J. Wysocki
2013-12-17 12:50     ` [PATCH v2] " Tejun Heo
2013-12-18  1:04       ` Rafael J. Wysocki
2013-12-18 11:08         ` Tejun Heo
2013-12-18 12:07       ` [PATCH v3] " Tejun Heo
2013-12-18 22:08         ` Rafael J. Wysocki
2013-12-19 17:24           ` Tejun Heo
2013-12-19 18:54         ` [PATCH v4] " Tejun Heo
2013-12-14  1:53 ` Writeback threads and freezable Dave Chinner
2013-12-14 17:30   ` Greg Kroah-Hartman
2013-12-14 20:23   ` Tejun Heo
2013-12-16  3:56     ` Dave Chinner
2013-12-16 12:51       ` Tejun Heo
2013-12-16 12:56         ` Tejun Heo
2013-12-18  0:35           ` Dave Chinner
2013-12-18 11:43             ` Tejun Heo
2013-12-18 22:14               ` Rafael J. Wysocki
2013-12-19  4:08               ` Dave Chinner [this message]
2013-12-19 16:24                 ` Tejun Heo
2013-12-20  0:51                   ` Dave Chinner
2013-12-20 14:51                     ` Tejun Heo
2013-12-20 14:00                   ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131219040821.GW31386@dastard \
    --to=david@fromorbit.com \
    --cc=aaron.lu@intel.com \
    --cc=axboe@kernel.dk \
    --cc=fengguang.wu@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=rjw@sisk.pl \
    --cc=tj@kernel.org \
    --cc=tomaz.solc@tablix.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).