From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752972Ab3LRAfu (ORCPT ); Tue, 17 Dec 2013 19:35:50 -0500 Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:13551 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751810Ab3LRAft (ORCPT ); Tue, 17 Dec 2013 19:35:49 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AlMIAPbssFJ5LHyk/2dsb2JhbABZgwqDPLBthVCBHxd0giUBAQEEJxMcIxAIAw4KCSUPBSUDIROIA8llFxaOfAeDI4ETBJgVkhWDPyiBLCM Date: Wed, 18 Dec 2013 11:35:10 +1100 From: Dave Chinner To: Tejun Heo Cc: "Rafael J. Wysocki" , Jens Axboe , tomaz.solc@tablix.org, aaron.lu@intel.com, linux-kernel@vger.kernel.org, Oleg Nesterov , Greg Kroah-Hartman , Fengguang Wu Subject: Re: Writeback threads and freezable Message-ID: <20131218003510.GD20579@dastard> References: <20131213174932.GA27070@htj.dyndns.org> <20131214015343.GP31386@dastard> <20131214202324.GA4020@htj.dyndns.org> <20131216035652.GY31386@dastard> <20131216125124.GC32509@htj.dyndns.org> <20131216125604.GD32509@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131216125604.GD32509@htj.dyndns.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 16, 2013 at 07:56:04AM -0500, Tejun Heo wrote: > On Mon, Dec 16, 2013 at 07:51:24AM -0500, Tejun Heo wrote: > > On Mon, Dec 16, 2013 at 02:56:52PM +1100, Dave Chinner wrote: > > > > What are you suggesting? Implementing separate warm and hot unplug > > > > paths? That makes no sense whatsoever. Device hot unplug is just a > > > > sub operation of general device unplug which should be able to succeed > > > > whether the underlying device is failing IOs or not. > > > > > > I don't care. Trying to issue IO from an an IO error handling path > > > where the device has just been removed is fundamentally broken. > > > > What? Have you even read the original message? IO error handling > > path isn't issuing the IO here. The hot unplug operation is > > completely asynchronous to the IO path. What's dead locking is not > > the filesystem and IO path but device driver layer and hot unplug > > path. IOs are not stalled. > > In fact, this deadlock can be reproduced without hotunplug at all. If > you initiate warm unplug and warm unplugging races with suspend/resume > cycle, it'll behave exactly the same - the IOs from flushing in that > scenario would succeed but IOs failing or not has *NOTHING* to do with > this deadlock. It'd still happen. It's freezer behaving as a giant > lock and other locks of course getting dragged into giant dependency > loop. Can you please at least *try* to understand what's going on > before throwing strong assertions? Sure I understand the problem. Part of that dependency loop is caused by filesystem re-entrancy from the hot unplug code. Regardless of this /specific freezer problem/, that filesystem re-entrancy path is *never OK* during a hot-unplug event and it needs to be fixed. Perhaps the function "invalidate_partition()" is badly named. To state the obvious, fsync != invalidation. What it does is: 1. sync filesystem 2. shrink the dcache 3. invalidates inodes and kills dirty inodes 4. invalidates block device (removes cached bdev pages) Basically, the first step is "flush", the remainder is "invalidate". Indeed, step 3 throws away dirty inodes, so why on earth would we even bother with step 1 to try to clean them in the first place? IOWs, the flush is irrelevant in the hot-unplug case as it will fail to flush stuff, and then we just throw the stuff we failed to write away. But in attempting to flush all the dirty data and metadata, we can cause all sorts of other potential re-entrancy based deadlocks due to attempting to issue IO. Whether they be freezer based or through IO error handling triggering device removal or some other means, it is irrelevant - it is the flush that causes all the problems. We need to either get rid of the flush on device failure/hot-unplug, or turn it into a callout for the superblock to take an appropriate action (e.g. shutting down the filesystem) rather than trying to issue IO. i.e. allow the filesystem to take appropriate action of shutting down the filesystem and invalidating it's caches. Indeed, in XFS there's several other caches that could contain dirty metadata that isn't invalidated by invalidate_partition(), and so unless the filesystem is shut down it can continue to try to issue IO on those buffers to the removed device until the filesystem is shutdown or unmounted. Seriously, Tejun, the assumption that invalidate_partition() knows enough about filesystems to safely "invalidate" them is just broken. These days, filesystems often reference multiple block devices, and so the way hotplug currently treats them as "one device, one filesystem" is also fundamentally wrong. So there's many ways in which the hot-unplug code is broken in it's use of invalidate_partition(), the least of which is the dependencies caused by re-entrancy. We really need a "sb->shutdown()" style callout as step one in the above process, not fsync_bdev(). Cheers, Dave. -- Dave Chinner david@fromorbit.com