From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from imap.thunk.org ([74.207.234.97]:33636 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751737AbeDLFeu (ORCPT ); Thu, 12 Apr 2018 01:34:50 -0400 Date: Thu, 12 Apr 2018 01:34:45 -0400 From: "Theodore Y. Ts'o" To: Andres Freund Cc: Andreas Dilger , 20180410184356.GD3563@thunk.org, Ext4 Developers List , Linux FS Devel , Jeff Layton , "Joshua D. Drake" Subject: Re: fsync() errors is unsafe and risks data loss Message-ID: <20180412053445.GP2801@thunk.org> References: <20180410220726.vunhvwuzxi5bm6e5@alap3.anarazel.de> <190CF56C-C03D-4504-8B35-5DB479801513@dilger.ca> <20180412021752.2wykkutkmzh4ikbf@alap3.anarazel.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180412021752.2wykkutkmzh4ikbf@alap3.anarazel.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote: > > If there is background data writeback *without an open file descriptor*, > > there is no mechanism for the kernel to return an error to any application > > which may exist, or may not ever come back. > > And that's *horrible*. If I cp a file, and writeback fails in the > background, and I then cat that file before restarting, I should be able > to see that that failed. Instead of returning something bogus. If there is no open file descriptor, and in many cases, no process (because it has already exited), it may be horrible, but what the h*ll else do you expect the OS to do? The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon. If it detects errors on a particular hard drive, it tells the cluster file system to stop using that disk, and to reconstruct from erasure code all of the data chunks on that disk onto other disks in the cluster. We then run a series of disk diagnostics to make sure we find all of the bad sectors (every often, where there is one bad sector, there are several more waiting to be found), and then afterwards, put the disk back into service. By making it be a separate health monitoring process, we can have HDD experts write much more sophisticated code that can ask the disk firmware for more information (e.g., SMART, the grown defect list), do much more careful scrubbing of the disk media, etc., before returning the disk back to service. > > Everyone already has too much work to do, so you need to find someone > > who has an interest in fixing this (IMHO very peculiar) use case. If > > PG developers want to add a tunable "keep dirty pages in RAM on IO > > failure", I don't think that it would be too hard for someone to do. > > It might be harder to convince some of the kernel maintainers to > > accept it, and I've been on the losing side of that battle more than > > once. However, like everything you don't pay for, you can't require > > someone else to do this for you. It wouldn't hurt to see if Jeff > > Layton, who wrote the errseq patches, would be interested to work on > > something like this. > > I don't think this is that PG specific, as explained above. The reality is that recovering from disk errors is tricky business, and I very much doubt most userspace applications, including distro package managers, are going to want to engineer for trying to detect and recover from disk errors. If that were true, then Red Hat and/or SuSE have kernel engineers, and they would have implemented everything everything on your wish list. They haven't, and that should tell you something. The other reality is that once a disk starts developing errors, in reality you will probably need to take the disk off-line, scrub it to find any other media errors, and there's a good chance you'll need to rewrite bad sectors (incluing some which are on top of file system metadata, so you probably will have to run fsck or reformat the whole file system). I certainly don't think it's realistic to assume adding lots of sophistication to each and every userspace program. If you have tens or hundreds of thousands of disk drives, then you will need to do tsomething automated, but I claim that you really don't want to smush all of that detailed exception handling and HDD repair technology into each database or cluster file system component. It really needs to be done in a separate health-monitor and machine-level management system. Regards, - Ted