From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from imap.thunk.org ([74.207.234.97]:33636 "EHLO imap.thunk.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751737AbeDLFeu (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Thu, 12 Apr 2018 01:34:50 -0400
Date: Thu, 12 Apr 2018 01:34:45 -0400
From: "Theodore Y. Ts'o" <tytso@mit.edu>
To: Andres Freund <andres@anarazel.de>
Cc: Andreas Dilger <adilger@dilger.ca>,
        20180410184356.GD3563@thunk.org,
        Ext4 Developers List <linux-ext4@vger.kernel.org>,
        Linux FS Devel <linux-fsdevel@vger.kernel.org>,
        Jeff Layton <jlayton@redhat.com>,
        "Joshua D. Drake" <jd@commandprompt.com>
Subject: Re: fsync() errors is unsafe and risks data loss
Message-ID: <20180412053445.GP2801@thunk.org>
References: <20180410220726.vunhvwuzxi5bm6e5@alap3.anarazel.de>
 <190CF56C-C03D-4504-8B35-5DB479801513@dilger.ca>
 <20180412021752.2wykkutkmzh4ikbf@alap3.anarazel.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20180412021752.2wykkutkmzh4ikbf@alap3.anarazel.de>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:
> > If there is background data writeback *without an open file descriptor*,
> > there is no mechanism for the kernel to return an error to any application
> > which may exist, or may not ever come back.
> 
> And that's *horrible*. If I cp a file, and writeback fails in the
> background, and I then cat that file before restarting, I should be able
> to see that that failed. Instead of returning something bogus.

If there is no open file descriptor, and in many cases, no process
(because it has already exited), it may be horrible, but what the h*ll
else do you expect the OS to do?

The solution we use at Google is that we watch for I/O errors using a
completely different process that is responsible for monitoring
machine health.  It used to scrape dmesg, but we now arrange to have
I/O errors get sent via a netlink channel to the machine health
monitoring daemon.  If it detects errors on a particular hard drive,
it tells the cluster file system to stop using that disk, and to
reconstruct from erasure code all of the data chunks on that disk onto
other disks in the cluster.  We then run a series of disk diagnostics
to make sure we find all of the bad sectors (every often, where there
is one bad sector, there are several more waiting to be found), and
then afterwards, put the disk back into service.

By making it be a separate health monitoring process, we can have HDD
experts write much more sophisticated code that can ask the disk
firmware for more information (e.g., SMART, the grown defect list), do
much more careful scrubbing of the disk media, etc., before returning
the disk back to service.

> > Everyone already has too much work to do, so you need to find someone
> > who has an interest in fixing this (IMHO very peculiar) use case.  If
> > PG developers want to add a tunable "keep dirty pages in RAM on IO
> > failure", I don't think that it would be too hard for someone to do.
> > It might be harder to convince some of the kernel maintainers to
> > accept it, and I've been on the losing side of that battle more than
> > once.  However, like everything you don't pay for, you can't require
> > someone else to do this for you.  It wouldn't hurt to see if Jeff
> > Layton, who wrote the errseq patches, would be interested to work on
> > something like this.
> 
> I don't think this is that PG specific, as explained above.

The reality is that recovering from disk errors is tricky business,
and I very much doubt most userspace applications, including distro
package managers, are going to want to engineer for trying to detect
and recover from disk errors.  If that were true, then Red Hat and/or
SuSE have kernel engineers, and they would have implemented everything
everything on your wish list.  They haven't, and that should tell you
something.

The other reality is that once a disk starts developing errors, in
reality you will probably need to take the disk off-line, scrub it to
find any other media errors, and there's a good chance you'll need to
rewrite bad sectors (incluing some which are on top of file system
metadata, so you probably will have to run fsck or reformat the whole
file system).  I certainly don't think it's realistic to assume adding
lots of sophistication to each and every userspace program.

If you have tens or hundreds of thousands of disk drives, then you
will need to do tsomething automated, but I claim that you really
don't want to smush all of that detailed exception handling and HDD
repair technology into each database or cluster file system component.
It really needs to be done in a separate health-monitor and
machine-level management system.

Regards,

						- Ted