From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from out3-smtp.messagingengine.com ([66.111.4.27]:48049 "EHLO
        out3-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1753074AbeDSS5r (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Thu, 19 Apr 2018 14:57:47 -0400
Date: Thu, 19 Apr 2018 11:57:43 -0700
From: "andres@anarazel.de" <andres@anarazel.de>
To: Trond Myklebust <trondmy@hammer.space>
Cc: "willy@infradead.org" <willy@infradead.org>,
        "lsf-pc@lists.linux-foundation.org"
        <lsf-pc@lists.linux-foundation.org>,
        "david@fromorbit.com" <david@fromorbit.com>,
        "jlayton@kernel.org" <jlayton@kernel.org>,
        "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling
Message-ID: <20180419185743.eaeurq3ou2ouveug@alap3.anarazel.de>
References: <1523963281.4779.21.camel@kernel.org>
 <20180417225309.GA27893@dastard>
 <1524067210.27056.28.camel@kernel.org>
 <20180419004411.GG27893@dastard>
 <1524102468.38378.12.camel@hammer.space>
 <20180419015723.GC16782@bombadil.infradead.org>
 <1524103952.38378.23.camel@hammer.space>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1524103952.38378.23.camel@hammer.space>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Hi,

On 2018-04-19 02:12:33 +0000, Trond Myklebust wrote:
> On Wed, 2018-04-18 at 18:57 -0700, Matthew Wilcox wrote:
> > On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote:
> > > If the main use case is something like Postgresql, where you care
> > > about
> > > just one or two critical files, rather than monitoring the entire
> > > filesystem could we perhaps use a dedicated mmap() mode? It should
> > > be
> > > possible to throw up a bitmap that displays the exact blocks or
> > > pages
> > > that are affected, once the file has been damaged.
> >
> > Perhaps we need to have a quick summary of the postgres problem ...
> > they're not concerned with "one or two files", otherwise they could
> > just keep those files open and the wb_err mechanism would work fine.
> > The problem is that they have too many files to keep open in their
> > checkpointer process, and when they come along and open the files,
> > they don't see the error..
>
> I thought I understood that there were at least two issues here:
>
> 1) Monitoring lots of files to figure out which ones may have an error.
> 2) Drilling down to see what might be wrong with an individual file.
>
> Unless you are in a situation where you can have millions of files all
> go wrong at the same time, it would seems that the former is the
> operation that needs to scale. Once you're talking about large numbers
> of files all getting errors, it would appear that an fsck-like recovery
>  would be necessary. Am I wrong?

Well, the correctness issue really only centers around 1). Currently
there are scenarios (some made less some more likely by the errseq_t
changes) where we don't notice IO errors. The result can either be that
we wrongly report back to the client that a "COMMIT;" was successful,
even though it wasn't persisted, or we might throw away journal data
because we think a checkpoint was successful even though it wasn't.  To
fix the correctness issue we really only need 1).  That said, it'd
obviously be nice to be able to report back a decent error pointing to
individual files and a more descriptive error message than
"PANIC: An IO error occurred somewhere. Perhaps look in the kernel logs?"
wouldn't hurt either.

To give a short overview of how PostgreSQL issues fsync and does the
surrounding buffer management:

1) There's a traditional journal (WAL), addressed by LSN. Every
   modification needs to first be in the WAL before buffers (and thus
   on-disk data) can be modified.

2) There's a postgres internal buffer cache. Pages are tagged with the
   WAL LSN that need to be flushed to disk before the data can be
   written back.

3) Reads and writes between OS and buffer cache are done using buffered
   IO. There's valid reasons to change that, but it'll require new
   infrastructure.  Each process has a limited size path -> fd cache.

4) Buffers are written out by:
   - checkpointing process during checkpoints
   - background writer that attempts to keep some "victim" buffers clean
   - backends (client connection associated) when they have to reuse
     dirty victim buffers

   whenever such a writeout happens information about the file
   containing that dirty buffer is forwarded to the checkpointer. The
   checkpointer keeps track of each file that'll need to be fsynced in a
   hashmap.

   It's worth to note that each table / index / whatnot is a separate
   file, and large relations are segmented into 1GB segments. So it's
   pretty common to have tens of thousands of files in a larger
   database.

5) During checkpointing, which is paced in most cases and often will be
   configured to take 30-60min, all buffers from before the start of the
   checkpoint will be written out.  We'll issue SYNC_FILE_RANGE_WRITE
   requests occasionally to keep the amount of dirty kernel buffers
   under control.   After that we'll fsync each of the dirty files.
   Once that and some other boring internal stuff succeeded, we'll issue
   a checkpoint record, and allow discarding WAL from before the checkpoint.

Because we cannot realistically keep each of the files open between 4)
and the end of 5), and because the fds used in 4) are not the same as
the ones in 5) (different processes), we currently aren't guaranteed
notification of writeback failures.

Realistically we're not going to do much file-specific handling in case
there's errors. Either a retry is going to fix the issue (ooops RN,
because the error has been "eaten"), or we're doing a crash-recovery
cycle from WAL (oops, because we don't necessarily know an error
occurred).

It's worth to note that for us syncfs() is better than nothing, but it's
not perfect.  It's pretty common to have temporary files (sort spool
files, temporary tables, ...) on the same filesystem as the persistent
database. syncfs() has the potential to flush out a lot of unnecessary
dirty data.  Note that it'd be very unlikely that the temp data files
would be moved to DIO - it's *good* that the kernel manages the amount
of dirty / cached data. It has a heck of a lot of more knowledge about
how much memory pressure the system has than postgres ever will have.

One reason, besides some architectural issues inside PG, we've been
concerned about when considering DIO, is along similar lines. A lot of
people use databases as a part of their stack without focusing on
them. Which usually means the database will be largely untuned. With
buffered IO that's not so bad, the kernel will dynamically adapt to some
extent. With DIO the consequences of a mistuned buffer cache size or
such are way worse. It's good for critical databases maintained by
dedicated people, not so good outside of that.

Matthew, I'm not sure what kind of summary you had in mind. Please let
me know if you want more detail in any of the areas, happy to expand.

Greetings,

Andres Freund