Re: [RFC] Better page cache error handling

From: Andreas Dilger <adilger@dilger.ca>
To: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	linux-kernel@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Kent Overstreet <kent.overstreet@gmail.com>
Subject: Re: [RFC] Better page cache error handling
Date: Wed, 24 Feb 2021 16:41:26 -0700	[thread overview]
Message-ID: <DC74377C-DFFD-4E26-90AB-213577DB3081@dilger.ca> (raw)
In-Reply-To: <20210224134115.GP2858050@casper.infradead.org>

[-- Attachment #1: Type: text/plain, Size: 2856 bytes --]

On Feb 24, 2021, at 6:41 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Feb 24, 2021 at 01:38:48PM +0100, Jan Kara wrote:
>>> We allocate a page and try to read it.  29 threads pile up waiting
>>> for the page lock in filemap_update_page().  The error returned by the
>>> original I/O is shared between all 29 waiters as well as being returned
>>> to the requesting thread.  The next request for index.html will send
>>> another I/O, and more waiters will pile up trying to get the page lock,
>>> but at no time will more than 30 threads be waiting for the I/O to fail.
>> 
>> Interesting idea. It certainly improves current behavior. I just wonder
>> whether this isn't a partial solution to a problem and a full solution of
>> it would have to go in a different direction? I mean it just seems
>> wrong that each reader (let's assume they just won't overlap) has to retry
>> the failed IO and wait for the HW to figure out it's not going to work.
>> Shouldn't we cache the error state with the page? And I understand that we
>> then also have to deal with the problem how to invalidate the error state
>> when the block might eventually become readable (for stuff like temporary
>> IO failures). That would need some signalling from the driver to the page
>> cache, maybe in a form of some error recovery sequence counter or something
>> like that. For stuff like iSCSI, multipath, or NBD it could be doable I
>> believe...
> 
> That felt like a larger change than I wanted to make.  I already have
> a few big projects on my plate!
> 
> Also, it's not clear to me that the host can necessarily figure out when
> a device has fixed an error -- certainly for the three cases you list
> it can be done.  I think we'd want a timer to indicate that it's worth
> retrying instead of returning the error.
> 
> Anyway, that seems like a lot of data to cram into a struct page.  So I
> think my proposal is still worth pursuing while waiting for someone to
> come up with a perfect solution.

Since you would know that the page is bad at this point (not uptodate,
does not contain valid data) you could potentially re-use some other
fields in struct page, or potentially store something in the page itself?
That would avoid bloating struct page with fields that are only rarely
needed.  Userspace shouldn't be able to read the page at that point if
it is not marked uptodate, but they could overwrite it, so you wouldn't
want to store any kind of complex data structure there, but you _could_
store a magic, an error value, and a timeout, that are only valid if
!uptodate (cleared if the page were totally overwritten by userspace).

Yes, it's nasty, but better than growing struct page, and better than
blocking userspace threads for tens of minutes when a block is bad.

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]