Re: [LSFMM] RDMA data corruption potential during FS writeback

From: John Hubbard <jhubbard@nvidia.com>
To: Matthew Wilcox <willy@infradead.org>,
	Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
	Christopher Lameter <cl@linux.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>, Michal Hocko <mhocko@kernel.org>
Subject: Re: [LSFMM] RDMA data corruption potential during FS writeback
Date: Wed, 23 May 2018 16:03:25 -0700	[thread overview]
Message-ID: <414dceeb-1cd1-61e4-2e5b-31f06242aba3@nvidia.com> (raw)
In-Reply-To: <20180521143830.GA25109@bombadil.infradead.org>

On 05/21/2018 07:38 AM, Matthew Wilcox wrote:
> On Fri, May 18, 2018 at 08:51:38PM -0700, Dan Williams wrote:
-------------8<------------------------------------------------
> 
> Suggestion 1:
> 
> in get_user_pages_fast(), mark the page as dirty, but don't tag the radix
> tree entry as dirty.  Then vmscan() won't find it when it's looking to
> write out dirty pages.  Only mark it as dirty in the radix tree once we
> call set_page_dirty_lock().
> 
> Suggestion 2:
> 
> in get_user_pages_fast(), replace the page in the radix tree with a special
> entry that means "page under io".  In set_page_dirty_lock(), replace the
> "page under io" entry with the struct page pointer.

This second one feels a simpler to me. If no one sees huge problems with this,
I can put this together and try it out, because I have a few nicely reproducible
bugs that I can test this on.

But with either approach, a quick question first: will this do the right thing
for the other two use cases below?

    a) ftruncate

    b) deleting the inode and dropping all references to it (only the 
       get_user_pages reference remains)

...or is some other way to sneak in and try_to_free_buffers() on a 
page in this state?

Also, just to be sure I'm on the same page, is it accurate to claim that we
would then have the following updated guidelines for device drivers and
user space?

1. You can safely DMA to file-backed memory that you've pinned via
get_user_pages (with the usual caveats about getting the pages you think
you're getting), if you are careful to avoid truncating or deleting the 
file out from under get_user_pages.

In other words, this pattern is supported:

    get_user_pages (on file-backed memory from a persistent storage filesystem)
    ...do some DMA
    set_page_dirty_lock
    put_page

2. Furthermore, even if you are less careful, you still won't crash the kernel,
The worst that could happen is to corrupt your data, due to interrupting the
writeback.

The possibility of data corruption is bad, but it's also arguably both
self-inflicted and avoidable. Anyway, even so, it's an improvement: the bugs
I'm seeing would definitely get fixed with this.

> 
> Both of these suggestions have trouble with simultaneous sub-page IOs to the
> same page.  Do we care?  I suspect we might as pages get larger (see also:
> supporting THP pages in the page cache).
> 

I don't *think* we care. At least, no examples occur to me where this would
cause a problem.

thanks,
-- 
John Hubbard
NVIDIA