From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 13 Dec 2018 11:51:19 +1100 From: Dave Chinner To: Jerome Glisse Cc: Jan Kara , John Hubbard , Matthew Wilcox , Dan Williams , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , mike.marciniszyn@intel.com, rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions Message-ID: <20181213005119.GD29416@dastard> References: <20181205014441.GA3045@redhat.com> <59ca5c4b-fd5b-1fc6-f891-c7986d91908e@nvidia.com> <7b4733be-13d3-c790-ff1b-ac51b505e9a6@nvidia.com> <20181207191620.GD3293@redhat.com> <3c4d46c0-aced-f96f-1bf3-725d02f11b60@nvidia.com> <20181208022445.GA7024@redhat.com> <20181210102846.GC29289@quack2.suse.cz> <20181212150319.GA3432@redhat.com> <20181212214641.GB29416@dastard> <20181212215931.GG5037@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181212215931.GG5037@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote: > On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote: > > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote: > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote: > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote: > > > > So this approach doesn't look like a win to me over using counter in struct > > > > page and I'd rather try looking into squeezing HMM public page usage of > > > > struct page so that we can fit that gup counter there as well. I know that > > > > it may be easier said than done... > > > > > > So i want back to the drawing board and first i would like to ascertain > > > that we all agree on what the objectives are: > > > > > > [O1] Avoid write back from a page still being written by either a > > > device or some direct I/O or any other existing user of GUP. IOWs, you need to mark pages being written to by a GUP as PageWriteback, so all attempts to write the page will block on wait_on_page_writeback() before trying to write the dirty page. > > > This would avoid possible file system corruption. This isn't a filesystem corruption vector. At worst, it could cause torn data writes due to updating the page while it is under IO. We have a name for this: "stable pages". This is designed to prevent updates to pages via mmap writes from causing corruption of things like MD RAID due to modification of the data during RAID parity calculations. Hence we have wait_for_stable_page() calls in all ->page_mkwrite implementations so that new mmap writes block until writeback IO is complete on the devices that require stable pages to prevent corruption. IOWs, we already deal with this "delay new modification while writeback is in progress" problem in the mmap/filesystem world and have infrastructure to handle it. And the ->page_mkwrite code already deals with it. > > > > > > [O2] Avoid crash when set_page_dirty() is call on a page that is > > > considered clean by core mm (buffer head have been remove and > > > with some file system this turns into an ugly mess). > > > > I think that's wrong. This isn't an "avoid a crash" case, this is a > > "prevent data and/or filesystem corruption" case. The primary goal > > we have here is removing our exposure to potential corruption, which > > has the secondary effect of avoiding the crash/panics that currently > > occur as a result of inconsistent page/filesystem state. > > This is O1 avoid corruption is O1 It's "avoid a specific instance of data corruption", not a general mechanism for avoiding data/filesystem corruption. Calling set_page_dirty() on a file backed page which has not been correctly prepared can cause data corruption, filesystem coruption and shutdowns, etc because we have dirty data over a region that is not correctly mapped. Yes, it can also cause a crash (because we really, really suck at validation and error handling in generic code paths), but there's so, so much more that can go wrong than crash the kernel when we do stupid shit like this. > > i.e. The goal is to have ->page_mkwrite() called on the clean page > > /before/ the file-backed page is marked dirty, and hence we don't > > expose ourselves to potential corruption or crashes that are a > > result of inappropriately calling set_page_dirty() on clean > > file-backed pages. > > Yes and this would be handle by put_user_page ie: No, put_user_page() is too late - it's after the DMA has completed, but we have to ensure the file has backing store allocated and the pages are in the correct state /before/ the DMA is done. Think ENOSPC - that has to be handled before we do the DMA, not after. Before the DMA it is a recoverable error, after the DMA it is data loss/corruption failure. > put_user_page(struct page *page, bool dirty) > { > if (!PageAnon(page)) { > if (dirty) { > // Do the whole dance ie page_mkwrite and all before > // calling set_page_dirty() > } > ... > } > ... > } Essentially, doing this would require a whole new "dirty a page" infrastructure because it is in the IO path, not the page fault path. And, for hardware that does it's own page faults for DMA, this whole post-DMA page setup is broken because the pages will have already gone through ->page_mkwrite() and be set up correctly already. > > > For [O2] i believe we can handle that case in the put_user_page() > > > function to properly dirty the page without causing filesystem > > > freak out. > > > > I'm pretty sure you can't call ->page_mkwrite() from > > put_user_page(), so I don't think this is workable at all. > > Hu why ? i can not think of any reason whike you could not. User of It's not a fault path, you can't safely lock pages, you can't take fault-path only locks in the IO path (mmap_sem inversion problems), etc. /me has a nagging feeling this was all explained in a previous discussions of this patchset... Cheers, Dave. -- Dave Chinner david@fromorbit.com