From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 189F07F for ; Mon, 26 Sep 2022 00:34:32 +0000 (UTC) Received: from dread.disaster.area (pa49-181-106-210.pa.nsw.optusnet.com.au [49.181.106.210]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id C06F01101031; Mon, 26 Sep 2022 10:34:31 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1occ4s-00CB1p-34; Mon, 26 Sep 2022 10:34:30 +1000 Date: Mon, 26 Sep 2022 10:34:30 +1000 From: Dave Chinner To: Jason Gunthorpe Cc: Dan Williams , akpm@linux-foundation.org, Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Subject: Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path Message-ID: <20220926003430.GB3600936@dread.disaster.area> References: <20220918225731.GG3600936@dread.disaster.area> <632894c4738d8_2a6ded294a@dwillia2-xfh.jf.intel.com.notmuch> <20220919212959.GL3600936@dread.disaster.area> <6329ee04c9272_2a6ded294bf@dwillia2-xfh.jf.intel.com.notmuch> <20220921221416.GT3600936@dread.disaster.area> <20220923001846.GX3600936@dread.disaster.area> <632d00a491d0d_4a67429488@dwillia2-xfh.jf.intel.com.notmuch> <20220923021012.GZ3600936@dread.disaster.area> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=e9dl9Yl/ c=1 sm=1 tr=0 ts=6330f398 a=j6JUzzrSC7wlfFge/rmVbg==:117 a=j6JUzzrSC7wlfFge/rmVbg==:17 a=kj9zAlcOel0A:10 a=xOM3xZuef0cA:10 a=VwQbUJbxAAAA:8 a=Ikd4Dj_1AAAA:8 a=7-415B0cAAAA:8 a=KjmkecRBzVTVnQtmf_MA:9 a=CjuIK1q_8ugA:10 a=AjGcO6oz07-iQ99wixmX:22 a=biEYGPWJfzWAr4FL6Ov7:22 On Fri, Sep 23, 2022 at 09:39:39AM -0300, Jason Gunthorpe wrote: > On Fri, Sep 23, 2022 at 12:10:12PM +1000, Dave Chinner wrote: > > > > Jason mentioned a scenario here: > > > > > > https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/ > > > > > > Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and > > > thread2 does memunmap()+close() while the read() is inflight. > > > > And, ah, what production application does this and expects to be > > able to process the result of the read() operation without getting a > > SEGV? > > The read() will do GUP and get a pined page, next the memunmap()/close > will release the inode the VMA was holding open. The read() FD is NOT > a DAX FD. > > We are now UAFing the DAX storage. There is no SEGV. Yes, but what happens *after* the read(). The userspace application now tries to access the mmap() region to access the data that was read, only to find that it's been unmapped. That triggers a SEGV, yes? IOWs, there's nothing *useful* a user application can do with a pattern like this. All it provides is a vector for UAF of the DAX storage. Now replace the read() with write(), and tell me why this can't cause data corruption and/or fatal filesystem corruption that can take the entire system down..... > It is not about sane applications, it is about kernel security against > hostile userspace. Turning this into a UAF doesn't provide any security at all. It makes this measurable worse from a security POV as it provides a channel for data leakage (read() case) or system unstability or compromise (the write() case). > > i.e. The underlying problem here is that memunmap() frees the VMA > > while there are still active task-based references to the pages in > > that VMA. IOWs, the VMA should not be torn down until the O_DIRECT > > read has released all the references to the pages mapped into the > > task address space. > > This is Jan's suggestion, I think we are still far from being able to > do that for O_DIRECT paths. > > Even if you fix the close() this way, doesn't truncate still have the > same problem? It sure does. Also fallocate(). The deja vu is strong right now. If something truncate()s a file, the only safe thing for an application that is using fsdax to directly access the underying storage is to unmap the file and remap it once the layout change operation has completed. We've been doing this safely with pNFS for remote RDMA-based direct access to the storage hardware for years. We have the layout lease infrastructure already there for it... I've pointed this out every time this conversation comes up. We have a solution for this problem pretty much ready to go - it just needs a UAPI to be defined for it. i.e. nothing has changed in the past 5 years - we have the same problems, we have the same solutions ready to be hooked up and used.... > At the end of the day the rule is a DAX page must not be re-used until > its refcount is 0. At some point the FS should wait for. The page is *not owned by DAX*. How many times do I have to say that FSDAX != DAX. The *storage media* must not be reused until the filesystem says it can be reused. And for that to work, nothing is allowed to keep an anonymous, non-filesystem reference to the storage media. It has nothing to do with struct page reference counts, and everything to do with ensuring that filesystem objects are correctly referenced while the storage media is in direct use by an application. I gave up on FSDAX years ago because nobody was listening to me. Here we are again, years down the track, with exactly the same issues as we had years ago, with exactly the same people repeating the same arguments for and against fixing the page reference problems. I don't have time to repeat history all over again, so I'm going to walk away from this train-wreck again so I can maintain some semblence of my remaining sanity.... -Dave. -- Dave Chinner david@fromorbit.com