From: Amir Goldstein <amir73il@gmail.com> To: Dave Chinner <david@fromorbit.com> Cc: "Darrick J . Wong" <darrick.wong@oracle.com>, Christoph Hellwig <hch@lst.de>, Matthew Wilcox <willy@infradead.org>, linux-xfs <linux-xfs@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Ext4 <linux-ext4@vger.kernel.org>, Lukas Czerner <lczerner@redhat.com>, Theodore Tso <tytso@mit.edu>, Jan Kara <jack@suse.cz> Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload Date: Mon, 8 Apr 2019 12:02:34 +0300 [thread overview] Message-ID: <CAOQ4uxgD4ErSUtbu0xqb5dSm_tM4J92qt6=hGH8GRc5KNGqP9A@mail.gmail.com> (raw) In-Reply-To: <20190407232728.GF26298@dastard> On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > This patch improves performance of mixed random rw workload > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > that xfs has always provided. > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > Once with a discard iterator to warm up page cache before taking > > > > the shared ilock and once again under shared ilock. > > > > > > This will race with thing like truncate, hole punching, etc that > > > serialise IO and invalidate the page cache for data integrity > > > reasons under the IOLOCK. These rely on there being no IO to the > > > inode in progress at all to work correctly, which this patch > > > violates. IOWs, while this is fast, it is not safe and so not a > > > viable approach to solving the problem. > > > > > > > This statement leaves me wondering, if ext4 does not takes > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > fs for that matter) guaranty buffered read synchronization with > > truncate, hole punching etc? > > The answer in ext4 case is i_mmap_sem, which is read locked > > in the page fault handler. > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > truncate, holepunching, etc. Completely irrelevant to the read() > path. > I'm at lost here. Why are page faults completely irrelevant to read() path? Aren't full pages supposed to be faulted in on read() after truncate_pagecache_range()? And aren't partial pages supposed to be partially zeroed and uptodate after truncate_pagecache_range()? > > And xfs does the same type of synchronization with MMAPLOCK, > > so while my patch may not be safe, I cannot follow why from your > > explanation, so please explain if I am missing something. > > mmap_sem inversions require independent locks for IO path and page > faults - the MMAPLOCK does not protect anything in the > read()/write() IO path. > [...] > > All you see is this: > > truncate: read() > > IOLOCK_EXCL > flush relevant cached data > truncate page cache > pre-read page cache between > new eof and old eof > IOLOCK_SHARED > <blocks> > start transaction > ILOCK_EXCL > update isize > remove extents > .... > commit xactn > IOLOCK unlock > <gets lock> > sees beyond EOF, returns 0 > > > So you see the read() doing the right thing (detect EOF, returning > short read). Great. > > But what I see is uptodate pages containing stale data being left in > the page cache beyond EOF. That is th eproblem here - truncate must > not leave stale pages beyond EOF behind - it's the landmine that > causes future things to go wrong. > > e.g. now the app does post-eof preallocation so the range those > pages are cached over are allocated as unwritten - the filesystem > will do this without even looking at the page cache because it's > beyond EOF. Now we extend the file past those cached pages, and > iomap_zero() sees the range as unwritten and so does not write zeros > to the blocks between the old EOF and the new EOF. Now the app reads > from that range (say it does a sub-page write, triggering a page > cache RMW cycle). the read goes to instantiate the page cache page, > finds a page already in the cache that is uptodate, and uses it > without zeroing or reading from disk. > > And now we have stale data exposure and/or data corruption. > > If can come up with quite a few scenarios where this particular > "populate cache after invalidation" race can cause similar problems > for XFS. Hole punch and most of the other fallocate extent > manipulations have the same serialisation requirements - no IO over > the range of the operation can be *initiated* between the /start/ of > the page cache invalidation and the end of the specific extent > manipulation operation. > > So how does ext4 avoid this problem on truncate? > > History lesson: truncate in Linux (and hence ext4) has traditionally > been serialised by the hacky post-page-lock checks that are strewn > all through the page cache and mm/ subsystem. i.e. every time you > look up and lock a page cache page, you have to check the > page->mapping and page->index to ensure that the lookup-and-lock > hasn't raced with truncate. This only works because truncate > requires the inode size to be updated before invalidating the page > cache - that's the "hacky" part of it. > > IOWs, the burden of detecting truncate races is strewn throughout > the mm/ subsystem, rather than being the responisibility of the > filesystem. This is made worse by the fact this mechanism simply > doesn't work for hole punching because there is no file size change > to indicate that the page lookup is racing with an in-progress > invalidation. > > That means the mm/ and page cache code is unable to detect hole > punch races, and so the serialisation of invalidation vs page cache > instantiation has to be done in the filesystem. And no Linux native > filesystem had the infrastructure for such serialisation because > they never had to implement anything to ensure truncate was > serialised against new and in-progress IO. > > The result of this is that, AFAICT, ext4 does not protect against > read() vs hole punch races - it's hole punching code it does: > > Hole Punch: read(): > > inode_lock() > inode_dio_wait(inode); > down_write(i_mmap_sem) > truncate_pagecache_range() > ext4_file_iter_read() > ext4_map_blocks() > down_read(i_data_sem) > <gets mapping> > <populates page cache over hole> > <reads stale data into cache> > ..... > down_write(i_data_sem) > remove extents > > IOWs, ext4 is safe against truncate because of the > change-inode-size-before-invalidation hacks, but the lack of > serialise buffered reads means that hole punch and other similar > fallocate based extent manipulations can race against reads.... > Adding some ext4 folks to comment on the above. Could it be that those races were already addressed by Lukas' work: https://lore.kernel.org/patchwork/cover/371861/ Thanks, Amir.
next prev parent reply other threads:[~2019-04-08 9:02 UTC|newest] Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-04-04 16:57 Amir Goldstein 2019-04-04 21:17 ` Dave Chinner 2019-04-05 14:02 ` Amir Goldstein 2019-04-07 23:27 ` Dave Chinner 2019-04-08 9:02 ` Amir Goldstein [this message] 2019-04-08 14:11 ` Jan Kara 2019-04-08 17:41 ` Amir Goldstein 2019-04-09 8:26 ` Jan Kara 2019-04-08 11:03 ` Jan Kara 2019-04-22 10:55 ` Boaz Harrosh 2019-04-08 10:33 ` Jan Kara 2019-04-08 16:37 ` Davidlohr Bueso 2019-04-11 1:11 ` Dave Chinner 2019-04-16 12:22 ` Dave Chinner 2019-04-18 3:10 ` Dave Chinner 2019-04-18 18:21 ` Davidlohr Bueso 2019-04-20 23:54 ` Dave Chinner 2019-05-03 4:17 ` Dave Chinner 2019-05-03 5:17 ` Dave Chinner
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CAOQ4uxgD4ErSUtbu0xqb5dSm_tM4J92qt6=hGH8GRc5KNGqP9A@mail.gmail.com' \ --to=amir73il@gmail.com \ --cc=darrick.wong@oracle.com \ --cc=david@fromorbit.com \ --cc=hch@lst.de \ --cc=jack@suse.cz \ --cc=lczerner@redhat.com \ --cc=linux-ext4@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-xfs@vger.kernel.org \ --cc=tytso@mit.edu \ --cc=willy@infradead.org \ --subject='Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).