From: Jan Kara <email@example.com> To: "Darrick J. Wong" <firstname.lastname@example.org> Cc: Jan Kara <email@example.com>, firstname.lastname@example.org, Christoph Hellwig <email@example.com>, Dave Chinner <firstname.lastname@example.org>, email@example.com, Chao Yu <firstname.lastname@example.org>, Damien Le Moal <email@example.com>, "Darrick J. Wong" <firstname.lastname@example.org>, Jaegeuk Kim <email@example.com>, Jeff Layton <firstname.lastname@example.org>, Johannes Thumshirn <email@example.com>, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, Miklos Szeredi <email@example.com>, Steve French <firstname.lastname@example.org>, Ted Tso <email@example.com>, Matthew Wilcox <firstname.lastname@example.org> Subject: Re: [PATCH 03/13] mm: Protect operations adding pages to page cache with invalidate_lock Date: Wed, 26 May 2021 12:00:27 +0200 [thread overview] Message-ID: <20210526100027.GA30369@quack2.suse.cz> (raw) In-Reply-To: <20210525210149.GO202121@locust> On Tue 25-05-21 14:01:49, Darrick J. Wong wrote: > On Tue, May 25, 2021 at 03:50:40PM +0200, Jan Kara wrote: > > Currently, serializing operations such as page fault, read, or readahead > > against hole punching is rather difficult. The basic race scheme is > > like: > > > > fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. > > truncate_inode_pages_range() > > <create pages in page > > cache here> > > <update fs block mapping and free blocks> > > > > Now the problem is in this way read / page fault / readahead can > > instantiate pages in page cache with potentially stale data (if blocks > > get quickly reused). Avoiding this race is not simple - page locks do > > not work because we want to make sure there are *no* pages in given > > range. inode->i_rwsem does not work because page fault happens under > > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes > > the performance for mixed read-write workloads suffer. > > > > So create a new rw_semaphore in the address_space - invalidate_lock - > > that protects adding of pages to page cache for page faults / reads / > > readahead. > > > > Signed-off-by: Jan Kara <email@example.com> > > --- > > Documentation/filesystems/locking.rst | 64 ++++++++++++++++++-------- > > fs/inode.c | 2 + > > include/linux/fs.h | 6 +++ > > mm/filemap.c | 65 ++++++++++++++++++++++----- > > mm/readahead.c | 2 + > > mm/rmap.c | 37 +++++++-------- > > mm/truncate.c | 3 +- > > 7 files changed, 129 insertions(+), 50 deletions(-) > > > > diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst > > index 4ed2b22bd0a8..af425bef55d3 100644 > > --- a/Documentation/filesystems/locking.rst > > +++ b/Documentation/filesystems/locking.rst > > @@ -271,19 +271,19 @@ prototypes:: > > locking rules: > > All except set_page_dirty and freepage may block > > > > -====================== ======================== ========= > > -ops PageLocked(page) i_rwsem > > -====================== ======================== ========= > > +====================== ======================== ========= =============== > > +ops PageLocked(page) i_rwsem invalidate_lock > > +====================== ======================== ========= =============== > > writepage: yes, unlocks (see below) > > -readpage: yes, unlocks > > +readpage: yes, unlocks shared > > writepages: > > set_page_dirty no > > -readahead: yes, unlocks > > -readpages: no > > +readahead: yes, unlocks shared > > +readpages: no shared > > write_begin: locks the page exclusive > > write_end: yes, unlocks exclusive > > bmap: > > -invalidatepage: yes > > +invalidatepage: yes exclusive > > releasepage: yes > > freepage: yes > > direct_IO: > > @@ -378,7 +378,10 @@ keep it that way and don't breed new callers. > > ->invalidatepage() is called when the filesystem must attempt to drop > > some or all of the buffers from the page when it is being truncated. It > > returns zero on success. If ->invalidatepage is zero, the kernel uses > > -block_invalidatepage() instead. > > +block_invalidatepage() instead. The filesystem should exclusively acquire > > s/should/must/ ? It's not really optional to lock out invalidations > anymore now that the page cache synchronizes on invalidate_lock, right? Right, updated. > > +invalidate_lock before invalidating page cache in truncate / hole punch path > > +(and thus calling into ->invalidatepage) to block races between page cache > > +invalidation and page cache filling functions (fault, read, ...). > > > > ->releasepage() is called when the kernel is about to try to drop the > > buffers from the page in preparation for freeing it. It returns zero to > > @@ -573,6 +576,27 @@ in sys_read() and friends. > > the lease within the individual filesystem to record the result of the > > operation > > > > +->fallocate implementation must be really careful to maintain page cache > > +consistency when punching holes or performing other operations that invalidate > > +page cache contents. Usually the filesystem needs to call > > +truncate_inode_pages_range() to invalidate relevant range of the page cache. > > +However the filesystem usually also needs to update its internal (and on disk) > > +view of file offset -> disk block mapping. Until this update is finished, the > > +filesystem needs to block page faults and reads from reloading now-stale page > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this > > +and acquires it in shared mode in paths loading pages from disk > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is > > +responsible for taking this lock in its fallocate implementation and generally > > +whenever the page cache contents needs to be invalidated because a block is > > +moving from under a page. > > + > > +->copy_file_range and ->remap_file_range implementations need to serialize > > +against modifications of file data while the operation is running. For > > +blocking changes through write(2) and similar operations inode->i_rwsem can be > > +used. For blocking changes through memory mapping, the filesystem can use > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite > > +implementation. > > Once this patch lands, will there be any filesystems that can skip > taking invalidate_lock in ->page_mkwrite and not have problems? Now > that the address_space has an invalidation lock, everyone is strongly > incentivized to use it unless they have yet another layer of locks to do > more or less the same thing, right? Well, I assume btrfs will want to keep their special extent tree locking and thus invalidate_lock is not necessary for it strictly speaking. Also filesystems supporting only read, write, mmap, truncate (such as udf, reiserfs, ...) do not really need invalidate_lock (they usually don't bother with any page_mkwrite helper in fact). So there are going to be exceptions. I want to add invalidate_lock locking around truncate handling for these filesystem as well to make locking rules simpler and to be able to add assertions into VFS helpers. I didn't plan to do this for .page_mkwrite as there it might actually hurt performance noticeably. Honza -- Jan Kara <firstname.lastname@example.org> SUSE Labs, CR
next prev parent reply other threads:[~2021-05-26 10:00 UTC|newest] Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-05-25 13:50 [PATCH 0/13 v6] fs: Hole punch vs page cache filling races Jan Kara 2021-05-25 13:50 ` [PATCH 01/13] mm: Fix comments mentioning i_mutex Jan Kara 2021-05-25 13:50 ` [PATCH 02/13] documentation: Sync file_operations members with reality Jan Kara 2021-05-25 20:43 ` Darrick J. Wong 2021-05-25 13:50 ` [PATCH 03/13] mm: Protect operations adding pages to page cache with invalidate_lock Jan Kara 2021-05-25 21:01 ` Darrick J. Wong 2021-05-26 10:00 ` Jan Kara [this message] 2021-05-25 13:50 ` [PATCH 04/13] mm: Add functions to lock invalidate_lock for two mappings Jan Kara 2021-05-25 20:48 ` Darrick J. Wong 2021-05-26 10:07 ` Jan Kara 2021-05-26 12:11 ` Damien Le Moal 2021-05-26 13:45 ` Jan Kara 2021-05-26 15:25 ` Darrick J. Wong 2021-05-25 13:50 ` [PATCH 05/13] ext4: Convert to use mapping->invalidate_lock Jan Kara 2021-05-25 13:50 ` [PATCH 06/13] ext2: Convert to using invalidate_lock Jan Kara 2021-05-25 13:50 ` [PATCH 07/13] xfs: Convert to use invalidate_lock Jan Kara 2021-05-25 21:37 ` Darrick J. Wong 2021-05-26 10:18 ` Jan Kara 2021-05-26 15:32 ` Darrick J. Wong 2021-05-27 12:01 ` Jan Kara 2021-05-25 21:40 ` Dave Chinner 2021-05-26 10:20 ` Jan Kara 2021-05-26 13:42 ` Jan Kara 2021-05-25 13:50 ` [PATCH 08/13] xfs: Convert double locking of MMAPLOCK to use VFS helpers Jan Kara 2021-05-25 21:41 ` Darrick J. Wong 2021-05-25 13:50 ` [PATCH 09/13] zonefs: Convert to using invalidate_lock Jan Kara 2021-05-25 13:50 ` [PATCH 10/13] f2fs: " Jan Kara 2021-05-26 9:55 ` Chao Yu 2021-05-25 13:50 ` [PATCH 11/13] fuse: " Jan Kara 2021-05-25 13:50 ` [PATCH 12/13] ceph: Fix race between hole punch and page fault Jan Kara 2021-05-25 13:50 ` [PATCH 13/13] cifs: " Jan Kara
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210526100027.GA30369@quack2.suse.cz \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: [PATCH 03/13] mm: Protect operations adding pages to page cache with invalidate_lock' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).