linux-cifs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	ceph-devel@vger.kernel.org, Chao Yu <yuchao0@huawei.com>,
	Damien Le Moal <damien.lemoal@wdc.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jaegeuk Kim <jaegeuk@kernel.org>,
	Jeff Layton <jlayton@kernel.org>,
	Johannes Thumshirn <jth@kernel.org>,
	linux-cifs@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net, linux-mm@kvack.org,
	linux-xfs@vger.kernel.org, Miklos Szeredi <miklos@szeredi.hu>,
	Steve French <sfrench@samba.org>, Ted Tso <tytso@mit.edu>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock
Date: Wed, 19 May 2021 08:36:37 +1000	[thread overview]
Message-ID: <20210518223637.GJ2893@dread.disaster.area> (raw)
In-Reply-To: <20210514161730.GL9675@magnolia>

On Fri, May 14, 2021 at 09:17:30AM -0700, Darrick J. Wong wrote:
> On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote:
> > On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> > > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > > > +->fallocate implementation must be really careful to maintain page cache
> > > > > > +consistency when punching holes or performing other operations that invalidate
> > > > > > +page cache contents. Usually the filesystem needs to call
> > > > > > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > > > > > +However the filesystem usually also needs to update its internal (and on disk)
> > > > > > +view of file offset -> disk block mapping. Until this update is finished, the
> > > > > > +filesystem needs to block page faults and reads from reloading now-stale page
> > > > > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > > > > > +and acquires it in shared mode in paths loading pages from disk
> > > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > > > +responsible for taking this lock in its fallocate implementation and generally
> > > > > > +whenever the page cache contents needs to be invalidated because a block is
> > > > > > +moving from under a page.
> > > > > > +
> > > > > > +->copy_file_range and ->remap_file_range implementations need to serialize
> > > > > > +against modifications of file data while the operation is running. For blocking
> > > > > > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > > > > > +blocking changes through memory mapping, the filesystem can use
> > > > > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > > > > > +implementation.
> > > > > 
> > > > > Question: What is the locking order when acquiring the invalidate_lock
> > > > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > > > being hoisted here (increasing order of i_ino)?
> > > > > 
> > > > > The reason I ask is that remap_file_range has to do that, but I don't
> > > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > > > calls in xfs_ilock2_io_mmap in this series.
> > > > 
> > > > Good question. Technically, I don't think there's real need to establish a
> > > > single ordering because locks among different filesystems are never going
> > > > to be acquired together (effectively each lock type is local per sb and we
> > > > are free to define an ordering for each lock type differently). But to
> > > > maintain some sanity I guess having the same locking order for doublelock
> > > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > > > by-ino ordering? So that we don't have to consider two different orders in
> > > > xfs_lock_two_inodes()...
> > > 
> > > I imagine Dave will chime in on this, but I suspect the reason is
> > > hysterical raisins^Wreasons.
> > 
> > It's the locking rules that XFS has used pretty much forever.
> > Locking by inode number always guarantees the same locking order of
> > two inodes in the same filesystem, regardless of the specific
> > in-memory instances of the two inodes.
> > 
> > e.g. if we lock based on the inode structure address, in one
> > instancex, we could get A -> B, then B gets recycled and
> > reallocated, then we get B -> A as the locking order for the same
> > two inodes.
> > 
> > That, IMNSHO, is utterly crazy because with non-deterministic inode
> > lock ordered like this you can't make consistent locking rules for
> > locking the physical inode cluster buffers underlying the inodes in
> > the situation where they also need to be locked.
> 
> <nod> That's protected by the ILOCK, correct?
> 
> > We've been down this path before more than a decade ago when the
> > powers that be decreed that inode locking order is to be "by
> > structure address" rather than inode number, because "inode number
> > is not unique across multiple superblocks".
> > 
> > I'm not sure that there is anywhere that locks multiple inodes
> > across different superblocks, but here we are again....
> 
> Hm.  Are there situations where one would want to lock multiple
> /mappings/ across different superblocks?  The remapping code doesn't
> allow cross-super operations, so ... pipes and splice, maybe?  I don't
> remember that code well enough to say for sure.

Hmmmm. Doing read IO into a buffer that is mmap()d from another
file, and we take a page fault on it inside the read IO path? We're
copying from a page in one mapping and taking a fault in another
mapping and hence taking the invalidate_lock to populate the page
cache for the second mapping...

I haven't looked closely enough at where the invalidate_lock is held
in the read path to determine if this is an issue, but if it is then
it is also a potential deadlock scenario...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2021-05-18 22:36 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12 13:46 [PATCH 0/11 v5] fs: Hole punch vs page cache filling races Jan Kara
2021-05-12 13:46 ` [PATCH 01/11] mm: Fix comments mentioning i_mutex Jan Kara
2021-05-12 13:46 ` [PATCH 02/11] documentation: Sync file_operations members with reality Jan Kara
2021-05-12 13:46 ` [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock Jan Kara
2021-05-12 14:20   ` Matthew Wilcox
2021-05-13 17:49     ` Jan Kara
2021-05-12 14:40   ` Matthew Wilcox
2021-05-13 19:01     ` Jan Kara
2021-05-13 19:38       ` Matthew Wilcox
2021-05-14 11:07         ` Jan Kara
2021-05-12 15:23   ` Darrick J. Wong
2021-05-13 17:44     ` Jan Kara
2021-05-13 18:52       ` Darrick J. Wong
2021-05-13 23:19         ` Dave Chinner
2021-05-14 16:17           ` Darrick J. Wong
2021-05-17 11:21             ` Jan Kara
2021-05-18 22:36             ` Dave Chinner [this message]
2021-05-19 10:57               ` Jan Kara
2021-05-12 13:46 ` [PATCH 04/11] ext4: Convert to use mapping->invalidate_lock Jan Kara
2021-05-12 13:46 ` [PATCH 05/11] ext2: Convert to using invalidate_lock Jan Kara
2021-05-12 13:46 ` [PATCH 06/11] xfs: Convert to use invalidate_lock Jan Kara
2021-05-12 13:46 ` [PATCH 07/11] zonefs: Convert to using invalidate_lock Jan Kara
2021-05-13  0:34   ` Damien Le Moal
2021-05-12 13:46 ` [PATCH 08/11] f2fs: " Jan Kara
2021-05-12 13:46 ` [PATCH 09/11] fuse: " Jan Kara
2021-05-12 13:46 ` [PATCH 10/11] ceph: Fix race between hole punch and page fault Jan Kara
2021-05-12 15:19   ` Jeff Layton
2021-05-12 13:46 ` [PATCH 11/11] cifs: " Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210518223637.GJ2893@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=damien.lemoal@wdc.com \
    --cc=darrick.wong@oracle.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jaegeuk@kernel.org \
    --cc=jlayton@kernel.org \
    --cc=jth@kernel.org \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=sfrench@samba.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).