All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	ceph-devel@vger.kernel.org, Chao Yu <yuchao0@huawei.com>,
	Damien Le Moal <damien.lemoal@wdc.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jaegeuk Kim <jaegeuk@kernel.org>,
	Jeff Layton <jlayton@kernel.org>,
	Johannes Thumshirn <jth@kernel.org>,
	linux-cifs@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net, linux-mm@kvack.org,
	linux-xfs@vger.kernel.org, Miklos Szeredi <miklos@szeredi.hu>,
	Steve French <sfrench@samba.org>, Ted Tso <tytso@mit.edu>
Subject: Re: [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock
Date: Thu, 13 May 2021 21:01:14 +0200	[thread overview]
Message-ID: <20210513190114.GJ2734@quack2.suse.cz> (raw)
In-Reply-To: <YJvo1bGG1tG+gtgC@casper.infradead.org>

On Wed 12-05-21 15:40:21, Matthew Wilcox wrote:
> On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > Currently, serializing operations such as page fault, read, or readahead
> > against hole punching is rather difficult. The basic race scheme is
> > like:
> > 
> > fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
> >   truncate_inode_pages_range()
> > 						  <create pages in page
> > 						   cache here>
> >   <update fs block mapping and free blocks>
> > 
> > Now the problem is in this way read / page fault / readahead can
> > instantiate pages in page cache with potentially stale data (if blocks
> > get quickly reused). Avoiding this race is not simple - page locks do
> > not work because we want to make sure there are *no* pages in given
> > range. inode->i_rwsem does not work because page fault happens under
> > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> > the performance for mixed read-write workloads suffer.
> > 
> > So create a new rw_semaphore in the address_space - invalidate_lock -
> > that protects adding of pages to page cache for page faults / reads /
> > readahead.
> 
> Remind me (or, rather, add to the documentation) why we have to hold the
> invalidate_lock during the call to readpage / readahead, and we don't just
> hold it around the call to add_to_page_cache / add_to_page_cache_locked
> / add_to_page_cache_lru ?  I appreciate that ->readpages is still going
> to suck, but we're down to just three implementations of ->readpages now
> (9p, cifs & nfs).

There's a comment in filemap_create_page() trying to explain this. We need
to protect against cases like: Filesystem with 1k blocksize, file F has
page at index 0 with uptodate buffer at 0-1k, rest not uptodate. All blocks
underlying page are allocated. Now let read at offset 1k race with hole
punch at offset 1k, length 1k.

read()					hole punch
...
  filemap_read()
    filemap_get_pages()
      - page found in the page cache but !Uptodate
      filemap_update_page()
					  locks everything
					  truncate_inode_pages_range()
					    lock_page(page)
					    do_invalidatepage()
					    unlock_page(page)
        locks page
          filemap_read_page()
            ->readpage()
              block underlying offset 1k
	      still allocated -> map buffer
					  free block under offset 1k
	      submit IO -> corrupted data

If you think I should expand it to explain more details, please tell.
Or maybe I can put more detailed discussion like above into the changelog?

> Also, could I trouble you to run the comments through 'fmt' (or
> equivalent)?  It's easier to read if you're not kissing right up on 80
> columns.

Sure, will do.

> > +++ b/fs/inode.c
> > @@ -190,6 +190,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> >  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> >  	mapping->private_data = NULL;
> >  	mapping->writeback_index = 0;
> > +	init_rwsem(&mapping->invalidate_lock);
> > +	lockdep_set_class(&mapping->invalidate_lock,
> > +			  &sb->s_type->invalidate_lock_key);
> 
> Why not:
> 
> 	__init_rwsem(&mapping->invalidate_lock, "mapping.invalidate_lock",
> 			&sb->s_type->invalidate_lock_key);

I replicated what we do for i_rwsem but you're right, this is better.
Updated.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-cifs@vger.kernel.org,
	Damien Le Moal <damien.lemoal@wdc.com>,
	linux-ext4@vger.kernel.org, Jan Kara <jack@suse.cz>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jeff Layton <jlayton@kernel.org>,
	Steve French <sfrench@samba.org>,
	Dave Chinner <david@fromorbit.com>,
	linux-f2fs-devel@lists.sourceforge.net,
	Christoph Hellwig <hch@infradead.org>,
	linux-mm@kvack.org, Miklos Szeredi <miklos@szeredi.hu>,
	Ted Tso <tytso@mit.edu>,
	linux-fsdevel@vger.kernel.org, Jaegeuk Kim <jaegeuk@kernel.org>,
	ceph-devel@vger.kernel.org, Johannes Thumshirn <jth@kernel.org>,
	linux-xfs@vger.kernel.org
Subject: Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock
Date: Thu, 13 May 2021 21:01:14 +0200	[thread overview]
Message-ID: <20210513190114.GJ2734@quack2.suse.cz> (raw)
In-Reply-To: <YJvo1bGG1tG+gtgC@casper.infradead.org>

On Wed 12-05-21 15:40:21, Matthew Wilcox wrote:
> On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > Currently, serializing operations such as page fault, read, or readahead
> > against hole punching is rather difficult. The basic race scheme is
> > like:
> > 
> > fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
> >   truncate_inode_pages_range()
> > 						  <create pages in page
> > 						   cache here>
> >   <update fs block mapping and free blocks>
> > 
> > Now the problem is in this way read / page fault / readahead can
> > instantiate pages in page cache with potentially stale data (if blocks
> > get quickly reused). Avoiding this race is not simple - page locks do
> > not work because we want to make sure there are *no* pages in given
> > range. inode->i_rwsem does not work because page fault happens under
> > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> > the performance for mixed read-write workloads suffer.
> > 
> > So create a new rw_semaphore in the address_space - invalidate_lock -
> > that protects adding of pages to page cache for page faults / reads /
> > readahead.
> 
> Remind me (or, rather, add to the documentation) why we have to hold the
> invalidate_lock during the call to readpage / readahead, and we don't just
> hold it around the call to add_to_page_cache / add_to_page_cache_locked
> / add_to_page_cache_lru ?  I appreciate that ->readpages is still going
> to suck, but we're down to just three implementations of ->readpages now
> (9p, cifs & nfs).

There's a comment in filemap_create_page() trying to explain this. We need
to protect against cases like: Filesystem with 1k blocksize, file F has
page at index 0 with uptodate buffer at 0-1k, rest not uptodate. All blocks
underlying page are allocated. Now let read at offset 1k race with hole
punch at offset 1k, length 1k.

read()					hole punch
...
  filemap_read()
    filemap_get_pages()
      - page found in the page cache but !Uptodate
      filemap_update_page()
					  locks everything
					  truncate_inode_pages_range()
					    lock_page(page)
					    do_invalidatepage()
					    unlock_page(page)
        locks page
          filemap_read_page()
            ->readpage()
              block underlying offset 1k
	      still allocated -> map buffer
					  free block under offset 1k
	      submit IO -> corrupted data

If you think I should expand it to explain more details, please tell.
Or maybe I can put more detailed discussion like above into the changelog?

> Also, could I trouble you to run the comments through 'fmt' (or
> equivalent)?  It's easier to read if you're not kissing right up on 80
> columns.

Sure, will do.

> > +++ b/fs/inode.c
> > @@ -190,6 +190,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> >  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> >  	mapping->private_data = NULL;
> >  	mapping->writeback_index = 0;
> > +	init_rwsem(&mapping->invalidate_lock);
> > +	lockdep_set_class(&mapping->invalidate_lock,
> > +			  &sb->s_type->invalidate_lock_key);
> 
> Why not:
> 
> 	__init_rwsem(&mapping->invalidate_lock, "mapping.invalidate_lock",
> 			&sb->s_type->invalidate_lock_key);

I replicated what we do for i_rwsem but you're right, this is better.
Updated.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

  reply	other threads:[~2021-05-13 19:01 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12 13:46 [PATCH 0/11 v5] fs: Hole punch vs page cache filling races Jan Kara
2021-05-12 13:46 ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 01/11] mm: Fix comments mentioning i_mutex Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 02/11] documentation: Sync file_operations members with reality Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 14:20   ` Matthew Wilcox
2021-05-12 14:20     ` [f2fs-dev] " Matthew Wilcox
2021-05-13 17:49     ` Jan Kara
2021-05-13 17:49       ` [f2fs-dev] " Jan Kara
2021-05-12 14:40   ` Matthew Wilcox
2021-05-12 14:40     ` [f2fs-dev] " Matthew Wilcox
2021-05-13 19:01     ` Jan Kara [this message]
2021-05-13 19:01       ` Jan Kara
2021-05-13 19:38       ` Matthew Wilcox
2021-05-13 19:38         ` [f2fs-dev] " Matthew Wilcox
2021-05-14 11:07         ` Jan Kara
2021-05-14 11:07           ` [f2fs-dev] " Jan Kara
2021-05-12 15:23   ` Darrick J. Wong
2021-05-12 15:23     ` [f2fs-dev] " Darrick J. Wong
2021-05-13 17:44     ` Jan Kara
2021-05-13 17:44       ` [f2fs-dev] " Jan Kara
2021-05-13 18:52       ` Darrick J. Wong
2021-05-13 18:52         ` [f2fs-dev] " Darrick J. Wong
2021-05-13 23:19         ` Dave Chinner
2021-05-13 23:19           ` [f2fs-dev] " Dave Chinner
2021-05-14 16:17           ` Darrick J. Wong
2021-05-14 16:17             ` [f2fs-dev] " Darrick J. Wong
2021-05-17 11:21             ` Jan Kara
2021-05-17 11:21               ` [f2fs-dev] " Jan Kara
2021-05-18 22:36             ` Dave Chinner
2021-05-18 22:36               ` [f2fs-dev] " Dave Chinner
2021-05-19 10:57               ` Jan Kara
2021-05-19 10:57                 ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 04/11] ext4: Convert to use mapping->invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 05/11] ext2: Convert to using invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 06/11] xfs: Convert to use invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 07/11] zonefs: Convert to using invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-13  0:34   ` Damien Le Moal
2021-05-13  0:34     ` [f2fs-dev] " Damien Le Moal
2021-05-13  0:34     ` Damien Le Moal
2021-05-12 13:46 ` [PATCH 08/11] f2fs: " Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 18:00   ` kernel test robot
2021-05-12 18:00     ` kernel test robot
2021-05-12 13:46 ` [PATCH 09/11] fuse: " Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 10/11] ceph: Fix race between hole punch and page fault Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 15:19   ` Jeff Layton
2021-05-12 15:19     ` [f2fs-dev] " Jeff Layton
2021-05-12 15:19     ` Jeff Layton
2021-05-12 13:46 ` [PATCH 11/11] cifs: " Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210513190114.GJ2734@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=ceph-devel@vger.kernel.org \
    --cc=damien.lemoal@wdc.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jaegeuk@kernel.org \
    --cc=jlayton@kernel.org \
    --cc=jth@kernel.org \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=sfrench@samba.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.