All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Jan Kara <jack@suse.com>, Matthew Wilcox <willy@linux.intel.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-nvdimm@lists.01.org,
	Matthew Wilcox <matthew.r.wilcox@intel.com>
Subject: Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
Date: Fri, 16 Oct 2015 09:55:44 +0200	[thread overview]
Message-ID: <20151016075544.GB22182@quack.suse.cz> (raw)
In-Reply-To: <20151014225316.GQ31326@dastard>

On Thu 15-10-15 09:53:16, Dave Chinner wrote:
> On Wed, Oct 14, 2015 at 10:40:25AM +0200, Jan Kara wrote:
> > On Wed 14-10-15 16:25:50, Dave Chinner wrote:
> > > On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > > > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > > > This is necessary to ensure that the page fault has not raced with truncate
> > > > and is now pointing to a region beyond the end of the current file.
> > > > 
> > > > This change is based on a similar outstanding patch for XFS from Dave
> > > > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > ---
> > > >  fs/dax.c | 13 +++++++++++--
> > > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 131fd35a..82be6e4 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > >   */
> > > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > >  {
> > > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > +	struct inode *inode = file_inode(vma->vm_file);
> > > > +	struct super_block *sb = inode->i_sb;
> > > > +	int ret = VM_FAULT_NOPAGE;
> > > > +	loff_t size;
> > > >  
> > > >  	sb_start_pagefault(sb);
> > > >  	file_update_time(vma->vm_file);
> > > > +
> > > > +	/* check that the faulting page hasn't raced with truncate */
> > > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > > +	if (vmf->pgoff >= size)
> > > > +		ret = VM_FAULT_SIGBUS;
> > > > +
> > > >  	sb_end_pagefault(sb);
> > > > -	return VM_FAULT_NOPAGE;
> > > > +	return ret;
> > > 
> > > This is still racy, as the read of the inode size is not serialised
> > > against or ordered by any locks held by truncate. The check in XFS
> > > is serialised against truncate by the XFS_MMAPLOCK and the generic
> > > DAX code does not have such a mechanism to rely on. Hence I'd
> > > suggest that the correct thing to do here is remove
> > > dax_pfn_mkwrite() and force filesystems to implement their own
> > > truncate-safe variants of ->pfn_mkwrite.
> > 
> > Well, the i_size check is just an optimization anyway. See below the
> > discussion regarding the hole punch.
> > 
> > > And on further thought, I'm not sure that what I did with XFS is
> > > at all correct when considering hole punching. That is, to get a PFN
> > > based fault, we've already had to guarantee the file offset has
> > > blocks allocated and mapped them. Then:
> > > 
> > > process 1				process 2
> > > pfn_mkwrite @ X				punch hole @ X
> > >  xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
> > >   XFS_MMAPLOCK_SHARED
> > >     <blocks>
> > > 					invalidate mapping @ X
> > > 					remove blocks @ X
> > > 					....
> > > 					unlock XFS_MMAPLOCK_EXCL
> > >    checks file size
> > >   unlock XFS_MMAPLOCK_SHARED
> > >   return VM_FAULT_NOPAGE
> > > 
> > > And so process 1 continues with an invalid page mapping over the
> > > hole process 2 just punched in the file. That's a data
> > > exposure/corruption problem the underlying blocks get reallocated
> > > to some other file.
> > > 
> > > Unless I'm missing something here, this gets ugly real fast.
> > 
> > So how I convinced myself that this is not a problem is:
> > 
> > When process 2 invalidates mapping, it will also modify page tables to
> > unmap freed pages. Then the pte_same() check in mm/memory.c:wp_pfn_shared()
> > will fail, we bail out,
> 
> So this comes in through handle_pte_fault()? And if !pte_same(),
> we return 0:
> 
> __handle_mm_fault
>   handle_pte_fault
>     do_wp_page
>       wp_pfn_shared()
>         ->pfn_mkwrite
>           xfs_filemap_pfn_mkwrite
> 	    return VM_FAULT_NOPAGE
>         !pte_same
>         return 0;
>       return 0;
>     return 0;
>   return 0;
> return 0;
> 
> and so we return all the way back out to __do_page_fault() with a
> return value of zero, which then thinks the page fault succeeded...

Yes. Except "succeeded" is a bit misleading term here. Let's say it was
handled without fatal error.

> > CPU will retry the memory access which faults again
> > and now we go the right path. So ->pfn_mkwrite() has to be prepared that
> > the pfn under it actually isn't there anymore and current implementations
> > don't really care so we are fine.
> 
> Ok, that's what I was missing - the CPU refaulting when retrying the
> write and that provides the retry path.
> 
> IOWs, we're relying on pte_same() to catch the truncate/hole
> punch invalidation of the pfn mapping, but that can only work if the
> filesystem ->pfn_mkwrite implementation first serialises the page
> fault against the pte invalidation that the hole punch/truncation
> does.
> 
> Right?

Well, faults and invalidation are already serialized by pte lock. The
sequence looks like this:

page fault
	...
	pte_lock
	check page tables, see valid pfn but we need to set write lock
	pte_unlock
	->pfn_mkwrite()
	pte_lock
	if (page table entry changed from when we last looked)
		bail out
	finish fault
	pte_unlock

truncate
	...
	for each page removed
		for each mapping of that page
			pte_lock
			clear all page table entry for this page
			pte_unlock
  
So the check that pte entry is still valid after we called ->pfn_mkwrite()
and reacquired pte_lock is all that is needed to make sure we don't map
invalid pfn (already truncated one) into the page tables.

For read faults we have to be more careful since there the filesystem looks
up pfn which is then inserted into page tables so *there* we need the
truncate vs fault serialization so that fs doesn't ask mm to insert pfn
that got truncated from the file in the mean time.

> My head hurts now.

Yup.

> > Trying to generalize when we are safe against such races: Unless we modify
> > page tables or inode allocation information, we don't care about races with
> > truncate() / punch_hole() so even the original dax_pfn_mkwrite() code was
> > actually correct. But it's really subtle...
> 
> I'd call it "completely undocumented", not "subtle". :/

Yeah, probably we need a writeup about the whole truncate/punch hole vs fault
vs munmap serialization. Probably somewhere in Documentation/... I'll try
to find time to write that down sometime next week.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Jan Kara <jack@suse.com>, Matthew Wilcox <willy@linux.intel.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-nvdimm@ml01.01.org,
	Matthew Wilcox <matthew.r.wilcox@intel.com>
Subject: Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
Date: Fri, 16 Oct 2015 09:55:44 +0200	[thread overview]
Message-ID: <20151016075544.GB22182@quack.suse.cz> (raw)
In-Reply-To: <20151014225316.GQ31326@dastard>

On Thu 15-10-15 09:53:16, Dave Chinner wrote:
> On Wed, Oct 14, 2015 at 10:40:25AM +0200, Jan Kara wrote:
> > On Wed 14-10-15 16:25:50, Dave Chinner wrote:
> > > On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > > > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > > > This is necessary to ensure that the page fault has not raced with truncate
> > > > and is now pointing to a region beyond the end of the current file.
> > > > 
> > > > This change is based on a similar outstanding patch for XFS from Dave
> > > > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > ---
> > > >  fs/dax.c | 13 +++++++++++--
> > > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 131fd35a..82be6e4 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > >   */
> > > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > >  {
> > > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > +	struct inode *inode = file_inode(vma->vm_file);
> > > > +	struct super_block *sb = inode->i_sb;
> > > > +	int ret = VM_FAULT_NOPAGE;
> > > > +	loff_t size;
> > > >  
> > > >  	sb_start_pagefault(sb);
> > > >  	file_update_time(vma->vm_file);
> > > > +
> > > > +	/* check that the faulting page hasn't raced with truncate */
> > > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > > +	if (vmf->pgoff >= size)
> > > > +		ret = VM_FAULT_SIGBUS;
> > > > +
> > > >  	sb_end_pagefault(sb);
> > > > -	return VM_FAULT_NOPAGE;
> > > > +	return ret;
> > > 
> > > This is still racy, as the read of the inode size is not serialised
> > > against or ordered by any locks held by truncate. The check in XFS
> > > is serialised against truncate by the XFS_MMAPLOCK and the generic
> > > DAX code does not have such a mechanism to rely on. Hence I'd
> > > suggest that the correct thing to do here is remove
> > > dax_pfn_mkwrite() and force filesystems to implement their own
> > > truncate-safe variants of ->pfn_mkwrite.
> > 
> > Well, the i_size check is just an optimization anyway. See below the
> > discussion regarding the hole punch.
> > 
> > > And on further thought, I'm not sure that what I did with XFS is
> > > at all correct when considering hole punching. That is, to get a PFN
> > > based fault, we've already had to guarantee the file offset has
> > > blocks allocated and mapped them. Then:
> > > 
> > > process 1				process 2
> > > pfn_mkwrite @ X				punch hole @ X
> > >  xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
> > >   XFS_MMAPLOCK_SHARED
> > >     <blocks>
> > > 					invalidate mapping @ X
> > > 					remove blocks @ X
> > > 					....
> > > 					unlock XFS_MMAPLOCK_EXCL
> > >    checks file size
> > >   unlock XFS_MMAPLOCK_SHARED
> > >   return VM_FAULT_NOPAGE
> > > 
> > > And so process 1 continues with an invalid page mapping over the
> > > hole process 2 just punched in the file. That's a data
> > > exposure/corruption problem the underlying blocks get reallocated
> > > to some other file.
> > > 
> > > Unless I'm missing something here, this gets ugly real fast.
> > 
> > So how I convinced myself that this is not a problem is:
> > 
> > When process 2 invalidates mapping, it will also modify page tables to
> > unmap freed pages. Then the pte_same() check in mm/memory.c:wp_pfn_shared()
> > will fail, we bail out,
> 
> So this comes in through handle_pte_fault()? And if !pte_same(),
> we return 0:
> 
> __handle_mm_fault
>   handle_pte_fault
>     do_wp_page
>       wp_pfn_shared()
>         ->pfn_mkwrite
>           xfs_filemap_pfn_mkwrite
> 	    return VM_FAULT_NOPAGE
>         !pte_same
>         return 0;
>       return 0;
>     return 0;
>   return 0;
> return 0;
> 
> and so we return all the way back out to __do_page_fault() with a
> return value of zero, which then thinks the page fault succeeded...

Yes. Except "succeeded" is a bit misleading term here. Let's say it was
handled without fatal error.

> > CPU will retry the memory access which faults again
> > and now we go the right path. So ->pfn_mkwrite() has to be prepared that
> > the pfn under it actually isn't there anymore and current implementations
> > don't really care so we are fine.
> 
> Ok, that's what I was missing - the CPU refaulting when retrying the
> write and that provides the retry path.
> 
> IOWs, we're relying on pte_same() to catch the truncate/hole
> punch invalidation of the pfn mapping, but that can only work if the
> filesystem ->pfn_mkwrite implementation first serialises the page
> fault against the pte invalidation that the hole punch/truncation
> does.
> 
> Right?

Well, faults and invalidation are already serialized by pte lock. The
sequence looks like this:

page fault
	...
	pte_lock
	check page tables, see valid pfn but we need to set write lock
	pte_unlock
	->pfn_mkwrite()
	pte_lock
	if (page table entry changed from when we last looked)
		bail out
	finish fault
	pte_unlock

truncate
	...
	for each page removed
		for each mapping of that page
			pte_lock
			clear all page table entry for this page
			pte_unlock
  
So the check that pte entry is still valid after we called ->pfn_mkwrite()
and reacquired pte_lock is all that is needed to make sure we don't map
invalid pfn (already truncated one) into the page tables.

For read faults we have to be more careful since there the filesystem looks
up pfn which is then inserted into page tables so *there* we need the
truncate vs fault serialization so that fs doesn't ask mm to insert pfn
that got truncated from the file in the mean time.

> My head hurts now.

Yup.

> > Trying to generalize when we are safe against such races: Unless we modify
> > page tables or inode allocation information, we don't care about races with
> > truncate() / punch_hole() so even the original dax_pfn_mkwrite() code was
> > actually correct. But it's really subtle...
> 
> I'd call it "completely undocumented", not "subtle". :/

Yeah, probably we need a writeup about the whole truncate/punch hole vs fault
vs munmap serialization. Probably somewhere in Documentation/... I'll try
to find time to write that down sometime next week.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2015-10-16  7:55 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-13 22:25 [PATCH v2 0/2] Add updated DAX locking to ext2 Ross Zwisler
2015-10-13 22:25 ` Ross Zwisler
2015-10-13 22:25 ` [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check Ross Zwisler
2015-10-13 22:25   ` Ross Zwisler
2015-10-14  5:25   ` Dave Chinner
2015-10-14  5:25     ` Dave Chinner
2015-10-14  8:40     ` Jan Kara
2015-10-14  8:40       ` Jan Kara
2015-10-14 22:53       ` Dave Chinner
2015-10-14 22:53         ` Dave Chinner
2015-10-16  7:55         ` Jan Kara [this message]
2015-10-16  7:55           ` Jan Kara
2015-10-14 17:26     ` Ross Zwisler
2015-10-14 17:26       ` Ross Zwisler
2015-10-13 22:25 ` [PATCH v2 2/2] ext2: Add locking for DAX faults Ross Zwisler
2015-10-13 22:25   ` Ross Zwisler
2015-10-14  8:51   ` Jan Kara
2015-10-14  8:51     ` Jan Kara
2015-10-14 15:31     ` Ross Zwisler
2015-10-14 15:31       ` Ross Zwisler
2015-10-19 12:47       ` Jan Kara
2015-10-19 12:47         ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151016075544.GB22182@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.