All of lore.kernel.org
 help / color / mirror / Atom feed
* set_page_dirty vs truncate
@ 2020-12-18 16:05 Matthew Wilcox
  2020-12-18 22:03 ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2020-12-18 16:05 UTC (permalink / raw)
  To: linux-fsdevel

A number of implementations of ->set_page_dirty check whether the page
has been truncated (ie page->mapping has become NULL since entering
set_page_dirty()).  Several other implementations assume that they can do
page->mapping->host to get to the inode.  So either some implementations
are doing unnecessary checks or others are vulnerable to a NULL pointer
dereference if truncate() races with set_page_dirty().

I'm touching ->set_page_dirty() anyway as part of the page folio
conversion.  I'm thinking about passing in the mapping so there's no
need to look at page->mapping.

The comments on set_page_dirty() and set_page_dirty_lock() suggests
there's no consistency in whether truncation is blocked or not; we're
only guaranteed that the inode itself won't go away.  But maybe the
comments are stale.


There're also some filesystems which always return false from
set_page_dirty() and others which check for PageSwapCache, which surely
can't happen.  I'm also confused by the ones which set PageUptodate.
And several should just use __set_page_dirty_no_writeback().

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
  2020-12-18 16:05 set_page_dirty vs truncate Matthew Wilcox
@ 2020-12-18 22:03 ` Matthew Wilcox
  2020-12-19  5:18     ` Matthew Wilcox
  2020-12-21 14:12   ` Jan Kara
  0 siblings, 2 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-12-18 22:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-ext4, Jan Kara, Theodore Ts'o, Andreas Dilger

On Fri, Dec 18, 2020 at 04:05:31PM +0000, Matthew Wilcox wrote:
> A number of implementations of ->set_page_dirty check whether the page
> has been truncated (ie page->mapping has become NULL since entering
> set_page_dirty()).  Several other implementations assume that they can do
> page->mapping->host to get to the inode.  So either some implementations
> are doing unnecessary checks or others are vulnerable to a NULL pointer
> dereference if truncate() races with set_page_dirty().
> 
> I'm touching ->set_page_dirty() anyway as part of the page folio
> conversion.  I'm thinking about passing in the mapping so there's no
> need to look at page->mapping.
> 
> The comments on set_page_dirty() and set_page_dirty_lock() suggests
> there's no consistency in whether truncation is blocked or not; we're
> only guaranteed that the inode itself won't go away.  But maybe the
> comments are stale.

The comments are, I believe, not stale.  Here's some syzbot
reports which indicate that ext4 is seeing races between set_page_dirty()
and truncate():

 https://groups.google.com/g/syzkaller-lts-bugs/c/s9fHu162zhQ/m/Phnf6ucaAwAJ

The reproducer includes calls to ftruncate(), so that would suggest
that's what's going on.

I would suggest just deleting this line:

        WARN_ON_ONCE(!page_has_buffers(page));

I'm not sure what value the other WARN_ON_ONCE adds.  Maybe just replace
ext4_set_page_dirty with __set_page_dirty_buffers in the aops?  I'd defer
to an ext4 expert on this ...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
  2020-12-18 22:03 ` Matthew Wilcox
@ 2020-12-19  5:18     ` Matthew Wilcox
  2020-12-21 14:12   ` Jan Kara
  1 sibling, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-12-19  5:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dominique Martinet, v9fs-developer, Steve French, linux-cifs,
	Miklos Szeredi, Jeff Dike, Richard Weinberger, linux-um,
	Dave Kleikamp, jfs-discussion, Trond Myklebust, Anna Schumaker,
	linux-nfs, Anton Altaparmakov, linux-ntfs-dev, Mike Marshall,
	Martin Brandenburg, devel, Hans de Goede

On Fri, Dec 18, 2020 at 10:03:16PM +0000, Matthew Wilcox wrote:
> On Fri, Dec 18, 2020 at 04:05:31PM +0000, Matthew Wilcox wrote:
> > A number of implementations of ->set_page_dirty check whether the page
> > has been truncated (ie page->mapping has become NULL since entering
> > set_page_dirty()).  Several other implementations assume that they can do
> > page->mapping->host to get to the inode.  So either some implementations
> > are doing unnecessary checks or others are vulnerable to a NULL pointer
> > dereference if truncate() races with set_page_dirty().
> > 
> > I'm touching ->set_page_dirty() anyway as part of the page folio
> > conversion.  I'm thinking about passing in the mapping so there's no
> > need to look at page->mapping.
> > 
> > The comments on set_page_dirty() and set_page_dirty_lock() suggests
> > there's no consistency in whether truncation is blocked or not; we're
> > only guaranteed that the inode itself won't go away.  But maybe the
> > comments are stale.
> 
> The comments are, I believe, not stale.  Here's some syzbot
> reports which indicate that ext4 is seeing races between set_page_dirty()
> and truncate():
> 
>  https://groups.google.com/g/syzkaller-lts-bugs/c/s9fHu162zhQ/m/Phnf6ucaAwAJ
> 
> The reproducer includes calls to ftruncate(), so that would suggest
> that's what's going on.

Hmmm ... looks like __set_page_dirty_nobuffers() has a similar problem:

{
        lock_page_memcg(page);
        if (!TestSetPageDirty(page)) {
                struct address_space *mapping = page_mapping(page);
                unsigned long flags;

                if (!mapping) {
                        unlock_page_memcg(page);
                        return 1;
                }

                xa_lock_irqsave(&mapping->i_pages, flags);
                BUG_ON(page_mapping(page) != mapping);

sure, we check that the page wasn't truncated between set_page_dirty()
and the call to TestSetPageDirty(), but we can truncate dirty pages
with no problem.  So between the call to TestSetPageDirty() and
the call to xa_lock_irqsave(), the page can be truncated, and the
BUG_ON should fire.

I haven't been able to find any examples of this, but maybe it's just a very
narrow race.  Does anyone recognise this signature?  Adding the filesystems
which use __set_page_dirty_nobuffers() directly without extra locking.

$ git grep set_page_dirty.*=.*__set_page_dirty_nobuffers
fs/9p/vfs_addr.c:       .set_page_dirty = __set_page_dirty_nobuffers,
fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
fs/fuse/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
fs/hostfs/hostfs_kern.c:        .set_page_dirty = __set_page_dirty_nobuffers,
fs/jfs/jfs_metapage.c:  .set_page_dirty = __set_page_dirty_nobuffers,
fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,
fs/ntfs/aops.c: .set_page_dirty = __set_page_dirty_nobuffers,   /* Set the page dirty
fs/orangefs/inode.c:    .set_page_dirty = __set_page_dirty_nobuffers,
fs/vboxsf/file.c:       .set_page_dirty = __set_page_dirty_nobuffers,


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
@ 2020-12-19  5:18     ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-12-19  5:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Martin Brandenburg, linux-cifs, jfs-discussion, Miklos Szeredi,
	Dave Kleikamp, Richard Weinberger, Dominique Martinet, linux-um,
	linux-nfs, Trond Myklebust, Steve French, linux-ntfs-dev,
	Hans de Goede, devel, Anna Schumaker, v9fs-developer, Jeff Dike,
	Anton Altaparmakov, Mike Marshall

On Fri, Dec 18, 2020 at 10:03:16PM +0000, Matthew Wilcox wrote:
> On Fri, Dec 18, 2020 at 04:05:31PM +0000, Matthew Wilcox wrote:
> > A number of implementations of ->set_page_dirty check whether the page
> > has been truncated (ie page->mapping has become NULL since entering
> > set_page_dirty()).  Several other implementations assume that they can do
> > page->mapping->host to get to the inode.  So either some implementations
> > are doing unnecessary checks or others are vulnerable to a NULL pointer
> > dereference if truncate() races with set_page_dirty().
> > 
> > I'm touching ->set_page_dirty() anyway as part of the page folio
> > conversion.  I'm thinking about passing in the mapping so there's no
> > need to look at page->mapping.
> > 
> > The comments on set_page_dirty() and set_page_dirty_lock() suggests
> > there's no consistency in whether truncation is blocked or not; we're
> > only guaranteed that the inode itself won't go away.  But maybe the
> > comments are stale.
> 
> The comments are, I believe, not stale.  Here's some syzbot
> reports which indicate that ext4 is seeing races between set_page_dirty()
> and truncate():
> 
>  https://groups.google.com/g/syzkaller-lts-bugs/c/s9fHu162zhQ/m/Phnf6ucaAwAJ
> 
> The reproducer includes calls to ftruncate(), so that would suggest
> that's what's going on.

Hmmm ... looks like __set_page_dirty_nobuffers() has a similar problem:

{
        lock_page_memcg(page);
        if (!TestSetPageDirty(page)) {
                struct address_space *mapping = page_mapping(page);
                unsigned long flags;

                if (!mapping) {
                        unlock_page_memcg(page);
                        return 1;
                }

                xa_lock_irqsave(&mapping->i_pages, flags);
                BUG_ON(page_mapping(page) != mapping);

sure, we check that the page wasn't truncated between set_page_dirty()
and the call to TestSetPageDirty(), but we can truncate dirty pages
with no problem.  So between the call to TestSetPageDirty() and
the call to xa_lock_irqsave(), the page can be truncated, and the
BUG_ON should fire.

I haven't been able to find any examples of this, but maybe it's just a very
narrow race.  Does anyone recognise this signature?  Adding the filesystems
which use __set_page_dirty_nobuffers() directly without extra locking.

$ git grep set_page_dirty.*=.*__set_page_dirty_nobuffers
fs/9p/vfs_addr.c:       .set_page_dirty = __set_page_dirty_nobuffers,
fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
fs/fuse/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
fs/hostfs/hostfs_kern.c:        .set_page_dirty = __set_page_dirty_nobuffers,
fs/jfs/jfs_metapage.c:  .set_page_dirty = __set_page_dirty_nobuffers,
fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,
fs/ntfs/aops.c: .set_page_dirty = __set_page_dirty_nobuffers,   /* Set the page dirty
fs/orangefs/inode.c:    .set_page_dirty = __set_page_dirty_nobuffers,
fs/vboxsf/file.c:       .set_page_dirty = __set_page_dirty_nobuffers,


_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
  2020-12-19  5:18     ` Matthew Wilcox
@ 2020-12-19  6:10       ` John Hubbard
  -1 siblings, 0 replies; 12+ messages in thread
From: John Hubbard @ 2020-12-19  6:10 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: Dominique Martinet, v9fs-developer, Steve French, linux-cifs,
	Miklos Szeredi, Jeff Dike, Richard Weinberger, linux-um,
	Dave Kleikamp, jfs-discussion, Trond Myklebust, Anna Schumaker,
	linux-nfs, Anton Altaparmakov, linux-ntfs-dev, Mike Marshall,
	Martin Brandenburg, devel, Hans de Goede

On 12/18/20 9:18 PM, Matthew Wilcox wrote:
> On Fri, Dec 18, 2020 at 10:03:16PM +0000, Matthew Wilcox wrote:
>> On Fri, Dec 18, 2020 at 04:05:31PM +0000, Matthew Wilcox wrote:
>>> A number of implementations of ->set_page_dirty check whether the page
>>> has been truncated (ie page->mapping has become NULL since entering
>>> set_page_dirty()).  Several other implementations assume that they can do
>>> page->mapping->host to get to the inode.  So either some implementations
>>> are doing unnecessary checks or others are vulnerable to a NULL pointer
>>> dereference if truncate() races with set_page_dirty().
>>>
>>> I'm touching ->set_page_dirty() anyway as part of the page folio
>>> conversion.  I'm thinking about passing in the mapping so there's no
>>> need to look at page->mapping.
>>>
>>> The comments on set_page_dirty() and set_page_dirty_lock() suggests
>>> there's no consistency in whether truncation is blocked or not; we're
>>> only guaranteed that the inode itself won't go away.  But maybe the
>>> comments are stale.
>>
>> The comments are, I believe, not stale.  Here's some syzbot
>> reports which indicate that ext4 is seeing races between set_page_dirty()
>> and truncate():
>>
>>   https://groups.google.com/g/syzkaller-lts-bugs/c/s9fHu162zhQ/m/Phnf6ucaAwAJ
>>
>> The reproducer includes calls to ftruncate(), so that would suggest
>> that's what's going on.
> 
> Hmmm ... looks like __set_page_dirty_nobuffers() has a similar problem:
> 
> {
>          lock_page_memcg(page);
>          if (!TestSetPageDirty(page)) {
>                  struct address_space *mapping = page_mapping(page);
>                  unsigned long flags;
> 
>                  if (!mapping) {
>                          unlock_page_memcg(page);
>                          return 1;
>                  }
> 
>                  xa_lock_irqsave(&mapping->i_pages, flags);
>                  BUG_ON(page_mapping(page) != mapping);
> 
> sure, we check that the page wasn't truncated between set_page_dirty()
> and the call to TestSetPageDirty(), but we can truncate dirty pages
> with no problem.  So between the call to TestSetPageDirty() and
> the call to xa_lock_irqsave(), the page can be truncated, and the
> BUG_ON should fire.
> 
> I haven't been able to find any examples of this, but maybe it's just a very
> narrow race.  Does anyone recognise this signature?  Adding the filesystems
> which use __set_page_dirty_nobuffers() directly without extra locking.


That sounds like the same *kind* of failure that Jan Kara and I were
seeing on live systems[1], that led eventually to the gup-to-pup
conversion exercise.

That crash happened due to calling set_page_dirty() on pages that had no
buffers on them [2]. And that sounds like *exactly* the same thing as
calling __set_page_dirty_nobuffers() without extra locking. So I'd
expect that it's Just Wrong To Do, for the same reasons as Jan spells
out very clearly in [1].

Hope that helps.


[1] https://www.spinics.net/lists/linux-mm/msg142700.html

[2] which triggered this assertion:

#define page_buffers(page)					\
	({							\
		BUG_ON(!PagePrivate(page));			\
		((struct buffer_head *)page_private(page));	\
	})


> 
> $ git grep set_page_dirty.*=.*__set_page_dirty_nobuffers
> fs/9p/vfs_addr.c:       .set_page_dirty = __set_page_dirty_nobuffers,
> fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> fs/fuse/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> fs/hostfs/hostfs_kern.c:        .set_page_dirty = __set_page_dirty_nobuffers,
> fs/jfs/jfs_metapage.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> fs/ntfs/aops.c: .set_page_dirty = __set_page_dirty_nobuffers,   /* Set the page dirty
> fs/orangefs/inode.c:    .set_page_dirty = __set_page_dirty_nobuffers,
> fs/vboxsf/file.c:       .set_page_dirty = __set_page_dirty_nobuffers,
> 

...wow, long list of these.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
@ 2020-12-19  6:10       ` John Hubbard
  0 siblings, 0 replies; 12+ messages in thread
From: John Hubbard @ 2020-12-19  6:10 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: Martin Brandenburg, linux-cifs, jfs-discussion, Miklos Szeredi,
	Dave Kleikamp, Richard Weinberger, Dominique Martinet, linux-um,
	linux-nfs, Trond Myklebust, Steve French, linux-ntfs-dev,
	Hans de Goede, devel, Anna Schumaker, v9fs-developer, Jeff Dike,
	Anton Altaparmakov, Mike Marshall

On 12/18/20 9:18 PM, Matthew Wilcox wrote:
> On Fri, Dec 18, 2020 at 10:03:16PM +0000, Matthew Wilcox wrote:
>> On Fri, Dec 18, 2020 at 04:05:31PM +0000, Matthew Wilcox wrote:
>>> A number of implementations of ->set_page_dirty check whether the page
>>> has been truncated (ie page->mapping has become NULL since entering
>>> set_page_dirty()).  Several other implementations assume that they can do
>>> page->mapping->host to get to the inode.  So either some implementations
>>> are doing unnecessary checks or others are vulnerable to a NULL pointer
>>> dereference if truncate() races with set_page_dirty().
>>>
>>> I'm touching ->set_page_dirty() anyway as part of the page folio
>>> conversion.  I'm thinking about passing in the mapping so there's no
>>> need to look at page->mapping.
>>>
>>> The comments on set_page_dirty() and set_page_dirty_lock() suggests
>>> there's no consistency in whether truncation is blocked or not; we're
>>> only guaranteed that the inode itself won't go away.  But maybe the
>>> comments are stale.
>>
>> The comments are, I believe, not stale.  Here's some syzbot
>> reports which indicate that ext4 is seeing races between set_page_dirty()
>> and truncate():
>>
>>   https://groups.google.com/g/syzkaller-lts-bugs/c/s9fHu162zhQ/m/Phnf6ucaAwAJ
>>
>> The reproducer includes calls to ftruncate(), so that would suggest
>> that's what's going on.
> 
> Hmmm ... looks like __set_page_dirty_nobuffers() has a similar problem:
> 
> {
>          lock_page_memcg(page);
>          if (!TestSetPageDirty(page)) {
>                  struct address_space *mapping = page_mapping(page);
>                  unsigned long flags;
> 
>                  if (!mapping) {
>                          unlock_page_memcg(page);
>                          return 1;
>                  }
> 
>                  xa_lock_irqsave(&mapping->i_pages, flags);
>                  BUG_ON(page_mapping(page) != mapping);
> 
> sure, we check that the page wasn't truncated between set_page_dirty()
> and the call to TestSetPageDirty(), but we can truncate dirty pages
> with no problem.  So between the call to TestSetPageDirty() and
> the call to xa_lock_irqsave(), the page can be truncated, and the
> BUG_ON should fire.
> 
> I haven't been able to find any examples of this, but maybe it's just a very
> narrow race.  Does anyone recognise this signature?  Adding the filesystems
> which use __set_page_dirty_nobuffers() directly without extra locking.


That sounds like the same *kind* of failure that Jan Kara and I were
seeing on live systems[1], that led eventually to the gup-to-pup
conversion exercise.

That crash happened due to calling set_page_dirty() on pages that had no
buffers on them [2]. And that sounds like *exactly* the same thing as
calling __set_page_dirty_nobuffers() without extra locking. So I'd
expect that it's Just Wrong To Do, for the same reasons as Jan spells
out very clearly in [1].

Hope that helps.


[1] https://www.spinics.net/lists/linux-mm/msg142700.html

[2] which triggered this assertion:

#define page_buffers(page)					\
	({							\
		BUG_ON(!PagePrivate(page));			\
		((struct buffer_head *)page_private(page));	\
	})


> 
> $ git grep set_page_dirty.*=.*__set_page_dirty_nobuffers
> fs/9p/vfs_addr.c:       .set_page_dirty = __set_page_dirty_nobuffers,
> fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> fs/fuse/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> fs/hostfs/hostfs_kern.c:        .set_page_dirty = __set_page_dirty_nobuffers,
> fs/jfs/jfs_metapage.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> fs/ntfs/aops.c: .set_page_dirty = __set_page_dirty_nobuffers,   /* Set the page dirty
> fs/orangefs/inode.c:    .set_page_dirty = __set_page_dirty_nobuffers,
> fs/vboxsf/file.c:       .set_page_dirty = __set_page_dirty_nobuffers,
> 

...wow, long list of these.

thanks,
-- 
John Hubbard
NVIDIA

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
  2020-12-19  6:10       ` John Hubbard
@ 2020-12-19  6:50         ` Matthew Wilcox
  -1 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-12-19  6:50 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-fsdevel, Dominique Martinet, v9fs-developer, Steve French,
	linux-cifs, Miklos Szeredi, Jeff Dike, Richard Weinberger,
	linux-um, Dave Kleikamp, jfs-discussion, Trond Myklebust,
	Anna Schumaker, linux-nfs, Anton Altaparmakov, linux-ntfs-dev,
	Mike Marshall, Martin Brandenburg, devel, Hans de Goede

On Fri, Dec 18, 2020 at 10:10:01PM -0800, John Hubbard wrote:
> On 12/18/20 9:18 PM, Matthew Wilcox wrote:
> > On Fri, Dec 18, 2020 at 10:03:16PM +0000, Matthew Wilcox wrote:
> > > On Fri, Dec 18, 2020 at 04:05:31PM +0000, Matthew Wilcox wrote:
> > > > A number of implementations of ->set_page_dirty check whether the page
> > > > has been truncated (ie page->mapping has become NULL since entering
> > > > set_page_dirty()).  Several other implementations assume that they can do
> > > > page->mapping->host to get to the inode.  So either some implementations
> > > > are doing unnecessary checks or others are vulnerable to a NULL pointer
> > > > dereference if truncate() races with set_page_dirty().
> > > > 
> > > > I'm touching ->set_page_dirty() anyway as part of the page folio
> > > > conversion.  I'm thinking about passing in the mapping so there's no
> > > > need to look at page->mapping.
> > > > 
> > > > The comments on set_page_dirty() and set_page_dirty_lock() suggests
> > > > there's no consistency in whether truncation is blocked or not; we're
> > > > only guaranteed that the inode itself won't go away.  But maybe the
> > > > comments are stale.
> > > 
> > > The comments are, I believe, not stale.  Here's some syzbot
> > > reports which indicate that ext4 is seeing races between set_page_dirty()
> > > and truncate():
> > > 
> > >   https://groups.google.com/g/syzkaller-lts-bugs/c/s9fHu162zhQ/m/Phnf6ucaAwAJ
> > > 
> > > The reproducer includes calls to ftruncate(), so that would suggest
> > > that's what's going on.
> > 
> > Hmmm ... looks like __set_page_dirty_nobuffers() has a similar problem:
> > 
> > {
> >          lock_page_memcg(page);
> >          if (!TestSetPageDirty(page)) {
> >                  struct address_space *mapping = page_mapping(page);
> >                  unsigned long flags;
> > 
> >                  if (!mapping) {
> >                          unlock_page_memcg(page);
> >                          return 1;
> >                  }
> > 
> >                  xa_lock_irqsave(&mapping->i_pages, flags);
> >                  BUG_ON(page_mapping(page) != mapping);
> > 
> > sure, we check that the page wasn't truncated between set_page_dirty()
> > and the call to TestSetPageDirty(), but we can truncate dirty pages
> > with no problem.  So between the call to TestSetPageDirty() and
> > the call to xa_lock_irqsave(), the page can be truncated, and the
> > BUG_ON should fire.
> > 
> > I haven't been able to find any examples of this, but maybe it's just a very
> > narrow race.  Does anyone recognise this signature?  Adding the filesystems
> > which use __set_page_dirty_nobuffers() directly without extra locking.
> 
> 
> That sounds like the same *kind* of failure that Jan Kara and I were
> seeing on live systems[1], that led eventually to the gup-to-pup
> conversion exercise.
> 
> That crash happened due to calling set_page_dirty() on pages that had no
> buffers on them [2]. And that sounds like *exactly* the same thing as
> calling __set_page_dirty_nobuffers() without extra locking. So I'd
> expect that it's Just Wrong To Do, for the same reasons as Jan spells
> out very clearly in [1].

Interesting.  It's a bit different, *but* Jan's race might be what's
causing this symptom.  The reason is that the backtrace contains
set_page_dirty_lock() which holds the page lock.  So there can't be
a truncation race because truncate holds the page lock when calling
->invalidatepage.

That said, the syzbot reproducer doesn't have any O_DIRECT in it
either.  So maybe this is some other race?

> Hope that helps.
> 
> 
> [1] https://www.spinics.net/lists/linux-mm/msg142700.html
> 
> [2] which triggered this assertion:
> 
> #define page_buffers(page)					\
> 	({							\
> 		BUG_ON(!PagePrivate(page));			\
> 		((struct buffer_head *)page_private(page));	\
> 	})
> 
> 
> > 
> > $ git grep set_page_dirty.*=.*__set_page_dirty_nobuffers
> > fs/9p/vfs_addr.c:       .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/fuse/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/hostfs/hostfs_kern.c:        .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/jfs/jfs_metapage.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/ntfs/aops.c: .set_page_dirty = __set_page_dirty_nobuffers,   /* Set the page dirty
> > fs/orangefs/inode.c:    .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/vboxsf/file.c:       .set_page_dirty = __set_page_dirty_nobuffers,
> > 
> 
> ...wow, long list of these.
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
@ 2020-12-19  6:50         ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-12-19  6:50 UTC (permalink / raw)
  To: John Hubbard
  Cc: Martin Brandenburg, linux-cifs, jfs-discussion, Miklos Szeredi,
	Dave Kleikamp, Richard Weinberger, Dominique Martinet, linux-um,
	linux-nfs, Trond Myklebust, Steve French, linux-ntfs-dev,
	Hans de Goede, devel, Anna Schumaker, linux-fsdevel,
	v9fs-developer, Jeff Dike, Anton Altaparmakov, Mike Marshall

On Fri, Dec 18, 2020 at 10:10:01PM -0800, John Hubbard wrote:
> On 12/18/20 9:18 PM, Matthew Wilcox wrote:
> > On Fri, Dec 18, 2020 at 10:03:16PM +0000, Matthew Wilcox wrote:
> > > On Fri, Dec 18, 2020 at 04:05:31PM +0000, Matthew Wilcox wrote:
> > > > A number of implementations of ->set_page_dirty check whether the page
> > > > has been truncated (ie page->mapping has become NULL since entering
> > > > set_page_dirty()).  Several other implementations assume that they can do
> > > > page->mapping->host to get to the inode.  So either some implementations
> > > > are doing unnecessary checks or others are vulnerable to a NULL pointer
> > > > dereference if truncate() races with set_page_dirty().
> > > > 
> > > > I'm touching ->set_page_dirty() anyway as part of the page folio
> > > > conversion.  I'm thinking about passing in the mapping so there's no
> > > > need to look at page->mapping.
> > > > 
> > > > The comments on set_page_dirty() and set_page_dirty_lock() suggests
> > > > there's no consistency in whether truncation is blocked or not; we're
> > > > only guaranteed that the inode itself won't go away.  But maybe the
> > > > comments are stale.
> > > 
> > > The comments are, I believe, not stale.  Here's some syzbot
> > > reports which indicate that ext4 is seeing races between set_page_dirty()
> > > and truncate():
> > > 
> > >   https://groups.google.com/g/syzkaller-lts-bugs/c/s9fHu162zhQ/m/Phnf6ucaAwAJ
> > > 
> > > The reproducer includes calls to ftruncate(), so that would suggest
> > > that's what's going on.
> > 
> > Hmmm ... looks like __set_page_dirty_nobuffers() has a similar problem:
> > 
> > {
> >          lock_page_memcg(page);
> >          if (!TestSetPageDirty(page)) {
> >                  struct address_space *mapping = page_mapping(page);
> >                  unsigned long flags;
> > 
> >                  if (!mapping) {
> >                          unlock_page_memcg(page);
> >                          return 1;
> >                  }
> > 
> >                  xa_lock_irqsave(&mapping->i_pages, flags);
> >                  BUG_ON(page_mapping(page) != mapping);
> > 
> > sure, we check that the page wasn't truncated between set_page_dirty()
> > and the call to TestSetPageDirty(), but we can truncate dirty pages
> > with no problem.  So between the call to TestSetPageDirty() and
> > the call to xa_lock_irqsave(), the page can be truncated, and the
> > BUG_ON should fire.
> > 
> > I haven't been able to find any examples of this, but maybe it's just a very
> > narrow race.  Does anyone recognise this signature?  Adding the filesystems
> > which use __set_page_dirty_nobuffers() directly without extra locking.
> 
> 
> That sounds like the same *kind* of failure that Jan Kara and I were
> seeing on live systems[1], that led eventually to the gup-to-pup
> conversion exercise.
> 
> That crash happened due to calling set_page_dirty() on pages that had no
> buffers on them [2]. And that sounds like *exactly* the same thing as
> calling __set_page_dirty_nobuffers() without extra locking. So I'd
> expect that it's Just Wrong To Do, for the same reasons as Jan spells
> out very clearly in [1].

Interesting.  It's a bit different, *but* Jan's race might be what's
causing this symptom.  The reason is that the backtrace contains
set_page_dirty_lock() which holds the page lock.  So there can't be
a truncation race because truncate holds the page lock when calling
->invalidatepage.

That said, the syzbot reproducer doesn't have any O_DIRECT in it
either.  So maybe this is some other race?

> Hope that helps.
> 
> 
> [1] https://www.spinics.net/lists/linux-mm/msg142700.html
> 
> [2] which triggered this assertion:
> 
> #define page_buffers(page)					\
> 	({							\
> 		BUG_ON(!PagePrivate(page));			\
> 		((struct buffer_head *)page_private(page));	\
> 	})
> 
> 
> > 
> > $ git grep set_page_dirty.*=.*__set_page_dirty_nobuffers
> > fs/9p/vfs_addr.c:       .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/cifs/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/fuse/file.c: .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/hostfs/hostfs_kern.c:        .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/jfs/jfs_metapage.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/ntfs/aops.c: .set_page_dirty = __set_page_dirty_nobuffers,   /* Set the page dirty
> > fs/orangefs/inode.c:    .set_page_dirty = __set_page_dirty_nobuffers,
> > fs/vboxsf/file.c:       .set_page_dirty = __set_page_dirty_nobuffers,
> > 
> 
> ...wow, long list of these.
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
  2020-12-19  6:50         ` Matthew Wilcox
@ 2020-12-19  7:04           ` John Hubbard
  -1 siblings, 0 replies; 12+ messages in thread
From: John Hubbard @ 2020-12-19  7:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Dominique Martinet, v9fs-developer, Steve French,
	linux-cifs, Miklos Szeredi, Jeff Dike, Richard Weinberger,
	linux-um, Dave Kleikamp, jfs-discussion, Trond Myklebust,
	Anna Schumaker, linux-nfs, Anton Altaparmakov, linux-ntfs-dev,
	Mike Marshall, Martin Brandenburg, devel, Hans de Goede

On 12/18/20 10:50 PM, Matthew Wilcox wrote:
...
>>> Hmmm ... looks like __set_page_dirty_nobuffers() has a similar problem:
>>>
>>> {
>>>           lock_page_memcg(page);
>>>           if (!TestSetPageDirty(page)) {
>>>                   struct address_space *mapping = page_mapping(page);
>>>                   unsigned long flags;
>>>
>>>                   if (!mapping) {
>>>                           unlock_page_memcg(page);
>>>                           return 1;
>>>                   }
>>>
>>>                   xa_lock_irqsave(&mapping->i_pages, flags);
>>>                   BUG_ON(page_mapping(page) != mapping);
>>>
>>> sure, we check that the page wasn't truncated between set_page_dirty()
>>> and the call to TestSetPageDirty(), but we can truncate dirty pages
>>> with no problem.  So between the call to TestSetPageDirty() and
>>> the call to xa_lock_irqsave(), the page can be truncated, and the
>>> BUG_ON should fire.
>>>
>>> I haven't been able to find any examples of this, but maybe it's just a very
>>> narrow race.  Does anyone recognise this signature?  Adding the filesystems
>>> which use __set_page_dirty_nobuffers() directly without extra locking.
>>
>>
>> That sounds like the same *kind* of failure that Jan Kara and I were
>> seeing on live systems[1], that led eventually to the gup-to-pup
>> conversion exercise.
>>
>> That crash happened due to calling set_page_dirty() on pages that had no
>> buffers on them [2]. And that sounds like *exactly* the same thing as
>> calling __set_page_dirty_nobuffers() without extra locking. So I'd
>> expect that it's Just Wrong To Do, for the same reasons as Jan spells
>> out very clearly in [1].
> 
> Interesting.  It's a bit different, *but* Jan's race might be what's
> causing this symptom.  The reason is that the backtrace contains
> set_page_dirty_lock() which holds the page lock.  So there can't be
> a truncation race because truncate holds the page lock when calling
> ->invalidatepage.
> 
> That said, the syzbot reproducer doesn't have any O_DIRECT in it
> either.  So maybe this is some other race?

Jan's race can be also be reproduced *without* O_DIRECT. I first saw
it via a program that just did these steps on a normal ext4 filesystem:

a) pin ext4 file-backed pages, via get_user_pages(). Actually the way
it got here was due to using what *looked* like anonymous RAM to the
program, but was really file-backed RAM, because the admin had it
set up to mount ext4 on /tmp, instead of using tmpfs, to "save RAM",
but I digress. :)

b) wait a while, optionally do some DMA on the pages from a GPU, drink
coffee...

c) call set_pages_dirty()

d) unpin the pages

e) BUG_ON() in page_buffers().


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
@ 2020-12-19  7:04           ` John Hubbard
  0 siblings, 0 replies; 12+ messages in thread
From: John Hubbard @ 2020-12-19  7:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Martin Brandenburg, linux-cifs, jfs-discussion, Miklos Szeredi,
	Dave Kleikamp, Richard Weinberger, Dominique Martinet, linux-um,
	linux-nfs, Trond Myklebust, Steve French, linux-ntfs-dev,
	Hans de Goede, devel, Anna Schumaker, linux-fsdevel,
	v9fs-developer, Jeff Dike, Anton Altaparmakov, Mike Marshall

On 12/18/20 10:50 PM, Matthew Wilcox wrote:
...
>>> Hmmm ... looks like __set_page_dirty_nobuffers() has a similar problem:
>>>
>>> {
>>>           lock_page_memcg(page);
>>>           if (!TestSetPageDirty(page)) {
>>>                   struct address_space *mapping = page_mapping(page);
>>>                   unsigned long flags;
>>>
>>>                   if (!mapping) {
>>>                           unlock_page_memcg(page);
>>>                           return 1;
>>>                   }
>>>
>>>                   xa_lock_irqsave(&mapping->i_pages, flags);
>>>                   BUG_ON(page_mapping(page) != mapping);
>>>
>>> sure, we check that the page wasn't truncated between set_page_dirty()
>>> and the call to TestSetPageDirty(), but we can truncate dirty pages
>>> with no problem.  So between the call to TestSetPageDirty() and
>>> the call to xa_lock_irqsave(), the page can be truncated, and the
>>> BUG_ON should fire.
>>>
>>> I haven't been able to find any examples of this, but maybe it's just a very
>>> narrow race.  Does anyone recognise this signature?  Adding the filesystems
>>> which use __set_page_dirty_nobuffers() directly without extra locking.
>>
>>
>> That sounds like the same *kind* of failure that Jan Kara and I were
>> seeing on live systems[1], that led eventually to the gup-to-pup
>> conversion exercise.
>>
>> That crash happened due to calling set_page_dirty() on pages that had no
>> buffers on them [2]. And that sounds like *exactly* the same thing as
>> calling __set_page_dirty_nobuffers() without extra locking. So I'd
>> expect that it's Just Wrong To Do, for the same reasons as Jan spells
>> out very clearly in [1].
> 
> Interesting.  It's a bit different, *but* Jan's race might be what's
> causing this symptom.  The reason is that the backtrace contains
> set_page_dirty_lock() which holds the page lock.  So there can't be
> a truncation race because truncate holds the page lock when calling
> ->invalidatepage.
> 
> That said, the syzbot reproducer doesn't have any O_DIRECT in it
> either.  So maybe this is some other race?

Jan's race can be also be reproduced *without* O_DIRECT. I first saw
it via a program that just did these steps on a normal ext4 filesystem:

a) pin ext4 file-backed pages, via get_user_pages(). Actually the way
it got here was due to using what *looked* like anonymous RAM to the
program, but was really file-backed RAM, because the admin had it
set up to mount ext4 on /tmp, instead of using tmpfs, to "save RAM",
but I digress. :)

b) wait a while, optionally do some DMA on the pages from a GPU, drink
coffee...

c) call set_pages_dirty()

d) unpin the pages

e) BUG_ON() in page_buffers().


thanks,
-- 
John Hubbard
NVIDIA

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
  2020-12-18 22:03 ` Matthew Wilcox
  2020-12-19  5:18     ` Matthew Wilcox
@ 2020-12-21 14:12   ` Jan Kara
  2020-12-21 14:58     ` Matthew Wilcox
  1 sibling, 1 reply; 12+ messages in thread
From: Jan Kara @ 2020-12-21 14:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-ext4, Jan Kara, Theodore Ts'o, Andreas Dilger

On Fri 18-12-20 22:03:16, Matthew Wilcox wrote:
> On Fri, Dec 18, 2020 at 04:05:31PM +0000, Matthew Wilcox wrote:
> > A number of implementations of ->set_page_dirty check whether the page
> > has been truncated (ie page->mapping has become NULL since entering
> > set_page_dirty()).  Several other implementations assume that they can do
> > page->mapping->host to get to the inode.  So either some implementations
> > are doing unnecessary checks or others are vulnerable to a NULL pointer
> > dereference if truncate() races with set_page_dirty().
> > 
> > I'm touching ->set_page_dirty() anyway as part of the page folio
> > conversion.  I'm thinking about passing in the mapping so there's no
> > need to look at page->mapping.
> > 
> > The comments on set_page_dirty() and set_page_dirty_lock() suggests
> > there's no consistency in whether truncation is blocked or not; we're
> > only guaranteed that the inode itself won't go away.  But maybe the
> > comments are stale.
> 
> The comments are, I believe, not stale.  Here's some syzbot
> reports which indicate that ext4 is seeing races between set_page_dirty()
> and truncate():
> 
>  https://groups.google.com/g/syzkaller-lts-bugs/c/s9fHu162zhQ/m/Phnf6ucaAwAJ
> 
> The reproducer includes calls to ftruncate(), so that would suggest
> that's what's going on.
> 
> I would suggest just deleting this line:
> 
>         WARN_ON_ONCE(!page_has_buffers(page));
> 
> I'm not sure what value the other WARN_ON_ONCE adds.  Maybe just replace
> ext4_set_page_dirty with __set_page_dirty_buffers in the aops?  I'd defer
> to an ext4 expert on this ...

Please no. We've added this WARN_ON_ONCE() in 6dcc693bc57 ("ext4: warn when
page is dirtied without buffers") to catch problems with page pinning
earlier so that we get more diagnostic information before we actually BUG_ON()
in the writeback code ;).

To give more context: The question in which states we can see a page in
set_page_dirty() is actually filesystem dependent. Filesystems such as
ext4, xfs, btrfs expect to have full control over page dirtying because for
them it's a question of fs consistency (due to journalling requirements,
delayed allocation accounting etc.). Generally they expect the page can be
dirtied only through ->page_mkwrite() or through ->write_iter() and lock
things accordingly to maintain consistency. Except there's stuff like GUP
which breaks these assumptions - GUP users will trigger ->page_mkwrite()
but page can be writeprotected and cleaned long before GUP user modifies
page data and calls set_page_dirty(). Which is the main point why we came
up with pin_user_pages() so that MM / filesystems can detect there are page
references which can potentially modify & dirty a page and can count with
it (the "count with it" part is still missing, I have some clear ideas how
to do it but didn't get to it yet). And the syzkaller reproducer you
reference above is exactly one of the paths using GUP (actually already
pin_user_pages() these days) that can get fs into inconsistent state.

But overall even with GUP woes fixed up, set_page_dirty() called by a PUP
user could still see already truncated page. So it has to deal with it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: set_page_dirty vs truncate
  2020-12-21 14:12   ` Jan Kara
@ 2020-12-21 14:58     ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-12-21 14:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-ext4, Jan Kara, Theodore Ts'o, Andreas Dilger

On Mon, Dec 21, 2020 at 03:12:57PM +0100, Jan Kara wrote:
> But overall even with GUP woes fixed up, set_page_dirty() called by a PUP
> user could still see already truncated page. So it has to deal with it.

Thanks!  That was really helpful.  We have a number of currently-buggy
filesystems which assume they can do inode = page->mapping->host without
checking that page->mapping is not NULL.

Anyway, since I'm changing the set_page_dirty signature for folios,
this feels like the right time to pass in the page's mapping.
__set_page_dirty() rechecks the mapping under the i_pages lock, so we
won't do anything inappropriate if the page has been truncated.

You can find the whole thing at
https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/folio

but the important bit is:

-       /* Set a page dirty.  Return true if this dirtied it */
-       int (*set_page_dirty)(struct page *page);
+       /* Set a folio dirty.  Return true if this dirtied it */
+       bool (*set_page_dirty)(struct address_space *, struct folio *);

I'm kind of tempted to rename it to ->dirty_folio(), but I'm also fine
with leaving it this way.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-12-21 15:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-18 16:05 set_page_dirty vs truncate Matthew Wilcox
2020-12-18 22:03 ` Matthew Wilcox
2020-12-19  5:18   ` Matthew Wilcox
2020-12-19  5:18     ` Matthew Wilcox
2020-12-19  6:10     ` John Hubbard
2020-12-19  6:10       ` John Hubbard
2020-12-19  6:50       ` Matthew Wilcox
2020-12-19  6:50         ` Matthew Wilcox
2020-12-19  7:04         ` John Hubbard
2020-12-19  7:04           ` John Hubbard
2020-12-21 14:12   ` Jan Kara
2020-12-21 14:58     ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.