linux-cifs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
       [not found]                 ` <20190921140731.GQ1131@ZenIV.linux.org.uk>
@ 2019-09-24  2:52                   ` Al Viro
  2019-09-24 13:30                     ` Josef Bacik
       [not found]                     ` <CAHk-=wiJ1eY7y6r_cFNRPCqD+BJZS7eJeQFO6OrXxRFjDAipsQ@mail.gmail.com>
  0 siblings, 2 replies; 8+ messages in thread
From: Al Viro @ 2019-09-24  2:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin (A), Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

[btrfs and cifs folks Cc'd]

On Sat, Sep 21, 2019 at 03:07:31PM +0100, Al Viro wrote:

> No "take cursors out of the list" parts yet.

Argh...  The things turned interesting.  The tricky part is
where do we handle switching cursors away from something
that gets moved.

What I hoped for was "just do it in simple_rename()".  Which is
almost OK; there are 3 problematic cases.  One is shmem -
there we have a special ->rename(), which handles things
like RENAME_EXCHANGE et.al.  Fair enough - some of that
might've been moved into simple_rename(), but some (whiteouts)
won't be that easy.  Fair enough - we can make kicking the
cursors outs a helper called by simple_rename() and by that.
Exchange case is going to cause a bit of headache (the
pathological case is when the entries being exchanged are
next to each other in the same directory), but it's not
that bad.

Two other cases, though, might be serious trouble.  Those are
btrfs new_simple_dir() and this in cifs_root_iget():
        if (rc && tcon->pipe) {
                cifs_dbg(FYI, "ipc connection - fake read inode\n");
                spin_lock(&inode->i_lock);
                inode->i_mode |= S_IFDIR;
                set_nlink(inode, 2);
                inode->i_op = &cifs_ipc_inode_ops;
                inode->i_fop = &simple_dir_operations;
                inode->i_uid = cifs_sb->mnt_uid;
                inode->i_gid = cifs_sb->mnt_gid;
                spin_unlock(&inode->i_lock);
	}
The trouble is, it looks like d_splice_alias() from a lookup elsewhere
might find an alias of some subdirectory in those.  And in that case
we'll end up with a child of those (dcache_readdir-using) directories
being ripped out and moved elsewhere.  With no calls of ->rename() in
sight, of course, *AND* with only shared lock on the parent.  The
last part is really nasty.  And not just for hanging cursors off the
dentries they point to - it's a problem for dcache_readdir() itself
even in the mainline and with all the lockless crap reverted.

We pass next->d_name.name to dir_emit() (i.e. potentially to
copy_to_user()).  And we have no warranty that it's not a long
(== separately allocated) name, that will be freed while
copy_to_user() is in progress.  Sure, it'll get an RCU delay
before freeing, but that doesn't help us at all.

I'm not familiar with those areas in btrfs or cifs; could somebody
explain what's going on there and can we indeed end up finding aliases
to those suckers?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24  2:52                   ` [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel Al Viro
@ 2019-09-24 13:30                     ` Josef Bacik
  2019-09-24 14:51                       ` Al Viro
       [not found]                     ` <CAHk-=wiJ1eY7y6r_cFNRPCqD+BJZS7eJeQFO6OrXxRFjDAipsQ@mail.gmail.com>
  1 sibling, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2019-09-24 13:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 03:52:15AM +0100, Al Viro wrote:
> [btrfs and cifs folks Cc'd]
> 
> On Sat, Sep 21, 2019 at 03:07:31PM +0100, Al Viro wrote:
> 
> > No "take cursors out of the list" parts yet.
> 
> Argh...  The things turned interesting.  The tricky part is
> where do we handle switching cursors away from something
> that gets moved.
> 
> What I hoped for was "just do it in simple_rename()".  Which is
> almost OK; there are 3 problematic cases.  One is shmem -
> there we have a special ->rename(), which handles things
> like RENAME_EXCHANGE et.al.  Fair enough - some of that
> might've been moved into simple_rename(), but some (whiteouts)
> won't be that easy.  Fair enough - we can make kicking the
> cursors outs a helper called by simple_rename() and by that.
> Exchange case is going to cause a bit of headache (the
> pathological case is when the entries being exchanged are
> next to each other in the same directory), but it's not
> that bad.
> 
> Two other cases, though, might be serious trouble.  Those are
> btrfs new_simple_dir() and this in cifs_root_iget():
>         if (rc && tcon->pipe) {
>                 cifs_dbg(FYI, "ipc connection - fake read inode\n");
>                 spin_lock(&inode->i_lock);
>                 inode->i_mode |= S_IFDIR;
>                 set_nlink(inode, 2);
>                 inode->i_op = &cifs_ipc_inode_ops;
>                 inode->i_fop = &simple_dir_operations;
>                 inode->i_uid = cifs_sb->mnt_uid;
>                 inode->i_gid = cifs_sb->mnt_gid;
>                 spin_unlock(&inode->i_lock);
> 	}
> The trouble is, it looks like d_splice_alias() from a lookup elsewhere
> might find an alias of some subdirectory in those.  And in that case
> we'll end up with a child of those (dcache_readdir-using) directories
> being ripped out and moved elsewhere.  With no calls of ->rename() in
> sight, of course, *AND* with only shared lock on the parent.  The
> last part is really nasty.  And not just for hanging cursors off the
> dentries they point to - it's a problem for dcache_readdir() itself
> even in the mainline and with all the lockless crap reverted.
> 
> We pass next->d_name.name to dir_emit() (i.e. potentially to
> copy_to_user()).  And we have no warranty that it's not a long
> (== separately allocated) name, that will be freed while
> copy_to_user() is in progress.  Sure, it'll get an RCU delay
> before freeing, but that doesn't help us at all.
> 
> I'm not familiar with those areas in btrfs or cifs; could somebody
> explain what's going on there and can we indeed end up finding aliases
> to those suckers?

We can't for the btrfs case.  This is used for the case where we have a link to
a subvolume but the root has disappeared already, so we add in that dummy inode.
We completely drop the dcache from that root downards when we drop the
subvolume, so we're not going to find aliases underneath those things.  Is that
what you're asking?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 13:30                     ` Josef Bacik
@ 2019-09-24 14:51                       ` Al Viro
  2019-09-24 15:01                         ` Josef Bacik
  0 siblings, 1 reply; 8+ messages in thread
From: Al Viro @ 2019-09-24 14:51 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 09:30:26AM -0400, Josef Bacik wrote:

> > We pass next->d_name.name to dir_emit() (i.e. potentially to
> > copy_to_user()).  And we have no warranty that it's not a long
> > (== separately allocated) name, that will be freed while
> > copy_to_user() is in progress.  Sure, it'll get an RCU delay
> > before freeing, but that doesn't help us at all.
> > 
> > I'm not familiar with those areas in btrfs or cifs; could somebody
> > explain what's going on there and can we indeed end up finding aliases
> > to those suckers?
> 
> We can't for the btrfs case.  This is used for the case where we have a link to
> a subvolume but the root has disappeared already, so we add in that dummy inode.
> We completely drop the dcache from that root downards when we drop the
> subvolume, so we're not going to find aliases underneath those things.  Is that
> what you're asking?  Thanks,

Umm...  Completely drop, you say?  What happens if something had been opened
in there at that time?

Could you give a bit more background?  How much of that subvolume remains
and what does btrfs_lookup() have to work with?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 14:51                       ` Al Viro
@ 2019-09-24 15:01                         ` Josef Bacik
  2019-09-24 15:11                           ` Al Viro
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2019-09-24 15:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Josef Bacik, Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 03:51:04PM +0100, Al Viro wrote:
> On Tue, Sep 24, 2019 at 09:30:26AM -0400, Josef Bacik wrote:
> 
> > > We pass next->d_name.name to dir_emit() (i.e. potentially to
> > > copy_to_user()).  And we have no warranty that it's not a long
> > > (== separately allocated) name, that will be freed while
> > > copy_to_user() is in progress.  Sure, it'll get an RCU delay
> > > before freeing, but that doesn't help us at all.
> > > 
> > > I'm not familiar with those areas in btrfs or cifs; could somebody
> > > explain what's going on there and can we indeed end up finding aliases
> > > to those suckers?
> > 
> > We can't for the btrfs case.  This is used for the case where we have a link to
> > a subvolume but the root has disappeared already, so we add in that dummy inode.
> > We completely drop the dcache from that root downards when we drop the
> > subvolume, so we're not going to find aliases underneath those things.  Is that
> > what you're asking?  Thanks,
> 
> Umm...  Completely drop, you say?  What happens if something had been opened
> in there at that time?
> 
> Could you give a bit more background?  How much of that subvolume remains
> and what does btrfs_lookup() have to work with?

Sorry I mis-read the code a little bit.  This is purely for the subvolume link
directories.  We haven't wandered down into this directory yet.  If the
subvolume is being deleted and we still have the fake directory entry for it
then we just populate it with this dummy inode and then we can't lookup anything
underneath it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 15:01                         ` Josef Bacik
@ 2019-09-24 15:11                           ` Al Viro
  2019-09-24 15:26                             ` Josef Bacik
  0 siblings, 1 reply; 8+ messages in thread
From: Al Viro @ 2019-09-24 15:11 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 11:01:45AM -0400, Josef Bacik wrote:

> Sorry I mis-read the code a little bit.  This is purely for the subvolume link
> directories.  We haven't wandered down into this directory yet.  If the
> subvolume is being deleted and we still have the fake directory entry for it
> then we just populate it with this dummy inode and then we can't lookup anything
> underneath it.  Thanks,

Umm...  OK, I guess my question would be better stated a bit differently: we
have a directory inode, with btrfs_lookup() for lookups in it *and* with
dcache_readdir() called when you try to do getdents(2) on that thing.
How does that work?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 15:11                           ` Al Viro
@ 2019-09-24 15:26                             ` Josef Bacik
  2019-09-24 16:33                               ` Al Viro
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2019-09-24 15:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Josef Bacik, Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 04:11:07PM +0100, Al Viro wrote:
> On Tue, Sep 24, 2019 at 11:01:45AM -0400, Josef Bacik wrote:
> 
> > Sorry I mis-read the code a little bit.  This is purely for the subvolume link
> > directories.  We haven't wandered down into this directory yet.  If the
> > subvolume is being deleted and we still have the fake directory entry for it
> > then we just populate it with this dummy inode and then we can't lookup anything
> > underneath it.  Thanks,
> 
> Umm...  OK, I guess my question would be better stated a bit differently: we
> have a directory inode, with btrfs_lookup() for lookups in it *and* with
> dcache_readdir() called when you try to do getdents(2) on that thing.
> How does that work?

Sorry I hadn't read through the context.  We won't end up with things under this
directory.  The lookup will try to look up into the subvolume, see that it's
empty, and just return nothing.  There should never be any entries that end up
under this dummy entry.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
  2019-09-24 15:26                             ` Josef Bacik
@ 2019-09-24 16:33                               ` Al Viro
  0 siblings, 0 replies; 8+ messages in thread
From: Al Viro @ 2019-09-24 16:33 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Linus Torvalds, zhengbin (A),
	Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 11:26:28AM -0400, Josef Bacik wrote:
> On Tue, Sep 24, 2019 at 04:11:07PM +0100, Al Viro wrote:
> > On Tue, Sep 24, 2019 at 11:01:45AM -0400, Josef Bacik wrote:
> > 
> > > Sorry I mis-read the code a little bit.  This is purely for the subvolume link
> > > directories.  We haven't wandered down into this directory yet.  If the
> > > subvolume is being deleted and we still have the fake directory entry for it
> > > then we just populate it with this dummy inode and then we can't lookup anything
> > > underneath it.  Thanks,
> > 
> > Umm...  OK, I guess my question would be better stated a bit differently: we
> > have a directory inode, with btrfs_lookup() for lookups in it *and* with
> > dcache_readdir() called when you try to do getdents(2) on that thing.
> > How does that work?
> 
> Sorry I hadn't read through the context.  We won't end up with things under this
> directory.  The lookup will try to look up into the subvolume, see that it's
> empty, and just return nothing.  There should never be any entries that end up
> under this dummy entry.  Thanks,

Er...  Then why not use simple_lookup() in there?  Confused...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel
       [not found]                     ` <CAHk-=wiJ1eY7y6r_cFNRPCqD+BJZS7eJeQFO6OrXxRFjDAipsQ@mail.gmail.com>
@ 2019-09-29  5:29                       ` Al Viro
  0 siblings, 0 replies; 8+ messages in thread
From: Al Viro @ 2019-09-29  5:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: zhengbin, Jan Kara, Andrew Morton, linux-fsdevel, zhangyi (F),
	renxudong1, Hou Tao, linux-btrfs, Yan, Zheng, linux-cifs,
	Steve French

On Tue, Sep 24, 2019 at 09:55:06AM -0700, Linus Torvalds wrote:
> [ Sorry for html, I'm driving around ]
> 
> On Mon, Sep 23, 2019, 19:52 Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> >
> > Argh...  The things turned interesting.  The tricky part is
> > where do we handle switching cursors away from something
> > that gets moved.
> >
> 
> I forget why we do that. Remind me?
> 
> Anyway, I think to a first approximation, we should probably first just see
> if the "remove cursors from the end" change already effectively makes most
> of the regression go away.
> 
> Because with that change, if you just readdir() things with a sufficiently
> big buffer, you'll never have the cursor hang around over several system
> calls.
> 
> Before, you'd do a readdir, add the cursor, return to user space, then do
> another readdir just to see the EOF, and then do the close().
> 
> So it used to leave the cursor in place over those two final system calls,
> and now it's all gone.
> 
> I'm sure that on the big systems you can still trigger the whole d_lock
> contention on the parent, but I wonder if it's all that much of a problem
> any more in practice with your other change..

If nothing else, it's DoSable right now.  And getting rid of that would
reduce the memory footprint, while we are at it.

In any case, it looks like btrfs really wants an empty directory there,
i.e. the right thing to do would be simple_lookup() for ->lookup.

CIFS is potentially trickier.  AFAICS, what's going on is
	* Windows has a strange export, called IPC$.  Looks like it
was (still is?) the place where they export their named pipes.  From
what I'd been able to figure out, it's always there and allows for
some other uses - it can be used to get the list of exports.  Whether
the actual named pipes are seen there these days... no idea.
	* there seems to be nothing to prevent a server (modified
samba, for example) from exporting whatever it wants under that
name.
	* IF it can be non-empty, mounting it will end up with
root directory where we can do lookups for whatever is there.
getdents() on that root will show what's currently in dcache
(== had been looked up and still has not been evicted by
memory pressure).  Mainline can get seriously buggered if
dcache_readdir() steps into something that is going away.  With the
patches in this series that's no longer a problem.  HOWEVER, if
lookup in one subdirectory picks an alias for another subdirectory
of root, we will be really screwed - shared lock on parent
won't stop d_splice_alias() from moving the alias, and that can
bloody well lead to freeing the original name.  From under
copy_to_user()...  And grabbing a reference to dentry obviously
doesn't prevent that - dentry itself won't get freed, but
external name very well might be.

Again, I don't know CIFS well enough to tell how IPC$ is really
used.  If it doesn't normally contain anything but named pipes,
we can simply have cifs_lookup() refuse to return subdirectories
on such mounts, with solves any problems with d_splice_alias().
If it's always empty - better yet.  If the impressions above
(and that's all those are) are correct, we might have a problem
in mainline with bogus/malicious servers.

That's independent from what we do with the cursors.  I asked
around a bit; no luck so far.  It would be really nice if
some CIFS folks could give a braindump on that thing...

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-09-29  5:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAHk-=whpKgNTxjrenAed2sNkegrpCCPkV77_pWKbqo+c7apCOw@mail.gmail.com>
     [not found] ` <20190914170146.GT1131@ZenIV.linux.org.uk>
     [not found]   ` <CAHk-=wiPv+yo86GpA+Gd_et0KS2Cydk4gSbEj3p4S4tEb1roKw@mail.gmail.com>
     [not found]     ` <20190914200412.GU1131@ZenIV.linux.org.uk>
     [not found]       ` <CAHk-=whpoQ_hX2KeqjQs3DeX6Wb4Tmb8BkHa5zr-Xu=S55+ORg@mail.gmail.com>
     [not found]         ` <20190915005046.GV1131@ZenIV.linux.org.uk>
     [not found]           ` <CAHk-=wjcZBB2GpGP-cxXppzW=M0EuFnSLoTXHyqJ4BtffYrCXw@mail.gmail.com>
     [not found]             ` <20190915160236.GW1131@ZenIV.linux.org.uk>
     [not found]               ` <CAHk-=whjNE+_oSBP_o_9mquUKsJn4gomL2f0MM79gxk_SkYLRw@mail.gmail.com>
     [not found]                 ` <20190921140731.GQ1131@ZenIV.linux.org.uk>
2019-09-24  2:52                   ` [PATCH] Re: Possible FS race condition between iterate_dir and d_alloc_parallel Al Viro
2019-09-24 13:30                     ` Josef Bacik
2019-09-24 14:51                       ` Al Viro
2019-09-24 15:01                         ` Josef Bacik
2019-09-24 15:11                           ` Al Viro
2019-09-24 15:26                             ` Josef Bacik
2019-09-24 16:33                               ` Al Viro
     [not found]                     ` <CAHk-=wiJ1eY7y6r_cFNRPCqD+BJZS7eJeQFO6OrXxRFjDAipsQ@mail.gmail.com>
2019-09-29  5:29                       ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).