[git pull] vfs and fs fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [git pull] vfs and fs fixes
@ 2012-04-17  5:25 Al Viro
  2012-04-17 15:01 ` Linus Torvalds
  2012-04-19  3:23 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 41+ messages in thread
From: Al Viro @ 2012-04-17  5:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

	A bunch of endianness fixes plus a patch from bfields untangling
dependencies between vfs and nfsd trees; in principle, we could keep it
in nfsd tree (along with a bunch of followups that definitely belong there),
but Miklos' stuff in fs/namei.c steps fairly close to it and overlayfs
and unionfs series - even closer, so that would create serious PITA for
both, whichever tree it would sit in.

	Speaking of endianness stuff, I'm rather tempted to slap
ccflags-y += -D__CHECK_ENDIAN__ in fs/Makefile, if not making it
default for the entire tree; nfsd regressions I've caught make one
hell of a pile and we'd obviously benefit from having that kind of
stuff caught earlier...

Please, pull from the usual place -
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (12):
      nfsd: fix b0rken error value for setattr on read-only mount
      nfsd: fix error values returned by nfsd4_lockt() when nfsd_open() fails
      nfsd: fix endianness breakage in TEST_STATEID handling
      nfsd: fix error value on allocation failure in nfsd4_decode_test_stateid()
      nfsd: fix compose_entry_fh() failure exits
      ext4: fix endianness breakage in ext4_split_extent_at()
      btrfs: btrfs_root_readonly() broken on big-endian
      ocfs2: ->l_next_free_req breakage on big-endian
      ocfs: ->rl_used breakage on big-endian
      ocfs2: ->rl_count endianness breakage
      ocfs2: ->e_leaf_clusters endianness breakage
      lockd: fix the endianness bug

J. Bruce Fields (1):
      vfs: take i_mutex on renamed file

Diffstat:
 Documentation/filesystems/directory-locking |   11 ++++++-----
 fs/btrfs/ctree.h                            |    2 +-
 fs/ext4/extents.c                           |    2 +-
 fs/lockd/clnt4xdr.c                         |    2 +-
 fs/lockd/clntxdr.c                          |    2 +-
 fs/namei.c                                  |    3 +++
 fs/nfsd/nfs3xdr.c                           |   22 ++++++++--------------
 fs/nfsd/nfs4proc.c                          |    7 ++++---
 fs/nfsd/nfs4state.c                         |   23 +++++++++--------------
 fs/nfsd/nfs4xdr.c                           |    4 ++--
 fs/ocfs2/alloc.c                            |    2 +-
 fs/ocfs2/refcounttree.c                     |   12 ++++++------
 fs/ocfs2/suballoc.c                         |    4 ++--
 include/linux/fs.h                          |    9 ++++++---
 14 files changed, 51 insertions(+), 54 deletions(-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17  5:25 [git pull] vfs and fs fixes Al Viro
@ 2012-04-17 15:01 ` Linus Torvalds
  2012-04-17 16:22   ` J. Bruce Fields
  2012-04-17 18:01   ` Al Viro
  2012-04-19  3:23 ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2012-04-17 15:01 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel

On Mon, Apr 16, 2012 at 10:25 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>        A bunch of endianness fixes plus a patch from bfields untangling
> dependencies between vfs and nfsd trees; in principle, we could keep it
> in nfsd tree (along with a bunch of followups that definitely belong there),
> but Miklos' stuff in fs/namei.c steps fairly close to it and overlayfs
> and unionfs series - even closer, so that would create serious PITA for
> both, whichever tree it would sit in.

Why is that double mutex taking in vfs_rename_other() safe from ABBA?

We aren't guaranteed to hold the s_vfs_rename_mutex, since the parent
directories may be the same.

And yes, we hold the i_mutex on that shared parent, but the inodes may
exist (hardlinked) in another directory, so another rename could be
doing the i_mutex in the reverse order.

Maybe there is some reason why that double lock is safe, but I don't
see it, and I want it clearly documented. So I'm not pulling this.

                        Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 15:01 ` Linus Torvalds
@ 2012-04-17 16:22   ` J. Bruce Fields
  2012-04-17 16:33       ` Linus Torvalds
  2012-04-17 18:01   ` Al Viro
  1 sibling, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-17 16:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 08:01:48AM -0700, Linus Torvalds wrote:
> On Mon, Apr 16, 2012 at 10:25 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >        A bunch of endianness fixes plus a patch from bfields untangling
> > dependencies between vfs and nfsd trees; in principle, we could keep it
> > in nfsd tree (along with a bunch of followups that definitely belong there),
> > but Miklos' stuff in fs/namei.c steps fairly close to it and overlayfs
> > and unionfs series - even closer, so that would create serious PITA for
> > both, whichever tree it would sit in.
> 
> Why is that double mutex taking in vfs_rename_other() safe from ABBA?
> 
> We aren't guaranteed to hold the s_vfs_rename_mutex, since the parent
> directories may be the same.
> 
> And yes, we hold the i_mutex on that shared parent, but the inodes may
> exist (hardlinked) in another directory, so another rename could be
> doing the i_mutex in the reverse order.
> 
> Maybe there is some reason why that double lock is safe, but I don't
> see it, and I want it clearly documented. So I'm not pulling this.

Ugh, no, I think you're right:

	rename A/a->A/b
	rename B/b->B/b

where A/a and B/a are the same file, and A/b and B/b are the same file,
can result in the first rename holding the lock on A and a and waiting
on b, and the second holding the lock on B and b and waiting on a.

--b.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 16:22   ` J. Bruce Fields
@ 2012-04-17 16:33       ` Linus Torvalds
  0 siblings, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2012-04-17 16:33 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Al Viro, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 9:22 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>
> Ugh, no, I think you're right:
>
>        rename A/a->A/b
>        rename B/b->B/b
>
> where A/a and B/a are the same file, and A/b and B/b are the same file,
> can result in the first rename holding the lock on A and a and waiting
> on b, and the second holding the lock on B and b and waiting on a.

In fact I don't think you need even that much. Just a simple

  touch a
  ln a b
  mv a b

looks like it should deadlock on itself, no? source and dest inodes
will be the same, so the mutex_lock() will just deadlock without even
any ABBA race.

(I didn't really check - maybe there is some reason that doesn't happen).

                  Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
@ 2012-04-17 16:33       ` Linus Torvalds
  0 siblings, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2012-04-17 16:33 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Al Viro, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 9:22 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
>
> Ugh, no, I think you're right:
>
>        rename A/a->A/b
>        rename B/b->B/b
>
> where A/a and B/a are the same file, and A/b and B/b are the same file,
> can result in the first rename holding the lock on A and a and waiting
> on b, and the second holding the lock on B and b and waiting on a.

In fact I don't think you need even that much. Just a simple

  touch a
  ln a b
  mv a b

looks like it should deadlock on itself, no? source and dest inodes
will be the same, so the mutex_lock() will just deadlock without even
any ABBA race.

(I didn't really check - maybe there is some reason that doesn't happen).

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 16:33       ` Linus Torvalds
@ 2012-04-17 17:06         ` J. Bruce Fields
  -1 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-17 17:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 09:33:06AM -0700, Linus Torvalds wrote:
> On Tue, Apr 17, 2012 at 9:22 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >
> > Ugh, no, I think you're right:
> >
> >        rename A/a->A/b
> >        rename B/b->B/b
> >
> > where A/a and B/a are the same file, and A/b and B/b are the same file,
> > can result in the first rename holding the lock on A and a and waiting
> > on b, and the second holding the lock on B and b and waiting on a.
> 
> In fact I don't think you need even that much. Just a simple
> 
>   touch a
>   ln a b
>   mv a b
> 
> looks like it should deadlock on itself, no? source and dest inodes
> will be the same, so the mutex_lock() will just deadlock without even
> any ABBA race.
> 
> (I didn't really check - maybe there is some reason that doesn't happen).

Yeah, rename has that funny exception that makes the above a no-op, so I
think that's safe.

But the patch is still wrong; back to the drawing board.

Maybe a paper bag over my head will help my concentration....

--b.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
@ 2012-04-17 17:06         ` J. Bruce Fields
  0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-17 17:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 09:33:06AM -0700, Linus Torvalds wrote:
> On Tue, Apr 17, 2012 at 9:22 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >
> > Ugh, no, I think you're right:
> >
> >        rename A/a->A/b
> >        rename B/b->B/b
> >
> > where A/a and B/a are the same file, and A/b and B/b are the same file,
> > can result in the first rename holding the lock on A and a and waiting
> > on b, and the second holding the lock on B and b and waiting on a.
> 
> In fact I don't think you need even that much. Just a simple
> 
>   touch a
>   ln a b
>   mv a b
> 
> looks like it should deadlock on itself, no? source and dest inodes
> will be the same, so the mutex_lock() will just deadlock without even
> any ABBA race.
> 
> (I didn't really check - maybe there is some reason that doesn't happen).

Yeah, rename has that funny exception that makes the above a no-op, so I
think that's safe.

But the patch is still wrong; back to the drawing board.

Maybe a paper bag over my head will help my concentration....

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 16:33       ` Linus Torvalds
  (?)
  (?)
@ 2012-04-17 17:59       ` Al Viro
  -1 siblings, 0 replies; 41+ messages in thread
From: Al Viro @ 2012-04-17 17:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: J. Bruce Fields, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 09:33:06AM -0700, Linus Torvalds wrote:
> On Tue, Apr 17, 2012 at 9:22 AM, J. Bruce Fields <bfields@fieldses.org> wrote:

> In fact I don't think you need even that much. Just a simple
> 
>   touch a
>   ln a b
>   mv a b
> 
> looks like it should deadlock on itself, no? source and dest inodes
> will be the same, so the mutex_lock() will just deadlock without even
> any ABBA race.

No, this one will bail out much earlier.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 15:01 ` Linus Torvalds
  2012-04-17 16:22   ` J. Bruce Fields
@ 2012-04-17 18:01   ` Al Viro
  2012-04-17 18:28     ` Al Viro
  1 sibling, 1 reply; 41+ messages in thread
From: Al Viro @ 2012-04-17 18:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 08:01:48AM -0700, Linus Torvalds wrote:
> On Mon, Apr 16, 2012 at 10:25 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > ? ? ? ?A bunch of endianness fixes plus a patch from bfields untangling
> > dependencies between vfs and nfsd trees; in principle, we could keep it
> > in nfsd tree (along with a bunch of followups that definitely belong there),
> > but Miklos' stuff in fs/namei.c steps fairly close to it and overlayfs
> > and unionfs series - even closer, so that would create serious PITA for
> > both, whichever tree it would sit in.
> 
> Why is that double mutex taking in vfs_rename_other() safe from ABBA?
> 
> We aren't guaranteed to hold the s_vfs_rename_mutex, since the parent
> directories may be the same.
> 
> And yes, we hold the i_mutex on that shared parent, but the inodes may
> exist (hardlinked) in another directory, so another rename could be
> doing the i_mutex in the reverse order.
> 
> Maybe there is some reason why that double lock is safe, but I don't
> see it, and I want it clearly documented. So I'm not pulling this.

It isn't.  Hell knows - I wonder if taking s_vfs_rename_mutex in all cases
in lock_rename() would be the right thing to do; it would remove the
problem, but might cost us too much contention...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 18:01   ` Al Viro
@ 2012-04-17 18:28     ` Al Viro
  2012-04-17 21:14       ` J. Bruce Fields
  0 siblings, 1 reply; 41+ messages in thread
From: Al Viro @ 2012-04-17 18:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 07:01:29PM +0100, Al Viro wrote:

> It isn't.  Hell knows - I wonder if taking s_vfs_rename_mutex in all cases
> in lock_rename() would be the right thing to do; it would remove the
> problem, but might cost us too much contention...

Actually, it's even worse.  ext4_move_extents() locks a _pair_ of
->i_mutex (having checked that both are non-directories first).  In
i_ino order.  So the only plausible ordering would be
	* directories by tree order (with s_vfs_rename_mutex held to
stabilize the tree topology)
	* non-directories after all directories, ordered in some consistent
way.  Which would have to be by inumber if we want to leave ext4 code
as-is.

Bruce: for now I'm dropping that patch.  We _might_ take ext4
mutex_inode_double_lock() into fs/namei.c and have it used by
vfs_rename_other(), but I'm not convinced that this is the right
thing to do.  Is there any other sane way to deal with nfsd problem?
i_mutex is already used for more things than I'd like...

Linus: I've removed that sucker from the queue; it should propagate to
git.kernel.org in a few.  Could you pull the rest?  Same place, the stats
are below:

Shortlog:
Al Viro (12):
      nfsd: fix b0rken error value for setattr on read-only mount
      nfsd: fix error values returned by nfsd4_lockt() when nfsd_open() fails
      nfsd: fix endianness breakage in TEST_STATEID handling
      nfsd: fix error value on allocation failure in nfsd4_decode_test_stateid()
      nfsd: fix compose_entry_fh() failure exits
      ext4: fix endianness breakage in ext4_split_extent_at()
      btrfs: btrfs_root_readonly() broken on big-endian
      ocfs2: ->l_next_free_req breakage on big-endian
      ocfs: ->rl_used breakage on big-endian
      ocfs2: ->rl_count endianness breakage
      ocfs2: ->e_leaf_clusters endianness breakage
      lockd: fix the endianness bug

Diffstat:
 fs/btrfs/ctree.h        |    2 +-
 fs/ext4/extents.c       |    2 +-
 fs/lockd/clnt4xdr.c     |    2 +-
 fs/lockd/clntxdr.c      |    2 +-
 fs/nfsd/nfs3xdr.c       |   22 ++++++++--------------
 fs/nfsd/nfs4proc.c      |    7 ++++---
 fs/nfsd/nfs4state.c     |   23 +++++++++--------------
 fs/nfsd/nfs4xdr.c       |    4 ++--
 fs/ocfs2/alloc.c        |    2 +-
 fs/ocfs2/refcounttree.c |   12 ++++++------
 fs/ocfs2/suballoc.c     |    4 ++--
 11 files changed, 36 insertions(+), 46 deletions(-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 18:28     ` Al Viro
@ 2012-04-17 21:14       ` J. Bruce Fields
  2012-04-17 22:08         ` Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-17 21:14 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 07:28:26PM +0100, Al Viro wrote:
> On Tue, Apr 17, 2012 at 07:01:29PM +0100, Al Viro wrote:
> 
> > It isn't.  Hell knows - I wonder if taking s_vfs_rename_mutex in all cases
> > in lock_rename() would be the right thing to do; it would remove the
> > problem, but might cost us too much contention...
> 
> Actually, it's even worse.  ext4_move_extents() locks a _pair_ of
> ->i_mutex (having checked that both are non-directories first).  In
> i_ino order.  So the only plausible ordering would be
> 	* directories by tree order (with s_vfs_rename_mutex held to
> stabilize the tree topology)
> 	* non-directories after all directories, ordered in some consistent
> way.  Which would have to be by inumber if we want to leave ext4 code
> as-is.
> 
> Bruce: for now I'm dropping that patch.  We _might_ take ext4
> mutex_inode_double_lock() into fs/namei.c and have it used by
> vfs_rename_other(), but I'm not convinced that this is the right
> thing to do.  Is there any other sane way to deal with nfsd problem?
> i_mutex is already used for more things than I'd like...

I don't want to give out a delegation while a rename, link, unlink, or
setattr of an inode is in progress.  All but rename are covered by the
i_mutex.

I'm happy just failing the delegation in case of conflict.

Maybe instead I could continue using the i_mutex but handle rename some
other way; e.g. in delegation code:

	if (!mutex_trylock(inode->i_mutex))
		return -EAGAIN;
	if (atomic_read(inode->i_renames_in_progress))
		return -EAGAIN;

and add an

	atomic_inc(inode->i_renames_in_progress);
	atomic_dec(inode->i_renames_in_progress);

pair around rename.

Or I could increment that counter for all the conflicting operations and
rely on it instead of the i_mutex.  I was trying to avoid adding
something like that (an inc, a dec, another error path) to every
operation.  And hoping to avoid adding another field to struct inode.
Oh well.

--b.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 21:14       ` J. Bruce Fields
@ 2012-04-17 22:08         ` Linus Torvalds
  2012-04-17 23:44           ` Al Viro
  2012-04-18  0:47           ` J. Bruce Fields
  0 siblings, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2012-04-17 22:08 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Al Viro, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 2:14 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Tue, Apr 17, 2012 at 07:28:26PM +0100, Al Viro wrote:>
> Maybe instead I could continue using the i_mutex but handle rename some
> other way; e.g. in delegation code:
>
>        if (!mutex_trylock(inode->i_mutex))
>                return -EAGAIN;
>        if (atomic_read(inode->i_renames_in_progress))
>                return -EAGAIN;
>
> and add an
>
>        atomic_inc(inode->i_renames_in_progress);
>        atomic_dec(inode->i_renames_in_progress);
>
> pair around rename.

Please don't make up your own locking. Plus it's broken anyway, since
a rename could come in directly after your atomic_read (and this is
*why* people shouldn't make up their own locks - they are invariably
broken).

> Or I could increment that counter for all the conflicting operations and
> rely on it instead of the i_mutex.  I was trying to avoid adding
> something like that (an inc, a dec, another error path) to every
> operation.  And hoping to avoid adding another field to struct inode.
> Oh well.

We could just say that we can do a double inode lock, but then
standardize on the order. And the only sane order is comparing inode
pointers, not inode numbers like ext4 apparently does.

With a standard order, I don't think it would be at all wrong to just
take the inode lock on rename.

                        Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 22:08         ` Linus Torvalds
@ 2012-04-17 23:44           ` Al Viro
  2012-04-18  0:49             ` J. Bruce Fields
                               ` (3 more replies)
  2012-04-18  0:47           ` J. Bruce Fields
  1 sibling, 4 replies; 41+ messages in thread
From: Al Viro @ 2012-04-17 23:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: J. Bruce Fields, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > Or I could increment that counter for all the conflicting operations and
> > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > something like that (an inc, a dec, another error path) to every
> > operation. ?And hoping to avoid adding another field to struct inode.
> > Oh well.
> 
> We could just say that we can do a double inode lock, but then
> standardize on the order. And the only sane order is comparing inode
> pointers, not inode numbers like ext4 apparently does.
> 
> With a standard order, I don't think it would be at all wrong to just
> take the inode lock on rename.

In principle, yes, but have you tried to grep for i_mutex?  Note that
we have *another* place where multiple ->i_mutex might be held on
non-directories (and unless I'm missing something, ext4 move_extent.c
stuff doesn't play well with it): quota writes.  Which can, AFAICS,
happen while write(2) is holding ->i_mutex on a regular file.  So
it's not _that_ easy - we want something like "and quota file is goes
last", since there we don't get to change the locking order - the first
->i_mutex is taken too far outside.

I really don't like how messy i_mutex had become these days.  Right now
I'm staring at 700-odd lines all over the place where it's taken/released
and it's a wastebucket lock - used to protect random bits and scraps, with a
lot of filesystems, etc. using it for purposes of their own ;-/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 22:08         ` Linus Torvalds
  2012-04-17 23:44           ` Al Viro
@ 2012-04-18  0:47           ` J. Bruce Fields
  1 sibling, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-18  0:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> On Tue, Apr 17, 2012 at 2:14 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Tue, Apr 17, 2012 at 07:28:26PM +0100, Al Viro wrote:>
> > Maybe instead I could continue using the i_mutex but handle rename some
> > other way; e.g. in delegation code:
> >
> >        if (!mutex_trylock(inode->i_mutex))
> >                return -EAGAIN;
> >        if (atomic_read(inode->i_renames_in_progress))
> >                return -EAGAIN;
> >
> > and add an
> >
> >        atomic_inc(inode->i_renames_in_progress);
> >        atomic_dec(inode->i_renames_in_progress);
> >
> > pair around rename.
> 
> Please don't make up your own locking. Plus it's broken anyway, since
> a rename could come in directly after your atomic_read (and this is
> *why* people shouldn't make up their own locks - they are invariably
> broken).

Doh, yes, sounds like a good rule.  (I was misremembering some previous
attempt at this--which admittedly may just have failed in some more
complicated way.)

--b.

> > Or I could increment that counter for all the conflicting operations and
> > rely on it instead of the i_mutex.  I was trying to avoid adding
> > something like that (an inc, a dec, another error path) to every
> > operation.  And hoping to avoid adding another field to struct inode.
> > Oh well.
> 
> We could just say that we can do a double inode lock, but then
> standardize on the order. And the only sane order is comparing inode
> pointers, not inode numbers like ext4 apparently does.
> 
> With a standard order, I don't think it would be at all wrong to just
> take the inode lock on rename.
> 
>                         Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 23:44           ` Al Viro
@ 2012-04-18  0:49             ` J. Bruce Fields
  2012-04-18  0:56             ` Linus Torvalds
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-18  0:49 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Wed, Apr 18, 2012 at 12:44:24AM +0100, Al Viro wrote:
> On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > Or I could increment that counter for all the conflicting operations and
> > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > something like that (an inc, a dec, another error path) to every
> > > operation. ?And hoping to avoid adding another field to struct inode.
> > > Oh well.
> > 
> > We could just say that we can do a double inode lock, but then
> > standardize on the order. And the only sane order is comparing inode
> > pointers, not inode numbers like ext4 apparently does.
> > 
> > With a standard order, I don't think it would be at all wrong to just
> > take the inode lock on rename.

I'll take a stab at that.  But:

> In principle, yes, but have you tried to grep for i_mutex?  Note that
> we have *another* place where multiple ->i_mutex might be held on
> non-directories (and unless I'm missing something, ext4 move_extent.c
> stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> happen while write(2) is holding ->i_mutex on a regular file.  So
> it's not _that_ easy - we want something like "and quota file is goes
> last", since there we don't get to change the locking order - the first
> ->i_mutex is taken too far outside.
> 
> I really don't like how messy i_mutex had become these days.  Right now
> I'm staring at 700-odd lines all over the place where it's taken/released
> and it's a wastebucket lock - used to protect random bits and scraps, with a
> lot of filesystems, etc. using it for purposes of their own ;-/

I can understand not wanting the i_mutex to have another use.

--b.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 23:44           ` Al Viro
  2012-04-18  0:49             ` J. Bruce Fields
@ 2012-04-18  0:56             ` Linus Torvalds
  2012-04-18 21:52             ` J. Bruce Fields
  2012-04-20 11:15             ` [git pull] vfs and fs fixes Jan Kara
  3 siblings, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2012-04-18  0:56 UTC (permalink / raw)
  To: Al Viro; +Cc: J. Bruce Fields, linux-kernel, linux-fsdevel

On Tue, Apr 17, 2012 at 4:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> In principle, yes, but have you tried to grep for i_mutex?

No, I hadn't, and right you are. That's a mess.

It's not the i_mutex hits themselves that look scary, but there's 75
hits on just the *nested* locking. That's way more than I would have
guessed without the grep.

The ext4 ones look pretty simple. But yeah, there's a lot of other
noise there...

                    Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 23:44           ` Al Viro
  2012-04-18  0:49             ` J. Bruce Fields
  2012-04-18  0:56             ` Linus Torvalds
@ 2012-04-18 21:52             ` J. Bruce Fields
  2012-04-25 15:20               ` J. Bruce Fields
                                 ` (5 more replies)
  2012-04-20 11:15             ` [git pull] vfs and fs fixes Jan Kara
  3 siblings, 6 replies; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-18 21:52 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Wed, Apr 18, 2012 at 12:44:24AM +0100, Al Viro wrote:
> On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > Or I could increment that counter for all the conflicting operations and
> > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > something like that (an inc, a dec, another error path) to every
> > > operation. ?And hoping to avoid adding another field to struct inode.
> > > Oh well.
> > 
> > We could just say that we can do a double inode lock, but then
> > standardize on the order. And the only sane order is comparing inode
> > pointers, not inode numbers like ext4 apparently does.
> > 
> > With a standard order, I don't think it would be at all wrong to just
> > take the inode lock on rename.
> 
> In principle, yes, but have you tried to grep for i_mutex?  Note that
> we have *another* place where multiple ->i_mutex might be held on
> non-directories (and unless I'm missing something, ext4 move_extent.c
> stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> happen while write(2) is holding ->i_mutex on a regular file.  So
> it's not _that_ easy - we want something like "and quota file is goes
> last"

So the idea would be to always take the i_mutex on non-quota files
before taking it on quota files?

I tried pulling the ext4 thing into fs/inode.c, modifying the order to
do that, and then doing the rename change on top of that.

One thing I don't understand is how that interacts with
quota_on/quota_off.  How do we decide the right lock ordering if files
can go back and forth between being quota files?

--b.

> , since there we don't get to change the locking order - the first
> ->i_mutex is taken too far outside.
> 
> I really don't like how messy i_mutex had become these days.  Right now
> I'm staring at 700-odd lines all over the place where it's taken/released
> and it's a wastebucket lock - used to protect random bits and scraps, with a
> lot of filesystems, etc. using it for purposes of their own ;-/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17  5:25 [git pull] vfs and fs fixes Al Viro
  2012-04-17 15:01 ` Linus Torvalds
@ 2012-04-19  3:23 ` Benjamin Herrenschmidt
  2012-04-19 14:50   ` Ted Ts'o
  1 sibling, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2012-04-19  3:23 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, Anton Blanchard

>       ext4: fix endianness breakage in ext4_split_extent_at()

That smells like something we'd want in -stable no ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-19  3:23 ` Benjamin Herrenschmidt
@ 2012-04-19 14:50   ` Ted Ts'o
  2012-04-24 17:40     ` Greg KH
  0 siblings, 1 reply; 41+ messages in thread
From: Ted Ts'o @ 2012-04-19 14:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel, Anton Blanchard

On Thu, Apr 19, 2012 at 01:23:18PM +1000, Benjamin Herrenschmidt wrote:
> >       ext4: fix endianness breakage in ext4_split_extent_at()
> 
> That smells like something we'd want in -stable no ?

Yes, commit af1584f5 is something that should definitely go into
-stable.


					- Ted

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-17 23:44           ` Al Viro
                               ` (2 preceding siblings ...)
  2012-04-18 21:52             ` J. Bruce Fields
@ 2012-04-20 11:15             ` Jan Kara
  2012-04-24 19:52               ` J. Bruce Fields
  3 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2012-04-20 11:15 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, J. Bruce Fields, linux-kernel, linux-fsdevel

On Wed 18-04-12 00:44:24, Al Viro wrote:
> On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > Or I could increment that counter for all the conflicting operations and
> > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > something like that (an inc, a dec, another error path) to every
> > > operation. ?And hoping to avoid adding another field to struct inode.
> > > Oh well.
> > 
> > We could just say that we can do a double inode lock, but then
> > standardize on the order. And the only sane order is comparing inode
> > pointers, not inode numbers like ext4 apparently does.
> > 
> > With a standard order, I don't think it would be at all wrong to just
> > take the inode lock on rename.
> 
> In principle, yes, but have you tried to grep for i_mutex?  Note that
> we have *another* place where multiple ->i_mutex might be held on
> non-directories (and unless I'm missing something, ext4 move_extent.c
> stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> happen while write(2) is holding ->i_mutex on a regular file.  So
> it's not _that_ easy - we want something like "and quota file is goes
> last", since there we don't get to change the locking order - the first
> ->i_mutex is taken too far outside.
  Hum, I think I could just do away with quota file i_mutex being special.
It's used for two purposes:
  1) When quota is being turned on/off, we want to set/clear inode immutable
flag, truncate page cache, etc. But we should be able push this locking
outside of quota locks.
  2) Inside filesystems when quota file is written to. Quota writes are
serialized by quota code anyway and noone else has any bussiness with quota
files (they are marked as immutable to avoid mistakes) so there i_mutex is
not really needed.

> I really don't like how messy i_mutex had become these days.  Right now
> I'm staring at 700-odd lines all over the place where it's taken/released
> and it's a wastebucket lock - used to protect random bits and scraps, with a
> lot of filesystems, etc. using it for purposes of their own ;-/

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-19 14:50   ` Ted Ts'o
@ 2012-04-24 17:40     ` Greg KH
  2012-04-24 17:45       ` Al Viro
  0 siblings, 1 reply; 41+ messages in thread
From: Greg KH @ 2012-04-24 17:40 UTC (permalink / raw)
  To: Ted Ts'o, Benjamin Herrenschmidt, Al Viro, Linus Torvalds,
	linux-kernel, linux-fsdevel, Anton Blanchard

On Thu, Apr 19, 2012 at 10:50:07AM -0400, Ted Ts'o wrote:
> On Thu, Apr 19, 2012 at 01:23:18PM +1000, Benjamin Herrenschmidt wrote:
> > >       ext4: fix endianness breakage in ext4_split_extent_at()
> > 
> > That smells like something we'd want in -stable no ?
> 
> Yes, commit af1584f5 is something that should definitely go into
> -stable.

Now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-24 17:40     ` Greg KH
@ 2012-04-24 17:45       ` Al Viro
  2012-04-24 17:59         ` Greg KH
  0 siblings, 1 reply; 41+ messages in thread
From: Al Viro @ 2012-04-24 17:45 UTC (permalink / raw)
  To: Greg KH
  Cc: Ted Ts'o, Benjamin Herrenschmidt, Linus Torvalds,
	linux-kernel, linux-fsdevel, Anton Blanchard

On Tue, Apr 24, 2012 at 10:40:06AM -0700, Greg KH wrote:
> On Thu, Apr 19, 2012 at 10:50:07AM -0400, Ted Ts'o wrote:
> > On Thu, Apr 19, 2012 at 01:23:18PM +1000, Benjamin Herrenschmidt wrote:
> > > >       ext4: fix endianness breakage in ext4_split_extent_at()
> > > 
> > > That smells like something we'd want in -stable no ?
> > 
> > Yes, commit af1584f5 is something that should definitely go into
> > -stable.
> 
> Now queued up, thanks.

Actually, the rest of commits from that pull also should go there; forgot
to add Cc: stable, my apologies...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-24 17:45       ` Al Viro
@ 2012-04-24 17:59         ` Greg KH
  2012-04-24 18:04           ` Al Viro
  0 siblings, 1 reply; 41+ messages in thread
From: Greg KH @ 2012-04-24 17:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Ted Ts'o, Benjamin Herrenschmidt, Linus Torvalds,
	linux-kernel, linux-fsdevel, Anton Blanchard

On Tue, Apr 24, 2012 at 06:45:02PM +0100, Al Viro wrote:
> On Tue, Apr 24, 2012 at 10:40:06AM -0700, Greg KH wrote:
> > On Thu, Apr 19, 2012 at 10:50:07AM -0400, Ted Ts'o wrote:
> > > On Thu, Apr 19, 2012 at 01:23:18PM +1000, Benjamin Herrenschmidt wrote:
> > > > >       ext4: fix endianness breakage in ext4_split_extent_at()
> > > > 
> > > > That smells like something we'd want in -stable no ?
> > > 
> > > Yes, commit af1584f5 is something that should definitely go into
> > > -stable.
> > 
> > Now queued up, thanks.
> 
> Actually, the rest of commits from that pull also should go there; forgot
> to add Cc: stable, my apologies...

So that would be commits
96f6f98501196d46ce52c2697dd758d9300c63f5..e847469bf77a1d339274074ed068d461f0c872bc
inclusive?

Or a different range?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-24 17:59         ` Greg KH
@ 2012-04-24 18:04           ` Al Viro
  2012-04-24 20:37             ` Greg KH
  0 siblings, 1 reply; 41+ messages in thread
From: Al Viro @ 2012-04-24 18:04 UTC (permalink / raw)
  To: Greg KH
  Cc: Ted Ts'o, Benjamin Herrenschmidt, Linus Torvalds,
	linux-kernel, linux-fsdevel, Anton Blanchard

On Tue, Apr 24, 2012 at 10:59:11AM -0700, Greg KH wrote:

> So that would be commits
> 96f6f98501196d46ce52c2697dd758d9300c63f5..e847469bf77a1d339274074ed068d461f0c872bc
> inclusive?

Yes.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-20 11:15             ` [git pull] vfs and fs fixes Jan Kara
@ 2012-04-24 19:52               ` J. Bruce Fields
  2012-04-24 22:23                 ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-24 19:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel

On Fri, Apr 20, 2012 at 01:15:17PM +0200, Jan Kara wrote:
> On Wed 18-04-12 00:44:24, Al Viro wrote:
> > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > Or I could increment that counter for all the conflicting operations and
> > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > something like that (an inc, a dec, another error path) to every
> > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > Oh well.
> > > 
> > > We could just say that we can do a double inode lock, but then
> > > standardize on the order. And the only sane order is comparing inode
> > > pointers, not inode numbers like ext4 apparently does.
> > > 
> > > With a standard order, I don't think it would be at all wrong to just
> > > take the inode lock on rename.
> > 
> > In principle, yes, but have you tried to grep for i_mutex?  Note that
> > we have *another* place where multiple ->i_mutex might be held on
> > non-directories (and unless I'm missing something, ext4 move_extent.c
> > stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> > happen while write(2) is holding ->i_mutex on a regular file.  So
> > it's not _that_ easy - we want something like "and quota file is goes
> > last", since there we don't get to change the locking order - the first
> > ->i_mutex is taken too far outside.
>   Hum, I think I could just do away with quota file i_mutex being special.
> It's used for two purposes:
>   1) When quota is being turned on/off, we want to set/clear inode immutable
> flag, truncate page cache, etc. But we should be able push this locking
> outside of quota locks.
>   2) Inside filesystems when quota file is written to. Quota writes are
> serialized by quota code anyway and noone else has any bussiness with quota
> files (they are marked as immutable to avoid mistakes) so there i_mutex is
> not really needed.

Grepping for I_MUTEX_QUOTA shows hits in ext4, reiserfs, and gfs2.  The
former two are in code called from the quota code (through the
->quota_write method).  But the gfs2 code appears to be called directly
from gfs2's write code.

--b.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-24 18:04           ` Al Viro
@ 2012-04-24 20:37             ` Greg KH
  0 siblings, 0 replies; 41+ messages in thread
From: Greg KH @ 2012-04-24 20:37 UTC (permalink / raw)
  To: Al Viro
  Cc: Ted Ts'o, Benjamin Herrenschmidt, Linus Torvalds,
	linux-kernel, linux-fsdevel, Anton Blanchard

On Tue, Apr 24, 2012 at 07:04:07PM +0100, Al Viro wrote:
> On Tue, Apr 24, 2012 at 10:59:11AM -0700, Greg KH wrote:
> 
> > So that would be commits
> > 96f6f98501196d46ce52c2697dd758d9300c63f5..e847469bf77a1d339274074ed068d461f0c872bc
> > inclusive?
> 
> Yes.

Wonderful, now queued up, thanks for letting me know.

greg k-h

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-24 19:52               ` J. Bruce Fields
@ 2012-04-24 22:23                 ` Jan Kara
  2012-04-25 11:29                   ` J. Bruce Fields
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2012-04-24 22:23 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jan Kara, Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel

On Tue 24-04-12 15:52:36, J. Bruce Fields wrote:
> On Fri, Apr 20, 2012 at 01:15:17PM +0200, Jan Kara wrote:
> > On Wed 18-04-12 00:44:24, Al Viro wrote:
> > > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > > Or I could increment that counter for all the conflicting operations and
> > > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > > something like that (an inc, a dec, another error path) to every
> > > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > > Oh well.
> > > > 
> > > > We could just say that we can do a double inode lock, but then
> > > > standardize on the order. And the only sane order is comparing inode
> > > > pointers, not inode numbers like ext4 apparently does.
> > > > 
> > > > With a standard order, I don't think it would be at all wrong to just
> > > > take the inode lock on rename.
> > > 
> > > In principle, yes, but have you tried to grep for i_mutex?  Note that
> > > we have *another* place where multiple ->i_mutex might be held on
> > > non-directories (and unless I'm missing something, ext4 move_extent.c
> > > stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> > > happen while write(2) is holding ->i_mutex on a regular file.  So
> > > it's not _that_ easy - we want something like "and quota file is goes
> > > last", since there we don't get to change the locking order - the first
> > > ->i_mutex is taken too far outside.
> >   Hum, I think I could just do away with quota file i_mutex being special.
> > It's used for two purposes:
> >   1) When quota is being turned on/off, we want to set/clear inode immutable
> > flag, truncate page cache, etc. But we should be able push this locking
> > outside of quota locks.
> >   2) Inside filesystems when quota file is written to. Quota writes are
> > serialized by quota code anyway and noone else has any bussiness with quota
> > files (they are marked as immutable to avoid mistakes) so there i_mutex is
> > not really needed.
> 
> Grepping for I_MUTEX_QUOTA shows hits in ext4, reiserfs, and gfs2.  The
> former two are in code called from the quota code (through the
> ->quota_write method).  But the gfs2 code appears to be called directly
> from gfs2's write code.
  Ah, gfs2 doesn't use generic quota code so whatever it does is it's own
invention. For ext4 and reiserfs I could get rid of I_MUTEX_QUOTA as I
wrote.

								Honza

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-24 22:23                 ` Jan Kara
@ 2012-04-25 11:29                   ` J. Bruce Fields
  2012-04-25 16:26                     ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-25 11:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel

On Wed, Apr 25, 2012 at 12:23:12AM +0200, Jan Kara wrote:
> On Tue 24-04-12 15:52:36, J. Bruce Fields wrote:
> > On Fri, Apr 20, 2012 at 01:15:17PM +0200, Jan Kara wrote:
> > > On Wed 18-04-12 00:44:24, Al Viro wrote:
> > > > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > > > Or I could increment that counter for all the conflicting operations and
> > > > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > > > something like that (an inc, a dec, another error path) to every
> > > > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > > > Oh well.
> > > > > 
> > > > > We could just say that we can do a double inode lock, but then
> > > > > standardize on the order. And the only sane order is comparing inode
> > > > > pointers, not inode numbers like ext4 apparently does.
> > > > > 
> > > > > With a standard order, I don't think it would be at all wrong to just
> > > > > take the inode lock on rename.
> > > > 
> > > > In principle, yes, but have you tried to grep for i_mutex?  Note that
> > > > we have *another* place where multiple ->i_mutex might be held on
> > > > non-directories (and unless I'm missing something, ext4 move_extent.c
> > > > stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> > > > happen while write(2) is holding ->i_mutex on a regular file.  So
> > > > it's not _that_ easy - we want something like "and quota file is goes
> > > > last", since there we don't get to change the locking order - the first
> > > > ->i_mutex is taken too far outside.
> > >   Hum, I think I could just do away with quota file i_mutex being special.
> > > It's used for two purposes:
> > >   1) When quota is being turned on/off, we want to set/clear inode immutable
> > > flag, truncate page cache, etc. But we should be able push this locking
> > > outside of quota locks.
> > >   2) Inside filesystems when quota file is written to. Quota writes are
> > > serialized by quota code anyway and noone else has any bussiness with quota
> > > files (they are marked as immutable to avoid mistakes) so there i_mutex is
> > > not really needed.
> > 
> > Grepping for I_MUTEX_QUOTA shows hits in ext4, reiserfs, and gfs2.  The
> > former two are in code called from the quota code (through the
> > ->quota_write method).  But the gfs2 code appears to be called directly
> > from gfs2's write code.
>   Ah, gfs2 doesn't use generic quota code so whatever it does is it's own
> invention. For ext4 and reiserfs I could get rid of I_MUTEX_QUOTA as I
> wrote.

So, just the appended?

But unfortunately as long as that's left in gfs2 we're still stuck
trying to order quota files after other files when we take two
non-directory i_mutexes elsewhere.

--b.

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index e1025c7..1a6fb52 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1441,7 +1441,6 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
 	struct buffer_head tmp_bh;
 	struct buffer_head *bh;
 
-	mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
 	while (towrite > 0) {
 		tocopy = sb->s_blocksize - offset < towrite ?
 				sb->s_blocksize - offset : towrite;
@@ -1471,16 +1470,13 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
 		blk++;
 	}
 out:
-	if (len == towrite) {
-		mutex_unlock(&inode->i_mutex);
+	if (len == towrite)
 		return err;
-	}
 	if (inode->i_size < off+len-towrite)
 		i_size_write(inode, off+len-towrite);
 	inode->i_version++;
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	mark_inode_dirty(inode);
-	mutex_unlock(&inode->i_mutex);
 	return len - towrite;
 }
 
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index cf0b592..7c08c93 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -3000,7 +3000,6 @@ static ssize_t ext3_quota_write(struct super_block *sb, int type,
 			(unsigned long long)off, (unsigned long long)len);
 		return -EIO;
 	}
-	mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
 	bh = ext3_bread(handle, inode, blk, 1, &err);
 	if (!bh)
 		goto out;
@@ -3024,10 +3023,8 @@ static ssize_t ext3_quota_write(struct super_block *sb, int type,
 	}
 	brelse(bh);
 out:
-	if (err) {
-		mutex_unlock(&inode->i_mutex);
+	if (err)
 		return err;
-	}
 	if (inode->i_size < off + len) {
 		i_size_write(inode, off + len);
 		EXT3_I(inode)->i_disksize = inode->i_size;
@@ -3035,7 +3032,6 @@ out:
 	inode->i_version++;
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	ext3_mark_inode_dirty(handle, inode);
-	mutex_unlock(&inode->i_mutex);
 	return len;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ceebaf8..97938db 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4760,7 +4760,6 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
 		return -EIO;
 	}
 
-	mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
 	bh = ext4_bread(handle, inode, blk, 1, &err);
 	if (!bh)
 		goto out;
@@ -4776,16 +4775,13 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
 	err = ext4_handle_dirty_metadata(handle, NULL, bh);
 	brelse(bh);
 out:
-	if (err) {
-		mutex_unlock(&inode->i_mutex);
+	if (err)
 		return err;
-	}
 	if (inode->i_size < off + len) {
 		i_size_write(inode, off + len);
 		EXT4_I(inode)->i_disksize = inode->i_size;
 		ext4_mark_inode_dirty(handle, inode);
 	}
-	mutex_unlock(&inode->i_mutex);
 	return len;
 }
 
diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 8b7616e..c07b7d7 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -2270,7 +2270,6 @@ static ssize_t reiserfs_quota_write(struct super_block *sb, int type,
 			(unsigned long long)off, (unsigned long long)len);
 		return -EIO;
 	}
-	mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
 	while (towrite > 0) {
 		tocopy = sb->s_blocksize - offset < towrite ?
 		    sb->s_blocksize - offset : towrite;
@@ -2302,16 +2301,13 @@ static ssize_t reiserfs_quota_write(struct super_block *sb, int type,
 		blk++;
 	}
 out:
-	if (len == towrite) {
-		mutex_unlock(&inode->i_mutex);
+	if (len == towrite)
 		return err;
-	}
 	if (inode->i_size < off + len - towrite)
 		i_size_write(inode, off + len - towrite);
 	inode->i_version++;
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	mark_inode_dirty(inode);
-	mutex_unlock(&inode->i_mutex);
 	return len - towrite;
 }
 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-18 21:52             ` J. Bruce Fields
@ 2012-04-25 15:20               ` J. Bruce Fields
  2012-04-25 15:22               ` [PATCH 1/5] vfs: fix outdated i_mutex_lock_class documentation bfields
                                 ` (4 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-25 15:20 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Wed, Apr 18, 2012 at 05:52:38PM -0400, J. Bruce Fields wrote:
> On Wed, Apr 18, 2012 at 12:44:24AM +0100, Al Viro wrote:
> > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > Or I could increment that counter for all the conflicting operations and
> > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > something like that (an inc, a dec, another error path) to every
> > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > Oh well.
> > > 
> > > We could just say that we can do a double inode lock, but then
> > > standardize on the order. And the only sane order is comparing inode
> > > pointers, not inode numbers like ext4 apparently does.
> > > 
> > > With a standard order, I don't think it would be at all wrong to just
> > > take the inode lock on rename.
> > 
> > In principle, yes, but have you tried to grep for i_mutex?  Note that
> > we have *another* place where multiple ->i_mutex might be held on
> > non-directories (and unless I'm missing something, ext4 move_extent.c
> > stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> > happen while write(2) is holding ->i_mutex on a regular file.  So
> > it's not _that_ easy - we want something like "and quota file is goes
> > last"
> 
> So the idea would be to always take the i_mutex on non-quota files
> before taking it on quota files?
> 
> I tried pulling the ext4 thing into fs/inode.c, modifying the order to
> do that, and then doing the rename change on top of that.

Patches follow, with the ordering change at the end.

And a documentation fix that I suppose could go in whenever.

--b.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/5] vfs: fix outdated i_mutex_lock_class documentation
  2012-04-18 21:52             ` J. Bruce Fields
  2012-04-25 15:20               ` J. Bruce Fields
@ 2012-04-25 15:22               ` bfields
  2012-04-25 15:22               ` [PATCH 2/5] vfs: pull ext4's double-i_mutex-locking into common code bfields
                                 ` (3 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: bfields @ 2012-04-25 15:22 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 include/linux/fs.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8de6755..f31e45c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -860,7 +860,8 @@ static inline int inode_unhashed(struct inode *inode)
  * 0: the object of the current VFS operation
  * 1: parent
  * 2: child/target
- * 3: quota file
+ * 3: xattr
+ * 4: quota file
  *
  * The locking order between these classes is
  * parent -> child -> normal -> xattr -> quota
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/5] vfs: pull ext4's double-i_mutex-locking into common code
  2012-04-18 21:52             ` J. Bruce Fields
  2012-04-25 15:20               ` J. Bruce Fields
  2012-04-25 15:22               ` [PATCH 1/5] vfs: fix outdated i_mutex_lock_class documentation bfields
@ 2012-04-25 15:22               ` bfields
  2012-04-25 15:22               ` [PATCH 3/5] vfs: don't use PARENT/CHILD lock classes for non-directories bfields
                                 ` (2 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: bfields @ 2012-04-25 15:22 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

We want to do this elsewhere as well.

Also, compare pointers instead of inode numbers.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/ext4/move_extent.c |   21 ++-------------------
 fs/inode.c            |   36 ++++++++++++++++++++++++++++++++++++
 include/linux/fs.h    |    3 +++
 3 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index c5826c6..b87d94a 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -1086,19 +1086,7 @@ mext_inode_double_lock(struct inode *inode1, struct inode *inode2)
 	if (ret < 0)
 		goto out;
 
-	if (inode1 == inode2) {
-		mutex_lock(&inode1->i_mutex);
-		goto out;
-	}
-
-	if (inode1->i_ino < inode2->i_ino) {
-		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
-	} else {
-		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
-	}
-
+	lock_two_nondirectories(inode1, inode2);
 out:
 	return ret;
 }
@@ -1123,12 +1111,7 @@ mext_inode_double_unlock(struct inode *inode1, struct inode *inode2)
 	if (ret < 0)
 		goto out;
 
-	if (inode1)
-		mutex_unlock(&inode1->i_mutex);
-
-	if (inode2 && inode2 != inode1)
-		mutex_unlock(&inode2->i_mutex);
-
+	unlock_two_nondirectories(inode1, inode2);
 out:
 	return ret;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 9f4f5fe..9718f1c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -962,6 +962,42 @@ void unlock_new_inode(struct inode *inode)
 EXPORT_SYMBOL(unlock_new_inode);
 
 /**
+ * lock_two_nondirectories - take two i_mutexes on non-directory objects
+ * @inode1: first inode to lock; must be non-NULL
+ * @inode2: second inode to lock; optional, may equal first or be NULL.
+ */
+void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
+{
+	if (inode1 == inode2 || inode2 == NULL)
+		mutex_lock(&inode1->i_mutex);
+	else if (inode1 < inode2) {
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+
+	} else {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+	}
+}
+EXPORT_SYMBOL(lock_two_nondirectories);
+
+/**
+ * lock_two_nondirectories - release locks from lock_two_nondirectories()
+ * @inode1: first inode to unlock
+ * @inode2: second inode to lock
+ *
+ * Arguments must be same as those given to corresponding
+ * lock_two_nondirectories() call.
+ */
+void unlock_two_nondirectories(struct inode *inode1, struct inode *inode2)
+{
+	mutex_unlock(&inode1->i_mutex);
+	if (inode2 && inode2 != inode1)
+		mutex_unlock(&inode2->i_mutex);
+}
+EXPORT_SYMBOL(unlock_two_nondirectories);
+
+/**
  * iget5_locked - obtain an inode from a mounted file system
  * @sb:		super block of file system
  * @hashval:	hash value (usually inode number) to get
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f31e45c..0bee727 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -875,6 +875,9 @@ enum inode_i_mutex_lock_class
 	I_MUTEX_QUOTA
 };
 
+void lock_two_nondirectories(struct inode *, struct inode*);
+void unlock_two_nondirectories(struct inode *, struct inode*);
+
 /*
  * NOTE: in a 32bit arch with a preemptable kernel and
  * an UP compile the i_size_read/write must be atomic
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/5] vfs: don't use PARENT/CHILD lock classes for non-directories
  2012-04-18 21:52             ` J. Bruce Fields
                                 ` (2 preceding siblings ...)
  2012-04-25 15:22               ` [PATCH 2/5] vfs: pull ext4's double-i_mutex-locking into common code bfields
@ 2012-04-25 15:22               ` bfields
  2012-04-25 15:22               ` [PATCH 4/5] vfs: take i_mutex on renamed file bfields
  2012-04-25 15:22               ` [PATCH 5/5] vfs: change nondirectory i_mutex ordering to fix quota deadlock bfields
  5 siblings, 0 replies; 41+ messages in thread
From: bfields @ 2012-04-25 15:22 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

Reserve I_MUTEX_PARENT and I_MUTEX_CHILD for locking of actual
directories.

(Since I_MUTEX_QUOTA is now used just any time we need to lock a second
non-directory object, not necessarily a quota file, perhaps it should
get some more generic name, like I_MUTEX_NONDIR2.)

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/inode.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9718f1c..487c924 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -971,12 +971,12 @@ void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
 	if (inode1 == inode2 || inode2 == NULL)
 		mutex_lock(&inode1->i_mutex);
 	else if (inode1 < inode2) {
-		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+		mutex_lock(&inode1->i_mutex);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_QUOTA);
 
 	} else {
-		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+		mutex_lock(&inode2->i_mutex);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_QUOTA);
 	}
 }
 EXPORT_SYMBOL(lock_two_nondirectories);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/5] vfs: take i_mutex on renamed file
  2012-04-18 21:52             ` J. Bruce Fields
                                 ` (3 preceding siblings ...)
  2012-04-25 15:22               ` [PATCH 3/5] vfs: don't use PARENT/CHILD lock classes for non-directories bfields
@ 2012-04-25 15:22               ` bfields
  2012-04-25 15:22               ` [PATCH 5/5] vfs: change nondirectory i_mutex ordering to fix quota deadlock bfields
  5 siblings, 0 replies; 41+ messages in thread
From: bfields @ 2012-04-25 15:22 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

A read delegation is used by NFSv4 as a guarantee that a client can
perform local read opens without informing the server.

The open operation takes the last component of the pathname as an
argument, thus is also a lookup operation, and giving the client the
above guarantee means informing the client before we allow anything that
would change the set of names pointing to the inode.

Therefore, we need to break delegations on rename, link, and unlink.

We also need to prevent new delegations from being acquired while one of
these operations is in progress.

We could add some completely new locking for that purpose, but it's
simpler to use the i_mutex, since that's already taken by all the
operations we care about.

The single exception is rename.  So, modify rename to take the i_mutex
on the file that is being renamed.

Also fix up lockdep and Documentation/filesystems/directory-locking to
reflect the change.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 Documentation/filesystems/directory-locking |   30 ++++++++++++++++++--------
 fs/namei.c                                  |    7 ++---
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking
index ff7b611..9e8a629 100644
--- a/Documentation/filesystems/directory-locking
+++ b/Documentation/filesystems/directory-locking
@@ -2,6 +2,10 @@
 kinds of locks - per-inode (->i_mutex) and per-filesystem
 (->s_vfs_rename_mutex).
 
+	When taking the i_mutex on multiple non-directory objects, we
+always acquire the locks in order by increasing address.  We'll call
+that "inode pointer" order in the following.
+
 	For our purposes all operations fall in 5 classes:
 
 1) read access.  Locking rules: caller locks directory we are accessing.
@@ -12,8 +16,9 @@ kinds of locks - per-inode (->i_mutex) and per-filesystem
 locks victim and calls the method.
 
 4) rename() that is _not_ cross-directory.  Locking rules: caller locks
-the parent, finds source and target, if target already exists - locks it
-and then calls the method.
+the parent and finds source and target.  If source and target both
+exist, they are locked in inode pointer order.  Otherwise lock just
+source.  Then call method.
 
 5) link creation.  Locking rules:
 	* lock parent
@@ -30,7 +35,8 @@ rules:
 		fail with -ENOTEMPTY
 	* if new parent is equal to or is a descendent of source
 		fail with -ELOOP
-	* if target exists - lock it.
+	* If target exists, lock both source and target, in inode
+	  pointer order.  Otherwise lock just source.
 	* call the method.
 
 
@@ -56,9 +62,11 @@ objects - A < B iff A is an ancestor of B.
     renames will be blocked on filesystem lock and we don't start changing
     the order until we had acquired all locks).
 
-(3) any operation holds at most one lock on non-directory object and
-    that lock is acquired after all other locks.  (Proof: see descriptions
-    of operations).
+(3) locks on non-directory objects are acquired only after locks on
+    directory objects, and are acquired in inode pointer order.
+    (Proof: all operations but renames take lock on at most one
+    non-directory object, except renames, which take locks on source and
+    target in inode pointer order.)
 
 	Now consider the minimal deadlock.  Each process is blocked on
 attempt to acquire some lock and already holds at least one lock.  Let's
@@ -66,9 +74,13 @@ consider the set of contended locks.  First of all, filesystem lock is
 not contended, since any process blocked on it is not holding any locks.
 Thus all processes are blocked on ->i_mutex.
 
-	Non-directory objects are not contended due to (3).  Thus link
-creation can't be a part of deadlock - it can't be blocked on source
-and it means that it doesn't hold any locks.
+	By (3), any process holding a non-directory lock can only be
+waiting on another non-directory lock with a larger address.  Therefore
+the process holding the "largest" such lock can always make progress, and
+non-directory objects are not included in the set of contended locks.
+
+	Thus link creation can't be a part of deadlock - it can't be
+blocked on source and it means that it doesn't hold any locks.
 
 	Any contended object is either held by cross-directory rename or
 has a child that is also contended.  Indeed, suppose that it is held by
diff --git a/fs/namei.c b/fs/namei.c
index 0062dd1..ee5fdd3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3165,6 +3165,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
 			    struct inode *new_dir, struct dentry *new_dentry)
 {
 	struct inode *target = new_dentry->d_inode;
+	struct inode *source = old_dentry->d_inode;
 	int error;
 
 	error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
@@ -3172,8 +3173,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
 		return error;
 
 	dget(new_dentry);
-	if (target)
-		mutex_lock(&target->i_mutex);
+	lock_two_nondirectories(source, target);
 
 	error = -EBUSY;
 	if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
@@ -3188,8 +3188,7 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
 	if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
 		d_move(old_dentry, new_dentry);
 out:
-	if (target)
-		mutex_unlock(&target->i_mutex);
+	unlock_two_nondirectories(source, target);
 	dput(new_dentry);
 	return error;
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/5] vfs: change nondirectory i_mutex ordering to fix quota deadlock
  2012-04-18 21:52             ` J. Bruce Fields
                                 ` (4 preceding siblings ...)
  2012-04-25 15:22               ` [PATCH 4/5] vfs: take i_mutex on renamed file bfields
@ 2012-04-25 15:22               ` bfields
  2012-04-25 15:28                 ` J. Bruce Fields
  5 siblings, 1 reply; 41+ messages in thread
From: bfields @ 2012-04-25 15:22 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, J. Bruce Fields

From: "J. Bruce Fields" <bfields@redhat.com>

A write can take an i_mutex on a quota file while holding the i_mutex on
the file being written to.

And both rename and fs/ext4/move_extent.c:mext_inode_double_lock() can
also take the i_mutex on two regular files.

Either of those could take locks in opposite order from a quota file
write, and end up deadlocked.

Changing the locking order in the quota-update-while-writing case looks
hard.  So, instead, change the order in the mext_inode_double_lock()
case so that the i_mutex is always taken on a quota file after being
taken on a file that isn't a quota file.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 Documentation/filesystems/directory-locking |   25 +++++++++++++++----------
 fs/inode.c                                  |   13 ++++++++++++-
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking
index 9e8a629..022d94f 100644
--- a/Documentation/filesystems/directory-locking
+++ b/Documentation/filesystems/directory-locking
@@ -3,8 +3,12 @@ kinds of locks - per-inode (->i_mutex) and per-filesystem
 (->s_vfs_rename_mutex).
 
 	When taking the i_mutex on multiple non-directory objects, we
-always acquire the locks in order by increasing address.  We'll call
-that "inode pointer" order in the following.
+always acquire them in the following order (which we'll call "the usual
+order" in the following):
+
+	* non-IS_NOQUOTA inodes before IS_NOQUOTA inodes
+	* within each category, inodes with smaller addresses before
+	  inodes with larger addresses
 
 	For our purposes all operations fall in 5 classes:
 
@@ -17,7 +21,7 @@ locks victim and calls the method.
 
 4) rename() that is _not_ cross-directory.  Locking rules: caller locks
 the parent and finds source and target.  If source and target both
-exist, they are locked in inode pointer order.  Otherwise lock just
+exist, they are locked in the usual order.  Otherwise lock just
 source.  Then call method.
 
 5) link creation.  Locking rules:
@@ -35,8 +39,8 @@ rules:
 		fail with -ENOTEMPTY
 	* if new parent is equal to or is a descendent of source
 		fail with -ELOOP
-	* If target exists, lock both source and target, in inode
-	  pointer order.  Otherwise lock just source.
+	* If target exists, lock both source and target, in the
+	  usual order.  Otherwise lock just source.
 	* call the method.
 
 
@@ -63,10 +67,10 @@ objects - A < B iff A is an ancestor of B.
     the order until we had acquired all locks).
 
 (3) locks on non-directory objects are acquired only after locks on
-    directory objects, and are acquired in inode pointer order.
+    directory objects, and are acquired in the usual order.
     (Proof: all operations but renames take lock on at most one
     non-directory object, except renames, which take locks on source and
-    target in inode pointer order.)
+    target in the usual order.)
 
 	Now consider the minimal deadlock.  Each process is blocked on
 attempt to acquire some lock and already holds at least one lock.  Let's
@@ -75,9 +79,10 @@ not contended, since any process blocked on it is not holding any locks.
 Thus all processes are blocked on ->i_mutex.
 
 	By (3), any process holding a non-directory lock can only be
-waiting on another non-directory lock with a larger address.  Therefore
-the process holding the "largest" such lock can always make progress, and
-non-directory objects are not included in the set of contended locks.
+waiting on another non-directory that is "larger" in the usual order.
+Therefore the process holding the "largest" such lock can always make
+progress, and non-directory objects are not included in the set of
+contended locks.
 
 	Thus link creation can't be a part of deadlock - it can't be
 blocked on source and it means that it doesn't hold any locks.
diff --git a/fs/inode.c b/fs/inode.c
index 487c924..13d23b6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -961,6 +961,17 @@ void unlock_new_inode(struct inode *inode)
 }
 EXPORT_SYMBOL(unlock_new_inode);
 
+/*
+ * We order !IS_NOQUOTA files before ISNOQUOTA files, and by pointer
+ * within each category.
+ */
+static bool nondir_mutex_ordered(struct inode *inode1, struct inode *inode2)
+{
+	if (IS_NOQUOTA(inode1) == IS_NOQUOTA(inode2))
+		return inode1 < inode2;
+	return IS_NOQUOTA(inode2);
+}
+
 /**
  * lock_two_nondirectories - take two i_mutexes on non-directory objects
  * @inode1: first inode to lock; must be non-NULL
@@ -970,7 +981,7 @@ void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
 {
 	if (inode1 == inode2 || inode2 == NULL)
 		mutex_lock(&inode1->i_mutex);
-	else if (inode1 < inode2) {
+	else if (nondir_mutex_ordered(inode1, inode2)) {
 		mutex_lock(&inode1->i_mutex);
 		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_QUOTA);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] vfs: change nondirectory i_mutex ordering to fix quota deadlock
  2012-04-25 15:22               ` [PATCH 5/5] vfs: change nondirectory i_mutex ordering to fix quota deadlock bfields
@ 2012-04-25 15:28                 ` J. Bruce Fields
  2012-04-25 19:53                   ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-25 15:28 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, J. Bruce Fields

On Wed, Apr 25, 2012 at 11:22:09AM -0400, bfields@fieldses.org wrote:
> diff --git a/fs/inode.c b/fs/inode.c
> index 487c924..13d23b6 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -961,6 +961,17 @@ void unlock_new_inode(struct inode *inode)
>  }
>  EXPORT_SYMBOL(unlock_new_inode);
>  
> +/*
> + * We order !IS_NOQUOTA files before ISNOQUOTA files, and by pointer
> + * within each category.
> + */
> +static bool nondir_mutex_ordered(struct inode *inode1, struct inode *inode2)
> +{
> +	if (IS_NOQUOTA(inode1) == IS_NOQUOTA(inode2))
> +		return inode1 < inode2;
> +	return IS_NOQUOTA(inode2);
> +}

This seems kind of awful.  Is it what you were thinking of originally,
Al, and could we live with it?

> +
>  /**
>   * lock_two_nondirectories - take two i_mutexes on non-directory objects
>   * @inode1: first inode to lock; must be non-NULL
> @@ -970,7 +981,7 @@ void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
>  {
>  	if (inode1 == inode2 || inode2 == NULL)
>  		mutex_lock(&inode1->i_mutex);
> -	else if (inode1 < inode2) {
> +	else if (nondir_mutex_ordered(inode1, inode2)) {
>  		mutex_lock(&inode1->i_mutex);
>  		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_QUOTA);

But I still don't see how to stop this code racing with S_NOQUOTA being
toggled.

--b.

>  
> -- 
> 1.7.5.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-25 11:29                   ` J. Bruce Fields
@ 2012-04-25 16:26                     ` Jan Kara
  2012-04-25 16:47                       ` Steven Whitehouse
  2012-04-25 17:11                       ` J. Bruce Fields
  0 siblings, 2 replies; 41+ messages in thread
From: Jan Kara @ 2012-04-25 16:26 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jan Kara, Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel,
	Steven Whitehouse

On Wed 25-04-12 07:29:30, J. Bruce Fields wrote:
> On Wed, Apr 25, 2012 at 12:23:12AM +0200, Jan Kara wrote:
> > On Tue 24-04-12 15:52:36, J. Bruce Fields wrote:
> > > On Fri, Apr 20, 2012 at 01:15:17PM +0200, Jan Kara wrote:
> > > > On Wed 18-04-12 00:44:24, Al Viro wrote:
> > > > > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > > > > Or I could increment that counter for all the conflicting operations and
> > > > > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > > > > something like that (an inc, a dec, another error path) to every
> > > > > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > > > > Oh well.
> > > > > > 
> > > > > > We could just say that we can do a double inode lock, but then
> > > > > > standardize on the order. And the only sane order is comparing inode
> > > > > > pointers, not inode numbers like ext4 apparently does.
> > > > > > 
> > > > > > With a standard order, I don't think it would be at all wrong to just
> > > > > > take the inode lock on rename.
> > > > > 
> > > > > In principle, yes, but have you tried to grep for i_mutex?  Note that
> > > > > we have *another* place where multiple ->i_mutex might be held on
> > > > > non-directories (and unless I'm missing something, ext4 move_extent.c
> > > > > stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> > > > > happen while write(2) is holding ->i_mutex on a regular file.  So
> > > > > it's not _that_ easy - we want something like "and quota file is goes
> > > > > last", since there we don't get to change the locking order - the first
> > > > > ->i_mutex is taken too far outside.
> > > >   Hum, I think I could just do away with quota file i_mutex being special.
> > > > It's used for two purposes:
> > > >   1) When quota is being turned on/off, we want to set/clear inode immutable
> > > > flag, truncate page cache, etc. But we should be able push this locking
> > > > outside of quota locks.
> > > >   2) Inside filesystems when quota file is written to. Quota writes are
> > > > serialized by quota code anyway and noone else has any bussiness with quota
> > > > files (they are marked as immutable to avoid mistakes) so there i_mutex is
> > > > not really needed.
> > > 
> > > Grepping for I_MUTEX_QUOTA shows hits in ext4, reiserfs, and gfs2.  The
> > > former two are in code called from the quota code (through the
> > > ->quota_write method).  But the gfs2 code appears to be called directly
> > > from gfs2's write code.
> >   Ah, gfs2 doesn't use generic quota code so whatever it does is it's own
> > invention. For ext4 and reiserfs I could get rid of I_MUTEX_QUOTA as I
> > wrote.
> 
> So, just the appended?
  Yup, that's the easier part. We also use the mutex in quota code itself
(fs/quota/dquot.c). That's somewhat harder to solve but still relatively
simple.

> But unfortunately as long as that's left in gfs2 we're still stuck
> trying to order quota files after other files when we take two
> non-directory i_mutexes elsewhere.
  As far as GFS2 is concerned, I'm not sure what it uses i_mutex in quota
code for.  In any case it should be possible to replace that usage by some
GFS2 internal lock to get rid of the last usage of I_MUTEX_QUOTA... Stephen?

								Honza

> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index e1025c7..1a6fb52 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -1441,7 +1441,6 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
>  	struct buffer_head tmp_bh;
>  	struct buffer_head *bh;
>  
> -	mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
>  	while (towrite > 0) {
>  		tocopy = sb->s_blocksize - offset < towrite ?
>  				sb->s_blocksize - offset : towrite;
> @@ -1471,16 +1470,13 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
>  		blk++;
>  	}
>  out:
> -	if (len == towrite) {
> -		mutex_unlock(&inode->i_mutex);
> +	if (len == towrite)
>  		return err;
> -	}
>  	if (inode->i_size < off+len-towrite)
>  		i_size_write(inode, off+len-towrite);
>  	inode->i_version++;
>  	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>  	mark_inode_dirty(inode);
> -	mutex_unlock(&inode->i_mutex);
>  	return len - towrite;
>  }
>  
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index cf0b592..7c08c93 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -3000,7 +3000,6 @@ static ssize_t ext3_quota_write(struct super_block *sb, int type,
>  			(unsigned long long)off, (unsigned long long)len);
>  		return -EIO;
>  	}
> -	mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
>  	bh = ext3_bread(handle, inode, blk, 1, &err);
>  	if (!bh)
>  		goto out;
> @@ -3024,10 +3023,8 @@ static ssize_t ext3_quota_write(struct super_block *sb, int type,
>  	}
>  	brelse(bh);
>  out:
> -	if (err) {
> -		mutex_unlock(&inode->i_mutex);
> +	if (err)
>  		return err;
> -	}
>  	if (inode->i_size < off + len) {
>  		i_size_write(inode, off + len);
>  		EXT3_I(inode)->i_disksize = inode->i_size;
> @@ -3035,7 +3032,6 @@ out:
>  	inode->i_version++;
>  	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>  	ext3_mark_inode_dirty(handle, inode);
> -	mutex_unlock(&inode->i_mutex);
>  	return len;
>  }
>  
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index ceebaf8..97938db 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4760,7 +4760,6 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
>  		return -EIO;
>  	}
>  
> -	mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
>  	bh = ext4_bread(handle, inode, blk, 1, &err);
>  	if (!bh)
>  		goto out;
> @@ -4776,16 +4775,13 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
>  	err = ext4_handle_dirty_metadata(handle, NULL, bh);
>  	brelse(bh);
>  out:
> -	if (err) {
> -		mutex_unlock(&inode->i_mutex);
> +	if (err)
>  		return err;
> -	}
>  	if (inode->i_size < off + len) {
>  		i_size_write(inode, off + len);
>  		EXT4_I(inode)->i_disksize = inode->i_size;
>  		ext4_mark_inode_dirty(handle, inode);
>  	}
> -	mutex_unlock(&inode->i_mutex);
>  	return len;
>  }
>  
> diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
> index 8b7616e..c07b7d7 100644
> --- a/fs/reiserfs/super.c
> +++ b/fs/reiserfs/super.c
> @@ -2270,7 +2270,6 @@ static ssize_t reiserfs_quota_write(struct super_block *sb, int type,
>  			(unsigned long long)off, (unsigned long long)len);
>  		return -EIO;
>  	}
> -	mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
>  	while (towrite > 0) {
>  		tocopy = sb->s_blocksize - offset < towrite ?
>  		    sb->s_blocksize - offset : towrite;
> @@ -2302,16 +2301,13 @@ static ssize_t reiserfs_quota_write(struct super_block *sb, int type,
>  		blk++;
>  	}
>  out:
> -	if (len == towrite) {
> -		mutex_unlock(&inode->i_mutex);
> +	if (len == towrite)
>  		return err;
> -	}
>  	if (inode->i_size < off + len - towrite)
>  		i_size_write(inode, off + len - towrite);
>  	inode->i_version++;
>  	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
>  	mark_inode_dirty(inode);
> -	mutex_unlock(&inode->i_mutex);
>  	return len - towrite;
>  }
>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-25 16:26                     ` Jan Kara
@ 2012-04-25 16:47                       ` Steven Whitehouse
  2012-04-25 17:11                       ` J. Bruce Fields
  1 sibling, 0 replies; 41+ messages in thread
From: Steven Whitehouse @ 2012-04-25 16:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: J. Bruce Fields, Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel

Hi,

On Wed, 2012-04-25 at 18:26 +0200, Jan Kara wrote:
> On Wed 25-04-12 07:29:30, J. Bruce Fields wrote:
> > On Wed, Apr 25, 2012 at 12:23:12AM +0200, Jan Kara wrote:
> > > On Tue 24-04-12 15:52:36, J. Bruce Fields wrote:
> > > > On Fri, Apr 20, 2012 at 01:15:17PM +0200, Jan Kara wrote:
> > > > > On Wed 18-04-12 00:44:24, Al Viro wrote:
> > > > > > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > > > > > Or I could increment that counter for all the conflicting operations and
> > > > > > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > > > > > something like that (an inc, a dec, another error path) to every
> > > > > > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > > > > > Oh well.
> > > > > > > 
> > > > > > > We could just say that we can do a double inode lock, but then
> > > > > > > standardize on the order. And the only sane order is comparing inode
> > > > > > > pointers, not inode numbers like ext4 apparently does.
> > > > > > > 
> > > > > > > With a standard order, I don't think it would be at all wrong to just
> > > > > > > take the inode lock on rename.
> > > > > > 
> > > > > > In principle, yes, but have you tried to grep for i_mutex?  Note that
> > > > > > we have *another* place where multiple ->i_mutex might be held on
> > > > > > non-directories (and unless I'm missing something, ext4 move_extent.c
> > > > > > stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> > > > > > happen while write(2) is holding ->i_mutex on a regular file.  So
> > > > > > it's not _that_ easy - we want something like "and quota file is goes
> > > > > > last", since there we don't get to change the locking order - the first
> > > > > > ->i_mutex is taken too far outside.
> > > > >   Hum, I think I could just do away with quota file i_mutex being special.
> > > > > It's used for two purposes:
> > > > >   1) When quota is being turned on/off, we want to set/clear inode immutable
> > > > > flag, truncate page cache, etc. But we should be able push this locking
> > > > > outside of quota locks.
> > > > >   2) Inside filesystems when quota file is written to. Quota writes are
> > > > > serialized by quota code anyway and noone else has any bussiness with quota
> > > > > files (they are marked as immutable to avoid mistakes) so there i_mutex is
> > > > > not really needed.
> > > > 
> > > > Grepping for I_MUTEX_QUOTA shows hits in ext4, reiserfs, and gfs2.  The
> > > > former two are in code called from the quota code (through the
> > > > ->quota_write method).  But the gfs2 code appears to be called directly
> > > > from gfs2's write code.
> > >   Ah, gfs2 doesn't use generic quota code so whatever it does is it's own
> > > invention. For ext4 and reiserfs I could get rid of I_MUTEX_QUOTA as I
> > > wrote.
> > 
> > So, just the appended?
>   Yup, that's the easier part. We also use the mutex in quota code itself
> (fs/quota/dquot.c). That's somewhat harder to solve but still relatively
> simple.
> 
> > But unfortunately as long as that's left in gfs2 we're still stuck
> > trying to order quota files after other files when we take two
> > non-directory i_mutexes elsewhere.
>   As far as GFS2 is concerned, I'm not sure what it uses i_mutex in quota
> code for.  In any case it should be possible to replace that usage by some
> GFS2 internal lock to get rid of the last usage of I_MUTEX_QUOTA... Stephen?
> 
> 								Honza
> 
I'll have a look and see what we can do. I'm not sure off the top of my
head how easy it will be to eliminate this lock,

Steve.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [git pull] vfs and fs fixes
  2012-04-25 16:26                     ` Jan Kara
  2012-04-25 16:47                       ` Steven Whitehouse
@ 2012-04-25 17:11                       ` J. Bruce Fields
  1 sibling, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-25 17:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel, Steven Whitehouse

On Wed, Apr 25, 2012 at 06:26:40PM +0200, Jan Kara wrote:
> On Wed 25-04-12 07:29:30, J. Bruce Fields wrote:
> > On Wed, Apr 25, 2012 at 12:23:12AM +0200, Jan Kara wrote:
> > > On Tue 24-04-12 15:52:36, J. Bruce Fields wrote:
> > > > On Fri, Apr 20, 2012 at 01:15:17PM +0200, Jan Kara wrote:
> > > > > On Wed 18-04-12 00:44:24, Al Viro wrote:
> > > > > > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > > > > > Or I could increment that counter for all the conflicting operations and
> > > > > > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > > > > > something like that (an inc, a dec, another error path) to every
> > > > > > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > > > > > Oh well.
> > > > > > > 
> > > > > > > We could just say that we can do a double inode lock, but then
> > > > > > > standardize on the order. And the only sane order is comparing inode
> > > > > > > pointers, not inode numbers like ext4 apparently does.
> > > > > > > 
> > > > > > > With a standard order, I don't think it would be at all wrong to just
> > > > > > > take the inode lock on rename.
> > > > > > 
> > > > > > In principle, yes, but have you tried to grep for i_mutex?  Note that
> > > > > > we have *another* place where multiple ->i_mutex might be held on
> > > > > > non-directories (and unless I'm missing something, ext4 move_extent.c
> > > > > > stuff doesn't play well with it): quota writes.  Which can, AFAICS,
> > > > > > happen while write(2) is holding ->i_mutex on a regular file.  So
> > > > > > it's not _that_ easy - we want something like "and quota file is goes
> > > > > > last", since there we don't get to change the locking order - the first
> > > > > > ->i_mutex is taken too far outside.
> > > > >   Hum, I think I could just do away with quota file i_mutex being special.
> > > > > It's used for two purposes:
> > > > >   1) When quota is being turned on/off, we want to set/clear inode immutable
> > > > > flag, truncate page cache, etc. But we should be able push this locking
> > > > > outside of quota locks.
> > > > >   2) Inside filesystems when quota file is written to. Quota writes are
> > > > > serialized by quota code anyway and noone else has any bussiness with quota
> > > > > files (they are marked as immutable to avoid mistakes) so there i_mutex is
> > > > > not really needed.
> > > > 
> > > > Grepping for I_MUTEX_QUOTA shows hits in ext4, reiserfs, and gfs2.  The
> > > > former two are in code called from the quota code (through the
> > > > ->quota_write method).  But the gfs2 code appears to be called directly
> > > > from gfs2's write code.
> > >   Ah, gfs2 doesn't use generic quota code so whatever it does is it's own
> > > invention. For ext4 and reiserfs I could get rid of I_MUTEX_QUOTA as I
> > > wrote.
> > 
> > So, just the appended?
>   Yup, that's the easier part. We also use the mutex in quota code itself
> (fs/quota/dquot.c). That's somewhat harder to solve but still relatively
> simple.

Yeah, OK, I stared at that part but wasn't completely sure what you
meant to do there.

--b.

> > But unfortunately as long as that's left in gfs2 we're still stuck
> > trying to order quota files after other files when we take two
> > non-directory i_mutexes elsewhere.
>   As far as GFS2 is concerned, I'm not sure what it uses i_mutex in quota
> code for.  In any case it should be possible to replace that usage by some
> GFS2 internal lock to get rid of the last usage of I_MUTEX_QUOTA... Stephen?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] vfs: change nondirectory i_mutex ordering to fix quota deadlock
  2012-04-25 15:28                 ` J. Bruce Fields
@ 2012-04-25 19:53                   ` Jan Kara
  2012-04-25 19:58                     ` J. Bruce Fields
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2012-04-25 19:53 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel, J. Bruce Fields

On Wed 25-04-12 11:28:58, J. Bruce Fields wrote:
> On Wed, Apr 25, 2012 at 11:22:09AM -0400, bfields@fieldses.org wrote:
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 487c924..13d23b6 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -961,6 +961,17 @@ void unlock_new_inode(struct inode *inode)
> >  }
> >  EXPORT_SYMBOL(unlock_new_inode);
> >  
> > +/*
> > + * We order !IS_NOQUOTA files before ISNOQUOTA files, and by pointer
> > + * within each category.
> > + */
> > +static bool nondir_mutex_ordered(struct inode *inode1, struct inode *inode2)
> > +{
> > +	if (IS_NOQUOTA(inode1) == IS_NOQUOTA(inode2))
> > +		return inode1 < inode2;
> > +	return IS_NOQUOTA(inode2);
> > +}
> 
> This seems kind of awful.  Is it what you were thinking of originally,
> Al, and could we live with it?
  Yeah, it's pretty ugly and also racy. I'm just now testing patches that
would get rid of I_MUTEX_QUOTA usage for filesystems (except GFS2) and
quota code. GFS2 could be certainly dealt with as well (at least by
introducing a new GFS2 internal lock) so this ugly code can go away.

								Honza


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/5] vfs: change nondirectory i_mutex ordering to fix quota deadlock
  2012-04-25 19:53                   ` Jan Kara
@ 2012-04-25 19:58                     ` J. Bruce Fields
  0 siblings, 0 replies; 41+ messages in thread
From: J. Bruce Fields @ 2012-04-25 19:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel,
	J. Bruce Fields, switeho

On Wed, Apr 25, 2012 at 09:53:43PM +0200, Jan Kara wrote:
> On Wed 25-04-12 11:28:58, J. Bruce Fields wrote:
> > On Wed, Apr 25, 2012 at 11:22:09AM -0400, bfields@fieldses.org wrote:
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 487c924..13d23b6 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -961,6 +961,17 @@ void unlock_new_inode(struct inode *inode)
> > >  }
> > >  EXPORT_SYMBOL(unlock_new_inode);
> > >  
> > > +/*
> > > + * We order !IS_NOQUOTA files before ISNOQUOTA files, and by pointer
> > > + * within each category.
> > > + */
> > > +static bool nondir_mutex_ordered(struct inode *inode1, struct inode *inode2)
> > > +{
> > > +	if (IS_NOQUOTA(inode1) == IS_NOQUOTA(inode2))
> > > +		return inode1 < inode2;
> > > +	return IS_NOQUOTA(inode2);
> > > +}
> > 
> > This seems kind of awful.  Is it what you were thinking of originally,
> > Al, and could we live with it?
>   Yeah, it's pretty ugly and also racy. I'm just now testing patches that
> would get rid of I_MUTEX_QUOTA usage for filesystems (except GFS2) and
> quota code. GFS2 could be certainly dealt with as well (at least by
> introducing a new GFS2 internal lock) so this ugly code can go away.

That would be great, thanks!

--b.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [git pull] vfs and fs fixes
@ 2013-09-18 22:52 Al Viro
  0 siblings, 0 replies; 41+ messages in thread
From: Al Viro @ 2013-09-18 22:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

atomic_open-related fixes (Miklos' series, with EEXIST-related parts replaced
with fix in fs/namei.c:atomic_open() instead of messing with the instances)
+ race fix in autofs + leak on failure exit in 9p.  Please, pull from the
usual place -
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (3):
      autofs4: close the races around autofs4_notify_daemon()
      atomic_open: take care of EEXIST in no-open case with O_CREAT|O_EXCL in fs/namei.c
      9p: don't forget to destroy inode cache if fscache registration fails

Miklos Szeredi (5):
      vfs: improve i_op->atomic_open() documentation
      cifs: fix filp leak in cifs_atomic_open()
      gfs2: set FILE_CREATED
      nfs: set FILE_CREATED
      vfs: don't set FILE_CREATED before calling ->atomic_open()

Diffstat:
 Documentation/filesystems/vfs.txt |   14 +++++++-------
 fs/9p/v9fs.c                      |    7 ++++---
 fs/9p/vfs_inode_dotl.c            |    8 +-------
 fs/autofs4/waitq.c                |   13 +++----------
 fs/cifs/dir.c                     |    1 +
 fs/gfs2/inode.c                   |    4 +++-
 fs/namei.c                        |   34 ++++++++++++++++++++++------------
 fs/nfs/dir.c                      |    3 +++
 fs/open.c                         |   21 ++++++++++++++++++---
 9 files changed, 62 insertions(+), 43 deletions(-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2013-09-18 22:53 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-17  5:25 [git pull] vfs and fs fixes Al Viro
2012-04-17 15:01 ` Linus Torvalds
2012-04-17 16:22   ` J. Bruce Fields
2012-04-17 16:33     ` Linus Torvalds
2012-04-17 16:33       ` Linus Torvalds
2012-04-17 17:06       ` J. Bruce Fields
2012-04-17 17:06         ` J. Bruce Fields
2012-04-17 17:59       ` Al Viro
2012-04-17 18:01   ` Al Viro
2012-04-17 18:28     ` Al Viro
2012-04-17 21:14       ` J. Bruce Fields
2012-04-17 22:08         ` Linus Torvalds
2012-04-17 23:44           ` Al Viro
2012-04-18  0:49             ` J. Bruce Fields
2012-04-18  0:56             ` Linus Torvalds
2012-04-18 21:52             ` J. Bruce Fields
2012-04-25 15:20               ` J. Bruce Fields
2012-04-25 15:22               ` [PATCH 1/5] vfs: fix outdated i_mutex_lock_class documentation bfields
2012-04-25 15:22               ` [PATCH 2/5] vfs: pull ext4's double-i_mutex-locking into common code bfields
2012-04-25 15:22               ` [PATCH 3/5] vfs: don't use PARENT/CHILD lock classes for non-directories bfields
2012-04-25 15:22               ` [PATCH 4/5] vfs: take i_mutex on renamed file bfields
2012-04-25 15:22               ` [PATCH 5/5] vfs: change nondirectory i_mutex ordering to fix quota deadlock bfields
2012-04-25 15:28                 ` J. Bruce Fields
2012-04-25 19:53                   ` Jan Kara
2012-04-25 19:58                     ` J. Bruce Fields
2012-04-20 11:15             ` [git pull] vfs and fs fixes Jan Kara
2012-04-24 19:52               ` J. Bruce Fields
2012-04-24 22:23                 ` Jan Kara
2012-04-25 11:29                   ` J. Bruce Fields
2012-04-25 16:26                     ` Jan Kara
2012-04-25 16:47                       ` Steven Whitehouse
2012-04-25 17:11                       ` J. Bruce Fields
2012-04-18  0:47           ` J. Bruce Fields
2012-04-19  3:23 ` Benjamin Herrenschmidt
2012-04-19 14:50   ` Ted Ts'o
2012-04-24 17:40     ` Greg KH
2012-04-24 17:45       ` Al Viro
2012-04-24 17:59         ` Greg KH
2012-04-24 18:04           ` Al Viro
2012-04-24 20:37             ` Greg KH
2013-09-18 22:52 Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.