From: Joel Becker <Joel.Becker@oracle.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org>, Louis Rilling <louis.rilling@kerlabs.com>, linux-kernel@vger.kernel.org, cluster-devel@redhat.com, swhiteho <swhiteho@redhat.com> Subject: Re: [PATCH] configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item() Date: Tue, 27 Jan 2009 19:41:14 -0800 [thread overview] Message-ID: <20090128034114.GC7244@mail.oracle.com> (raw) In-Reply-To: <1232973009.4863.76.camel@laptop> On Mon, Jan 26, 2009 at 01:30:09PM +0100, Peter Zijlstra wrote: > I don't think I was suggesting that. All you need is to serialize any > mkdir/creat against the rmdir of the youngest non-default group, and you > can do that by holding su_mutex. > > In rmdir, you already own all the i_mutex instances you need to uncouple > the whole tree, all you need to do is validate that its indeed empty -- > you don't need i_mutex's for that, because you're holding su_mutex, and > any concurrent mkdir/creat will be blocking on that. > > If you find it empty, just mark everybody DEAD, drop su_mutex and > decouple. All concurrent mkdir/creat thingies that were blocking will > now bail because their parent is found DEAD. Right, that's what I was talking about when I said: > > Now look in detach_groups(). We drop the groups children before > > marking them DEAD. Louis' plan, I think, is to perhaps mark a group > > DEAD, disconnect it from the vfs, and then operate on its children. In > > this fashion, perhaps we can unlock the trailing lock like a normal VFS > > operation. > > This will require some serious auditing, however, because now > > vfs functions can get into the vfs objects behind us. And more vfs > > changes affect us. Whereas the current locking relies on the vfs's > > parent->child lock ordering only, something that isn't likely to be > > changed. That is, the vfs has already walked past this directory, dropped i_mutex, and is in a child default group holding its i_mutex. It wants to mkdir(2) down there. You're saying that, if mkdir(2) holds su_mutex higher up, it can check S_DEAD and compare with us, and that's exactly the scheme I mentioned in the first of the quoted paragraphs (Louis proposed it a while back). Thus, though we don't hold i_mutex on that child, it will eventually either a) have gotten su_mutex first, and will cause rmdir to ENOTEMPTY or b) have gotten su_mutex second, will see S_DEAD, and return -ENOENT. As far as preventing mkdir(2), I don't see why it wouldn't work. The issue is in that second quoted paragraph. We know that the VFS can look up our children if we're not holding i_mutex. In fact, cached_lookup() can find them without i_mutex. Now, we know that mkdir(2) and rmdir(2) will block at su_mutex. But what about all other file operations, both on the child directories *and* the attribute files? For attribute files, we prevent access at creation time with a flag. We can trust the flag because we hold i_mutex. This might hold anyway because we're holding that toplevel i_mutex. At teardown time, though, we know they can't be found because of i_mutex. Now we don't have that protection for processes that are farther down the tree. But the bigger issue is just the plain regular operations on our directories. An example is ->readdir(). We currently lock out ->readdir() during rmdir(2) with our holding of i_mutex. However, if we are not holding i_mutex, vfs_readdir() can call into our ->readdir() right as we're tearing things down. We may not have gotten to S_DEAD yet in the rmdir(2) path, and the two will race. We can't take su_mutex in ->readdir(), because that sort of solution effectively says "we have to take su_mutex for all operations", and we end up serializing all operations on a subsustem. Now, I can think of a way to make ->readdir() safe without su_mutex. But what other operation is next? How do I know I have them all? How do I notice when someone adds a new operation or code path to the generic code that I have to protect against? With i_mutex, I *know* that everyone has agreed that is the gatekeeper. Without it... *That's* the big worry. That's what I'm worried about being sure of. I'd love to hear a solution that we know will work, or at least move towards one. Joel PS: And we haven't even talked about configfs_depend_item() yet. -- Life's Little Instruction Book #252 "Take good care of those you love." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127
WARNING: multiple messages have this Message-ID (diff)
From: Joel Becker <Joel.Becker@oracle.com> To: cluster-devel.redhat.com Subject: [Cluster-devel] Re: [PATCH] configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item() Date: Tue, 27 Jan 2009 19:41:14 -0800 [thread overview] Message-ID: <20090128034114.GC7244@mail.oracle.com> (raw) In-Reply-To: <1232973009.4863.76.camel@laptop> On Mon, Jan 26, 2009 at 01:30:09PM +0100, Peter Zijlstra wrote: > I don't think I was suggesting that. All you need is to serialize any > mkdir/creat against the rmdir of the youngest non-default group, and you > can do that by holding su_mutex. > > In rmdir, you already own all the i_mutex instances you need to uncouple > the whole tree, all you need to do is validate that its indeed empty -- > you don't need i_mutex's for that, because you're holding su_mutex, and > any concurrent mkdir/creat will be blocking on that. > > If you find it empty, just mark everybody DEAD, drop su_mutex and > decouple. All concurrent mkdir/creat thingies that were blocking will > now bail because their parent is found DEAD. Right, that's what I was talking about when I said: > > Now look in detach_groups(). We drop the groups children before > > marking them DEAD. Louis' plan, I think, is to perhaps mark a group > > DEAD, disconnect it from the vfs, and then operate on its children. In > > this fashion, perhaps we can unlock the trailing lock like a normal VFS > > operation. > > This will require some serious auditing, however, because now > > vfs functions can get into the vfs objects behind us. And more vfs > > changes affect us. Whereas the current locking relies on the vfs's > > parent->child lock ordering only, something that isn't likely to be > > changed. That is, the vfs has already walked past this directory, dropped i_mutex, and is in a child default group holding its i_mutex. It wants to mkdir(2) down there. You're saying that, if mkdir(2) holds su_mutex higher up, it can check S_DEAD and compare with us, and that's exactly the scheme I mentioned in the first of the quoted paragraphs (Louis proposed it a while back). Thus, though we don't hold i_mutex on that child, it will eventually either a) have gotten su_mutex first, and will cause rmdir to ENOTEMPTY or b) have gotten su_mutex second, will see S_DEAD, and return -ENOENT. As far as preventing mkdir(2), I don't see why it wouldn't work. The issue is in that second quoted paragraph. We know that the VFS can look up our children if we're not holding i_mutex. In fact, cached_lookup() can find them without i_mutex. Now, we know that mkdir(2) and rmdir(2) will block at su_mutex. But what about all other file operations, both on the child directories *and* the attribute files? For attribute files, we prevent access at creation time with a flag. We can trust the flag because we hold i_mutex. This might hold anyway because we're holding that toplevel i_mutex. At teardown time, though, we know they can't be found because of i_mutex. Now we don't have that protection for processes that are farther down the tree. But the bigger issue is just the plain regular operations on our directories. An example is ->readdir(). We currently lock out ->readdir() during rmdir(2) with our holding of i_mutex. However, if we are not holding i_mutex, vfs_readdir() can call into our ->readdir() right as we're tearing things down. We may not have gotten to S_DEAD yet in the rmdir(2) path, and the two will race. We can't take su_mutex in ->readdir(), because that sort of solution effectively says "we have to take su_mutex for all operations", and we end up serializing all operations on a subsustem. Now, I can think of a way to make ->readdir() safe without su_mutex. But what other operation is next? How do I know I have them all? How do I notice when someone adds a new operation or code path to the generic code that I have to protect against? With i_mutex, I *know* that everyone has agreed that is the gatekeeper. Without it... *That's* the big worry. That's what I'm worried about being sure of. I'd love to hear a solution that we know will work, or at least move towards one. Joel PS: And we haven't even talked about configfs_depend_item() yet. -- Life's Little Instruction Book #252 "Take good care of those you love." Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127
next prev parent reply other threads:[~2009-01-28 3:42 UTC|newest] Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top 2008-12-11 14:20 configfs, dlm_controld & lockdep Steven Whitehouse 2008-12-11 14:20 ` [Cluster-devel] " Steven Whitehouse 2008-12-11 14:44 ` Louis Rilling 2008-12-11 17:34 ` Joel Becker 2008-12-11 17:34 ` [Cluster-devel] " Joel Becker 2008-12-12 10:06 ` Louis Rilling 2008-12-12 15:29 ` [PATCH] configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item() Louis Rilling 2008-12-17 21:40 ` Andrew Morton 2008-12-17 21:40 ` [Cluster-devel] " Andrew Morton 2008-12-17 22:03 ` Joel Becker 2008-12-17 22:03 ` [Cluster-devel] " Joel Becker 2008-12-17 22:09 ` Andrew Morton 2008-12-17 22:09 ` [Cluster-devel] " Andrew Morton 2008-12-18 7:26 ` Peter Zijlstra 2008-12-18 9:27 ` Joel Becker 2008-12-18 9:27 ` [Cluster-devel] " Joel Becker 2008-12-18 11:15 ` Louis Rilling 2008-12-18 18:00 ` Make lockdep happy with configfs Louis Rilling 2009-01-26 11:51 ` Louis Rilling 2009-01-28 3:44 ` Joel Becker 2009-01-28 3:44 ` [Cluster-devel] " Joel Becker 2008-12-18 18:00 ` [PATCH 1/2] configfs: Silence lockdep on mkdir() and rmdir() Louis Rilling 2009-01-28 3:55 ` Joel Becker 2009-01-28 3:55 ` [Cluster-devel] " Joel Becker 2009-01-28 10:38 ` Louis Rilling 2008-12-18 18:00 ` [PATCH 2/2] configfs: Rework configfs_depend_item() locking and make lockdep happy Louis Rilling 2009-01-28 4:13 ` Joel Becker 2009-01-28 4:13 ` [Cluster-devel] " Joel Becker 2009-01-28 10:32 ` Louis Rilling 2008-12-18 11:26 ` [PATCH] configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item() Steven Whitehouse 2008-12-18 11:26 ` [Cluster-devel] " Steven Whitehouse 2008-12-18 11:48 ` Louis Rilling 2008-12-18 11:56 ` Peter Zijlstra 2008-12-18 12:28 ` Peter Zijlstra 2008-12-18 22:58 ` Joel Becker 2008-12-18 22:58 ` [Cluster-devel] " Joel Becker 2008-12-19 10:29 ` Louis Rilling 2009-01-26 12:30 ` Peter Zijlstra 2009-01-26 13:24 ` Louis Rilling 2009-01-26 13:41 ` Peter Zijlstra 2009-01-26 14:00 ` Louis Rilling 2009-01-26 14:19 ` Peter Zijlstra 2009-01-26 14:55 ` Louis Rilling 2009-01-28 3:05 ` Joel Becker 2009-01-28 3:05 ` [Cluster-devel] " Joel Becker 2009-01-28 3:41 ` Joel Becker [this message] 2009-01-28 3:41 ` Joel Becker
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20090128034114.GC7244@mail.oracle.com \ --to=joel.becker@oracle.com \ --cc=akpm@linux-foundation.org \ --cc=cluster-devel@redhat.com \ --cc=linux-kernel@vger.kernel.org \ --cc=louis.rilling@kerlabs.com \ --cc=peterz@infradead.org \ --cc=swhiteho@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.