From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933419AbdC3K1T (ORCPT ); Thu, 30 Mar 2017 06:27:19 -0400 Received: from mail-wr0-f196.google.com ([209.85.128.196]:33500 "EHLO mail-wr0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933024AbdC3K1Q (ORCPT ); Thu, 30 Mar 2017 06:27:16 -0400 From: Nicolai Stange To: Johannes Berg Cc: Nicolai Stange , linux-kernel , "Paul E.McKenney" , gregkh Subject: Re: deadlock in synchronize_srcu() in debugfs? References: <1490280886.2766.4.camel@sipsolutions.net> <87o9ws6m4s.fsf@gmail.com> <1490614617.3393.4.camel@sipsolutions.net> <87lgrnmd8b.fsf@gmail.com> <1490860547.32041.1.camel@sipsolutions.net> Date: Thu, 30 Mar 2017 12:27:07 +0200 In-Reply-To: <1490860547.32041.1.camel@sipsolutions.net> (Johannes Berg's message of "Thu, 30 Mar 2017 09:55:47 +0200") Message-ID: <87inmroy9w.fsf@d147-183.mpimet.mpg.de> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org So, please correct me if I'm wrong, there are two problems with indefinitely blocking debugfs files' fops: 1. The one which actually hung your system: An indefinitely blocking debugfs_remove() while holding a lock. Other tasks attempting to grab that same lock get stuck as well. 2. The other one you've found, namely that the locking granularity is too coarse: a debugfs_remove() would get blocked by unrelated files' pending fops. AFAICS, the first one can't get resolved by simply refining the blocking granularity: a debugfs_remove() on the indefinitely blocking file would still block as well. But: On Thu, Mar 30 2017, Johannes Berg wrote: > On Thu, 2017-03-30 at 09:32 +0200, Nicolai Stange wrote: >> >> I wonder if holding the RTNL during the debugfs file removal is >> really needed. I'll try to have a look in the next couple of days. > > Yes, I'm pretty much convinced that it is. I considered doing a > deferred debugfs_remove() by holding the object around, but then I > can't be sure that I can later re-add a new object with the same > directory name, Ah, I see. What about making debugfs provide separate - debugfs_remove_start(): unlink file (c.f. __debugfs_remove()), can be called under a lock and - debugfs_remove_wait(): the synchronize_srcu(), must not get called under any lock ? This would solve 1.). It would still be nice to detect those situations, i.e. calls to debugfs_remove() with some lock being held, though. In short, I'd make calling debugfs_remove() with any lock being held illegal. What do you think? > Half the thread here was about that - it's not easily doable because > you'd have to teach lockdep about the special SRCU semantics first. > Since it doesn't even seem to do read/write locks properly that's > probably a massive undertaking. I haven't looked into lockdep yet. So there is no way to ask lockdep "is there any lock held?" from debugfs_remove() before doing the synchonize_srcu()? > I also doubt that it's useful, because even if we did flag this sort of > situation, it can occur across very different drivers - for example the > netronome driver using rtnl_lock() inside its debugfs files, and > mac80211 removing a completely unrelated debugfs file within > rtnl_lock(). I'm proposing to convert the latter to a debugfs_remove_start()/debugfs_remove_wait() pair. >> > Similarly, nobody should be blocking in debugfs files, like we did >> > in ours, but also smsdvb_stats_read(), crtc_crc_open() look like >> > they could block for quite a while. >> >> Blocking in the debugfs files' fops shall be fine by itself, that's >> why SRCU is used for the removal stuff. > > No, it isn't fine at all now! "Shall" in the sense of "it's a requirement" and if it isn't fulfilled, it must be fixed. So I do agree with you here. > If I have a debugfs file - like the one I > had - that could block for external events (firmware, in my case), then > *any* other debugfs_remove() in the whole system would also block > indefinitely. That's a major problem! Indeed. > Ultimately, I'm not sure I see why one couldn't just have a > reader/writer lock per *file*, which would be the ultimate granularity > to solve this. Obviously, a blocking file has to be aborted before > being removed itself, but there's nothing that says that you can't > remove any other file - even from the same directory - while this one > is in a blocking read. When I did this, per-file reader/writer locks actuallt came to my mind first. The problem here is that debugfs_use_file_start() must grab the lock first and check whether the file has been deleted in the meanwhile. But as it stands, there's nothing that would guarantee the existence of the lock at the time it's to be taken. Anyways, I'll have to think a little about possible solutions to mitigate problem 2.) Thanks, Nicolai