From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S933390AbdC3Hzx (ORCPT <rfc822;w@1wt.eu>);
        Thu, 30 Mar 2017 03:55:53 -0400
Received: from s3.sipsolutions.net ([5.9.151.49]:58220 "EHLO sipsolutions.net"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S932790AbdC3Hzv (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 30 Mar 2017 03:55:51 -0400
Message-ID: <1490860547.32041.1.camel@sipsolutions.net>
Subject: Re: deadlock in synchronize_srcu() in debugfs?
From: Johannes Berg <johannes@sipsolutions.net>
To: Nicolai Stange <nicstange@gmail.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
        "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>,
        gregkh <gregkh@linuxfoundation.org>
Date: Thu, 30 Mar 2017 09:55:47 +0200
In-Reply-To: <87lgrnmd8b.fsf@gmail.com> (sfid-20170330_093222_584439_BA8321D0)
References: <1490280886.2766.4.camel@sipsolutions.net>
         <87o9ws6m4s.fsf@gmail.com> <1490614617.3393.4.camel@sipsolutions.net>
         <87lgrnmd8b.fsf@gmail.com> (sfid-20170330_093222_584439_BA8321D0)
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.22.4-1 
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, 2017-03-30 at 09:32 +0200, Nicolai Stange wrote:
> 
> I wonder if holding the RTNL during the debugfs file removal is
> really needed. I'll try to have a look in the next couple of days.

Yes, I'm pretty much convinced that it is. I considered doing a
deferred debugfs_remove() by holding the object around, but then I
can't be sure that I can later re-add a new object with the same
directory name, so I have much more complexity - I'm not even sure that
can be solved at all, *perhaps* by renaming in debugfs first, but
that's major new complexity.

Enough complexity that I'm considering just removing debugfs usage
entirely and invent new mechanisms, or use sysfs, or something else
instead.

> Summarizing, the problem is the call to the indefinitely blocking
> srcu_synchronize() while having a lock held? I'll see whether I can
> ask lockdep if any lock is held and spit out a warning then.

Half the thread here was about that - it's not easily doable because
you'd have to teach lockdep about the special SRCU semantics first.
Since it doesn't even seem to do read/write locks properly that's
probably a massive undertaking.

I also doubt that it's useful, because even if we did flag this sort of
situation, it can occur across very different drivers - for example the
netronome driver using rtnl_lock() inside its debugfs files, and
mac80211 removing a completely unrelated debugfs file within
rtnl_lock().

> 
> > Similarly, nobody should be blocking in debugfs files, like we did
> > in ours, but also smsdvb_stats_read(), crtc_crc_open() look like
> > they could block for quite a while.
> 
> Blocking in the debugfs files' fops shall be fine by itself, that's
> why SRCU is used for the removal stuff.

No, it isn't fine at all now! If I have a debugfs file - like the one I
had - that could block for external events (firmware, in my case), then
*any* other debugfs_remove() in the whole system would also block
indefinitely. That's a major problem! The two other files listed above
can also block waiting for external events, afaict. I'm told that there
are more files in other wireless drivers too.

Basically, even with SRCU being used, you cannot have blocking files.
You have to treat everything as O_NONBLOCK, because if you don't a
completely unrelated debugfs_remove() will block until the file
produces data. IMHO that's completely unacceptable.

> Yes, there's only one global srcu_struct for debugfs. So far this
> hasn't been a problem and if I understand things correctly, it's also
> not the problem at hand? If it really becomes an issue, we can very
> well introduce per directory srcu_structs as you suggested.

No, this is exactly the problem. If I have one blocking file, and
remove any completely unrelated file elsewhere in the system, I need to
wait for the blocking file to have produced data. That just doesn't
scale.

Using a separate SRCU subsystem per directory will go some way, but it
would still be useful to have lockdep annotations there.

Ultimately, I'm not sure I see why one couldn't just have a
reader/writer lock per *file*, which would be the ultimate granularity
to solve this. Obviously, a blocking file has to be aborted before
being removed itself, but there's nothing that says that you can't
remove any other file - even from the same directory - while this one
is in a blocking read.

> Let me have a look
> - whether holding the RTNL lock while removing the debugfs files is
>   actually needed and
> - whether there is an easy way to spot similar scenarios and emit
>   a warning for them.
> 
> If this doesn't solve the problem, I'll have to think of a different
> way to fix this...

It solves - imho with unnecessary hardship on the caller of
debugfs_remove() - only half of the problem, namely the real deadlock.
It does nothing for the "blocking debugfs file" live-lock where forward
progress can be made as soon as the file has some data available -
which in practice in my scenario never happened as the data producer
was completely idle.

> > (*) before removing first first we'd obviously wake up and thereby
> > more or less terminate the readers first
> 
> With the current implementation, I can't see an easy way to identify
> the tasks blocking on a particular debugfs file. But maybe this is
> resolvable and the way to go here...

No, this is unrelated. If I write a blocking debugfs file, then *of
course* I need to take care that before remove it I wake up all the
readers. But that's easy because I control how that file produces data,
so I can wake up the waitq and tell my own code inside the debugfs read
to return 0 (EOF).

This would be entirely inappropriate as a general solution because
you'd have to kill the userspace process or something...

johannes