Re: [PATCH] drivers/base: use a worker for sysfs unbind

From: Daniel Vetter <daniel.vetter@ffwll.ch>
To: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	Ramalingam C <ramalingam.c@intel.com>,
	Greg KH <gregkh@linuxfoundation.org>,
	Daniel Vetter <daniel.vetter@intel.com>
Subject: Re: [PATCH] drivers/base: use a worker for sysfs unbind
Date: Thu, 13 Dec 2018 13:36:22 +0100	[thread overview]
Message-ID: <CAKMK7uF7noCEgwE0QYZWQFx-OPxipAF1MojUZ8KTo_SXfQW8+w@mail.gmail.com> (raw)
In-Reply-To: <CAJZ5v0iWshem3kuurF53gutVJ8jFm_caAbetK2CiSCpyc6ReeQ@mail.gmail.com>

On Thu, Dec 13, 2018 at 11:23 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Dec 13, 2018 at 10:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, Dec 13, 2018 at 10:38:14AM +0100, Rafael J. Wysocki wrote:
> > > On Mon, Dec 10, 2018 at 9:47 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > Drivers might want to remove some sysfs files, which needs the same
> > > > locks and ends up angering lockdep. Relevant snippet of the stack
> > > > trace:
> > > >
> > > >   kernfs_remove_by_name_ns+0x3b/0x80
> > > >   bus_remove_driver+0x92/0xa0
> > > >   acpi_video_unregister+0x24/0x40
> > > >   i915_driver_unload+0x42/0x130 [i915]
> > > >   i915_pci_remove+0x19/0x30 [i915]
> > > >   pci_device_remove+0x36/0xb0
> > > >   device_release_driver_internal+0x185/0x250
> > > >   unbind_store+0xaf/0x180
> > > >   kernfs_fop_write+0x104/0x190
> > >
> > > Is the acpi_bus_unregister_driver() in acpi_video_unregister() the
> > > source of the lockdep unhappiness?
> >
> > Yeah I guess I cut out too much of the lockdep splat. It complains about
> > kernfs_fop_write and kernfs_remove_by_name_ns acquiring the same lock
> > class. It's ofc not the same lock, so no real deadlock. Getting the
> > device_release_driver outside of the callchain under kernfs_fop_write,
> > which this patch does, "fixes" it. For "fixes" = shut up lockdep.
>
> OK, so the problem really is that the operation is started via sysfs
> which means that this code is running under a lock already.
>
> Which lock does lockdep complain about, exactly?

mutex_lock(&of->mutex);

> > Other options:
> > - Anotate the recursion with the usual lockdep annotations. Potentially
> >   results in lockdep not catching real deadlocks (you can still have other
> >   loops closing the deadlock, maybe through some subsystem/bus lock).
> >
> > - Rewrite kernfs_fop_write to drop the lock (optionally, for callbacks
> >   that know what they're doing), which should be fine if we refcount
> >   everything properly (bus, driver & device).
> >
> > - Also note that probably the same bug exists on the bind sysfs interface,
> >   but we don't use that, so I don't care :-)
> >
> > - Most of these issues are never visible in normal usage, since normally
> >   driver bind/unbind is done from a kthread or model_load/unload, neither
> >   of which is running in the context of that kernfs mutex kernfs_fop_write
> >   holds. That's why I think the task work is the best solution, since it
> >   changes the locking context of the unbind sysfs to match the locking
> >   context of module unload and hotunplug.
>
> I think that using a task work here makes sense.  There is a drawback,
> which is that the original sysfs write will not wait for the driver to
> actually be released before returning to user space AFAICS, but that
> probably isn't a big deal.

This would happen with a normal work_struct, which runs on some other
thread eventually. That added asynonchrouns execution uncovered lots
of bugs in our CI (fbcon isn't solid, let's put it that way). Hence
the task work, which will be run before the syscall returns to
userspace, but outside of anything else. Was originally created to
avoid locking inversion on the final fput, where the same "must
complete before returning to userspace, but outside of any other
locking context" issue was causing trouble.

> Also please note that the patch changes the code flow slightly,
> because passing a non-NULL parent pointer to
> device_release_driver_internal() potentially has side effects, but
> that should not be a big deal either.

I can do the old code exactly, but afaict the non-NULL parent just
takes care of the parent bus locking for us, instead of hand-rolling
it in the caller. But if I missed something, I can easily undo that
part.

> > Unfortunately that trick doesn't work for the bind sysfs file, since that way we can't thread the errno value back to userspace.
>
> Right.  That is unless we wait for the operation to complete and check
> the error left behind by it.  That should be doable, but somewhat
> complicated.

For real deadlocks this doesn't fix anything, it just hides it from
lockdep. cross-release lockdep would still complain. If we want to fix
the bind side _and_ keep reporting the errno from the driver's bind
function, then we need to rework kernfs to and add a callback which
doesn't hold the mutex. Should be doable, just a pile more work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch