>> >>> On Thu, Apr 19, 2012 at 5:47 PM, Dave Airlie wrote: >>> > On Thu, Apr 19, 2012 at 5:41 PM, Andy Whitcroft wrote: >>> >> On Thu, Apr 19, 2012 at 05:30:03PM +0100, Dave Airlie wrote: >>> >>> On Thu, Apr 19, 2012 at 5:22 PM, Andy Whitcroft wrote: >>> >>> > We have been carrying a (rather poor) patch for an issue we identified in >>> >>> > the DRM driver.  This issue is triggered when a DRM device is initialising >>> >>> > and userspace attempts to open it, typically in response to the sysfs >>> >>> > device added event.  Basically we allocate the minor numbers making >>> >>> > the device available, and then call the drm load callback.  Until this >>> >>> > completes the device is really not ready and these early opens typically >>> >>> > lead to oopses. >>> >>> > >>> >>> > We have been using the following patch to avoid this by marking the minors >>> >>> > as in error until the load method has completed.  This avoids the early >>> >>> > open by simply erroring out the opens with EAGAIN.  Obviously we should >>> >>> > be delaying the open until the load method complete. >>> >>> > >>> >>> > I include the existing patch for completness (it is not really ready for >>> >>> > merging) to illustrate the issue.  I think it is logical that the wait >>> >>> > should simply be delayed until the load has completed.  I am proposing >>> >>> > to include a wait queue associated with the idr cache for the drm minors >>> >>> > which we can use to allow open callers to wait_event_interruptible() on. >>> >>> > I'll be putting together a prototype shortly and will follow up with it. >>> >>> > >>> >>> > Thoughts? >>> >>> >>> >>> Couldn't we just delay registering things until the driver is ready to >>> >>> accept an open? >>> >>> >>> >>> Granted the midlayer of drm doesn't make that easy, >>> >> >>> >> It seems that we need the dri minor allocated before we hit the load >>> >> function as things are done right now. >>> >> >>> >>> thanks for sending this out, it keeps falling off my radar, I don't >>> >>> think I've ever seen this reported on RHEL/Fedora, which makes me >>> >>> wonder what we are doing that makes us lucky. >>> >> >>> >> We never hit it until we started doing things earlier and quicker.  I first >>> >> found it in the prettification of boot so we were keen to get plymouth >>> >> running as soon as possible.  That lead to random panics and me finding >>> >> this bug.  The window is tiny as far as I know and it tends to be specific >>> >> machines and specific package combinations which trigger it reliably. >>> >> >>> >> I suspect that a proper fix would allow delaying the registration as you >>> >> suggest but in the interim a wait would at least avoid the issues we are >>> >> seeing.  I will see how awful it looks. >>> > >>> > Just to confirm its the drm_sysfs_device_add that causes the race we care about. >>> > >>> > it needs to happen after the driver is happy. Since it calls >>> > device_register and that is what triggers udev magic to load the >>> > userspace. >>> > >>> > If you have a userspace app banging on a static device node that might >>> > need another set of fun fixes. >>> >>> Okay the sysfs add and the idr_replace are the things we need to delay. >> >> Since you can still get at things with a static node, it seems like >> locking is the real issue here?  Is there no mutex we can take across >> init to block any openers until we're done? > > well the idr replace should be the thing that matters, since before > that openers get -ENODEV, after it they end up success. > we may need a lock around that once we fix the logic.\ Here's my predinner hack, contains random rtl change as well, plz ignore. now for dinner. Dave.