linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: kobject: delayed kobject release: help find buggy drivers
       [not found] <20130903185924.046EE660E91@gitolite.kernel.org>
@ 2013-09-03 20:44 ` Dave Jones
  2013-09-03 20:55   ` Greg KH
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Jones @ 2013-09-03 20:44 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: rmk+kernel, gregkh

On Tue, Sep 03, 2013 at 06:59:24PM +0000, Linux Kernel wrote:
 > Gitweb:     http://git.kernel.org/linus/;a=commit;h=c817a67ecba7c3c2aaa104796d78f160af60920d
 > Commit:     c817a67ecba7c3c2aaa104796d78f160af60920d
 > Parent:     7c42721fe0c58a848849b43ff558cf2fb86aa35a
 > Author:     Russell King <rmk+kernel@arm.linux.org.uk>
 > AuthorDate: Thu Jun 27 15:06:14 2013 +0100
 > Committer:  Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 > CommitDate: Thu Jul 25 15:39:04 2013 -0700
 > 
 >     kobject: delayed kobject release: help find buggy drivers
 >     
 >     Implement debugging for kobject release functions.  kobjects are
 >     reference counted, so the drop of the last reference to them is not
 >     predictable. However, the common case is for the last reference to be
 >     the kobject's removal from a subsystem, which results in the release
 >     function being immediately called.
 >     
 >     This can hide subtle bugs, which can occur when another thread holds a
 >     reference to the kobject at the same time that a kobject is removed.
 >     This results in the release method being delayed.
 >     
 >     In order to make these kinds of problems more visible, the following
 >     patch implements a delayed release; this has the effect that the
 >     release function will be out of order with respect to the removal of
 >     the kobject in the same manner that it would be if a reference was
 >     being held.
 >     
 >     This provides us with an easy way to allow driver writers to debug
 >     their drivers and fix otherwise hidden problems.
 >     
 >     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
 >     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

With this enabled, I get a bunch of scrolling oopses immediately after
exiting the bootloader. It happens so early I can't even capture them
over usb-serial, or earlyprintk=dbgp.

And for whatever reason, the printk path while oopsing ignores boot_delay parameter,
so I can't even use that.

any ideas ?

	Dave


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-03 20:44 ` kobject: delayed kobject release: help find buggy drivers Dave Jones
@ 2013-09-03 20:55   ` Greg KH
  2013-09-04  5:23     ` Rusty Russell
  0 siblings, 1 reply; 17+ messages in thread
From: Greg KH @ 2013-09-03 20:55 UTC (permalink / raw)
  To: Dave Jones, Rusty Russell, Linux Kernel Mailing List, rmk+kernel

On Tue, Sep 03, 2013 at 04:44:42PM -0400, Dave Jones wrote:
> On Tue, Sep 03, 2013 at 06:59:24PM +0000, Linux Kernel wrote:
>  > Gitweb:     http://git.kernel.org/linus/;a=commit;h=c817a67ecba7c3c2aaa104796d78f160af60920d
>  > Commit:     c817a67ecba7c3c2aaa104796d78f160af60920d
>  > Parent:     7c42721fe0c58a848849b43ff558cf2fb86aa35a
>  > Author:     Russell King <rmk+kernel@arm.linux.org.uk>
>  > AuthorDate: Thu Jun 27 15:06:14 2013 +0100
>  > Committer:  Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>  > CommitDate: Thu Jul 25 15:39:04 2013 -0700
>  > 
>  >     kobject: delayed kobject release: help find buggy drivers
>  >     
>  >     Implement debugging for kobject release functions.  kobjects are
>  >     reference counted, so the drop of the last reference to them is not
>  >     predictable. However, the common case is for the last reference to be
>  >     the kobject's removal from a subsystem, which results in the release
>  >     function being immediately called.
>  >     
>  >     This can hide subtle bugs, which can occur when another thread holds a
>  >     reference to the kobject at the same time that a kobject is removed.
>  >     This results in the release method being delayed.
>  >     
>  >     In order to make these kinds of problems more visible, the following
>  >     patch implements a delayed release; this has the effect that the
>  >     release function will be out of order with respect to the removal of
>  >     the kobject in the same manner that it would be if a reference was
>  >     being held.
>  >     
>  >     This provides us with an easy way to allow driver writers to debug
>  >     their drivers and fix otherwise hidden problems.
>  >     
>  >     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
>  >     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
> With this enabled, I get a bunch of scrolling oopses immediately after
> exiting the bootloader. It happens so early I can't even capture them
> over usb-serial, or earlyprintk=dbgp.
> 
> And for whatever reason, the printk path while oopsing ignores boot_delay parameter,
> so I can't even use that.
> 
> any ideas ?

The first big bug found with this was with the module kobject code, and
a fix for that should be going in through Rusty's tree to Linus for this
merge window (right Rusty?)

After that, things should settle down and boot properly, although a few
bugs have been reported already to their respective subsystem owners, I
don't know the status of them yet.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-03 20:55   ` Greg KH
@ 2013-09-04  5:23     ` Rusty Russell
  2013-09-05  4:13       ` Dave Jones
  0 siblings, 1 reply; 17+ messages in thread
From: Rusty Russell @ 2013-09-04  5:23 UTC (permalink / raw)
  To: Greg KH, Dave Jones, Linux Kernel Mailing List, rmk+kernel

Greg KH <gregkh@linuxfoundation.org> writes:
> On Tue, Sep 03, 2013 at 04:44:42PM -0400, Dave Jones wrote:
>> On Tue, Sep 03, 2013 at 06:59:24PM +0000, Linux Kernel wrote:
>>  > Gitweb:     http://git.kernel.org/linus/;a=commit;h=c817a67ecba7c3c2aaa104796d78f160af60920d
>>  > Commit:     c817a67ecba7c3c2aaa104796d78f160af60920d
>>  > Parent:     7c42721fe0c58a848849b43ff558cf2fb86aa35a
>>  > Author:     Russell King <rmk+kernel@arm.linux.org.uk>
>>  > AuthorDate: Thu Jun 27 15:06:14 2013 +0100
>>  > Committer:  Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>  > CommitDate: Thu Jul 25 15:39:04 2013 -0700
>>  > 
>>  >     kobject: delayed kobject release: help find buggy drivers
>>  >     
>>  >     Implement debugging for kobject release functions.  kobjects are
>>  >     reference counted, so the drop of the last reference to them is not
>>  >     predictable. However, the common case is for the last reference to be
>>  >     the kobject's removal from a subsystem, which results in the release
>>  >     function being immediately called.
>>  >     
>>  >     This can hide subtle bugs, which can occur when another thread holds a
>>  >     reference to the kobject at the same time that a kobject is removed.
>>  >     This results in the release method being delayed.
>>  >     
>>  >     In order to make these kinds of problems more visible, the following
>>  >     patch implements a delayed release; this has the effect that the
>>  >     release function will be out of order with respect to the removal of
>>  >     the kobject in the same manner that it would be if a reference was
>>  >     being held.
>>  >     
>>  >     This provides us with an easy way to allow driver writers to debug
>>  >     their drivers and fix otherwise hidden problems.
>>  >     
>>  >     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
>>  >     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> 
>> With this enabled, I get a bunch of scrolling oopses immediately after
>> exiting the bootloader. It happens so early I can't even capture them
>> over usb-serial, or earlyprintk=dbgp.
>> 
>> And for whatever reason, the printk path while oopsing ignores boot_delay parameter,
>> so I can't even use that.
>> 
>> any ideas ?
>
> The first big bug found with this was with the module kobject code, and
> a fix for that should be going in through Rusty's tree to Linus for this
> merge window (right Rusty?)

Yep, just sent pull request.  The fix was the final commit there...

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-04  5:23     ` Rusty Russell
@ 2013-09-05  4:13       ` Dave Jones
  2013-09-05  4:57         ` Greg KH
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Jones @ 2013-09-05  4:13 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Greg KH, Linux Kernel Mailing List, rmk+kernel

On Wed, Sep 04, 2013 at 02:53:11PM +0930, Rusty Russell wrote:
 
 > >> With this enabled, I get a bunch of scrolling oopses immediately after
 > >> exiting the bootloader. It happens so early I can't even capture them
 > >> over usb-serial, or earlyprintk=dbgp.
 > >> 
 > >> And for whatever reason, the printk path while oopsing ignores boot_delay parameter,
 > >> so I can't even use that.
 > >> 
 > >> any ideas ?
 > >
 > > The first big bug found with this was with the module kobject code, and
 > > a fix for that should be going in through Rusty's tree to Linus for this
 > > merge window (right Rusty?)
 > 
 > Yep, just sent pull request.  The fix was the final commit there...

Crap, that isn't the bug I'm hitting.

	Dave


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05  4:13       ` Dave Jones
@ 2013-09-05  4:57         ` Greg KH
  2013-09-05  5:00           ` Dave Jones
  2013-09-05  5:19           ` Dave Jones
  0 siblings, 2 replies; 17+ messages in thread
From: Greg KH @ 2013-09-05  4:57 UTC (permalink / raw)
  To: Dave Jones, Rusty Russell, Linux Kernel Mailing List, rmk+kernel

On Thu, Sep 05, 2013 at 12:13:38AM -0400, Dave Jones wrote:
> On Wed, Sep 04, 2013 at 02:53:11PM +0930, Rusty Russell wrote:
>  
>  > >> With this enabled, I get a bunch of scrolling oopses immediately after
>  > >> exiting the bootloader. It happens so early I can't even capture them
>  > >> over usb-serial, or earlyprintk=dbgp.
>  > >> 
>  > >> And for whatever reason, the printk path while oopsing ignores boot_delay parameter,
>  > >> so I can't even use that.
>  > >> 
>  > >> any ideas ?
>  > >
>  > > The first big bug found with this was with the module kobject code, and
>  > > a fix for that should be going in through Rusty's tree to Linus for this
>  > > merge window (right Rusty?)
>  > 
>  > Yep, just sent pull request.  The fix was the final commit there...
> 
> Crap, that isn't the bug I'm hitting.

Any chance to get a oops traceback?

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05  4:57         ` Greg KH
@ 2013-09-05  5:00           ` Dave Jones
  2013-09-05  5:19           ` Dave Jones
  1 sibling, 0 replies; 17+ messages in thread
From: Dave Jones @ 2013-09-05  5:00 UTC (permalink / raw)
  To: Greg KH; +Cc: Rusty Russell, Linux Kernel Mailing List, rmk+kernel

On Wed, Sep 04, 2013 at 09:57:45PM -0700, Greg KH wrote:
 > On Thu, Sep 05, 2013 at 12:13:38AM -0400, Dave Jones wrote:
 > > On Wed, Sep 04, 2013 at 02:53:11PM +0930, Rusty Russell wrote:
 > >  
 > >  > >> With this enabled, I get a bunch of scrolling oopses immediately after
 > >  > >> exiting the bootloader. It happens so early I can't even capture them
 > >  > >> over usb-serial, or earlyprintk=dbgp.
 > >  > >> 
 > >  > >> And for whatever reason, the printk path while oopsing ignores boot_delay parameter,
 > >  > >> so I can't even use that.
 > >  > >> 
 > >  > >> any ideas ?
 > >  > >
 > >  > > The first big bug found with this was with the module kobject code, and
 > >  > > a fix for that should be going in through Rusty's tree to Linus for this
 > >  > > merge window (right Rusty?)
 > >  > 
 > >  > Yep, just sent pull request.  The fix was the final commit there...
 > > 
 > > Crap, that isn't the bug I'm hitting.
 > 
 > Any chance to get a oops traceback?

Trying to figure out where to stick a while (1) to stop it scrolling away.
The end of dump_stack apparently wasn't enough.

	Dave


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05  4:57         ` Greg KH
  2013-09-05  5:00           ` Dave Jones
@ 2013-09-05  5:19           ` Dave Jones
  2013-09-05 15:34             ` Greg KH
  2013-09-05 17:35             ` Russell King - ARM Linux
  1 sibling, 2 replies; 17+ messages in thread
From: Dave Jones @ 2013-09-05  5:19 UTC (permalink / raw)
  To: Greg KH; +Cc: Rusty Russell, Linux Kernel Mailing List, rmk+kernel

On Wed, Sep 04, 2013 at 09:57:45PM -0700, Greg KH wrote:
 > On Thu, Sep 05, 2013 at 12:13:38AM -0400, Dave Jones wrote:
 > > On Wed, Sep 04, 2013 at 02:53:11PM +0930, Rusty Russell wrote:
 > >  
 > >  > >> With this enabled, I get a bunch of scrolling oopses immediately after
 > >  > >> exiting the bootloader. It happens so early I can't even capture them
 > >  > >> over usb-serial, or earlyprintk=dbgp.
 > >  > >> 
 > >  > >> And for whatever reason, the printk path while oopsing ignores boot_delay parameter,
 > >  > >> so I can't even use that.
 > >  > >> 
 > >  > >> any ideas ?
 > >  > >
 > >  > > The first big bug found with this was with the module kobject code, and
 > >  > > a fix for that should be going in through Rusty's tree to Linus for this
 > >  > > merge window (right Rusty?)
 > >  > 
 > >  > Yep, just sent pull request.  The fix was the final commit there...
 > > 
 > > Crap, that isn't the bug I'm hitting.
 > 
 > Any chance to get a oops traceback?

Got it. (I think I was getting tracebacks from multiple cpus, hence the spewing).
Adding a check for tainted() = infinite loop to show_backtrace() combined with boot_delay
gave me a really long oops that I had to grab video to record.

RIP: <null>

Trace:
<IRQ>
run_timer_softirq
__do_softirq
irq_exit
smp_apic_timer_interrupt
apic_timer_interrupt
<EOI>
vprintk_emit
dev_vprintk_emit
dev_vprint_emit
__dev_printk
_dev_info
ahci_print_info
ahci_init_one
local_pci_probe
pci_device_probe
driver_probe_device
__driver_attach
bus_for_each_dev
driver_attach
bus_add_driver
driver_register
__pci_register_driver
ahci_pci_driver_init
do_one_initcall
...

	Dave


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05  5:19           ` Dave Jones
@ 2013-09-05 15:34             ` Greg KH
  2013-09-05 15:47               ` Dave Jones
  2013-09-05 17:35             ` Russell King - ARM Linux
  1 sibling, 1 reply; 17+ messages in thread
From: Greg KH @ 2013-09-05 15:34 UTC (permalink / raw)
  To: Dave Jones, Rusty Russell, Linux Kernel Mailing List, rmk+kernel

On Thu, Sep 05, 2013 at 01:19:45AM -0400, Dave Jones wrote:
> On Wed, Sep 04, 2013 at 09:57:45PM -0700, Greg KH wrote:
>  > On Thu, Sep 05, 2013 at 12:13:38AM -0400, Dave Jones wrote:
>  > > On Wed, Sep 04, 2013 at 02:53:11PM +0930, Rusty Russell wrote:
>  > >  
>  > >  > >> With this enabled, I get a bunch of scrolling oopses immediately after
>  > >  > >> exiting the bootloader. It happens so early I can't even capture them
>  > >  > >> over usb-serial, or earlyprintk=dbgp.
>  > >  > >> 
>  > >  > >> And for whatever reason, the printk path while oopsing ignores boot_delay parameter,
>  > >  > >> so I can't even use that.
>  > >  > >> 
>  > >  > >> any ideas ?
>  > >  > >
>  > >  > > The first big bug found with this was with the module kobject code, and
>  > >  > > a fix for that should be going in through Rusty's tree to Linus for this
>  > >  > > merge window (right Rusty?)
>  > >  > 
>  > >  > Yep, just sent pull request.  The fix was the final commit there...
>  > > 
>  > > Crap, that isn't the bug I'm hitting.
>  > 
>  > Any chance to get a oops traceback?
> 
> Got it. (I think I was getting tracebacks from multiple cpus, hence the spewing).
> Adding a check for tainted() = infinite loop to show_backtrace() combined with boot_delay
> gave me a really long oops that I had to grab video to record.

Odd, but thanks for going through all of that.

> RIP: <null>
> 
> Trace:
> <IRQ>
> run_timer_softirq
> __do_softirq
> irq_exit
> smp_apic_timer_interrupt
> apic_timer_interrupt
> <EOI>
> vprintk_emit
> dev_vprintk_emit
> dev_vprint_emit
> __dev_printk
> _dev_info
> ahci_print_info
> ahci_init_one
> local_pci_probe
> pci_device_probe
> driver_probe_device
> __driver_attach
> bus_for_each_dev
> driver_attach
> bus_add_driver
> driver_register
> __pci_register_driver
> ahci_pci_driver_init

So we are loading a ahci driver here.  Any hint as to which
driver/device this is?  Is it a device that fails for this driver, and
then falls back to another more "specific" one?

And dieing in ahci_print_info()?  That's very strange.

Russell (not Rusty), any thoughts about this one?  You've been able to
debug a bunch of these tracebacks recently really well.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05 15:34             ` Greg KH
@ 2013-09-05 15:47               ` Dave Jones
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Jones @ 2013-09-05 15:47 UTC (permalink / raw)
  To: Greg KH; +Cc: Rusty Russell, Linux Kernel Mailing List, rmk+kernel

On Thu, Sep 05, 2013 at 08:34:57AM -0700, Greg KH wrote:
 
 > > Trace:
 > > <IRQ>
 > > run_timer_softirq
 > > __do_softirq
 > > irq_exit
 > > smp_apic_timer_interrupt
 > > apic_timer_interrupt
 > > <EOI>
 > > vprintk_emit
 > > dev_vprintk_emit
 > > dev_vprint_emit
 > > __dev_printk
 > > _dev_info
 > > ahci_print_info
 > > ahci_init_one
 > > local_pci_probe
 > > pci_device_probe
 > > driver_probe_device
 > > __driver_attach
 > > bus_for_each_dev
 > > driver_attach
 > > bus_add_driver
 > > driver_register
 > > __pci_register_driver
 > > ahci_pci_driver_init
 > 
 > So we are loading a ahci driver here.  Any hint as to which
 > driver/device this is?  Is it a device that fails for this driver, and
 > then falls back to another more "specific" one?

It's "ahci.ko" (though it's built in).
For a boring onboard intel sata controller...
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 04)

 
When I disable the debug option, usually it looks like..

[    6.382970] ahci 0000:00:1f.2: version 3.0
[    6.384753] ahci 0000:00:1f.2: irq 43 for MSI/MSI-X
[    6.385039] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x6 impl SATA mode
[    6.385067] ahci 0000:00:1f.2: flags: 64bit ncq pm led clo pio slum part ems apst 
[    6.385101] ahci 0000:00:1f.2: setting latency timer to 64
[    6.402536] scsi0 : ahci
[    6.403669] scsi1 : ahci
[    6.404208] scsi2 : ahci
[    6.404756] scsi3 : ahci
[    6.405342] scsi4 : ahci
[    6.405951] scsi5 : ahci
[    6.406345] ata1: DUMMY
[    6.406364] ata2: SATA max UDMA/133 abar m2048@0xc043b000 port 0xc043b180 irq 43
[    6.406393] ata3: SATA max UDMA/133 abar m2048@0xc043b000 port 0xc043b200 irq 43
[    6.406416] ata4: DUMMY
[    6.406427] ata5: DUMMY
[    6.406438] ata6: DUMMY

	Dave


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05  5:19           ` Dave Jones
  2013-09-05 15:34             ` Greg KH
@ 2013-09-05 17:35             ` Russell King - ARM Linux
  2013-09-05 18:13               ` Dave Jones
  2013-09-05 19:29               ` Dave Jones
  1 sibling, 2 replies; 17+ messages in thread
From: Russell King - ARM Linux @ 2013-09-05 17:35 UTC (permalink / raw)
  To: Dave Jones; +Cc: Greg KH, Rusty Russell, Linux Kernel Mailing List

On Thu, Sep 05, 2013 at 01:19:45AM -0400, Dave Jones wrote:
> Got it. (I think I was getting tracebacks from multiple cpus, hence
> the spewing). Adding a check for tainted() = infinite loop to
> show_backtrace() combined with boot_delay gave me a really long oops
> that I had to grab video to record.

How complete is this trace?  If accurate, I suspect that this is an oops
caused by the timer expiring, but the memory associated with the timer
has already been overwritten.  That would make it particularly hard to
debug.

Given that we run the delayed release 1 second after the object release,
I'm thinking that this has nothing to do with AHCI.

So, there's not a lot to go on here.

I'm wondering if there's any milage to allocating the kobject release
information inside kobject_add_internal(), so that we can at least run
the timer list and delayed work queues safely, and add debugging to
print the kobject pointer before we end up oopsing.  If we also add a
magic value to the beginning of the kobject, we can check whether it
looks like it may still be valid before we try and treat it as a kobject,
which may save us from oopsing on boot.

With the kobject debugging already there, it should be possible to match
the pointer value with previous debugging from kobject_add_internal()
and work out what kobject causes this.

Maybe the patch below will allow Dave to boot to a prompt and capture
the full dmesg - I've only just cooked up this patch so it may need
a bit of work to get it to build.  Please ensure that kobject debugging
is enabled, so we can translate from kobject addresses to names.

Dave - thanks for your patience and effort in trying to track this down.
Hopefully the patch below will make it easier.

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index de6dcbcc..30821f1 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -58,7 +58,13 @@ enum kobject_action {
 	KOBJ_MAX
 };
 
+struct kobject_release;
+
 struct kobject {
+#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
+	unsigned int		magic;
+	struct kobject_release	*release;
+#endif
 	const char		*name;
 	struct list_head	entry;
 	struct kobject		*parent;
@@ -66,9 +72,6 @@ struct kobject {
 	struct kobj_type	*ktype;
 	struct sysfs_dirent	*sd;
 	struct kref		kref;
-#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
-	struct delayed_work	release;
-#endif
 	unsigned int state_initialized:1;
 	unsigned int state_in_sysfs:1;
 	unsigned int state_add_uevent_sent:1;
diff --git a/lib/kobject.c b/lib/kobject.c
index 1d46c15..e5b5998 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -153,6 +153,13 @@ static void kobject_init_internal(struct kobject *kobj)
 }
 
 
+#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
+#define KOBJ_MAGIC	('k' | 'o' << 8 | 'b' << 16 | 'j' << 24)
+struct kobject_release {
+	struct delayed_work work;
+	struct kobject *kobj;
+};
+#endif
 static int kobject_add_internal(struct kobject *kobj)
 {
 	int error = 0;
@@ -169,6 +176,14 @@ static int kobject_add_internal(struct kobject *kobj)
 
 	parent = kobject_get(kobj->parent);
 
+#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
+	kobj->release = kmalloc(sizeof(*kobj->release), GFP_KERNEL);
+	if (!kobj->release)
+		return -ENOMEM;
+	kobj->release->kobj = kobj;
+	kobj->magic = KOBJ_MAGIC;
+#endif
+
 	/* join kset if set, use it as parent if we do not already have one */
 	if (kobj->kset) {
 		if (!parent)
@@ -198,6 +213,9 @@ static int kobject_add_internal(struct kobject *kobj)
 			WARN(1, "%s failed for %s (error: %d parent: %s)\n",
 			     __func__, kobject_name(kobj), error,
 			     parent ? kobject_name(parent) : "'none'");
+		kfree(kobj->release);
+		kobj->release = NULL;
+		kobj->magic = 0;
 	} else
 		kobj->state_in_sysfs = 1;
 
@@ -583,8 +601,20 @@ static void kobject_cleanup(struct kobject *kobj)
 #ifdef CONFIG_DEBUG_KOBJECT_RELEASE
 static void kobject_delayed_cleanup(struct work_struct *work)
 {
-	kobject_cleanup(container_of(to_delayed_work(work),
-				     struct kobject, release));
+	struct kobject_release *release = container_of(to_delayed_work(work),
+						struct kobject_release, work);
+	struct kobject *kobj = release->kobj;
+
+	kfree(release);
+	if (kobj->magic == KOBJ_MAGIC) {
+		pr_debug("kobject: '%s' (%p): %s\n",
+			 kobject_name(kobj), kobj, __func__);
+		kobject_cleanup(kobj);
+	} else {
+		pr_err("kobject %p has been corrupted (magic 0x%08x).  Please enable kobject debugging for full debug.\n",
+			kobj, kobj->magic);
+		/* We end up leaking the kobject */
+	}
 }
 #endif
 
@@ -594,8 +624,8 @@ static void kobject_release(struct kref *kref)
 #ifdef CONFIG_DEBUG_KOBJECT_RELEASE
 	pr_debug("kobject: '%s' (%p): %s, parent %p (delayed)\n",
 		 kobject_name(kobj), kobj, __func__, kobj->parent);
-	INIT_DELAYED_WORK(&kobj->release, kobject_delayed_cleanup);
-	schedule_delayed_work(&kobj->release, HZ);
+	INIT_DELAYED_WORK(&kobj->release->work, kobject_delayed_cleanup);
+	schedule_delayed_work(&kobj->release->work, HZ);
 #else
 	kobject_cleanup(kobj);
 #endif


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05 17:35             ` Russell King - ARM Linux
@ 2013-09-05 18:13               ` Dave Jones
  2013-09-05 19:29               ` Dave Jones
  1 sibling, 0 replies; 17+ messages in thread
From: Dave Jones @ 2013-09-05 18:13 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Greg KH, Rusty Russell, Linux Kernel Mailing List

On Thu, Sep 05, 2013 at 06:35:40PM +0100, Russell King - ARM Linux wrote:
 > On Thu, Sep 05, 2013 at 01:19:45AM -0400, Dave Jones wrote:
 > > Got it. (I think I was getting tracebacks from multiple cpus, hence
 > > the spewing). Adding a check for tainted() = infinite loop to
 > > show_backtrace() combined with boot_delay gave me a really long oops
 > > that I had to grab video to record.
 > 
 > How complete is this trace?  If accurate, I suspect that this is an oops
 > caused by the timer expiring, but the memory associated with the timer
 > has already been overwritten.  That would make it particularly hard to
 > debug.

There were some '?' symbols in the trace too which I omitted last night.
(offsets left out, because I'm hand-transcribing from video) 

<IRQ>
? call_timer_fn
? call_timer_fn
run_timer_softirq
__do_softirq
irq_exit
smp_apic_timer_interrupt
apic_timer_interrupt
<EOI>
? touch_nmi_watchdog
? touch_nmi_watchdog
vprintk_emit
? create_syslog_header
dev_vprintk_emit
? dump_trace
? trace_hardirqs_off_caller
? trace_hardirqs_off_caller
dev_print_emit
? trace_hardirqs_off_caller
__dev_printk
_dev_info
? trace_hardirqs_off
ahci_print_info
ahci_init_one
local_pci_probe
pci_device_probe
driver_probe_device
__driver_attach
? __device_attach
bus_for_each_dev
driver_attach
bus_add_driver
? ata_sff_init
driver_register
__pci_register_driver
ahci_pci_driver_init
do_one_initcall
? parse_args
kernel_init_freeable
? do_early_param
? rest_init
kernel_init
ret_from_fork
? rest_init

 > Maybe the patch below will allow Dave to boot to a prompt and capture
 > the full dmesg - I've only just cooked up this patch so it may need
 > a bit of work to get it to build.  Please ensure that kobject debugging
 > is enabled, so we can translate from kobject addresses to names.
 > 
 > Dave - thanks for your patience and effort in trying to track this down.
 > Hopefully the patch below will make it easier.

building now.

Thanks,

	Dave


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05 17:35             ` Russell King - ARM Linux
  2013-09-05 18:13               ` Dave Jones
@ 2013-09-05 19:29               ` Dave Jones
  2013-09-05 21:11                 ` Dave Jones
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Jones @ 2013-09-05 19:29 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Greg KH, Rusty Russell, Linux Kernel Mailing List

On Thu, Sep 05, 2013 at 06:35:40PM +0100, Russell King - ARM Linux wrote:
 
 > Dave - thanks for your patience and effort in trying to track this down.
 > Hopefully the patch below will make it easier.
 
I'm not sure if this is happening because I enabled DEBUG_KOBJECT this time around
or not, but now I'm seeing a different trace. (same hang though)

serial8250: ttyS0 at I/O 0x3f8 (ieq =4, base_baud = 115200) is 16550A
Oops
RIP: kobject_release
trace:

kobject_put
cdev_del
tty_unregister_device
uart_remove_one_port
serial8250_register_8250_port
pciserial_init_ports
pciserial_init_one
local_pci_probe
pci_device_probe
driver_probe_device
__driver_attach
? __device_attach
bus_for_each_dev
driver_attach
bus_add_driver
? early_serial_setup
driver_register
? early_serial_setup
__pci_register_driver
serial_pci_driver_init
do_one_initcall


Trying without serial console next..


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05 19:29               ` Dave Jones
@ 2013-09-05 21:11                 ` Dave Jones
  2013-09-05 21:26                   ` Dave Jones
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Jones @ 2013-09-05 21:11 UTC (permalink / raw)
  To: Russell King - ARM Linux, Greg KH, Rusty Russell,
	Linux Kernel Mailing List

On Thu, Sep 05, 2013 at 03:29:17PM -0400, Dave Jones wrote:
 > On Thu, Sep 05, 2013 at 06:35:40PM +0100, Russell King - ARM Linux wrote:
 >  
 >  > Dave - thanks for your patience and effort in trying to track this down.
 >  > Hopefully the patch below will make it easier.
 >  
 > I'm not sure if this is happening because I enabled DEBUG_KOBJECT this time around
 > or not, but now I'm seeing a different trace. (same hang though)
 > 
 > serial8250: ttyS0 at I/O 0x3f8 (ieq =4, base_baud = 115200) is 16550A
 > Oops
 > RIP: kobject_release
 > trace:
 > 
 > kobject_put
 > cdev_del
 > tty_unregister_device
 > uart_remove_one_port
 > serial8250_register_8250_port
 > pciserial_init_ports
 > pciserial_init_one
 > local_pci_probe
 > pci_device_probe
 > driver_probe_device
 > __driver_attach
 > ? __device_attach
 > bus_for_each_dev
 > driver_attach
 > bus_add_driver
 > ? early_serial_setup
 > driver_register
 > ? early_serial_setup
 > __pci_register_driver
 > serial_pci_driver_init
 > do_one_initcall
 > 
 > 
 > Trying without serial console next..

rebuilt with all serial turned off.

no luck, then it oopses somewhere else. I'm suspecting something isn't
right with that debug patch, as the next trace is also in kobject_release

	Dave

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05 21:11                 ` Dave Jones
@ 2013-09-05 21:26                   ` Dave Jones
  2013-09-05 21:44                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Jones @ 2013-09-05 21:26 UTC (permalink / raw)
  To: Russell King - ARM Linux, Greg KH, Rusty Russell,
	Linux Kernel Mailing List

On Thu, Sep 05, 2013 at 05:11:13PM -0400, Dave Jones wrote:
 
 >  > Trying without serial console next..
 > 
 > rebuilt with all serial turned off.
 > 
 > no luck, then it oopses somewhere else. I'm suspecting something isn't
 > right with that debug patch, as the next trace is also in kobject_release

I managed to get to userspace on one boot, and got this.

kobject ffff88023d93f518 has been corrupted (magic 0x6b6b6b6b).  Please enable kobject debugging for full debug.

That looks like SLAB_POISON. Incompatibility between the two options ?

For some reason, even though I have DEBUG_KOBJECT on, I didn't get extra messages output.

	Dave


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05 21:26                   ` Dave Jones
@ 2013-09-05 21:44                     ` Russell King - ARM Linux
  2013-09-06  1:11                       ` Dave Jones
  0 siblings, 1 reply; 17+ messages in thread
From: Russell King - ARM Linux @ 2013-09-05 21:44 UTC (permalink / raw)
  To: Dave Jones; +Cc: Greg KH, Rusty Russell, Linux Kernel Mailing List

On Thu, Sep 05, 2013 at 05:26:06PM -0400, Dave Jones wrote:
> On Thu, Sep 05, 2013 at 05:11:13PM -0400, Dave Jones wrote:
>  >  > Trying without serial console next..
>  > 
>  > rebuilt with all serial turned off.
>  > 
>  > no luck, then it oopses somewhere else. I'm suspecting something isn't
>  > right with that debug patch, as the next trace is also in kobject_release

You're right about that - I had assumed that it was necessary for all
kobjects to be 'added' before they're released, but that is not so.
Mea culpa.  They just need to be initialised - the problem with putting
a kmalloc into kobject_init() is that if it fails, we have no way to
report that failure...

Updated patch attached... though it sounds like you got it working
anyway.

> I managed to get to userspace on one boot, and got this.
> 
> kobject ffff88023d93f518 has been corrupted (magic 0x6b6b6b6b).  Please
> enable kobject debugging for full debug.

Okay, so this is definitely a case that someone has kfree'd the kobject
without waiting for the ->release function to be called.

> That looks like SLAB_POISON. Incompatibility between the two options ?
> 
> For some reason, even though I have DEBUG_KOBJECT on, I didn't get
> extra messages output.

Hmm.

ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
CFLAGS_kobject_uevent.o += -DDEBUG
endif

should enable the pr_debug()'s in lib/kobject.c... which should at least
appear in the dmesg log.  Being debug level, of course, they won't appear
during normal kernel boot unless 'debug' is passed on the kernel command
line.

It seems to work for me - but... produces rather a lot of debug messages,
so you may also wish to ensure that you have LOG_BUF_SHIFT set to
something large.

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index de6dcbcc..30821f1 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -58,7 +58,13 @@ enum kobject_action {
 	KOBJ_MAX
 };
 
+struct kobject_release;
+
 struct kobject {
+#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
+	unsigned int		magic;
+	struct kobject_release	*release;
+#endif
 	const char		*name;
 	struct list_head	entry;
 	struct kobject		*parent;
@@ -66,9 +72,6 @@ struct kobject {
 	struct kobj_type	*ktype;
 	struct sysfs_dirent	*sd;
 	struct kref		kref;
-#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
-	struct delayed_work	release;
-#endif
 	unsigned int state_initialized:1;
 	unsigned int state_in_sysfs:1;
 	unsigned int state_add_uevent_sent:1;
diff --git a/lib/kobject.c b/lib/kobject.c
index 1d46c15..a700d4f 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -153,6 +153,13 @@ static void kobject_init_internal(struct kobject *kobj)
 }
 
 
+#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
+#define KOBJ_MAGIC	('k' | 'o' << 8 | 'b' << 16 | 'j' << 24)
+struct kobject_release {
+	struct delayed_work work;
+	struct kobject *kobj;
+};
+#endif
 static int kobject_add_internal(struct kobject *kobj)
 {
 	int error = 0;
@@ -283,6 +290,15 @@ void kobject_init(struct kobject *kobj, struct kobj_type *ktype)
 		       "object, something is seriously wrong.\n", kobj);
 		dump_stack();
 	}
+#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
+	kobj->release = kmalloc(sizeof(*kobj->release), GFP_KERNEL);
+	WARN_ON(!kobj->release);
+	if (kobj->release) {
+		kobj->release->kobj = kobj;
+		kobj->magic = KOBJ_MAGIC;
+	}
+#endif
+
 
 	kobject_init_internal(kobj);
 	kobj->ktype = ktype;
@@ -583,8 +599,20 @@ static void kobject_cleanup(struct kobject *kobj)
 #ifdef CONFIG_DEBUG_KOBJECT_RELEASE
 static void kobject_delayed_cleanup(struct work_struct *work)
 {
-	kobject_cleanup(container_of(to_delayed_work(work),
-				     struct kobject, release));
+	struct kobject_release *release = container_of(to_delayed_work(work),
+						struct kobject_release, work);
+	struct kobject *kobj = release->kobj;
+
+	kfree(release);
+	if (kobj->magic == KOBJ_MAGIC) {
+		pr_debug("kobject: '%s' (%p): %s\n",
+			 kobject_name(kobj), kobj, __func__);
+		kobject_cleanup(kobj);
+	} else {
+		pr_err("kobject %p has been corrupted (magic 0x%08x).  Please enable kobject debugging for full debug.\n",
+			kobj, kobj->magic);
+		/* We end up leaking the kobject */
+	}
 }
 #endif
 
@@ -594,8 +622,8 @@ static void kobject_release(struct kref *kref)
 #ifdef CONFIG_DEBUG_KOBJECT_RELEASE
 	pr_debug("kobject: '%s' (%p): %s, parent %p (delayed)\n",
 		 kobject_name(kobj), kobj, __func__, kobj->parent);
-	INIT_DELAYED_WORK(&kobj->release, kobject_delayed_cleanup);
-	schedule_delayed_work(&kobj->release, HZ);
+	INIT_DELAYED_WORK(&kobj->release->work, kobject_delayed_cleanup);
+	schedule_delayed_work(&kobj->release->work, HZ);
 #else
 	kobject_cleanup(kobj);
 #endif


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-05 21:44                     ` Russell King - ARM Linux
@ 2013-09-06  1:11                       ` Dave Jones
  2013-09-06  7:59                         ` Russell King - ARM Linux
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Jones @ 2013-09-06  1:11 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Greg KH, Rusty Russell, Linux Kernel Mailing List

On Thu, Sep 05, 2013 at 10:44:34PM +0100, Russell King - ARM Linux wrote:
 > On Thu, Sep 05, 2013 at 05:26:06PM -0400, Dave Jones wrote:
 > > On Thu, Sep 05, 2013 at 05:11:13PM -0400, Dave Jones wrote:
 > >  >  > Trying without serial console next..
 > >  > 
 > >  > rebuilt with all serial turned off.
 > >  > 
 > >  > no luck, then it oopses somewhere else. I'm suspecting something isn't
 > >  > right with that debug patch, as the next trace is also in kobject_release
 > 
 > You're right about that - I had assumed that it was necessary for all
 > kobjects to be 'added' before they're released, but that is not so.
 > Mea culpa.  They just need to be initialised - the problem with putting
 > a kmalloc into kobject_init() is that if it fails, we have no way to
 > report that failure...
 > 
 > Updated patch attached... though it sounds like you got it working
 > anyway.
 > 
 > > I managed to get to userspace on one boot, and got this.
 > > 
 > > kobject ffff88023d93f518 has been corrupted (magic 0x6b6b6b6b).  Please
 > > enable kobject debugging for full debug.
 > 
 > Okay, so this is definitely a case that someone has kfree'd the kobject
 > without waiting for the ->release function to be called.
 > 
 > > That looks like SLAB_POISON. Incompatibility between the two options ?
 > > 
 > > For some reason, even though I have DEBUG_KOBJECT on, I didn't get
 > > extra messages output.
 > 
 > Hmm.
 > 
 > ifeq ($(CONFIG_DEBUG_KOBJECT),y)
 > CFLAGS_kobject.o += -DDEBUG
 > CFLAGS_kobject_uevent.o += -DDEBUG
 > endif
 > 
 > should enable the pr_debug()'s in lib/kobject.c... which should at least
 > appear in the dmesg log.  Being debug level, of course, they won't appear
 > during normal kernel boot unless 'debug' is passed on the kernel command
 > line.
 > 
 > It seems to work for me - but... produces rather a lot of debug messages,
 > so you may also wish to ensure that you have LOG_BUF_SHIFT set to
 > something large.

As discussed on irc, with this updated patch it hangs *really* early in boot. 

With earlyprintk=vga I can see some of the trace..

kobject_init
firmware_map_add_entry
firmware_map_add_early
e820_reserve_resources
setup_arch
start_kernel
x86_64_start_reservations
x86_64_start_kernel

<RIP> kmem_cache_alloc_trace

	Dave


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kobject: delayed kobject release: help find buggy drivers
  2013-09-06  1:11                       ` Dave Jones
@ 2013-09-06  7:59                         ` Russell King - ARM Linux
  0 siblings, 0 replies; 17+ messages in thread
From: Russell King - ARM Linux @ 2013-09-06  7:59 UTC (permalink / raw)
  To: Dave Jones; +Cc: Greg KH, Rusty Russell, Linux Kernel Mailing List

On Thu, Sep 05, 2013 at 09:11:37PM -0400, Dave Jones wrote:
> As discussed on irc, with this updated patch it hangs *really* early in boot. 
> 
> With earlyprintk=vga I can see some of the trace..
> 
> kobject_init
> firmware_map_add_entry
> firmware_map_add_early
> e820_reserve_resources
> setup_arch
> start_kernel
> x86_64_start_reservations
> x86_64_start_kernel
> 
> <RIP> kmem_cache_alloc_trace

Argh, so kobject_init() gets used before the memory allocators are setup.
This is starting to look like we can't separate the delayed_work from
the kobject struct to provide a layer of isolation.  I'm running out of
ideas here.

Do we have a way to tell if the memory allocators have been initialised?
I can't see anything that would, with the exception that the scheduler
initialisation requires memory allocators are up and running, and it has
its own scheduler_running state, but that's buried in the scheduler.

system_state is useless for this because it gets set to the running state
after device initialisation, and just before we drop into userspace.

We could go back to doing the allocation at kobject_add() time, that seemed
to be marginally successful for you - but I'm fairly certain it won't get
all kobjects that are later released (because its possible to init a
kobject and then put it once to discard it.)

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-09-06  7:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20130903185924.046EE660E91@gitolite.kernel.org>
2013-09-03 20:44 ` kobject: delayed kobject release: help find buggy drivers Dave Jones
2013-09-03 20:55   ` Greg KH
2013-09-04  5:23     ` Rusty Russell
2013-09-05  4:13       ` Dave Jones
2013-09-05  4:57         ` Greg KH
2013-09-05  5:00           ` Dave Jones
2013-09-05  5:19           ` Dave Jones
2013-09-05 15:34             ` Greg KH
2013-09-05 15:47               ` Dave Jones
2013-09-05 17:35             ` Russell King - ARM Linux
2013-09-05 18:13               ` Dave Jones
2013-09-05 19:29               ` Dave Jones
2013-09-05 21:11                 ` Dave Jones
2013-09-05 21:26                   ` Dave Jones
2013-09-05 21:44                     ` Russell King - ARM Linux
2013-09-06  1:11                       ` Dave Jones
2013-09-06  7:59                         ` Russell King - ARM Linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).