All of lore.kernel.org
 help / color / mirror / Atom feed
* Top 10 kernel oopses for the week ending January 5th, 2008
@ 2008-01-05 21:06 Arjan van de Ven
  2008-01-05 21:26 ` Al Viro
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Arjan van de Ven @ 2008-01-05 21:06 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Linus Torvalds, Andrew Morton, NetDev

The http://www.kerneloops.org website collects kernel oops and
warning reports from various mailing lists and bugzillas as well as
with a client users can install to auto-submit oopses.
Below is a top 10 list of the oopses collected in the last 7 days.
(Reports prior to 2.6.23 have been omitted in collecting the top 10)

This week, a total of 49 oopses and warnings have been reported,
compared to 53 reports in the previous week.


Rank 1: __ieee80211_rx
	Warning at net/mac80211/rx.c:1672
	Reported 6 times (11 total reports)
	Same issue that was ranked 2nd last week
	Johannes has diagnosed this as a driver bug in the iwlwifi drivers
	More info: http://www.kerneloops.org/search.php?search=__ieee80211_rx

Rank 2: elv_next_request
	kernel page fault
	Reported 6 times (7 total reports)
	Seems to be related to fast modprobe/rmmod cycles
	More info: http://www.kerneloops.org/search.php?search=elv_next_request

Rank 3: d_splice_alias
	NULL pointer deref
	Reported 3 times
	Happens in the isofs code
	Only seen in 2.6.24-rc5-mm1
	More info: http://www.kerneloops.org/search.php?search=d_splice_alias

Rank 4: remove_proc_entry
	Was also ranked 4th last week
	Only in tainted oopses
	Reported 3 times (12 total reports)
	More info: http://www.kerneloops.org/search.php?search=remove_proc_entry

Rank 5: __d_path
	In the sys_getcwd system call
	Only reported for 2.6.23.x, by one user
	Reported 2 times
	More info: http://www.kerneloops.org/search.php?search=__d_path

Rank 6: device_release
	Was ranked 8th last week
	Same reports as last week, but now entered into bugzilla.kernel.org
	Reported 2 times (8 total reports)
	More info: http://www.kerneloops.org/search.php?search=device_release

Rank 7: pgd_alloc
	Has only been seen on machines tainted with the nvidia module
	Reported 2 times
	More info: http://www.kerneloops.org/search.php?search=pgd_alloc

Rank 8: evdev_disconnect
	kernel page fault
	Reported 2 times (10 total reports)
	Previously seen in older kernels including 2.6.21 but as far back as 2.6.16
	More info: http://www.kerneloops.org/search.php?search=evdev_disconnect

Rank 9: mutex_lock
	kernel null pointer due to rfcomm_tty_close sysfs interaction
	Reported 2 times (9 total reports)
	Ranked 9th last week as well
	More info: http://www.kerneloops.org/search.php?search=mutex_lock

Rank 10: lock_acquire
	WARNING: at kernel/lockdep.c:2658 check_flags()
	Reported 2 times (8 total reports)
	Seems related to __atomic_notifier_call_chain
	More info: http://www.kerneloops.org/search.php?search=lock_acquire


kerneloops.org news
* There is now a UI client so that if your kernel has an oops, you'll get asked for permission
   to submit this (rather than having to manually edit a config file as before)
* If you run the Gentoo distribution, please install the client using "emerge kerneloops"
* If you run the Fedora 8 (or rawhide) distribution, please install the rpm you can download
   from the http://www.kerneloops.org homepage
* If you run Debian unstable, please install the client via apt-get install kerneloops

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-05 21:06 Top 10 kernel oopses for the week ending January 5th, 2008 Arjan van de Ven
@ 2008-01-05 21:26 ` Al Viro
  2008-01-05 21:39 ` Al Viro
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2008-01-05 21:26 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, NetDev

On Sat, Jan 05, 2008 at 01:06:17PM -0800, Arjan van de Ven wrote:
> Rank 3: d_splice_alias
> 	NULL pointer deref
> 	Reported 3 times
> 	Happens in the isofs code
> 	Only seen in 2.6.24-rc5-mm1
> 	More info: http://www.kerneloops.org/search.php?search=d_splice_alias

in -rc6-mm1 as well, fixes on l-k...
 
> Rank 8: evdev_disconnect
> 	kernel page fault
> 	Reported 2 times (10 total reports)
> 	Previously seen in older kernels including 2.6.21 but as far back as 
> 	2.6.16
> 	More info: 
> 	http://www.kerneloops.org/search.php?search=evdev_disconnect

Practically certain to be fixed by 6addb1d6de1968b84852f54561cc9a999909b5a9;
it's list_for_each_entry() walking from the entry that just got list_del()
and kernels without that sucker have no protection of the list in question
whatsoever.
 
> Rank 9: mutex_lock
> 	kernel null pointer due to rfcomm_tty_close sysfs interaction
> 	Reported 2 times (9 total reports)
> 	Ranked 9th last week as well
> 	More info: http://www.kerneloops.org/search.php?search=mutex_lock

pissfs locking problems; AFAICS, its sysfs_get_dentry() walking into
already unlinked sysfs directory node on the way to target; no exclusion
against that is there and it's not trivial to add.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-05 21:06 Top 10 kernel oopses for the week ending January 5th, 2008 Arjan van de Ven
  2008-01-05 21:26 ` Al Viro
@ 2008-01-05 21:39 ` Al Viro
  2008-01-07 17:44   ` J. Bruce Fields
  2008-01-06  3:30 ` Andi Kleen
  2008-01-09 14:12 ` Johannes Berg
  3 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2008-01-05 21:39 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, NetDev

On Sat, Jan 05, 2008 at 01:06:17PM -0800, Arjan van de Ven wrote:
> The http://www.kerneloops.org website collects kernel oops and
> warning reports from various mailing lists and bugzillas as well as
> with a client users can install to auto-submit oopses.
> Below is a top 10 list of the oopses collected in the last 7 days.
> (Reports prior to 2.6.23 have been omitted in collecting the top 10)
> 
> This week, a total of 49 oopses and warnings have been reported,
> compared to 53 reports in the previous week.

FWIW, people moaning about the lack of entry-level kernel work would
do well by decoding those to the level of "this place in this function,
called from <here>, with so-and-so variable being <this>" and posting
the results.  As skills go, it's far more useful than "how to trim
the trailing whitespace" and the rest of checkpatch.pl-inspired crap
that got so popular lately...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-05 21:06 Top 10 kernel oopses for the week ending January 5th, 2008 Arjan van de Ven
  2008-01-05 21:26 ` Al Viro
  2008-01-05 21:39 ` Al Viro
@ 2008-01-06  3:30 ` Andi Kleen
  2008-01-06  3:31   ` Arjan van de Ven
  2008-01-09 14:12 ` Johannes Berg
  3 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2008-01-06  3:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, NetDev

Arjan van de Ven <arjan@linux.intel.com> writes:
>
> Rank 4: remove_proc_entry
> 	Was also ranked 4th last week
> 	Only in tainted oopses
> 	Reported 3 times (12 total reports)
> 	More info: http://www.kerneloops.org/search.php?search=remove_proc_entry

Likely a broken module_exit() function that corrupts the list. To track
these down it might be useful to keep a list of recently unloaded modules
and dump these too in the oops module list with a special flag.

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-06  3:30 ` Andi Kleen
@ 2008-01-06  3:31   ` Arjan van de Ven
  2008-01-06  3:50     ` Andi Kleen
  0 siblings, 1 reply; 26+ messages in thread
From: Arjan van de Ven @ 2008-01-06  3:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, NetDev

Andi Kleen wrote:
> Arjan van de Ven <arjan@linux.intel.com> writes:
>> Rank 4: remove_proc_entry
>> 	Was also ranked 4th last week
>> 	Only in tainted oopses
>> 	Reported 3 times (12 total reports)
>> 	More info: http://www.kerneloops.org/search.php?search=remove_proc_entry
> 
> Likely a broken module_exit() function that corrupts the list. To track
> these down it might be useful to keep a list of recently unloaded modules
> and dump these too in the oops module list with a special flag.
> 
I suspect that simply printing the currently unloading module will catch 90%+ already;
I'll look into adding this, it's a very good idea.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-06  3:31   ` Arjan van de Ven
@ 2008-01-06  3:50     ` Andi Kleen
  0 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2008-01-06  3:50 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, NetDev

On Sat, Jan 05, 2008 at 07:31:29PM -0800, Arjan van de Ven wrote:
> Andi Kleen wrote:
> >Arjan van de Ven <arjan@linux.intel.com> writes:
> >>Rank 4: remove_proc_entry
> >>	Was also ranked 4th last week
> >>	Only in tainted oopses
> >>	Reported 3 times (12 total reports)
> >>	More info: 
> >>	http://www.kerneloops.org/search.php?search=remove_proc_entry
> >
> >Likely a broken module_exit() function that corrupts the list. To track
> >these down it might be useful to keep a list of recently unloaded modules
> >and dump these too in the oops module list with a special flag.
> >
> I suspect that simply printing the currently unloading module will catch 
> 90%+ already;

There are a few cases where the corruption is only visible next time
someone accesses the list. So a small ring buffer might make sense.

-Andi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-05 21:39 ` Al Viro
@ 2008-01-07 17:44   ` J. Bruce Fields
  2008-01-08  1:19     ` Kevin Winchester
  0 siblings, 1 reply; 26+ messages in thread
From: J. Bruce Fields @ 2008-01-07 17:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Arjan van de Ven, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, NetDev

On Sat, Jan 05, 2008 at 09:39:35PM +0000, Al Viro wrote:
> On Sat, Jan 05, 2008 at 01:06:17PM -0800, Arjan van de Ven wrote:
> > The http://www.kerneloops.org website collects kernel oops and
> > warning reports from various mailing lists and bugzillas as well as
> > with a client users can install to auto-submit oopses.
> > Below is a top 10 list of the oopses collected in the last 7 days.
> > (Reports prior to 2.6.23 have been omitted in collecting the top 10)
> > 
> > This week, a total of 49 oopses and warnings have been reported,
> > compared to 53 reports in the previous week.
> 
> FWIW, people moaning about the lack of entry-level kernel work would
> do well by decoding those to the level of "this place in this function,
> called from <here>, with so-and-so variable being <this>" and posting
> the results.  As skills go, it's far more useful than "how to trim
> the trailing whitespace" and the rest of checkpatch.pl-inspired crap
> that got so popular lately...

Is there any good basic documentation on this to point people at?

--b.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-07 17:44   ` J. Bruce Fields
@ 2008-01-08  1:19     ` Kevin Winchester
  2008-01-08  3:26       ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Winchester @ 2008-01-08  1:19 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Al Viro, Arjan van de Ven, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, NetDev

J. Bruce Fields wrote:
> On Sat, Jan 05, 2008 at 09:39:35PM +0000, Al Viro wrote:
>> On Sat, Jan 05, 2008 at 01:06:17PM -0800, Arjan van de Ven wrote:
>>> The http://www.kerneloops.org website collects kernel oops and 
>>> warning reports from various mailing lists and bugzillas as well 
>>> as with a client users can install to auto-submit oopses. Below 
>>> is a top 10 list of the oopses collected in the last 7 days. 
>>> (Reports prior to 2.6.23 have been omitted in collecting the top 
>>> 10)
>>> 
>>> This week, a total of 49 oopses and warnings have been reported,
>>>  compared to 53 reports in the previous week.
>> FWIW, people moaning about the lack of entry-level kernel work 
>> would do well by decoding those to the level of "this place in this
>>  function, called from <here>, with so-and-so variable being
>> <this>" and posting the results.  As skills go, it's far more
>> useful than "how to trim the trailing whitespace" and the rest of 
>> checkpatch.pl-inspired crap that got so popular lately...
> 
> Is there any good basic documentation on this to point people at?
> 

I would second this question.  I see people "decode" oops on lkml often enough, but I've never been entirely sure how its done.  Is it somewhere in Documentation?

-- 
Kevin Winchester

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08  1:19     ` Kevin Winchester
@ 2008-01-08  3:26       ` Linus Torvalds
  2008-01-08  5:59         ` Al Viro
                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Linus Torvalds @ 2008-01-08  3:26 UTC (permalink / raw)
  To: Kevin Winchester
  Cc: J. Bruce Fields, Al Viro, Arjan van de Ven,
	Linux Kernel Mailing List, Andrew Morton, NetDev



On Mon, 7 Jan 2008, Kevin Winchester wrote:

> J. Bruce Fields wrote:
> > 
> > Is there any good basic documentation on this to point people at?
> 
> I would second this question.  I see people "decode" oops on lkml often 
> enough, but I've never been entirely sure how its done.  Is it somewhere 
> in Documentation?

It's actually not necessarily at all that trivial, unless you have a deep 
understanding of the code generated for the architecture in question (and 
even then, some oopses take more time to figure out than others, thanks 
to inlining and tailcalls etc).

If the oops happened with a kernel you generated yourself, it's usually 
rather easy. Especially if you said "y" to the "generate debugging info" 
question at configuration time. Because, in that case, you really just do 
a simple

	gdb vmlinux

and then you can do (for example) something like setting a breakpoint at 
the EIP that was reported for the oops, and it will tell you what line it 
came from.

However, if you don't have the exact binary - which is the common case for 
random oopses reported on lkml - you will generally have to disassemble 
the hex sequence given in the oops (the "Code:" line), and try to match it 
up against the source code to try to figure out what is going on.

Even just the disassembly is not entirely trivial, since the oops will 
give you the eip that it happened at, but you often want to also 
disassemble *backwards* in order to get more of a context (the "Code:" 
line will mark the particular EIP that starts the oopsing instruction by 
enclosing it in <xx>, but with non-constant instruction lengths, you need 
to use a bit of trial-and-error to figure it out.

I usually just compile a small program like

	const char array[]="\xnn\xnn\xnn...";

	int main(int argc, char **argv)
	{
		printf("%p\n", array);
		*(int *)0=0;
	}

and run it under gdb, and then when it gets the SIGSEGV (due to the 
obvious NULL pointer dereference), I can just ask gdb to disassemble 
around the array that contains the code[] stuff. Try a few offsets, to see 
when the disassembly makes sense (and gives the reported EIP as the 
beginning of one of the disassembled instructions).

(You can do it other and smarter ways too, I'm not claiming that's a 
particularly good way to do it, and the old "ksymoops" program used to do 
a pretty good job of this, but I'm used to that particular idiotic way 
myself, since it's how I've basically always done it)

After that, you still need to try to match up the assembly code with the 
source code and figure out what variables the register contents actually 
are all about. You can often try to do a

	make the/affected/file.s

to generate the asm file in your own tree - the register allocation can be 
totally different due to different compilers and different options (and 
things like the fact that maybe the source tree you do this on doesn't 
match the oops report exactly), but it's usually a good starting point to 
compare the disassembly from gdb with the *.s file output from the 
compiler.

Quite often, it's all very obvious (you see some constant or other simple 
pattern). But if you're not used to the assembly format, you'll spend a 
lot of brainpower just trying to figure that part out even for the obvious 
stuff, which is why it's a good thing if you are very comfortable indeed 
with the assembly language of that particular platform.

It's not really all that hard. But the first few times you see those 
oopses, it all looks mostly like just line noise. So it definitely takes 
some practice to do it well.

Anyway, let's take an example, from

	http://lkml.org/lkml/2008/1/1/189

where the most obviously relevant parts are:

	BUG: unable to handle kernel paging request at virtual address 00100100
	EIP:    0060:[<f8819668>] 
	EIP is at evdev_disconnect+0x65/0x9e

	eax: 00000000   ebx: 000ffcf0   ecx: c1926760   edx: 00000033
	esi: f7415600   edi: f741564c   ebp: f7415654   esp: c1967e68
	Call Trace:
		[<c03454b2>] input_unregister_device+0x6f/0xff
		[<c03c6eb6>] klist_release+0x27/0x30
		[<c029178a>] kref_put+0x5f/0x6c
	..
	Code: 5e 4c 81 eb 10 04 00 00 eb 21 8d 83 08 04 00 00 b9 06 00 02 
	      00 ba 1d 00 00 00 e8 6a 93 95 c7 8b 9b 10 04 00 00 81 eb 10 
	      04 00 00 <8b> 83 10 04 00 00 0f 18 00 90 8d 83 10 04 00 00 
	      39 f8 75 cb 8d

so here let's do the above silly C program:

	const char array[]="\x5e\x4c\x81\xeb\x10\x04\x00\x00\xeb\x21..

and running it under gdb gives:

	0x8048500

	Program received signal SIGSEGV, Segmentation fault.
	0x080483f7 in main () at test.c:14
	14              *(int*)0=0;

and now I can just try

	x/20i 0x8048500

and it turns out that already gives a reasonable disassembly. The first 
few instructions are bogus: they're really part of the previous 
instruction, but it looks pretty sane around the actual problem spot, 
which is "array+43" (there are 42 bytes of code before the EIP one, and 20 
bytes after):

	0x8048500 <array>:      pop    %esi
	0x8048501 <array+1>:    dec    %esp
	0x8048502 <array+2>:    sub    $0x410,%ebx
	0x8048508 <array+8>:    jmp    0x804852b <array+43>
	0x804850a <array+10>:   lea    0x408(%ebx),%eax
	0x8048510 <array+16>:   mov    $0x20006,%ecx
	0x8048515 <array+21>:   mov    $0x1d,%edx
	0x804851a <array+26>:   call   0xcf9a1889
	0x804851f <array+31>:   mov    0x410(%ebx),%ebx
	0x8048525 <array+37>:   sub    $0x410,%ebx
	0x804852b <array+43>:   mov    0x410(%ebx),%eax
	0x8048531 <array+49>:   prefetchnta (%eax)
	0x8048534 <array+52>:   nop
	0x8048535 <array+53>:   lea    0x410(%ebx),%eax
	0x804853b <array+59>:   cmp    %edi,%eax
	0x804853d <array+61>:   jne    0x804850a <array+10>
	0x804853f <array+63>:   lea    (%eax),%eax
	.. 

so now we know that the faulting instruction was that

	mov    0x410(%ebx),%eax

and we can also see that this also matches the address that caused the 
oops (ebx=000ffcf0, so 0x410(%ebx) is 00100100, which matches the "unable 
to handle kernel paging request" message).

(Now, people used to kernel oopses will also recognize 00100100 as the 
LIST_POISON1, so this is all about dereferencing the ->next pointer of a 
list entry that has been removed from the list, but that's a whole 
separate level of kernel knowledge).

Anyway, you can now do

	make drivers/input/evdev.s

and see if you can find that kind of code sequence in there. You can use 
the "EIP: evdev_disconnect+0x65/0x9e" thing as a hint: if your compiler 
setup isn't too different, it's likely to be roughly two thirds into that 
evdev_disconnect function (but inlining really can mean that it's 
somewhere else entirely in the source tree!)

The rest left as an exercise for the reader.

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08  3:26       ` Linus Torvalds
@ 2008-01-08  5:59         ` Al Viro
  2008-01-08  7:33           ` Jarek Poplawski
  2008-01-10  4:13           ` Neil Brown
  2008-01-08 16:14         ` Randy Dunlap
  2008-01-08 17:08         ` Andi Kleen
  2 siblings, 2 replies; 26+ messages in thread
From: Al Viro @ 2008-01-08  5:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Winchester, J. Bruce Fields, Arjan van de Ven,
	Linux Kernel Mailing List, Andrew Morton, NetDev, gregkh, neilb

On Mon, Jan 07, 2008 at 07:26:12PM -0800, Linus Torvalds wrote:
 
> I usually just compile a small program like
> 
> 	const char array[]="\xnn\xnn\xnn...";
> 
> 	int main(int argc, char **argv)
> 	{
> 		printf("%p\n", array);
> 		*(int *)0=0;
> 	}
Heh.  I prefer
char main[] = {.....};
for the same thing, with gdb a.out and no running at all.

FWIW, I'm going to go through Arjan's collection and post blow-by-blow
analysis of some of those suckers.  Tonight, probably...

Let's take e.g. http://www.kerneloops.org/raw.php?rawid=2618

RIP: 0010:[<ffffffff803b49a1>]  [<ffffffff803b49a1>] kref_put+0x31/0x80
RSP: 0000:ffff81007ffe5df0  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000034333545 RCX: ffffffff80606270
RDX: 0000000000000040 RSI: ffffffff803b38b0 RDI: 0000000034333545
RBP: ffff81007ffe5e00 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffff8094c430 R11: 0000000000000000 R12: ffffffff803b38b0
R13: ffff81011ede44d8 R14: ffffffff804d7d50 R15: ffff81011ff210f0
FS:  0000000002024870(0000) GS:ffff81011ff0dd00(0000)
...
Call Trace:
 [<ffffffff803b37e9>] kobject_put+0x19/0x20
 [<ffffffff803b389b>] kobject_del+0x2b/0x40
 [<ffffffff804d7d50>] delayed_delete+0x0/0xb0
 [<ffffffff804d7db9>] delayed_delete+0x69/0xb0
 [<ffffffff80249775>] run_workqueue+0x175/0x210
 [<ffffffff8024a411>] worker_thread+0x71/0xb0
 [<ffffffff8024d9e0>] autoremove_wake_function+0x0/0x40
 [<ffffffff8024a3a0>] worker_thread+0x0/0xb0
 [<ffffffff8024d5fd>] kthread+0x4d/0x80
 [<ffffffff8020c4b8>] child_rip+0xa/0x12
 [<ffffffff8020bbcf>] restore_args+0x0/0x30
 [<ffffffff8024d5b0>] kthread+0x0/0x80
 [<ffffffff8020c4ae>] child_rip+0x0/0x12

Code: f0 ff 0b 0f 94 c0 31 d2 84 c0 74 0b 48 89 df 41 ff d4 ba 01

What do we have here?  It barfs in kref_put() called from kobject_put().
It's -rc6-mm1 and I don't have -mm at hand.  Let's see if we can make
any sense out of it from the mainline - it might be a good idea for the
first pass, unless there are some clear indications to the contrary.

kref_put() is fairly low-level (and deep in call chain, here).  And it's
pretty small:
int kref_put(struct kref *kref, void (*release)(struct kref *kref))
{
        WARN_ON(release == NULL);
        WARN_ON(release == (void (*)(struct kref *))kfree);

        if (atomic_dec_and_test(&kref->refcount)) {
                release(kref);
                return 1;
        }
        return 0;
}

Poking around on the site (I'm not familiar enough with it, so bear with
me) gives a link to posting and an important detail missed in that page:

<1>Unable to handle kernel paging request at 0000000034333545 RIP:
[<ffffffff803b49a1>] kref_put+0x31/0x80

Now, that's interesting - we barf on dereferencing a pointer that (a) has
upper 32 bits zero and (b) doesn't have a bit 7 set in any byte.  Smells
like ASCII data misinterpreted as a pointer.  Conversion to ASCII gives
"E534", which sure as hell does look like a fragment of some string.

OK, so where does that happen?  In this case we have only one candidate,
really - atomic_dec_and_test(&kref->refcount).  Before it we only compare
argument with constants, after it we pass argument to another function.

Now, looking at the registers (see above) we notice that this address had
come from rbx.  Let's try objdump -d lib/kref.o and see what we've got there
for kref_put():

  4a:   55                      push   %rbp
  4b:   48 85 f6                test   %rsi,%rsi
  4e:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
  55:   ba 36 00 00 00          mov    $0x36,%edx
  5a:   48 89 e5                mov    %rsp,%rbp
  5d:   41 54                   push   %r12
  5f:   49 89 f4                mov    %rsi,%r12
  62:   53                      push   %rbx
  63:   48 89 fb                mov    %rdi,%rbx
  66:   74 15                   je     7d <kref_put+0x33>
  68:   48 81 fe 00 00 00 00    cmp    $0x0,%rsi
  6f:   75 26                   jne    97 <kref_put+0x4d>
  71:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
  78:   ba 37 00 00 00          mov    $0x37,%edx
  7d:   48 c7 c6 00 00 00 00    mov    $0x0,%rsi
  84:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
  8b:   31 c0                   xor    %eax,%eax
  8d:   e8 00 00 00 00          callq  92 <kref_put+0x48>
  92:   e8 00 00 00 00          callq  97 <kref_put+0x4d>
  97:   f0 ff 0b                lock decl (%rbx)
  9a:   0f 94 c0                sete   %al
  9d:   31 d2                   xor    %edx,%edx
  9f:   84 c0                   test   %al,%al
  a1:   74 0b                   je     ae <kref_put+0x64>
  a3:   48 89 df                mov    %rbx,%rdi
  a6:   41 ff d4                callq  *%r12
  a9:   ba 01 00 00 00          mov    $0x1,%edx
  ae:   5b                      pop    %rbx
  af:   41 5c                   pop    %r12
  b1:   c9                      leaveq 
  b2:   89 d0                   mov    %edx,%eax
  b4:   c3                      retq   

It's not necessary identical to what that kernel had; still, not bad for
a starting point.  We are even lucky enough to find 'f0 ff' immediately
in there:
  97:   f0 ff 0b                lock decl (%rbx)
Next instructions also match - actually, better than I expected.  So.
We even have register allocation matching the reported kernel and it
all makes sense - this is the instruction where it had puked, the address
had, indeed, come from rbx and it's locked decrement of *rbx followed
by some testing and conditional jump.  Just what one would expect
from atomic_dec_and_test().

IOW, we have &kref->refcount equal to 0x0000000034333545.  What's the
value of kref itself?  We could look into definition of struct kref,
or just notice that release(kref) should be right after that, so
we could see what gets passed to it.
  a3:   48 89 df                mov    %rbx,%rdi
  a6:   41 ff d4                callq  *%r12
is clearly it.  So what we are passing is rbx itself, so that offset got to
be zero.

All right, so we have kref_put() getting 0x0000000034333545 instead of a
pointer in the first argument.  It's called from kobject_put() and unless
-mm has changes in there, there's no need to guess where in kobject_put()
that had been:
void kobject_put(struct kobject * kobj)
{
        if (kobj)
                kref_put(&kobj->kref, kobject_release);
}

Aha.  Now, we need to work back from &kobj->kref to kobj.  Again, assuming
that -mm doesn't change struct kobject in a way that would affect the
offset, we have
struct kobject {
        const char              * k_name;
        struct kref             kref;
        struct list_head        entry;
...
Looks like it's not too likely, unless somebody had been deliberately
rearranging fields (to fit cachelines better, etc.).  We'll need to
verify that, of course, but for now it's a good starting assumption.
Very well, we have one pointer in front of it.  It's an amd64, so
it's an 8 byte field and no alignment padding is to be expected.
IOW, kobj is 0x0000000034333545 - 8, i.e. 0x000000003433353D.  What's _that_
in ASCII?  "=534".  OK, that makes even more sense for a part of some
string...

Let's check; time to take a look at that patch, after all
 struct kobject {
-       const char              * k_name;
+       const char              *name;
        struct kref             kref;

OK, not a problem.  While we are at it, kobject_put() is unchanged by
the patch.  Now, back to trace.  kobject_put() is called from kobject_del().
Now, that one does differ from mainline:

void kobject_del(struct kobject * kobj)
{
	if (!kobj)
		return;

	sysfs_remove_dir(kobj);
	kobj->state_in_sysfs = 0;
	kobj_kset_leave(kobj);
	kobject_put(kobj->parent);
	kobj->parent = NULL;
}

Humm...  So we have kobj->parent containing crap.  What about the caller?
It's from drivers/md/md.c:
static void delayed_delete(struct work_struct *ws)
{
        mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
        kobject_del(&rdev->kobj);
}
and it's only used in
static void unbind_rdev_from_array(mdk_rdev_t * rdev)
{
        char b[BDEVNAME_SIZE];
        if (!rdev->mddev) {
                MD_BUG();
                return;
        }
        bd_release_from_disk(rdev->bdev, rdev->mddev->gendisk);
        list_del_init(&rdev->same_set);
        printk(KERN_INFO "md: unbind<%s>\n", bdevname(rdev->bdev,b));
        rdev->mddev = NULL;
        sysfs_remove_link(&rdev->kobj, "block");

        /* We need to delay this, otherwise we can deadlock when
         * writing to 'remove' to "dev/state"
         */
        INIT_WORK(&rdev->del_work, delayed_delete);
        schedule_work(&rdev->del_work);
}

Well, that takes care of the rest of trace - we have used INIT_WORK to set
rdev->del_work up, scheduled it for execution and eventually the callback
had been called (asynchronously to us).

So what do we have so far?
	unbind_rdev_from_array(rdev);
had been called and rdev->kobj.parent turned to contain a crap value
(0x000000003433353D) instead of a pointer.  That's about all we can
get out of trace.

Now, let's see what could have triggered it.  Curious... Looking through
diff shows an interesting bit:
-       rdev->kobj.parent = &mddev->kobj;
-       if ((err = kobject_add(&rdev->kobj)))
+       if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b)))
                goto fail;
in bind_rdev_to_array().  At the first sight the changes in kobject_add() 
seem to match that.  And nothing else in md.c seems to be setting it to
anything non-NULL.  Very well, so it's one of the following:

	* unbind_rdev_from_array() called on rdev that didn't pass through
bind_rdev_to_array().
	* unbind_rdev_from_array() called on rdev that bailed out from
bind_rdev_to_array() before that point.
	* mddev value in the above is crap.  That's bloody unlikely, BTW -
kobject_add() would increment the refcount of rdev->kobj.parent (or we would
be in far more trouble, since it would not match kobject_del() and _that_
would hurt a lot more than just md.c).  So &mddev->kobj would better be
something saner when it went through that point.
	* *rdev got corrupted in between.

Actually, looking at the callers of unbind_rdev_from_array()...  We always
follow it with export_rdev().  Which does (presumably) final kobject_put()
on &rdev->kobj, freeing rdev itself.

What guarantees that it doesn't happen before we get to callback?  AFAICS,
nothing whatsoever...

And if it does happen, we'll get rdev happily freed (by rdev_free(), as
->release() of &rdev->kobj) by the time we get to delayed_delete().  Which
explains what's going on just fine.

BTW, -mm changes in kobject.c explain WTF it doesn't trigger in mainline -
there we managed to get away with that since kobject_add() bumped refcount
of kobject by one and kobject_del() decremented it.  That masked the bug.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08  5:59         ` Al Viro
@ 2008-01-08  7:33           ` Jarek Poplawski
  2008-01-10  4:13           ` Neil Brown
  1 sibling, 0 replies; 26+ messages in thread
From: Jarek Poplawski @ 2008-01-08  7:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Al Viro, Linus Torvalds, Kevin Winchester, J. Bruce Fields,
	Linux Kernel Mailing List, Andrew Morton, NetDev, gregkh, neilb

On 08-01-2008 06:59, Al Viro wrote:
> On Mon, Jan 07, 2008 at 07:26:12PM -0800, Linus Torvalds wrote:
>  
>> I usually just compile a small program like
>>
>> 	const char array[]="\xnn\xnn\xnn...";
>>
>> 	int main(int argc, char **argv)
>> 	{
>> 		printf("%p\n", array);
>> 		*(int *)0=0;
>> 	}
> Heh.  I prefer
> char main[] = {.....};
> for the same thing, with gdb a.out and no running at all.
...

IMHO, it would be really wasteful if Arian havn't published these
and maybe a few more such advices and links on this site, not
necessarily waiting for html-ized or howto-ized versions!

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08  3:26       ` Linus Torvalds
  2008-01-08  5:59         ` Al Viro
@ 2008-01-08 16:14         ` Randy Dunlap
  2008-01-08 17:42           ` Arjan van de Ven
  2008-01-08 17:08         ` Andi Kleen
  2 siblings, 1 reply; 26+ messages in thread
From: Randy Dunlap @ 2008-01-08 16:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Winchester, J. Bruce Fields, Al Viro, Arjan van de Ven,
	Linux Kernel Mailing List, Andrew Morton, NetDev

On Mon, 7 Jan 2008 19:26:12 -0800 (PST) Linus Torvalds wrote:

> On Mon, 7 Jan 2008, Kevin Winchester wrote:
> 
> > J. Bruce Fields wrote:
> > > 
> > > Is there any good basic documentation on this to point people at?
> > 
> > I would second this question.  I see people "decode" oops on lkml often 
> > enough, but I've never been entirely sure how its done.  Is it somewhere 
> > in Documentation?
> 
> It's actually not necessarily at all that trivial, unless you have a deep 
> understanding of the code generated for the architecture in question (and 
> even then, some oopses take more time to figure out than others, thanks 
> to inlining and tailcalls etc).
> 
> If the oops happened with a kernel you generated yourself, it's usually 
> rather easy. Especially if you said "y" to the "generate debugging info" 
> question at configuration time. Because, in that case, you really just do 
> a simple
> 
> 	gdb vmlinux
> 
> and then you can do (for example) something like setting a breakpoint at 
> the EIP that was reported for the oops, and it will tell you what line it 
> came from.
> 
> However, if you don't have the exact binary - which is the common case for 
> random oopses reported on lkml - you will generally have to disassemble 
> the hex sequence given in the oops (the "Code:" line), and try to match it 
> up against the source code to try to figure out what is going on.
> 
> Even just the disassembly is not entirely trivial, since the oops will 
> give you the eip that it happened at, but you often want to also 
> disassemble *backwards* in order to get more of a context (the "Code:" 
> line will mark the particular EIP that starts the oopsing instruction by 
> enclosing it in <xx>, but with non-constant instruction lengths, you need 
> to use a bit of trial-and-error to figure it out.
> 
> I usually just compile a small program like
> 
> 	const char array[]="\xnn\xnn\xnn...";
> 
> 	int main(int argc, char **argv)
> 	{
> 		printf("%p\n", array);
> 		*(int *)0=0;
> 	}
> 
> and run it under gdb, and then when it gets the SIGSEGV (due to the 
> obvious NULL pointer dereference), I can just ask gdb to disassemble 
> around the array that contains the code[] stuff. Try a few offsets, to see 
> when the disassembly makes sense (and gives the reported EIP as the 
> beginning of one of the disassembled instructions).
> 
> (You can do it other and smarter ways too, I'm not claiming that's a 
> particularly good way to do it, and the old "ksymoops" program used to do 
> a pretty good job of this, but I'm used to that particular idiotic way 
> myself, since it's how I've basically always done it)

One other way to do it (at least for x86-32/64) is to use
$kerneltree/scripts/decodecode.  It may work on other $arches also,
but I haven't tested it on others.

---
~Randy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08  3:26       ` Linus Torvalds
  2008-01-08  5:59         ` Al Viro
  2008-01-08 16:14         ` Randy Dunlap
@ 2008-01-08 17:08         ` Andi Kleen
  2 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2008-01-08 17:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Winchester, J. Bruce Fields, Al Viro, Arjan van de Ven,
	Linux Kernel Mailing List, Andrew Morton, NetDev

Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> I usually just compile a small program like

Just use scripts/decodecode and cat the Code line into that.

> particularly good way to do it, and the old "ksymoops" program used to do 
> a pretty good job of this, but I'm used to that particular idiotic way 
> myself, since it's how I've basically always done it)
>
> After that, you still need to try to match up the assembly code with the 
> source code and figure out what variables the register contents actually 
> are all about. You can often try to do a
>
> 	make the/affected/file.s


IMHO better is  

make the/file/xyz.lst        

which gives you a listing with binary data in there which can be
grepped for.

But you should install a very recent binutils because older objdump -S
couldn't deal with unit-at-a-time compilers.

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08 16:14         ` Randy Dunlap
@ 2008-01-08 17:42           ` Arjan van de Ven
  2008-01-08 18:08             ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Arjan van de Ven @ 2008-01-08 17:42 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Linus Torvalds, Kevin Winchester, J. Bruce Fields, Al Viro,
	Linux Kernel Mailing List, Andrew Morton, NetDev

Randy Dunlap wrote:
>>
>> (You can do it other and smarter ways too, I'm not claiming that's a 
>> particularly good way to do it, and the old "ksymoops" program used to do 
>> a pretty good job of this, but I'm used to that particular idiotic way 
>> myself, since it's how I've basically always done it)
> 
> One other way to do it (at least for x86-32/64) is to use
> $kerneltree/scripts/decodecode.  It may work on other $arches also,
> but I haven't tested it on others.

I've made life easier for those using the www.kerneloops.org website;
at least for x86 oopses the website now does this for you and shows
the decoded Code: line in the raw oops data:

http://www.kerneloops.org/raw.php?rawid=2716

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08 17:42           ` Arjan van de Ven
@ 2008-01-08 18:08             ` Linus Torvalds
  2008-01-08 18:16               ` Arjan van de Ven
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-01-08 18:08 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Randy Dunlap, Kevin Winchester, J. Bruce Fields, Al Viro,
	Linux Kernel Mailing List, Andrew Morton, NetDev



On Tue, 8 Jan 2008, Arjan van de Ven wrote:
> 
> I've made life easier for those using the www.kerneloops.org website;
> at least for x86 oopses the website now does this for you and shows
> the decoded Code: line in the raw oops data:
> 
> http://www.kerneloops.org/raw.php?rawid=2716

Cool.

One thing I wonder about - could you separate out the bug-ons and warnings 
from the oopses? They really are different issues, and an oops with 
register information etc is very different from a BUG() with line numbers, 
which in turn is very different from a WARN_ON().

Right now, it says

	Oopses reported for kernel 2.6.24-rc7


	7 oopses reported

	hfsplus_releasepage	3
	enqueue_task		1
	lock_acquire		1
	__hfs_brec_find		1
	__ieee80211_rx		1

and in fact three of those five entries are really WARN_ON's. It would be 
nicer if it would look more along the lines of

	Backtraces reported for kernel 2.6.24-rc7


	4 oopses reported

	hfsplus_releasepage     3
	__hfs_brec_find         1


	3 warnings repored

	enqueue_task            1
	lock_acquire            1
	__ieee80211_rx          1

because those things really don't have the same kind of impact at all, and 
tend to be very different to debug (a "BUG_ON()" is perhaps somewhat 
closer to an oops, but a WARN_ON() is definitely in a class of its own).

On that "Code:" side, it seems there is still some problem with oops 
parsing. See for example:

	http://www.kerneloops.org/raw.php?rawid=1521&msgid=http://mid.gmane.org/20071017154655.GA13394@elte.hu

and notice how the Code: never made it into the raw message (and thus 
there is also no instruction disassembly).

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08 18:08             ` Linus Torvalds
@ 2008-01-08 18:16               ` Arjan van de Ven
  2008-01-08 18:27                 ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Arjan van de Ven @ 2008-01-08 18:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Randy Dunlap, Kevin Winchester, J. Bruce Fields, Al Viro,
	Linux Kernel Mailing List, Andrew Morton, NetDev

Linus Torvalds wrote:
> Cool.
> 
> One thing I wonder about - could you separate out the bug-ons and warnings 
> from the oopses? They really are different issues, and an oops with 
> register information etc is very different from a BUG() with line numbers, 
> which in turn is very different from a WARN_ON().


> and in fact three of those five entries are really WARN_ON's. It would be 
> nicer if it would look more along the lines of
> 
> 	Backtraces reported for kernel 2.6.24-rc7
> 
> 
> 	4 oopses reported
> 
> 	hfsplus_releasepage     3
> 	__hfs_brec_find         1
> 
> 
> 	3 warnings repored
> 
> 	enqueue_task            1
> 	lock_acquire            1
> 	__ieee80211_rx          1
> 
> because those things really don't have the same kind of impact at all, and 
> tend to be very different to debug (a "BUG_ON()" is perhaps somewhat 
> closer to an oops, but a WARN_ON() is definitely in a class of its own).

the database has the information so it's just a matter of slightly different php code ;)
Before I do that... do you want the BUG's separate, part of the warnings or part of the oopses?
(I rather make this change once ;)

> 
> On that "Code:" side, it seems there is still some problem with oops 
> parsing. See for example:
> 
> 	http://www.kerneloops.org/raw.php?rawid=1521&msgid=http://mid.gmane.org/20071017154655.GA13394@elte.hu
> 
> and notice how the Code: never made it into the raw message (and thus 
> there is also no instruction disassembly).

ok I'll fix this; I can fix this for all new entries at least, fixing retroactive is going to be
near impossible I suspect.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08 18:16               ` Arjan van de Ven
@ 2008-01-08 18:27                 ` Linus Torvalds
  2008-01-08 19:05                   ` Arjan van de Ven
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-01-08 18:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Randy Dunlap, Kevin Winchester, J. Bruce Fields, Al Viro,
	Linux Kernel Mailing List, Andrew Morton, NetDev



On Tue, 8 Jan 2008, Arjan van de Ven wrote:
> 
> the database has the information so it's just a matter of slightly different
> php code ;)
> Before I do that... do you want the BUG's separate, part of the warnings or
> part of the oopses?
> (I rather make this change once ;)

I'd like them all separate, they tend to be very different and contain 
different information.

Put the warnings last, as the least important. Oopses at the top, since 
they tend to be the ones that are less expected.

> > and notice how the Code: never made it into the raw message (and thus there
> > is also no instruction disassembly).
> 
> ok I'll fix this; I can fix this for all new entries at least, fixing
> retroactive is going to be near impossible I suspect.

Oh well..

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08 18:27                 ` Linus Torvalds
@ 2008-01-08 19:05                   ` Arjan van de Ven
  2008-01-08 19:31                     ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Arjan van de Ven @ 2008-01-08 19:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Randy Dunlap, Kevin Winchester, J. Bruce Fields, Al Viro,
	Linux Kernel Mailing List, Andrew Morton, NetDev

Linus Torvalds wrote:
> 
> On Tue, 8 Jan 2008, Arjan van de Ven wrote:
>> the database has the information so it's just a matter of slightly different
>> php code ;)
>> Before I do that... do you want the BUG's separate, part of the warnings or
>> part of the oopses?
>> (I rather make this change once ;)
> 
> I'd like them all separate, they tend to be very different and contain 
> different information.
> 
> Put the warnings last, as the least important. Oopses at the top, since 
> they tend to be the ones that are less expected.

ok done; I had to fizzle a bit because some things aren't *exactly* a BUG() statement
but I track them anyway (things like the "sleeping in invalid context" check), so I
had to somewhat arbitrarily assign categories for those. I might fine tune these over time
some; if you or someone else sees problems with categorization please let me know


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08 19:05                   ` Arjan van de Ven
@ 2008-01-08 19:31                     ` Linus Torvalds
  2008-01-08 22:56                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2008-01-08 19:31 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Randy Dunlap, Kevin Winchester, J. Bruce Fields, Al Viro,
	Linux Kernel Mailing List, Andrew Morton, NetDev,
	Rafael J. Wysocki



On Tue, 8 Jan 2008, Arjan van de Ven wrote:
> 
> ok done; I had to fizzle a bit because some things aren't *exactly* a 
> BUG() statement but I track them anyway (things like the "sleeping in 
> invalid context" check), so I had to somewhat arbitrarily assign 
> categories for those. I might fine tune these over time some; if you or 
> someone else sees problems with categorization please let me know

Looking good. I wonder if we could also have some way to cross-ref these 
things with the regression list (notably try to get pointers to them in 
the regression list). 

Right now, the regression list keeps track of the things that are closed, 
but kerneloops.org obviously doesn't see that, so it's not obvious which 
oopses are "uninteresting" since they've been fixed.

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08 19:31                     ` Linus Torvalds
@ 2008-01-08 22:56                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 26+ messages in thread
From: Rafael J. Wysocki @ 2008-01-08 22:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Randy Dunlap, Kevin Winchester,
	J. Bruce Fields, Al Viro, Linux Kernel Mailing List,
	Andrew Morton, NetDev

On Tuesday, 8 of January 2008, Linus Torvalds wrote:
> 
> On Tue, 8 Jan 2008, Arjan van de Ven wrote:
> > 
> > ok done; I had to fizzle a bit because some things aren't *exactly* a 
> > BUG() statement but I track them anyway (things like the "sleeping in 
> > invalid context" check), so I had to somewhat arbitrarily assign 
> > categories for those. I might fine tune these over time some; if you or 
> > someone else sees problems with categorization please let me know
> 
> Looking good. I wonder if we could also have some way to cross-ref these 
> things with the regression list (notably try to get pointers to them in 
> the regression list). 

I'm thinking about that, but haven't invented any automated solution yet.
I only can manually add references to kerneloops.org, for now.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-05 21:06 Top 10 kernel oopses for the week ending January 5th, 2008 Arjan van de Ven
                   ` (2 preceding siblings ...)
  2008-01-06  3:30 ` Andi Kleen
@ 2008-01-09 14:12 ` Johannes Berg
  2008-01-09 15:28   ` Arjan van de Ven
  3 siblings, 1 reply; 26+ messages in thread
From: Johannes Berg @ 2008-01-09 14:12 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, NetDev

[-- Attachment #1: Type: text/plain, Size: 670 bytes --]


> Rank 1: __ieee80211_rx
> 	Warning at net/mac80211/rx.c:1672
> 	Reported 6 times (11 total reports)
> 	Same issue that was ranked 2nd last week
> 	Johannes has diagnosed this as a driver bug in the iwlwifi drivers
> 	More info: http://www.kerneloops.org/search.php?search=__ieee80211_rx

Note that because we don't get the module list for WARN_ON, we don't
actually know whether all of these instances are from the iwlwifi
drivers. A few other drivers suffer from the same problem. In one of
these cases, iwlwifi was contained in the stack trace, but in the common
case that isn't happening because packet processing is delayed to a
tasklet.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-09 14:12 ` Johannes Berg
@ 2008-01-09 15:28   ` Arjan van de Ven
  0 siblings, 0 replies; 26+ messages in thread
From: Arjan van de Ven @ 2008-01-09 15:28 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, NetDev

Johannes Berg wrote:
>> Rank 1: __ieee80211_rx
>> 	Warning at net/mac80211/rx.c:1672
>> 	Reported 6 times (11 total reports)
>> 	Same issue that was ranked 2nd last week
>> 	Johannes has diagnosed this as a driver bug in the iwlwifi drivers
>> 	More info: http://www.kerneloops.org/search.php?search=__ieee80211_rx
> 
> Note that because we don't get the module list for WARN_ON, we don't
> actually know whether all of these instances are from the iwlwifi
> drivers. A few other drivers suffer from the same problem. In one of
> these cases, iwlwifi was contained in the stack trace, but in the common
> case that isn't happening because packet processing is delayed to a
> tasklet.
> 

and fwiw a patch to get this added to WARN_ON was posted by my last week to fix this;
once this goes into 2.6.25-rc this annoyance/hinderance in debugging will be fixed.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-08  5:59         ` Al Viro
  2008-01-08  7:33           ` Jarek Poplawski
@ 2008-01-10  4:13           ` Neil Brown
  2008-01-10  5:53               ` Al Viro
  1 sibling, 1 reply; 26+ messages in thread
From: Neil Brown @ 2008-01-10  4:13 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Kevin Winchester, J. Bruce Fields,
	Arjan van de Ven, Linux Kernel Mailing List, Andrew Morton,
	NetDev, gregkh

On Tuesday January 8, viro@ZenIV.linux.org.uk wrote:
> 
> FWIW, I'm going to go through Arjan's collection and post blow-by-blow
> analysis of some of those suckers.  Tonight, probably...
> 
> Let's take e.g. http://www.kerneloops.org/raw.php?rawid=2618

Thanks for that analysis.
...
> 
> Humm...  So we have kobj->parent containing crap.  What about the caller?
> It's from drivers/md/md.c:
> static void delayed_delete(struct work_struct *ws)

This is a good argument for sticking "md_" at the from of all my
function names, even if they are static.  I'm fairly sure I looked at
that trace:

> Call Trace:
>  [<ffffffff803b37e9>] kobject_put+0x19/0x20
>  [<ffffffff803b389b>] kobject_del+0x2b/0x40
>  [<ffffffff804d7d50>] delayed_delete+0x0/0xb0
>  [<ffffffff804d7db9>] delayed_delete+0x69/0xb0
>  [<ffffffff80249775>] run_workqueue+0x175/0x210
>  [<ffffffff8024a411>] worker_thread+0x71/0xb0
>  [<ffffffff8024d9e0>] autoremove_wake_function+0x0/0x40
>  [<ffffffff8024a3a0>] worker_thread+0x0/0xb0
>  [<ffffffff8024d5fd>] kthread+0x4d/0x80
>  [<ffffffff8020c4b8>] child_rip+0xa/0x12
>  [<ffffffff8020bbcf>] restore_args+0x0/0x30
>  [<ffffffff8024d5b0>] kthread+0x0/0x80
>  [<ffffffff8020c4ae>] child_rip+0x0/0x12

but as it doesn't mention 'md' or 'nfs' I moved on.  My bad.

> 
> What guarantees that it doesn't happen before we get to callback?  AFAICS,
> nothing whatsoever...

Yes, that's bad isn't it :-)

I think I should be using sysfs_schedule_callback here.  That makes the 
required 'get' and 'put' calls.... but it can fail with -ENOMEM.  I
wonder what I do if -ENOMEM???  Maybe I'll just continue to roll my
one :-( 

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-10  4:13           ` Neil Brown
@ 2008-01-10  5:53               ` Al Viro
  0 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2008-01-10  5:53 UTC (permalink / raw)
  To: Neil Brown
  Cc: Linus Torvalds, Kevin Winchester, J. Bruce Fields,
	Arjan van de Ven, Linux Kernel Mailing List, Andrew Morton,
	NetDev, gregkh

On Thu, Jan 10, 2008 at 03:13:48PM +1100, Neil Brown wrote:
> > What guarantees that it doesn't happen before we get to callback?  AFAICS,
> > nothing whatsoever...
> 
> Yes, that's bad isn't it :-)
> 
> I think I should be using sysfs_schedule_callback here.  That makes the 
> required 'get' and 'put' calls.... but it can fail with -ENOMEM.  I
> wonder what I do if -ENOMEM???  Maybe I'll just continue to roll my
> one :-( 

How about this instead (completely untested)

	* split failure exits
	* switch to kick_rdev_from_array()
	* fold unbind_rdev_from_array() into it (no other callers anymore)
	* take export_rdev() into failure case in bind_rdev_to_array()
	* in kick_rdev_from_array() do what export_rdev() does sans
kobject_put() and do that before schedule_work().  Take kobject_put() into
delayed_delete().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/drivers/md/md.c b/drivers/md/md.c
index cef9ebd..116cc5a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1344,6 +1344,39 @@ static int match_mddev_units(mddev_t *mddev1, mddev_t *mddev2)
 
 static LIST_HEAD(pending_raid_disks);
 
+static void unlock_rdev(mdk_rdev_t *rdev)
+{
+	struct block_device *bdev = rdev->bdev;
+	rdev->bdev = NULL;
+	if (!bdev)
+		MD_BUG();
+	bd_release(bdev);
+	blkdev_put(bdev);
+}
+
+void md_autodetect_dev(dev_t dev);
+
+static void __export_rdev(mdk_rdev_t * rdev)
+{
+	char b[BDEVNAME_SIZE];
+	printk(KERN_INFO "md: export_rdev(%s)\n",
+		bdevname(rdev->bdev,b));
+	if (rdev->mddev)
+		MD_BUG();
+	free_disk_sb(rdev);
+	list_del_init(&rdev->same_set);
+#ifndef MODULE
+	md_autodetect_dev(rdev->bdev->bd_dev);
+#endif
+	unlock_rdev(rdev);
+}
+
+static void export_rdev(mdk_rdev_t * rdev)
+{
+	__export_rdev(rdev);
+	kobject_put(&rdev->kobj);
+}
+
 static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 {
 	char b[BDEVNAME_SIZE];
@@ -1353,7 +1386,8 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 
 	if (rdev->mddev) {
 		MD_BUG();
-		return -EINVAL;
+		err = -EINVAL;
+		goto out;
 	}
 	/* make sure rdev->size exceeds mddev->size */
 	if (rdev->size && (mddev->size == 0 || rdev->size < mddev->size)) {
@@ -1362,8 +1396,10 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 			 * If mddev->level <= 0, then we don't care
 			 * about aligning sizes (e.g. linear)
 			 */
-			if (mddev->level > 0)
-				return -ENOSPC;
+			if (mddev->level > 0) {
+				err = -ENOSPC;
+				goto out;
+			}
 		} else
 			mddev->size = rdev->size;
 	}
@@ -1379,12 +1415,16 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 			choice++;
 		rdev->desc_nr = choice;
 	} else {
-		if (find_rdev_nr(mddev, rdev->desc_nr))
-			return -EBUSY;
+		if (find_rdev_nr(mddev, rdev->desc_nr)) {
+			err = -EBUSY;
+			goto out;
+		}
 	}
 	bdevname(rdev->bdev,b);
-	if (kobject_set_name(&rdev->kobj, "dev-%s", b) < 0)
-		return -ENOMEM;
+	if (kobject_set_name(&rdev->kobj, "dev-%s", b) < 0) {
+		err = -ENOMEM;
+		goto out;
+	}
 	while ( (s=strchr(rdev->kobj.k_name, '/')) != NULL)
 		*s = '!';
 			
@@ -1407,9 +1447,11 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 	bd_claim_by_disk(rdev->bdev, rdev, mddev->gendisk);
 	return 0;
 
- fail:
+fail:
 	printk(KERN_WARNING "md: failed to register dev-%s for %s\n",
 	       b, mdname(mddev));
+out:
+	export_rdev(rdev);
 	return err;
 }
 
@@ -1417,19 +1459,22 @@ static void delayed_delete(struct work_struct *ws)
 {
 	mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
 	kobject_del(&rdev->kobj);
+	kobject_put(&rdev->kobj);
 }
 
-static void unbind_rdev_from_array(mdk_rdev_t * rdev)
+static void kick_rdev_from_array(mdk_rdev_t * rdev)
 {
 	char b[BDEVNAME_SIZE];
 	if (!rdev->mddev) {
 		MD_BUG();
+		export_rdev(rdev);
 		return;
 	}
 	bd_release_from_disk(rdev->bdev, rdev->mddev->gendisk);
 	list_del_init(&rdev->same_set);
 	printk(KERN_INFO "md: unbind<%s>\n", bdevname(rdev->bdev,b));
 	rdev->mddev = NULL;
+	__export_rdev(rdev);
 	sysfs_remove_link(&rdev->kobj, "block");
 
 	/* We need to delay this, otherwise we can deadlock when
@@ -1467,40 +1512,6 @@ static int lock_rdev(mdk_rdev_t *rdev, dev_t dev)
 	return err;
 }
 
-static void unlock_rdev(mdk_rdev_t *rdev)
-{
-	struct block_device *bdev = rdev->bdev;
-	rdev->bdev = NULL;
-	if (!bdev)
-		MD_BUG();
-	bd_release(bdev);
-	blkdev_put(bdev);
-}
-
-void md_autodetect_dev(dev_t dev);
-
-static void export_rdev(mdk_rdev_t * rdev)
-{
-	char b[BDEVNAME_SIZE];
-	printk(KERN_INFO "md: export_rdev(%s)\n",
-		bdevname(rdev->bdev,b));
-	if (rdev->mddev)
-		MD_BUG();
-	free_disk_sb(rdev);
-	list_del_init(&rdev->same_set);
-#ifndef MODULE
-	md_autodetect_dev(rdev->bdev->bd_dev);
-#endif
-	unlock_rdev(rdev);
-	kobject_put(&rdev->kobj);
-}
-
-static void kick_rdev_from_array(mdk_rdev_t * rdev)
-{
-	unbind_rdev_from_array(rdev);
-	export_rdev(rdev);
-}
-
 static void export_array(mddev_t *mddev)
 {
 	struct list_head *tmp;
@@ -2576,8 +2587,10 @@ new_dev_store(mddev_t *mddev, const char *buf, size_t len)
 						       mdk_rdev_t, same_set);
 			err = super_types[mddev->major_version]
 				.load_super(rdev, rdev0, mddev->minor_version);
-			if (err < 0)
-				goto out;
+			if (err < 0) {
+				export_rdev(rdev);
+				return err;
+			}
 		}
 	} else
 		rdev = md_import_device(dev, -1, -1);
@@ -2585,9 +2598,6 @@ new_dev_store(mddev_t *mddev, const char *buf, size_t len)
 	if (IS_ERR(rdev))
 		return PTR_ERR(rdev);
 	err = bind_rdev_to_array(rdev, mddev);
- out:
-	if (err)
-		export_rdev(rdev);
 	return err ? err : len;
 }
 
@@ -3637,8 +3647,7 @@ static void autorun_devices(int part)
 			printk(KERN_INFO "md: created %s\n", mdname(mddev));
 			ITERATE_RDEV_GENERIC(candidates,rdev,tmp) {
 				list_del_init(&rdev->same_set);
-				if (bind_rdev_to_array(rdev, mddev))
-					export_rdev(rdev);
+				bind_rdev_to_array(rdev, mddev);
 			}
 			autorun_array(mddev);
 			mddev_unlock(mddev);
@@ -3807,7 +3816,6 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 		return -EOVERFLOW;
 
 	if (!mddev->raid_disks) {
-		int err;
 		/* expecting a device which has a superblock */
 		rdev = md_import_device(dev, mddev->major_version, mddev->minor_version);
 		if (IS_ERR(rdev)) {
@@ -3830,10 +3838,7 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 				return -EINVAL;
 			}
 		}
-		err = bind_rdev_to_array(rdev, mddev);
-		if (err)
-			export_rdev(rdev);
-		return err;
+		return bind_rdev_to_array(rdev, mddev);
 	}
 
 	/*
@@ -3887,10 +3892,8 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 				validate_super(mddev, rdev);
 			err = mddev->pers->hot_add_disk(mddev, rdev);
 			if (err)
-				unbind_rdev_from_array(rdev);
+				kick_rdev_from_array(rdev);
 		}
-		if (err)
-			export_rdev(rdev);
 
 		md_update_sb(mddev, 1);
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
@@ -3908,7 +3911,6 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 	}
 
 	if (!(info->state & (1<<MD_DISK_FAULTY))) {
-		int err;
 		rdev = md_import_device (dev, -1, 0);
 		if (IS_ERR(rdev)) {
 			printk(KERN_WARNING 
@@ -3938,11 +3940,7 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 			rdev->sb_offset = calc_dev_sboffset(rdev->bdev);
 		rdev->size = calc_dev_size(rdev, mddev->chunk_size);
 
-		err = bind_rdev_to_array(rdev, mddev);
-		if (err) {
-			export_rdev(rdev);
-			return err;
-		}
+		return bind_rdev_to_array(rdev, mddev);
 	}
 
 	return 0;
@@ -4018,15 +4016,15 @@ static int hot_add_disk(mddev_t * mddev, dev_t dev)
 		printk(KERN_WARNING 
 			"md: can not hot-add faulty %s disk to %s!\n",
 			bdevname(rdev->bdev,b), mdname(mddev));
-		err = -EINVAL;
-		goto abort_export;
+		export_rdev(rdev);
+		return -EINVAL;
 	}
 	clear_bit(In_sync, &rdev->flags);
 	rdev->desc_nr = -1;
 	rdev->saved_raid_disk = -1;
 	err = bind_rdev_to_array(rdev, mddev);
 	if (err)
-		goto abort_export;
+		return err;
 
 	/*
 	 * The rest should better be atomic, we can have disk failures
@@ -4036,8 +4034,8 @@ static int hot_add_disk(mddev_t * mddev, dev_t dev)
 	if (rdev->desc_nr == mddev->max_disks) {
 		printk(KERN_WARNING "%s: can not hot-add to full array!\n",
 			mdname(mddev));
-		err = -EBUSY;
-		goto abort_unbind_export;
+		kick_rdev_from_array(rdev);
+		return -EBUSY;
 	}
 
 	rdev->raid_disk = -1;
@@ -4052,13 +4050,6 @@ static int hot_add_disk(mddev_t * mddev, dev_t dev)
 	md_wakeup_thread(mddev->thread);
 	md_new_event(mddev);
 	return 0;
-
-abort_unbind_export:
-	unbind_rdev_from_array(rdev);
-
-abort_export:
-	export_rdev(rdev);
-	return err;
 }
 
 static int set_bitmap_file(mddev_t *mddev, int fd)

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
@ 2008-01-10  5:53               ` Al Viro
  0 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2008-01-10  5:53 UTC (permalink / raw)
  To: Neil Brown
  Cc: Linus Torvalds, Kevin Winchester, J. Bruce Fields,
	Arjan van de Ven, Linux Kernel Mailing List, Andrew Morton,
	NetDev, gregkh

On Thu, Jan 10, 2008 at 03:13:48PM +1100, Neil Brown wrote:
> > What guarantees that it doesn't happen before we get to callback?  AFAICS,
> > nothing whatsoever...
> 
> Yes, that's bad isn't it :-)
> 
> I think I should be using sysfs_schedule_callback here.  That makes the 
> required 'get' and 'put' calls.... but it can fail with -ENOMEM.  I
> wonder what I do if -ENOMEM???  Maybe I'll just continue to roll my
> one :-( 

How about this instead (completely untested)

	* split failure exits
	* switch to kick_rdev_from_array()
	* fold unbind_rdev_from_array() into it (no other callers anymore)
	* take export_rdev() into failure case in bind_rdev_to_array()
	* in kick_rdev_from_array() do what export_rdev() does sans
kobject_put() and do that before schedule_work().  Take kobject_put() into
delayed_delete().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/drivers/md/md.c b/drivers/md/md.c
index cef9ebd..116cc5a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1344,6 +1344,39 @@ static int match_mddev_units(mddev_t *mddev1, mddev_t *mddev2)
 
 static LIST_HEAD(pending_raid_disks);
 
+static void unlock_rdev(mdk_rdev_t *rdev)
+{
+	struct block_device *bdev = rdev->bdev;
+	rdev->bdev = NULL;
+	if (!bdev)
+		MD_BUG();
+	bd_release(bdev);
+	blkdev_put(bdev);
+}
+
+void md_autodetect_dev(dev_t dev);
+
+static void __export_rdev(mdk_rdev_t * rdev)
+{
+	char b[BDEVNAME_SIZE];
+	printk(KERN_INFO "md: export_rdev(%s)\n",
+		bdevname(rdev->bdev,b));
+	if (rdev->mddev)
+		MD_BUG();
+	free_disk_sb(rdev);
+	list_del_init(&rdev->same_set);
+#ifndef MODULE
+	md_autodetect_dev(rdev->bdev->bd_dev);
+#endif
+	unlock_rdev(rdev);
+}
+
+static void export_rdev(mdk_rdev_t * rdev)
+{
+	__export_rdev(rdev);
+	kobject_put(&rdev->kobj);
+}
+
 static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 {
 	char b[BDEVNAME_SIZE];
@@ -1353,7 +1386,8 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 
 	if (rdev->mddev) {
 		MD_BUG();
-		return -EINVAL;
+		err = -EINVAL;
+		goto out;
 	}
 	/* make sure rdev->size exceeds mddev->size */
 	if (rdev->size && (mddev->size == 0 || rdev->size < mddev->size)) {
@@ -1362,8 +1396,10 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 			 * If mddev->level <= 0, then we don't care
 			 * about aligning sizes (e.g. linear)
 			 */
-			if (mddev->level > 0)
-				return -ENOSPC;
+			if (mddev->level > 0) {
+				err = -ENOSPC;
+				goto out;
+			}
 		} else
 			mddev->size = rdev->size;
 	}
@@ -1379,12 +1415,16 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 			choice++;
 		rdev->desc_nr = choice;
 	} else {
-		if (find_rdev_nr(mddev, rdev->desc_nr))
-			return -EBUSY;
+		if (find_rdev_nr(mddev, rdev->desc_nr)) {
+			err = -EBUSY;
+			goto out;
+		}
 	}
 	bdevname(rdev->bdev,b);
-	if (kobject_set_name(&rdev->kobj, "dev-%s", b) < 0)
-		return -ENOMEM;
+	if (kobject_set_name(&rdev->kobj, "dev-%s", b) < 0) {
+		err = -ENOMEM;
+		goto out;
+	}
 	while ( (s=strchr(rdev->kobj.k_name, '/')) != NULL)
 		*s = '!';
 			
@@ -1407,9 +1447,11 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 	bd_claim_by_disk(rdev->bdev, rdev, mddev->gendisk);
 	return 0;
 
- fail:
+fail:
 	printk(KERN_WARNING "md: failed to register dev-%s for %s\n",
 	       b, mdname(mddev));
+out:
+	export_rdev(rdev);
 	return err;
 }
 
@@ -1417,19 +1459,22 @@ static void delayed_delete(struct work_struct *ws)
 {
 	mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
 	kobject_del(&rdev->kobj);
+	kobject_put(&rdev->kobj);
 }
 
-static void unbind_rdev_from_array(mdk_rdev_t * rdev)
+static void kick_rdev_from_array(mdk_rdev_t * rdev)
 {
 	char b[BDEVNAME_SIZE];
 	if (!rdev->mddev) {
 		MD_BUG();
+		export_rdev(rdev);
 		return;
 	}
 	bd_release_from_disk(rdev->bdev, rdev->mddev->gendisk);
 	list_del_init(&rdev->same_set);
 	printk(KERN_INFO "md: unbind<%s>\n", bdevname(rdev->bdev,b));
 	rdev->mddev = NULL;
+	__export_rdev(rdev);
 	sysfs_remove_link(&rdev->kobj, "block");
 
 	/* We need to delay this, otherwise we can deadlock when
@@ -1467,40 +1512,6 @@ static int lock_rdev(mdk_rdev_t *rdev, dev_t dev)
 	return err;
 }
 
-static void unlock_rdev(mdk_rdev_t *rdev)
-{
-	struct block_device *bdev = rdev->bdev;
-	rdev->bdev = NULL;
-	if (!bdev)
-		MD_BUG();
-	bd_release(bdev);
-	blkdev_put(bdev);
-}
-
-void md_autodetect_dev(dev_t dev);
-
-static void export_rdev(mdk_rdev_t * rdev)
-{
-	char b[BDEVNAME_SIZE];
-	printk(KERN_INFO "md: export_rdev(%s)\n",
-		bdevname(rdev->bdev,b));
-	if (rdev->mddev)
-		MD_BUG();
-	free_disk_sb(rdev);
-	list_del_init(&rdev->same_set);
-#ifndef MODULE
-	md_autodetect_dev(rdev->bdev->bd_dev);
-#endif
-	unlock_rdev(rdev);
-	kobject_put(&rdev->kobj);
-}
-
-static void kick_rdev_from_array(mdk_rdev_t * rdev)
-{
-	unbind_rdev_from_array(rdev);
-	export_rdev(rdev);
-}
-
 static void export_array(mddev_t *mddev)
 {
 	struct list_head *tmp;
@@ -2576,8 +2587,10 @@ new_dev_store(mddev_t *mddev, const char *buf, size_t len)
 						       mdk_rdev_t, same_set);
 			err = super_types[mddev->major_version]
 				.load_super(rdev, rdev0, mddev->minor_version);
-			if (err < 0)
-				goto out;
+			if (err < 0) {
+				export_rdev(rdev);
+				return err;
+			}
 		}
 	} else
 		rdev = md_import_device(dev, -1, -1);
@@ -2585,9 +2598,6 @@ new_dev_store(mddev_t *mddev, const char *buf, size_t len)
 	if (IS_ERR(rdev))
 		return PTR_ERR(rdev);
 	err = bind_rdev_to_array(rdev, mddev);
- out:
-	if (err)
-		export_rdev(rdev);
 	return err ? err : len;
 }
 
@@ -3637,8 +3647,7 @@ static void autorun_devices(int part)
 			printk(KERN_INFO "md: created %s\n", mdname(mddev));
 			ITERATE_RDEV_GENERIC(candidates,rdev,tmp) {
 				list_del_init(&rdev->same_set);
-				if (bind_rdev_to_array(rdev, mddev))
-					export_rdev(rdev);
+				bind_rdev_to_array(rdev, mddev);
 			}
 			autorun_array(mddev);
 			mddev_unlock(mddev);
@@ -3807,7 +3816,6 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 		return -EOVERFLOW;
 
 	if (!mddev->raid_disks) {
-		int err;
 		/* expecting a device which has a superblock */
 		rdev = md_import_device(dev, mddev->major_version, mddev->minor_version);
 		if (IS_ERR(rdev)) {
@@ -3830,10 +3838,7 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 				return -EINVAL;
 			}
 		}
-		err = bind_rdev_to_array(rdev, mddev);
-		if (err)
-			export_rdev(rdev);
-		return err;
+		return bind_rdev_to_array(rdev, mddev);
 	}
 
 	/*
@@ -3887,10 +3892,8 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 				validate_super(mddev, rdev);
 			err = mddev->pers->hot_add_disk(mddev, rdev);
 			if (err)
-				unbind_rdev_from_array(rdev);
+				kick_rdev_from_array(rdev);
 		}
-		if (err)
-			export_rdev(rdev);
 
 		md_update_sb(mddev, 1);
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
@@ -3908,7 +3911,6 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 	}
 
 	if (!(info->state & (1<<MD_DISK_FAULTY))) {
-		int err;
 		rdev = md_import_device (dev, -1, 0);
 		if (IS_ERR(rdev)) {
 			printk(KERN_WARNING 
@@ -3938,11 +3940,7 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 			rdev->sb_offset = calc_dev_sboffset(rdev->bdev);
 		rdev->size = calc_dev_size(rdev, mddev->chunk_size);
 
-		err = bind_rdev_to_array(rdev, mddev);
-		if (err) {
-			export_rdev(rdev);
-			return err;
-		}
+		return bind_rdev_to_array(rdev, mddev);
 	}
 
 	return 0;
@@ -4018,15 +4016,15 @@ static int hot_add_disk(mddev_t * mddev, dev_t dev)
 		printk(KERN_WARNING 
 			"md: can not hot-add faulty %s disk to %s!\n",
 			bdevname(rdev->bdev,b), mdname(mddev));
-		err = -EINVAL;
-		goto abort_export;
+		export_rdev(rdev);
+		return -EINVAL;
 	}
 	clear_bit(In_sync, &rdev->flags);
 	rdev->desc_nr = -1;
 	rdev->saved_raid_disk = -1;
 	err = bind_rdev_to_array(rdev, mddev);
 	if (err)
-		goto abort_export;
+		return err;
 
 	/*
 	 * The rest should better be atomic, we can have disk failures
@@ -4036,8 +4034,8 @@ static int hot_add_disk(mddev_t * mddev, dev_t dev)
 	if (rdev->desc_nr == mddev->max_disks) {
 		printk(KERN_WARNING "%s: can not hot-add to full array!\n",
 			mdname(mddev));
-		err = -EBUSY;
-		goto abort_unbind_export;
+		kick_rdev_from_array(rdev);
+		return -EBUSY;
 	}
 
 	rdev->raid_disk = -1;
@@ -4052,13 +4050,6 @@ static int hot_add_disk(mddev_t * mddev, dev_t dev)
 	md_wakeup_thread(mddev->thread);
 	md_new_event(mddev);
 	return 0;
-
-abort_unbind_export:
-	unbind_rdev_from_array(rdev);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
  2008-01-10  5:53               ` Al Viro
  (?)
@ 2008-01-14  1:36               ` Neil Brown
  -1 siblings, 0 replies; 26+ messages in thread
From: Neil Brown @ 2008-01-14  1:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Kevin Winchester, J. Bruce Fields,
	Arjan van de Ven, Linux Kernel Mailing List, Andrew Morton,
	NetDev, gregkh

On Thursday January 10, viro@ZenIV.linux.org.uk wrote:
> On Thu, Jan 10, 2008 at 03:13:48PM +1100, Neil Brown wrote:
> > > What guarantees that it doesn't happen before we get to callback?  AFAICS,
> > > nothing whatsoever...
> > 
> > Yes, that's bad isn't it :-)
> > 
> > I think I should be using sysfs_schedule_callback here.  That makes the 
> > required 'get' and 'put' calls.... but it can fail with -ENOMEM.  I
> > wonder what I do if -ENOMEM???  Maybe I'll just continue to roll my
> > one :-( 
> 
> How about this instead (completely untested)
> 
> 	* split failure exits
> 	* switch to kick_rdev_from_array()
> 	* fold unbind_rdev_from_array() into it (no other callers anymore)
> 	* take export_rdev() into failure case in bind_rdev_to_array()
> 	* in kick_rdev_from_array() do what export_rdev() does sans
> kobject_put() and do that before schedule_work().  Take kobject_put() into
> delayed_delete().

While there are probably some good ideas in there, I think fixing this
particular bug is much simpler.  Just take a reference to the object
before scheduling the worker, and drop it when the worker has done
its work.

I have a closer look at the idea of no required export_rdev after a
failed bind_rdev_to_array.  On the surface it does seem to make the
code nicer.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-01-14  1:36 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-05 21:06 Top 10 kernel oopses for the week ending January 5th, 2008 Arjan van de Ven
2008-01-05 21:26 ` Al Viro
2008-01-05 21:39 ` Al Viro
2008-01-07 17:44   ` J. Bruce Fields
2008-01-08  1:19     ` Kevin Winchester
2008-01-08  3:26       ` Linus Torvalds
2008-01-08  5:59         ` Al Viro
2008-01-08  7:33           ` Jarek Poplawski
2008-01-10  4:13           ` Neil Brown
2008-01-10  5:53             ` Al Viro
2008-01-10  5:53               ` Al Viro
2008-01-14  1:36               ` Neil Brown
2008-01-08 16:14         ` Randy Dunlap
2008-01-08 17:42           ` Arjan van de Ven
2008-01-08 18:08             ` Linus Torvalds
2008-01-08 18:16               ` Arjan van de Ven
2008-01-08 18:27                 ` Linus Torvalds
2008-01-08 19:05                   ` Arjan van de Ven
2008-01-08 19:31                     ` Linus Torvalds
2008-01-08 22:56                       ` Rafael J. Wysocki
2008-01-08 17:08         ` Andi Kleen
2008-01-06  3:30 ` Andi Kleen
2008-01-06  3:31   ` Arjan van de Ven
2008-01-06  3:50     ` Andi Kleen
2008-01-09 14:12 ` Johannes Berg
2008-01-09 15:28   ` Arjan van de Ven

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.