All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ayaz Abdulla <aabdulla@nvidia.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Jeff Garzik <jeff@garzik.org>, Adrian Bunk <bunk@stusta.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: Linux 2.6.21-rc5
Date: Mon, 26 Mar 2007 03:17:22 -0500	[thread overview]
Message-ID: <46078192.6020307@nvidia.com> (raw)
In-Reply-To: <20070326083146.GA11666@elte.hu>

This issue might be resolved with the patch provided in the following 
bug report: http://bugzilla.kernel.org/show_bug.cgi?id=8058

Please try out the patch in the bug report without your patch and see if 
the issue reproduces.

Ayaz


Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> 
>>There's various fixes here, ranging from some architecture updates 
>>(ia64, ARM, MIPS, SH, Sparc64) to KVM, networking and network drivers.
> 
> 
> here's a new v2.6.20 -> v2.6.21 forcedeth.c regression:
> 
> in the last week or so i've been seeing sporadic under-load forcedeth.c 
> crashes (see the full oops further below):
> 
>  eth1: too many iterations (6) in nv_nic_irq.
>  Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP: 
>  [<ffffffff80404587>] nv_tx_done+0xf4/0x1cf
> 
> this is line 1906 of drivers/net/forcedeth.c:
> 
>     np->stats.tx_bytes += np->get_tx_ctx->skb->len;
> 
> struct sk_buff's len field is at offset 88, so np->get_tx_ctx->skb is 
> NULL. That is an 'impossible' scenario for tx descriptors here - the tx 
> ring descriptors are always set up with a valid skb (and a valid dma 
> address), and their completion is serialized via np->lock.
> 
> these crashes are almost instant on the .21-rc5-rt kernel, but extremely 
> sporadic on the upstream kernel and needed very high networking loads to 
> trigger. Today i found a good way to trigger it almost instantly on 
> upstream kernels too: apply the debug patch attached further below and 
> do:
> 
> 	echo 100 > /proc/sys/kernel/panic
> 
> that will inject 100 artificial 'too many iterations' failures and 
> provokes a TX timeout - which TX timeout will crash. (i've used a 
> dual-core Athlon64 system in this test)
> 
> my first quick guess was to extend np->priv locking to the whole of 
> nv_start_xmit/nv_start_xmit_optimized - while that appeared to make the 
> crash a bit less likely, it did not prevent it. So there must be some 
> other, more fundamental problem be left as well. At first glance the SMP 
> locking looks OK, so maybe the ring indices are messed up somehow and we 
> got into a 'ring head bites the tail' scenario?
> 
> i can provide more info if needed.
> 
> 	Ingo
> 
> -------------->
> eth1: too many iterations (6) in nv_nic_irq.
> Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP: 
>  [<ffffffff80404587>] nv_tx_done+0xf4/0x1cf
> PGD 34d03067 PUD 34d02067 PMD 0 
> Oops: 0000 [1] PREEMPT SMP 
> CPU 1 
> Modules linked in:
> Pid: 0, comm: swapper Not tainted 2.6.21-rc5 #8
> RIP: 0010:[<ffffffff80404587>]  [<ffffffff80404587>] nv_tx_done+0xf4/0x1cf
> RSP: 0018:ffff81003ff6be40  EFLAGS: 00010206
> RAX: 0000000000000000 RBX: ffff810002e26700 RCX: 0000000000000001
> RDX: 0000000000000042 RSI: 000000003ef00cbe RDI: ffff81003fbeb070
> RBP: ffff81003ff6be60 R08: ffff810002e26a00 R09: 0000000000000003
> R10: ffff81003ff4e100 R11: ffff810001e283f8 R12: 000000003ef00cbe
> R13: ffff810002e26000 R14: ffff810002e28fc0 R15: 0000000000000000
> FS:  00002b6cb57f1db0(0000) GS:ffff81003ff4ad40(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000088 CR3: 0000000034c87000 CR4: 00000000000006e0
> Process swapper (pid: 0, threadinfo ffff81003ff64000, task ffff81003ff4e100)
> Stack:  ffff810002e26700 0000000000000032 ffffc2000001a000 ffff810002e26000
>  ffff81003ff6bea0 ffffffff80406dae ffff810002e26700 ffff810002e26700
>  ffff810002e26000 00000000000000ff ffffc2000001a000 ffffffff80749080
> Call Trace:
>  <IRQ>  [<ffffffff80406dae>] nv_nic_irq+0x76/0x261
>  [<ffffffff8040961e>] nv_do_nic_poll+0x200/0x284
>  [<ffffffff8040941e>] nv_do_nic_poll+0x0/0x284
>  [<ffffffff80241995>] run_timer_softirq+0x167/0x1dd
>  [<ffffffff8023de45>] __do_softirq+0x5b/0xc9
>  [<ffffffff8020af0c>] call_softirq+0x1c/0x28
>  [<ffffffff8020c2b4>] do_softirq+0x31/0x84
>  [<ffffffff8023db16>] irq_exit+0x3f/0x50
>  [<ffffffff802190c2>] smp_apic_timer_interrupt+0x49/0x5b
>  [<ffffffff802087fb>] default_idle+0x0/0x44
>  [<ffffffff8020a9b6>] apic_timer_interrupt+0x66/0x70
>  <EOI>  [<ffffffff8020882a>] default_idle+0x2f/0x44
>  [<ffffffff8020804c>] enter_idle+0x22/0x24
>  [<ffffffff802088d0>] cpu_idle+0x91/0xd4
>  [<ffffffff80218572>] start_secondary+0x2e3/0x2f5
> 
> ---
>  drivers/net/forcedeth.c |   20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> Index: linux/drivers/net/forcedeth.c
> ===================================================================
> --- linux.orig/drivers/net/forcedeth.c
> +++ linux/drivers/net/forcedeth.c
> @@ -2908,6 +2908,10 @@ static irqreturn_t nv_nic_irq(int foo, v
>  			spin_unlock(&np->lock);
>  			break;
>  		}
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock(&np->lock);
>  			/* disable interrupts on the nic */
> @@ -3026,6 +3030,10 @@ static irqreturn_t nv_nic_irq_optimized(
>  			break;
>  		}
>  
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock(&np->lock);
>  			/* disable interrupts on the nic */
> @@ -3076,6 +3084,10 @@ static irqreturn_t nv_nic_irq_tx(int foo
>  			dprintk(KERN_DEBUG "%s: received irq with events 0x%x. Probably TX fail.\n",
>  						dev->name, events);
>  		}
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock_irqsave(&np->lock, flags);
>  			/* disable interrupts on the nic */
> @@ -3191,6 +3203,10 @@ static irqreturn_t nv_nic_irq_rx(int foo
>  			}
>  		}
>  
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock_irqsave(&np->lock, flags);
>  			/* disable interrupts on the nic */
> @@ -3264,6 +3280,10 @@ static irqreturn_t nv_nic_irq_other(int 
>  			printk(KERN_DEBUG "%s: received irq with unknown events 0x%x. Please report\n",
>  						dev->name, events);
>  		}
> +		if (panic_timeout > 0) {
> +			panic_timeout--;
> +			i = max_interrupt_work+1;
> +		}
>  		if (unlikely(i > max_interrupt_work)) {
>  			spin_lock_irqsave(&np->lock, flags);
>  			/* disable interrupts on the nic */

  reply	other threads:[~2007-03-26 19:31 UTC|newest]

Thread overview: 128+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-25 23:08 Linux 2.6.21-rc5 Linus Torvalds
2007-03-26  8:31 ` Ingo Molnar
2007-03-26  8:17   ` Ayaz Abdulla [this message]
2007-03-26  8:39   ` Ingo Molnar
2007-03-26  8:58     ` [patch] forcedeth: work around NULL skb dereference crash Ingo Molnar
2007-04-02 11:56       ` [patch] forcedeth: improve NAPI logic Ingo Molnar
2007-03-26  8:55 ` Linux 2.6.21-rc5 Thomas Gleixner
2007-03-26 12:25   ` Bob Tracy
2007-03-26 12:30     ` Thomas Gleixner
2007-03-26  9:04 ` 2.6.21-rc5: maxcpus=1 crash in cpufreq: kernel BUG at drivers/cpufreq/cpufreq.c:82! Ingo Molnar
2007-03-26 18:12   ` Venki Pallipadi
2007-03-26 19:03     ` Venki Pallipadi
2007-03-27  7:11       ` Ingo Molnar
2007-03-26  9:21 ` [PATCH] clockevents: remove bad designed sysfs support for now Thomas Gleixner
2007-03-26  9:25   ` Ingo Molnar
2007-03-26 18:57     ` Greg KH
2007-03-26 12:51   ` Pavel Machek
2007-03-27  7:08   ` [PATCH] i386: Fix bogus return value in hpet_next_event() Thomas Gleixner
2007-03-26 10:11 ` -rc5: e1000 resume weirdness Ingo Molnar
2007-03-26 15:39   ` Kok, Auke
2007-03-26 15:50   ` Jesse Brandeburg
2007-03-26 15:55     ` Kok, Auke
2007-03-26 17:39     ` Ingo Molnar
2007-03-27  1:59 ` [1/5] 2.6.21-rc5: known regressions Adrian Bunk
2007-03-28 18:54   ` Kok, Auke
2007-03-28 19:23     ` Ingo Molnar
2007-03-30 18:04     ` Adrian Bunk
2007-03-30 12:04   ` [bug] hung bootup in various drivers, was: "2.6.21-rc5: known regressions" Ingo Molnar
2007-03-30 12:06     ` [bug] fixed_init(): BUG: at drivers/base/core.c:120 device_release(), " Ingo Molnar
2007-03-30 14:18       ` Greg KH
2007-03-30 14:25         ` Ingo Molnar
2007-03-30 16:31           ` Vitaly Bordug
2007-03-30 14:16     ` [bug] hung bootup in various drivers, " Greg KH
2007-03-30 17:46       ` Ingo Molnar
2007-03-30 19:32         ` Greg KH
2007-03-31  2:32           ` Kay Sievers
2007-03-31 16:51             ` [patch] driver core: fix built-in drivers sysfs links Ingo Molnar
2007-03-31 16:31           ` [bug] hung bootup in various drivers, was: "2.6.21-rc5: known regressions" Ingo Molnar
2007-04-01  7:49     ` Pavel Machek
2007-04-01 17:17       ` Linus Torvalds
2007-04-01 17:35         ` [patch] driver core: if built-in, do not wait in driver_unregister() Ingo Molnar
2007-04-02  1:47           ` Greg KH
2007-03-27  1:59 ` [2/5] 2.6.21-rc5: known regressions Adrian Bunk
2007-03-27  1:59   ` Adrian Bunk
2007-03-27  1:59   ` Adrian Bunk
2007-03-28 19:46   ` Laurent Riffard
2007-03-29 19:02     ` Fabio Comolli
2007-03-27  1:59 ` [3/5] " Adrian Bunk
2007-03-27  1:59 ` [4/5] " Adrian Bunk
2007-03-27  1:59   ` Adrian Bunk
2007-03-27  8:00   ` Marcus Better
2007-03-27 13:25     ` Eric W. Biederman
2007-03-27 16:53       ` Marcus Better
2007-03-27 20:50         ` Eric W. Biederman
2007-03-27 10:09   ` Rafael J. Wysocki
2007-03-27 10:09     ` Rafael J. Wysocki
2007-03-27 22:29     ` Adrian Bunk
2007-03-27 22:29       ` Adrian Bunk
2007-03-27 22:45       ` Thomas Meyer
2007-03-27 22:45         ` Thomas Meyer
2007-03-28 12:19   ` Ingo Molnar
2007-03-28 12:41     ` Ingo Molnar
2007-03-28 13:03       ` Ingo Molnar
2007-03-28 13:06         ` [patch] MSI-X: fix resume crash Ingo Molnar
2007-03-28 13:31           ` Eric W. Biederman
2007-03-28 13:36             ` Ingo Molnar
2007-03-29  4:30           ` Len Brown
2007-03-29  4:57             ` Eric W. Biederman
2007-03-27  1:59 ` [5/5] 2.6.21-rc5: known regressions Adrian Bunk
2007-03-27  1:59   ` Adrian Bunk
2007-03-27  5:51 ` ATA ACPI (was Re: Linux 2.6.21-rc5) Jeff Garzik
2007-03-27  5:54   ` Tejun Heo
2007-03-27 21:32     ` Pavel Machek
2007-03-28  9:51       ` Tejun Heo
2007-03-27 17:07   ` Linus Torvalds
2007-03-27 18:48     ` Jeff Garzik
2007-03-27  6:17 ` Linux 2.6.21-rc5 Andrew Morton
2007-03-27  6:20   ` Greg KH
2007-03-27 16:49     ` Jesse Barnes
2007-03-27  9:49   ` Takashi Iwai
2007-03-27 12:25   ` Andi Kleen
2007-03-27 16:33     ` Andrew Morton
2007-03-27 12:43   ` Dmitry Torokhov
2007-03-28 22:32   ` Tilman Schmidt
2007-03-27 18:34 ` Michal Piotrowski
2007-03-27 22:29   ` Pavel Machek
2007-03-27 22:55     ` Michal Piotrowski
2007-03-27 18:53 ` Michal Piotrowski
2007-03-28 14:30   ` Andi Kleen
2007-03-28 14:56     ` Michal Piotrowski
2007-03-28 16:12       ` Jiri Kosina
2007-03-28 16:51         ` Michal Piotrowski
2007-03-28 17:56     ` Linus Torvalds
     [not found] ` <20070327230024.GJ16477@stusta.de>
2007-03-27 23:10   ` 2.6.21-rc5: known regressions with patches Rafael J. Wysocki
2007-03-28  0:50   ` Jay Cliburn
2007-03-30 21:32 ` [1/4] 2.6.21-rc5: known regressions (v2) Adrian Bunk
2007-03-30 21:32   ` Adrian Bunk
2007-03-30 21:38   ` Greg KH
2007-03-31  0:23   ` Michal Jaegermann
2007-03-31 15:01     ` Adrian Bunk
2007-03-31 16:42       ` Michal Jaegermann
2007-03-30 21:32 ` [2/4] " Adrian Bunk
2007-03-30 21:32 ` [3/4] " Adrian Bunk
2007-03-30 21:32   ` Adrian Bunk
2007-03-31  2:52   ` Jeff Chua
2007-03-31  2:52     ` Jeff Chua
2007-03-31  2:52     ` Jeff Chua
2007-03-31  3:16     ` Adrian Bunk
2007-03-31 11:08       ` Jens Axboe
2007-04-01  5:39   ` Jeremy Fitzhardinge
2007-04-01  5:39     ` Jeremy Fitzhardinge
2007-04-13 16:32   ` Michal Piotrowski
2007-04-13 16:32     ` Michal Piotrowski
2007-03-30 21:49 ` [4/4] " Adrian Bunk
2007-03-30 21:49   ` Adrian Bunk
2007-03-31  2:41   ` Jeff Chua
2007-03-31  2:41     ` Jeff Chua
2007-03-31  6:44   ` Frédéric Riss
2007-04-01  7:04   ` Michael S. Tsirkin
2007-04-01  7:04     ` Michael S. Tsirkin
2007-04-01 20:37   ` Michael S. Tsirkin
2007-04-01 20:37     ` Michael S. Tsirkin
2007-03-31 18:19 ` 2.6.21-rc5: known regressions with patches (v2) Adrian Bunk
2007-03-31 18:19   ` Adrian Bunk
2007-04-03  4:05   ` [PATCH] libata: add NCQ blacklist entries from Silicon Image Windows driver (v2) Robert Hancock
2007-04-03  4:13     ` Tejun Heo
2007-04-04  6:09     ` Jeff Garzik
2007-04-04 14:26       ` Robert Hancock

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46078192.6020307@nvidia.com \
    --to=aabdulla@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=bunk@stusta.de \
    --cc=jeff@garzik.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.