linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4
@ 2009-12-23 17:40 Lennart Sorensen
  2009-12-23 18:04 ` Anton Vorontsov
  0 siblings, 1 reply; 8+ messages in thread
From: Lennart Sorensen @ 2009-12-23 17:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: linuxppc-dev, netdev, leoli, Len Sorensen, Anton Vorontsov

We use the ucc_geth for 6 ports (4 100Mbit and 2 Gbit ports) on an
mpc8360e.  Up to 2.6.31 this worked fine.  2.6.32 on the other hand
crashes very quickly after boot.

I managed to see the same crash when I was selectively trying to add newer
ucc_geth patches to the 2.6.31 kernel a couple of months ago, and the same
patch that caused a crash then seems suspect.  If I revert the patch the
system runs completely stable.  Amusingly, the excact error message the
patch claims to fix is in fact the error it causes to happen in my case.

So preferably 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4 could be reverted
unless someone can fix it.  I can't even make sense of why it is supposed
to improve anything, it certainly seems like a very unsafe change
to make.  Removing locking and disabling of interrupts before poking at
phy settings and such doesn't seem like a minor change and doesn't seem
that safe either.

Now I must add that I run with the xenomai/adeos-ipipe patches as well,
which do change interrupt handling a little, but so far this has worked
fine with the previous code and only the current code is broken for us.
I could try to build without the patch, although I would loose some major
functionality on the box doing so, and I would not be surprised if it
still fails since I believe I tried that already with 2.6.31+selected
git patches before without xenomai patched in and it still failed,
but I am only about 99% sure of that.

With the patch I get:
NETDEV WATCHDOG: eth2 (ucc_geth): transmit queue 0 timed out
------------[ cut here ]------------
Badness at c02729a8 [verbose debug info unavailable]
NIP: c02729a8 LR: c02729a8 CTR: c01b6088
REGS: c0451c40 TRAP: 0700   Not tainted  (2.6.32-trunk-8360e)
MSR: 00029032 <EE,ME,CE,IR,DR>  CR: 42042024  XER: 20000000
TASK = c041f3e8[0] 'swapper' THREAD: c0450000
GPR00: c02729a8 c0451cf0 c041f3e8 00000045 00002ae9 ffffffff c01b6afc c0422c48
GPR08: c042fde8 00000002 00000003 00010000 22042024 1001af90 1fffd000 00000000
GPR16: 00000000 c038c6d8 00000001 00200200 00000000 c0465eec c0465cec c0465aec
GPR24: c0450000 c04658ec c0423c2c df0e01c0 c0480000 df0e0000 c0423c2c 00000000
NIP [c02729a8] dev_watchdog+0x280/0x290
LR [c02729a8] dev_watchdog+0x280/0x290
Call Trace:
[c0451cf0] [c02729a8] dev_watchdog+0x280/0x290 (unreliable)
[c0451d50] [c00377c4] run_timer_softirq+0x164/0x224
[c0451da0] [c0032a38] __do_softirq+0xb8/0x13c
[c0451df0] [c00065cc] do_softirq+0xa0/0xac
[c0451e00] [c003280c] irq_exit+0x7c/0x9c
[c0451e10] [c00640c4] __ipipe_sync_stage+0x248/0x24c
[c0451e50] [c0064374] ipipe_suspend_domain+0xa0/0xf4
[c0451e70] [c00644a4] __ipipe_walk_pipeline+0xdc/0x120
[c0451e90] [c000af28] __ipipe_handle_irq+0x164/0x168
[c0451ec0] [c000b03c] __ipipe_grab_irq+0x3c/0xa4
[c0451ed0] [c0014814] __ipipe_ret_from_except+0x0/0xc
--- Exception: 501 at cpu_idle+0xe0/0xf0
    LR = cpu_idle+0xe0/0xf0
[c0451f90] [c000970c] cpu_idle+0x68/0xf0 (unreliable)
[c0451fb0] [c0003f30] rest_init+0x5c/0x6c
[c0451fc0] [c03f07ac] start_kernel+0x27c/0x2e0
[c0451ff0] [00003438] 0x3438
Instruction dump:
7d2903a6 4bfffea8 38810008 7fa3eb78 38a00040 4bfe9c81 7fe6fb78 7fa4eb78
7c651b78 3c60c03c 38631774 480b7d2d <0fe00000> 38000001 901c2fb0 4bffff94
warning: `zebra' uses 32-bit capabilities (legacy support in use)
PHY: 0:03 - Link is Up - 1000/Full
PHY: 0:09 - Link is Up - 100/Full
PHY: 0:02 - Link is Up - 100/Full
PHY: 0:0f - Link is Up - 100/Full
PHY: 0:17 - Link is Up - 100/Full

When reverted I get a stable running system with no errors.

Which port happens to fail first varies, but one always does and then
the system almost always crashes soon after.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4
  2009-12-23 17:40 ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4 Lennart Sorensen
@ 2009-12-23 18:04 ` Anton Vorontsov
  2009-12-23 20:09   ` Lennart Sorensen
  0 siblings, 1 reply; 8+ messages in thread
From: Anton Vorontsov @ 2009-12-23 18:04 UTC (permalink / raw)
  To: Lennart Sorensen; +Cc: linux-kernel, linuxppc-dev, netdev, leoli

On Wed, Dec 23, 2009 at 12:40:19PM -0500, Lennart Sorensen wrote:
> We use the ucc_geth for 6 ports (4 100Mbit and 2 Gbit ports) on an
> mpc8360e.  Up to 2.6.31 this worked fine.  2.6.32 on the other hand
> crashes very quickly after boot.

Hm. Just curious, what CPU revision you use? So that I could try
to reproduce the issue... I have MPC8360E-MDS boards, rev 2.0 and
rev rev 2.1 CPUs, Marvell PHYs. I also have MPC8360E-RDK (rev 2.1).
And I didn't see any issues with this patch.

> I managed to see the same crash when I was selectively trying to add newer
> ucc_geth patches to the 2.6.31 kernel a couple of months ago, and the same
> patch that caused a crash then seems suspect.  If I revert the patch the
> system runs completely stable.  Amusingly, the excact error message the
> patch claims to fix is in fact the error it causes to happen in my case.
> 
> So preferably 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4 could be reverted

I don't see any point in reverting, because it will surely break my boards.
So, we need to fix this, not just hide.

> unless someone can fix it.

Well, I'm ready to help you with debugging.

> Now I must add that I run with the xenomai/adeos-ipipe patches as well,
> which do change interrupt handling a little,

It could be that it takes too long to stop the UCC, and xenomai
makes the timings worse, so the watchdog triggers.

Can you try the following patch?

diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index afaf088..2f73e3f 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -1563,6 +1563,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
 
 static void ugeth_quiesce(struct ucc_geth_private *ugeth)
 {
+	netif_device_detach(ugeth->ndev);
+
 	/* Wait for and prevent any further xmits. */
 	netif_tx_disable(ugeth->ndev);
 
@@ -1577,7 +1579,7 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
 {
 	napi_enable(&ugeth->napi);
 	enable_irq(ugeth->ug_info->uf_info.irq);
-	netif_tx_wake_all_queues(ugeth->ndev);
+	netif_device_attach(ugeth->ndev);
 }
 
 /* Called every time the controller might need to be made

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4
  2009-12-23 18:04 ` Anton Vorontsov
@ 2009-12-23 20:09   ` Lennart Sorensen
  2009-12-23 20:15     ` Lennart Sorensen
  2009-12-23 20:22     ` Anton Vorontsov
  0 siblings, 2 replies; 8+ messages in thread
From: Lennart Sorensen @ 2009-12-23 20:09 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Lennart Sorensen, linux-kernel, linuxppc-dev, netdev, leoli

On Wed, Dec 23, 2009 at 09:04:15PM +0300, Anton Vorontsov wrote:
> On Wed, Dec 23, 2009 at 12:40:19PM -0500, Lennart Sorensen wrote:
> > We use the ucc_geth for 6 ports (4 100Mbit and 2 Gbit ports) on an
> > mpc8360e.  Up to 2.6.31 this worked fine.  2.6.32 on the other hand
> > crashes very quickly after boot.
> 
> Hm. Just curious, what CPU revision you use? So that I could try
> to reproduce the issue... I have MPC8360E-MDS boards, rev 2.0 and
> rev rev 2.1 CPUs, Marvell PHYs. I also have MPC8360E-RDK (rev 2.1).
> And I didn't see any issues with this patch.

Hmm, I have an MDS around, so I could probably try and see if it is
reproduceable there.

CPU:   e300c1, MPC8360EA, Rev: 2.1 at 533.333 MHz, CSB: 266.667 MHz
BOARD: 12-86-0016-001 [00000000]                                   
UPMA:  Configured for Compact Flash PIO Mode                        
I2C:   ready                                                       
SPI:   ready                                                       
DRAM:  512 MB                                                      
MTEST: Testing data bus                                            
MTEST: Testing address bus                                         
MTEST: PASS                                                        
BCSR:  Ver A000B                                                   
FLASH: 32 MB

The PHY we use is KSZ8001LS for the 100Mbit ports, and on the port
that usually fails first is a marvel 88E1111 running fixed gigabit link
(converts to serdes before connecting to a switch chip).

I am actually surprised this code should matter at all given out of the
6 ethernet ports only two actually go to external ports, the rest are
fixed link.

If relevant here are some chunks from the dtb related to ethernet:

        aliases {
                ethernet0 = &enet0;
                ethernet1 = &enet1;
                ethernet2 = &enet2;
                ethernet3 = &enet3;
                ethernet4 = &enet4;
                ethernet5 = &enet5;
                serial0 = &serial0;
                serial1 = &serial1;
        };

                mdio {
                        compatible = "virtual,mdio-gpio";
                        #address-cells = <1>;
                        #size-cells = <0>;
                        /* MDC = PG6 MDIO = PG7 */
                        gpios = <&qe_pio_g 6 0 &qe_pio_g 7 0>;

                        phy0: ethernet-phy@00 {
                                reg = <0x18>;
                                device_type = "ethernet-phy";
                        };
                        phy1: ethernet-phy@01 {
                                reg = <0x06>;
                                device_type = "ethernet-phy";
                        };
                        phy2: ethernet-phy@02 {
                                reg = <0x0F>;
                                device_type = "ethernet-phy";
                        };
                        phy3: ethernet-phy@03 {
                                reg = <0x17>;
                                device_type = "ethernet-phy";
                        };
                };

                /* UCC5 Linux fe-cm-1 */
                enet0: ethernet@2400 {
                        device_type = "network";
                        compatible = "ucc_geth";
                        cell-index = <0x5>;
                        reg = <0x2400 0x200>;
                        interrupts = <0x28>;
                        interrupt-parent = <&qeic>;
                        local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
                        rx-clock-name = "clk16";
                        tx-clock-name = "clk14";
                        phy-handle = <&phy2>;
                        phy-connection-type = "mii";
                        pio-handle = <&ucc5>;
                };
                /* UCC 6 Linux fe-em-1 */
                enet1: ethernet@3400 {
                        device_type = "network";
                        compatible = "ucc_geth";
                        cell-index = <0x6>;
                        reg = <0x3400 0x200>;
                        interrupts = <0x29>;
                        interrupt-parent = <&qeic>;
                        local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
                        rx-clock-name = "clk7";
                        tx-clock-name = "clk8";
                        phy-handle = <&phy3>;
                        phy-connection-type = "mii";
                        pio-handle = <&ucc6>;
                };
                /* UCC 1 & 3  1000BT Data Plane dp0 */
                enet2: ethernet@2000 {
                        device_type = "network";
                        compatible = "ucc_geth";
                        cell-index = <0x1>;
                        reg = <0x2000 0x200>;
                        interrupts = <0x20>;
                        interrupt-parent = <&qeic>;
                        local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
                        rx-clock-name = "none";
                        tx-clock-name = "clk9";
                        fixed-link = <0x03 1 1000 0 0>; /* P3 on data plane switch on SM */
                        phy-connection-type = "gmii";
                        pio-handle = <&ucc1>;
                };
                /* UCC 7 100BT Data Plane dp1 */
                enet3: ethernet@2600 {
                        device_type = "network";
                        compatible = "ucc_geth";
                        cell-index = <0x7>;
                        reg = <0x2600 0x200>;
                        interrupts = <0x2a>;
                        interrupt-parent = <&qeic>;
                        local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
                        rx-clock-name = "clk20";
                        tx-clock-name = "clk19";
                        fixed-link = <0x02 1 100 0 0>; /* P2 on data-plane switch on SM */
                        phy-connection-type = "mii";
                        pio-handle = <&ucc7>;
                };
                /* UCC 2 & 4 1000 BT Control Plane cp0 */
                enet4: ethernet@3000 {
                        device_type = "network";
                        compatible = "ucc_geth";
                        cell-index = <0x2>;
                        reg = <0x3000 0x200>;
                        interrupts = <0x21>;
                        interrupt-parent = <&qeic>;
                        local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
                        rx-clock-name = "none";
                        tx-clock-name = "clk4";
                        tx-thread-num = <2>; /* Reduced to allow all enet to come up */
                        fixed-link = <10 1 1000 0 0>; /* P10 on control-plane switch on CM */
                        phy-connection-type = "gmii";
                        pio-handle = <&ucc2>;
                };
                /* UCC 8 100 BT Control Plane cp1 */
                enet5: ethernet@3600 {
                        device_type = "network";
                        compatible = "ucc_geth";
                        cell-index = <0x8>;
                        reg = <0x3600 0x200>;
                        interrupts = <0x2b>;
                        interrupt-parent = <&qeic>;
                        local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
                        rx-clock-name = "clk5";
                        tx-clock-name = "clk21";
                        fixed-link = <9 1 100 0 0>; /* P9 on control-plane switch on CM */
                        phy-connection-type = "mii";
                        pio-handle = <&ucc8>;
                };

> I don't see any point in reverting, because it will surely break my boards.
> So, we need to fix this, not just hide.

I agree with that.

> > unless someone can fix it.
> 
> Well, I'm ready to help you with debugging.

Excellent.

> > Now I must add that I run with the xenomai/adeos-ipipe patches as well,
> > which do change interrupt handling a little,
> 
> It could be that it takes too long to stop the UCC, and xenomai
> makes the timings worse, so the watchdog triggers.
> 
> Can you try the following patch?

Sure thing.

> diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
> index afaf088..2f73e3f 100644
> --- a/drivers/net/ucc_geth.c
> +++ b/drivers/net/ucc_geth.c
> @@ -1563,6 +1563,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
>  
>  static void ugeth_quiesce(struct ucc_geth_private *ugeth)
>  {
> +	netif_device_detach(ugeth->ndev);
> +
>  	/* Wait for and prevent any further xmits. */
>  	netif_tx_disable(ugeth->ndev);
>  
> @@ -1577,7 +1579,7 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
>  {
>  	napi_enable(&ugeth->napi);
>  	enable_irq(ugeth->ug_info->uf_info.irq);
> -	netif_tx_wake_all_queues(ugeth->ndev);
> +	netif_device_attach(ugeth->ndev);
>  }
>  
>  /* Called every time the controller might need to be made

So there result is:

Unable to handle kernel paging request for data at address 0x00000058
Faulting instruction address: 0xc024f2fc
Oops: Kernel access of bad area, sig: 11 [#1]
RC8360 CM
Modules linked in: rclibapi xeno_native max6369_wdt ucc_geth_driver spi_mpc8xxx ltc4215 lm75
NIP: c024f2fc LR: e30aa0a4 CTR: c024f2e8
REGS: df857ca0 TRAP: 0300   Not tainted  (2.6.32-trunk-8360e)
MSR: 00009032 <EE,ME,IR,DR>  CR: 44042088  XER: 00000000
DAR: 00000058, DSISR: 20000000
TASK = df848c90[4] 'events/0' THREAD: df856000
GPR00: e30aa0a4 df857d50 df848c90 00000000 00000640 00000001 c0428df4 dfa40b80
GPR08: 000000c8 e30ad2b8 df084360 c024f2e8 44042082 1001af90 e30ad2b8 00000000
GPR16: 00000048 00000001 00000000 00000000 df08436c df08440c 00000190 df08455c
GPR24: df0844ec df0842c0 df084000 180005ea dfa40b80 00000000 df0842c0 00000000
NIP [c024f2fc] skb_recycle_check+0x14/0x100
LR [e30aa0a4] ucc_geth_poll+0xd8/0x4e0 [ucc_geth_driver]
Call Trace:
[df857d50] [c000b03c] __ipipe_grab_irq+0x3c/0xa4 (unreliable)
[df857d60] [e30aa0a4] ucc_geth_poll+0xd8/0x4e0 [ucc_geth_driver]
[df857dd0] [c0258cf8] net_rx_action+0xf8/0x1b8
[df857e10] [c0032a38] __do_softirq+0xb8/0x13c
[df857e60] [c00065cc] do_softirq+0xa0/0xac
[df857e70] [c003280c] irq_exit+0x7c/0x9c
[df857e80] [c000b1f4] __ipipe_do_IRQ+0x50/0x68
[df857ea0] [c0064098] __ipipe_sync_stage+0x21c/0x24c
[df857ee0] [c0064a4c] __ipipe_unstall_root+0x5c/0x60
[df857ef0] [c005fb4c] enable_irq+0x88/0xc4
[df857f10] [e30a73d0] adjust_link+0x1fc/0x2f0 [ucc_geth_driver]
[df857f40] [c0209c0c] phy_state_machine+0x3a4/0x610
[df857f60] [c003fec4] worker_thread+0x124/0x1a4
[df857fc0] [c0044310] kthread+0x78/0x7c
[df857ff0] [c0013c30] kernel_thread+0x4c/0x68
Instruction dump:
4e800020 4800fcb5 4bffff50 4802782d 4bffffc8 48076dd9 4bffff6c 9421fff0
7c0802a6 93e1000c 7c7f1b78 90010014 <80030058> 2f800000 409e00cc 81630068
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 180 seconds..

So it seems as soon as it touched the phy settings, it blew up.

Now perhaps the ipipe interrupt handling code changes the behaviour
enough that it makes the spinlock_irqsave necesary and that is why it
worked before.  Maybe ipipe is doing something wrong, given ipipe only
supports 2.6.30 for powerpc and I personally ported it to 2.6.31 and
then 2.6.32.  Everything else works fine though as long as I revert this
particular patch.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4
  2009-12-23 20:09   ` Lennart Sorensen
@ 2009-12-23 20:15     ` Lennart Sorensen
  2009-12-23 20:22     ` Anton Vorontsov
  1 sibling, 0 replies; 8+ messages in thread
From: Lennart Sorensen @ 2009-12-23 20:15 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Anton Vorontsov, linux-kernel, linuxppc-dev, netdev, leoli

On Wed, Dec 23, 2009 at 03:09:48PM -0500, Lennart Sorensen wrote:
> On Wed, Dec 23, 2009 at 09:04:15PM +0300, Anton Vorontsov wrote:
> > On Wed, Dec 23, 2009 at 12:40:19PM -0500, Lennart Sorensen wrote:
> > > We use the ucc_geth for 6 ports (4 100Mbit and 2 Gbit ports) on an
> > > mpc8360e.  Up to 2.6.31 this worked fine.  2.6.32 on the other hand
> > > crashes very quickly after boot.
> > 
> > Hm. Just curious, what CPU revision you use? So that I could try
> > to reproduce the issue... I have MPC8360E-MDS boards, rev 2.0 and
> > rev rev 2.1 CPUs, Marvell PHYs. I also have MPC8360E-RDK (rev 2.1).
> > And I didn't see any issues with this patch.
> 
> Hmm, I have an MDS around, so I could probably try and see if it is
> reproduceable there.
> 
> CPU:   e300c1, MPC8360EA, Rev: 2.1 at 533.333 MHz, CSB: 266.667 MHz
> BOARD: 12-86-0016-001 [00000000]                                   
> UPMA:  Configured for Compact Flash PIO Mode                        
> I2C:   ready                                                       
> SPI:   ready                                                       
> DRAM:  512 MB                                                      
> MTEST: Testing data bus                                            
> MTEST: Testing address bus                                         
> MTEST: PASS                                                        
> BCSR:  Ver A000B                                                   
> FLASH: 32 MB
> 
> The PHY we use is KSZ8001LS for the 100Mbit ports, and on the port
> that usually fails first is a marvel 88E1111 running fixed gigabit link
> (converts to serdes before connecting to a switch chip).
> 
> I am actually surprised this code should matter at all given out of the
> 6 ethernet ports only two actually go to external ports, the rest are
> fixed link.
> 
> If relevant here are some chunks from the dtb related to ethernet:
> 
>         aliases {
>                 ethernet0 = &enet0;
>                 ethernet1 = &enet1;
>                 ethernet2 = &enet2;
>                 ethernet3 = &enet3;
>                 ethernet4 = &enet4;
>                 ethernet5 = &enet5;
>                 serial0 = &serial0;
>                 serial1 = &serial1;
>         };
> 
>                 mdio {
>                         compatible = "virtual,mdio-gpio";
>                         #address-cells = <1>;
>                         #size-cells = <0>;
>                         /* MDC = PG6 MDIO = PG7 */
>                         gpios = <&qe_pio_g 6 0 &qe_pio_g 7 0>;
> 
>                         phy0: ethernet-phy@00 {
>                                 reg = <0x18>;
>                                 device_type = "ethernet-phy";
>                         };
>                         phy1: ethernet-phy@01 {
>                                 reg = <0x06>;
>                                 device_type = "ethernet-phy";
>                         };
>                         phy2: ethernet-phy@02 {
>                                 reg = <0x0F>;
>                                 device_type = "ethernet-phy";
>                         };
>                         phy3: ethernet-phy@03 {
>                                 reg = <0x17>;
>                                 device_type = "ethernet-phy";
>                         };
>                 };
> 
>                 /* UCC5 Linux fe-cm-1 */
>                 enet0: ethernet@2400 {
>                         device_type = "network";
>                         compatible = "ucc_geth";
>                         cell-index = <0x5>;
>                         reg = <0x2400 0x200>;
>                         interrupts = <0x28>;
>                         interrupt-parent = <&qeic>;
>                         local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
>                         rx-clock-name = "clk16";
>                         tx-clock-name = "clk14";
>                         phy-handle = <&phy2>;
>                         phy-connection-type = "mii";
>                         pio-handle = <&ucc5>;
>                 };
>                 /* UCC 6 Linux fe-em-1 */
>                 enet1: ethernet@3400 {
>                         device_type = "network";
>                         compatible = "ucc_geth";
>                         cell-index = <0x6>;
>                         reg = <0x3400 0x200>;
>                         interrupts = <0x29>;
>                         interrupt-parent = <&qeic>;
>                         local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
>                         rx-clock-name = "clk7";
>                         tx-clock-name = "clk8";
>                         phy-handle = <&phy3>;
>                         phy-connection-type = "mii";
>                         pio-handle = <&ucc6>;
>                 };
>                 /* UCC 1 & 3  1000BT Data Plane dp0 */
>                 enet2: ethernet@2000 {
>                         device_type = "network";
>                         compatible = "ucc_geth";
>                         cell-index = <0x1>;
>                         reg = <0x2000 0x200>;
>                         interrupts = <0x20>;
>                         interrupt-parent = <&qeic>;
>                         local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
>                         rx-clock-name = "none";
>                         tx-clock-name = "clk9";
>                         fixed-link = <0x03 1 1000 0 0>; /* P3 on data plane switch on SM */
>                         phy-connection-type = "gmii";
>                         pio-handle = <&ucc1>;
>                 };
>                 /* UCC 7 100BT Data Plane dp1 */
>                 enet3: ethernet@2600 {
>                         device_type = "network";
>                         compatible = "ucc_geth";
>                         cell-index = <0x7>;
>                         reg = <0x2600 0x200>;
>                         interrupts = <0x2a>;
>                         interrupt-parent = <&qeic>;
>                         local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
>                         rx-clock-name = "clk20";
>                         tx-clock-name = "clk19";
>                         fixed-link = <0x02 1 100 0 0>; /* P2 on data-plane switch on SM */
>                         phy-connection-type = "mii";
>                         pio-handle = <&ucc7>;
>                 };
>                 /* UCC 2 & 4 1000 BT Control Plane cp0 */
>                 enet4: ethernet@3000 {
>                         device_type = "network";
>                         compatible = "ucc_geth";
>                         cell-index = <0x2>;
>                         reg = <0x3000 0x200>;
>                         interrupts = <0x21>;
>                         interrupt-parent = <&qeic>;
>                         local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
>                         rx-clock-name = "none";
>                         tx-clock-name = "clk4";
>                         tx-thread-num = <2>; /* Reduced to allow all enet to come up */
>                         fixed-link = <10 1 1000 0 0>; /* P10 on control-plane switch on CM */
>                         phy-connection-type = "gmii";
>                         pio-handle = <&ucc2>;
>                 };
>                 /* UCC 8 100 BT Control Plane cp1 */
>                 enet5: ethernet@3600 {
>                         device_type = "network";
>                         compatible = "ucc_geth";
>                         cell-index = <0x8>;
>                         reg = <0x3600 0x200>;
>                         interrupts = <0x2b>;
>                         interrupt-parent = <&qeic>;
>                         local-mac-address = [ 00 00 00 00 00 00 ]; /* [U-BOOT] */
>                         rx-clock-name = "clk5";
>                         tx-clock-name = "clk21";
>                         fixed-link = <9 1 100 0 0>; /* P9 on control-plane switch on CM */
>                         phy-connection-type = "mii";
>                         pio-handle = <&ucc8>;
>                 };
> 
> > I don't see any point in reverting, because it will surely break my boards.
> > So, we need to fix this, not just hide.
> 
> I agree with that.
> 
> > > unless someone can fix it.
> > 
> > Well, I'm ready to help you with debugging.
> 
> Excellent.
> 
> > > Now I must add that I run with the xenomai/adeos-ipipe patches as well,
> > > which do change interrupt handling a little,
> > 
> > It could be that it takes too long to stop the UCC, and xenomai
> > makes the timings worse, so the watchdog triggers.
> > 
> > Can you try the following patch?
> 
> Sure thing.
> 
> > diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
> > index afaf088..2f73e3f 100644
> > --- a/drivers/net/ucc_geth.c
> > +++ b/drivers/net/ucc_geth.c
> > @@ -1563,6 +1563,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
> >  
> >  static void ugeth_quiesce(struct ucc_geth_private *ugeth)
> >  {
> > +	netif_device_detach(ugeth->ndev);
> > +
> >  	/* Wait for and prevent any further xmits. */
> >  	netif_tx_disable(ugeth->ndev);
> >  
> > @@ -1577,7 +1579,7 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
> >  {
> >  	napi_enable(&ugeth->napi);
> >  	enable_irq(ugeth->ug_info->uf_info.irq);
> > -	netif_tx_wake_all_queues(ugeth->ndev);
> > +	netif_device_attach(ugeth->ndev);
> >  }
> >  
> >  /* Called every time the controller might need to be made
> 
> So there result is:
> 
> Unable to handle kernel paging request for data at address 0x00000058
> Faulting instruction address: 0xc024f2fc
> Oops: Kernel access of bad area, sig: 11 [#1]
> RC8360 CM
> Modules linked in: rclibapi xeno_native max6369_wdt ucc_geth_driver spi_mpc8xxx ltc4215 lm75
> NIP: c024f2fc LR: e30aa0a4 CTR: c024f2e8
> REGS: df857ca0 TRAP: 0300   Not tainted  (2.6.32-trunk-8360e)
> MSR: 00009032 <EE,ME,IR,DR>  CR: 44042088  XER: 00000000
> DAR: 00000058, DSISR: 20000000
> TASK = df848c90[4] 'events/0' THREAD: df856000
> GPR00: e30aa0a4 df857d50 df848c90 00000000 00000640 00000001 c0428df4 dfa40b80
> GPR08: 000000c8 e30ad2b8 df084360 c024f2e8 44042082 1001af90 e30ad2b8 00000000
> GPR16: 00000048 00000001 00000000 00000000 df08436c df08440c 00000190 df08455c
> GPR24: df0844ec df0842c0 df084000 180005ea dfa40b80 00000000 df0842c0 00000000
> NIP [c024f2fc] skb_recycle_check+0x14/0x100
> LR [e30aa0a4] ucc_geth_poll+0xd8/0x4e0 [ucc_geth_driver]
> Call Trace:
> [df857d50] [c000b03c] __ipipe_grab_irq+0x3c/0xa4 (unreliable)
> [df857d60] [e30aa0a4] ucc_geth_poll+0xd8/0x4e0 [ucc_geth_driver]
> [df857dd0] [c0258cf8] net_rx_action+0xf8/0x1b8
> [df857e10] [c0032a38] __do_softirq+0xb8/0x13c
> [df857e60] [c00065cc] do_softirq+0xa0/0xac
> [df857e70] [c003280c] irq_exit+0x7c/0x9c
> [df857e80] [c000b1f4] __ipipe_do_IRQ+0x50/0x68
> [df857ea0] [c0064098] __ipipe_sync_stage+0x21c/0x24c
> [df857ee0] [c0064a4c] __ipipe_unstall_root+0x5c/0x60
> [df857ef0] [c005fb4c] enable_irq+0x88/0xc4
> [df857f10] [e30a73d0] adjust_link+0x1fc/0x2f0 [ucc_geth_driver]
> [df857f40] [c0209c0c] phy_state_machine+0x3a4/0x610
> [df857f60] [c003fec4] worker_thread+0x124/0x1a4
> [df857fc0] [c0044310] kthread+0x78/0x7c
> [df857ff0] [c0013c30] kernel_thread+0x4c/0x68
> Instruction dump:
> 4e800020 4800fcb5 4bffff50 4802782d 4bffffc8 48076dd9 4bffff6c 9421fff0
> 7c0802a6 93e1000c 7c7f1b78 90010014 <80030058> 2f800000 409e00cc 81630068
> Kernel panic - not syncing: Fatal exception in interrupt
> Rebooting in 180 seconds..
> 
> So it seems as soon as it touched the phy settings, it blew up.
> 
> Now perhaps the ipipe interrupt handling code changes the behaviour
> enough that it makes the spinlock_irqsave necesary and that is why it
> worked before.  Maybe ipipe is doing something wrong, given ipipe only
> supports 2.6.30 for powerpc and I personally ported it to 2.6.31 and
> then 2.6.32.  Everything else works fine though as long as I revert this
> particular patch.

Oh and I will be off on vacation for about a week by tomorrow, so if I
don't fix it today, I will be a bit slow at getting back to you
on this.  Not that it is bothering any real users yet (they are using
2.6.30 for now).

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4
  2009-12-23 20:09   ` Lennart Sorensen
  2009-12-23 20:15     ` Lennart Sorensen
@ 2009-12-23 20:22     ` Anton Vorontsov
  2009-12-23 22:10       ` Lennart Sorensen
  1 sibling, 1 reply; 8+ messages in thread
From: Anton Vorontsov @ 2009-12-23 20:22 UTC (permalink / raw)
  To: Lennart Sorensen; +Cc: linux-kernel, linuxppc-dev, netdev, leoli

On Wed, Dec 23, 2009 at 03:09:48PM -0500, Lennart Sorensen wrote:
[...]
> So there result is:
> 
> Unable to handle kernel paging request for data at address 0x00000058
> Faulting instruction address: 0xc024f2fc
> Oops: Kernel access of bad area, sig: 11 [#1]
> RC8360 CM
> Modules linked in: rclibapi xeno_native max6369_wdt ucc_geth_driver spi_mpc8xxx ltc4215 lm75
> NIP: c024f2fc LR: e30aa0a4 CTR: c024f2e8
> REGS: df857ca0 TRAP: 0300   Not tainted  (2.6.32-trunk-8360e)
> MSR: 00009032 <EE,ME,IR,DR>  CR: 44042088  XER: 00000000
> DAR: 00000058, DSISR: 20000000
> TASK = df848c90[4] 'events/0' THREAD: df856000
> GPR00: e30aa0a4 df857d50 df848c90 00000000 00000640 00000001 c0428df4 dfa40b80
> GPR08: 000000c8 e30ad2b8 df084360 c024f2e8 44042082 1001af90 e30ad2b8 00000000
> GPR16: 00000048 00000001 00000000 00000000 df08436c df08440c 00000190 df08455c
> GPR24: df0844ec df0842c0 df084000 180005ea dfa40b80 00000000 df0842c0 00000000
> NIP [c024f2fc] skb_recycle_check+0x14/0x100
> LR [e30aa0a4] ucc_geth_poll+0xd8/0x4e0 [ucc_geth_driver]
> Call Trace:
> [df857d50] [c000b03c] __ipipe_grab_irq+0x3c/0xa4 (unreliable)
> [df857d60] [e30aa0a4] ucc_geth_poll+0xd8/0x4e0 [ucc_geth_driver]

This I can reproduce. It seems it's a long standing bug that
becomes easily reproducible with quiesce/activate sequence.
The driver doesn't handle empty queue correctly, i.e. it ignores
the empty queue check if netdev queue is stopped, which makes no
sense.

Can you try this patch in addition to previous (i.e. both should
be applied)?

Thanks!

diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index 2f73e3f..b22de51 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -3275,7 +3275,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
 		/* Handle the transmitted buffer and release */
 		/* the BD to be used with the current frame  */
 
-		if ((bd == ugeth->txBd[txQ]) && (netif_queue_stopped(dev) == 0))
+		if (bd == ugeth->txBd[txQ]) /* queue empty? */
 			break;
 
 		dev->stats.tx_packets++;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4
  2009-12-23 20:22     ` Anton Vorontsov
@ 2009-12-23 22:10       ` Lennart Sorensen
  2009-12-23 22:21         ` Anton Vorontsov
  0 siblings, 1 reply; 8+ messages in thread
From: Lennart Sorensen @ 2009-12-23 22:10 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Lennart Sorensen, linux-kernel, linuxppc-dev, netdev, leoli

On Wed, Dec 23, 2009 at 11:22:26PM +0300, Anton Vorontsov wrote:
> On Wed, Dec 23, 2009 at 03:09:48PM -0500, Lennart Sorensen wrote:
> [...]
> > So there result is:
> > 
> > Unable to handle kernel paging request for data at address 0x00000058
> > Faulting instruction address: 0xc024f2fc
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > RC8360 CM
> > Modules linked in: rclibapi xeno_native max6369_wdt ucc_geth_driver spi_mpc8xxx ltc4215 lm75
> > NIP: c024f2fc LR: e30aa0a4 CTR: c024f2e8
> > REGS: df857ca0 TRAP: 0300   Not tainted  (2.6.32-trunk-8360e)
> > MSR: 00009032 <EE,ME,IR,DR>  CR: 44042088  XER: 00000000
> > DAR: 00000058, DSISR: 20000000
> > TASK = df848c90[4] 'events/0' THREAD: df856000
> > GPR00: e30aa0a4 df857d50 df848c90 00000000 00000640 00000001 c0428df4 dfa40b80
> > GPR08: 000000c8 e30ad2b8 df084360 c024f2e8 44042082 1001af90 e30ad2b8 00000000
> > GPR16: 00000048 00000001 00000000 00000000 df08436c df08440c 00000190 df08455c
> > GPR24: df0844ec df0842c0 df084000 180005ea dfa40b80 00000000 df0842c0 00000000
> > NIP [c024f2fc] skb_recycle_check+0x14/0x100
> > LR [e30aa0a4] ucc_geth_poll+0xd8/0x4e0 [ucc_geth_driver]
> > Call Trace:
> > [df857d50] [c000b03c] __ipipe_grab_irq+0x3c/0xa4 (unreliable)
> > [df857d60] [e30aa0a4] ucc_geth_poll+0xd8/0x4e0 [ucc_geth_driver]
> 
> This I can reproduce. It seems it's a long standing bug that
> becomes easily reproducible with quiesce/activate sequence.
> The driver doesn't handle empty queue correctly, i.e. it ignores
> the empty queue check if netdev queue is stopped, which makes no
> sense.
> 
> Can you try this patch in addition to previous (i.e. both should
> be applied)?
> 
> Thanks!
> 
> diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
> index 2f73e3f..b22de51 100644
> --- a/drivers/net/ucc_geth.c
> +++ b/drivers/net/ucc_geth.c
> @@ -3275,7 +3275,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>  		/* Handle the transmitted buffer and release */
>  		/* the BD to be used with the current frame  */
>  
> -		if ((bd == ugeth->txBd[txQ]) && (netif_queue_stopped(dev) == 0))
> +		if (bd == ugeth->txBd[txQ]) /* queue empty? */
>  			break;
>  
>  		dev->stats.tx_packets++;

That seems to be it.  It works now.  No more crashes.

Those two patches together seem to do the trick.  I really hope they
can go into 2.6.32-stable then, since this is a regression over 2.6.31
and is hopefully an obvious fix.

Now if only my mdio-gpio bitbang one line fix would be accepted.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4
  2009-12-23 22:10       ` Lennart Sorensen
@ 2009-12-23 22:21         ` Anton Vorontsov
  2009-12-23 22:23           ` Lennart Sorensen
  0 siblings, 1 reply; 8+ messages in thread
From: Anton Vorontsov @ 2009-12-23 22:21 UTC (permalink / raw)
  To: Lennart Sorensen; +Cc: linux-kernel, linuxppc-dev, netdev, leoli

On Wed, Dec 23, 2009 at 05:10:47PM -0500, Lennart Sorensen wrote:
[...]
> That seems to be it.  It works now.  No more crashes.
> Those two patches together seem to do the trick.

Great!

> I really hope they
> can go into 2.6.32-stable then, since this is a regression over 2.6.31
> and is hopefully an obvious fix.

Yep, will try to get them there.

Thanks a lot for helping to track this down,

-- 
Anton Vorontsov
email: cbouatmailru@gmail.com
irc://irc.freenode.net/bd2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4
  2009-12-23 22:21         ` Anton Vorontsov
@ 2009-12-23 22:23           ` Lennart Sorensen
  0 siblings, 0 replies; 8+ messages in thread
From: Lennart Sorensen @ 2009-12-23 22:23 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Lennart Sorensen, linux-kernel, linuxppc-dev, netdev, leoli

On Thu, Dec 24, 2009 at 01:21:16AM +0300, Anton Vorontsov wrote:
> On Wed, Dec 23, 2009 at 05:10:47PM -0500, Lennart Sorensen wrote:
> [...]
> > That seems to be it.  It works now.  No more crashes.
> > Those two patches together seem to do the trick.
> 
> Great!
> 
> > I really hope they
> > can go into 2.6.32-stable then, since this is a regression over 2.6.31
> > and is hopefully an obvious fix.
> 
> Yep, will try to get them there.
> 
> Thanks a lot for helping to track this down,

Thanks for fixing it so fast.  I only knew that patch broke it, not why
it broke it.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-12-23 22:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-23 17:40 ucc_geth broken in 2.6.32 by 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4 Lennart Sorensen
2009-12-23 18:04 ` Anton Vorontsov
2009-12-23 20:09   ` Lennart Sorensen
2009-12-23 20:15     ` Lennart Sorensen
2009-12-23 20:22     ` Anton Vorontsov
2009-12-23 22:10       ` Lennart Sorensen
2009-12-23 22:21         ` Anton Vorontsov
2009-12-23 22:23           ` Lennart Sorensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).