All of lore.kernel.org
 help / color / mirror / Atom feed
* [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
@ 2017-04-07 18:17 Ville Syrjälä
  2017-04-07 18:38 ` Eric Dumazet
  2017-04-21 11:40 ` Ville Syrjälä
  0 siblings, 2 replies; 21+ messages in thread
From: Ville Syrjälä @ 2017-04-07 18:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, netdev, linux-kernel

Hi,

My old P3 laptop started to die on me in the middle of larger compile
jobs (using distcc) after v4.11-rc<something>. I bisected the problem
to 617f01211baf ("8139too: use napi_complete_done()").

Unfortunately I wasn't able to capture a full oops as the machine doesn't
have serial and ramoops failed me. I did get one partial oops on vgacon
which showed rtl8139_poll() being involved (EIP was around
_raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
bisect result.

So maybe some kind of nasty thing going between the hard irq and
softirq? Perhaps UP related? I tried to stare at the locking around
rtl8139_poll() for a while but it looked mostly sane to me.

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-07 18:17 [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()") Ville Syrjälä
@ 2017-04-07 18:38 ` Eric Dumazet
  2017-04-07 18:42   ` David Miller
                     ` (2 more replies)
  2017-04-21 11:40 ` Ville Syrjälä
  1 sibling, 3 replies; 21+ messages in thread
From: Eric Dumazet @ 2017-04-07 18:38 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Eric Dumazet, David S. Miller, netdev, linux-kernel

On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> Hi,
> 
> My old P3 laptop started to die on me in the middle of larger compile
> jobs (using distcc) after v4.11-rc<something>. I bisected the problem
> to 617f01211baf ("8139too: use napi_complete_done()").
> 
> Unfortunately I wasn't able to capture a full oops as the machine doesn't
> have serial and ramoops failed me. I did get one partial oops on vgacon
> which showed rtl8139_poll() being involved (EIP was around
> _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> bisect result.
> 
> So maybe some kind of nasty thing going between the hard irq and
> softirq? Perhaps UP related? I tried to stare at the locking around
> rtl8139_poll() for a while but it looked mostly sane to me.
> 

Thanks a lot for the detective work, I am so sorry for this !

Could you try the following patch ?

I do not really see what could be wrong, the code should run just fine
on UP.

Thanks.

diff --git a/drivers/net/ethernet/realtek/8139too.c b/drivers/net/ethernet/realtek/8139too.c
index 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 100644
--- a/drivers/net/ethernet/realtek/8139too.c
+++ b/drivers/net/ethernet/realtek/8139too.c
@@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int budget)
 	if (likely(RTL_R16(IntrStatus) & RxAckBits))
 		work_done += rtl8139_rx(dev, tp, budget);
 
-	if (work_done < budget && napi_complete_done(napi, work_done)) {
+	if (work_done < budget) {
 		unsigned long flags;
 
 		spin_lock_irqsave(&tp->lock, flags);
-		RTL_W16_F(IntrMask, rtl8139_intr_mask);
+		if (napi_complete_done(napi, work_done))
+			RTL_W16_F(IntrMask, rtl8139_intr_mask);
 		spin_unlock_irqrestore(&tp->lock, flags);
 	}
 	spin_unlock(&tp->rx_lock);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-07 18:38 ` Eric Dumazet
@ 2017-04-07 18:42   ` David Miller
  2017-04-08 10:23     ` Francois Romieu
  2017-04-10 12:11   ` Ville Syrjälä
  2017-06-19 14:44   ` [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()") Michal Kubecek
  2 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2017-04-07 18:42 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ville.syrjala, edumazet, netdev, linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 07 Apr 2017 11:38:49 -0700

> I do not really see what could be wrong, the code should run just fine
> on UP.

One theory is that the interrupt masking isn't working properly
and interrupts are still arriving and hitting the NAPI state even
when we are actively polling NAPI.

And this problem was masked by the locking done here.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-07 18:42   ` David Miller
@ 2017-04-08 10:23     ` Francois Romieu
  2017-04-10 14:22       ` tedheadster
  0 siblings, 1 reply; 21+ messages in thread
From: Francois Romieu @ 2017-04-08 10:23 UTC (permalink / raw)
  To: ville.syrjala; +Cc: David Miller, eric.dumazet, edumazet, netdev, linux-kernel

David Miller <davem@davemloft.net> :
[...]
> One theory is that the interrupt masking isn't working properly
> and interrupts are still arriving and hitting the NAPI state even
> when we are actively polling NAPI.
> 
> And this problem was masked by the locking done here.

Yes.

Ville, can you rule out irq sharing between the 8139 and some other
device ? It's a candidate for unexpected interrupt handler invocation
with older pc, even with properly working hardware.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-07 18:38 ` Eric Dumazet
  2017-04-07 18:42   ` David Miller
@ 2017-04-10 12:11   ` Ville Syrjälä
  2017-09-18 19:46     ` Ville Syrjälä
  2017-06-19 14:44   ` [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()") Michal Kubecek
  2 siblings, 1 reply; 21+ messages in thread
From: Ville Syrjälä @ 2017-04-10 12:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, David S. Miller, netdev, linux-kernel

On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote:
> On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> > Hi,
> > 
> > My old P3 laptop started to die on me in the middle of larger compile
> > jobs (using distcc) after v4.11-rc<something>. I bisected the problem
> > to 617f01211baf ("8139too: use napi_complete_done()").
> > 
> > Unfortunately I wasn't able to capture a full oops as the machine doesn't
> > have serial and ramoops failed me. I did get one partial oops on vgacon
> > which showed rtl8139_poll() being involved (EIP was around
> > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> > bisect result.
> > 
> > So maybe some kind of nasty thing going between the hard irq and
> > softirq? Perhaps UP related? I tried to stare at the locking around
> > rtl8139_poll() for a while but it looked mostly sane to me.
> > 
> 
> Thanks a lot for the detective work, I am so sorry for this !
> 
> Could you try the following patch ?
> 
> I do not really see what could be wrong, the code should run just fine
> on UP.
> 
> Thanks.
> 
> diff --git a/drivers/net/ethernet/realtek/8139too.c b/drivers/net/ethernet/realtek/8139too.c
> index 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 100644
> --- a/drivers/net/ethernet/realtek/8139too.c
> +++ b/drivers/net/ethernet/realtek/8139too.c
> @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int budget)
>  	if (likely(RTL_R16(IntrStatus) & RxAckBits))
>  		work_done += rtl8139_rx(dev, tp, budget);
>  
> -	if (work_done < budget && napi_complete_done(napi, work_done)) {
> +	if (work_done < budget) {
>  		unsigned long flags;
>  
>  		spin_lock_irqsave(&tp->lock, flags);
> -		RTL_W16_F(IntrMask, rtl8139_intr_mask);
> +		if (napi_complete_done(napi, work_done))
> +			RTL_W16_F(IntrMask, rtl8139_intr_mask);
>  		spin_unlock_irqrestore(&tp->lock, flags);
>  	}
>  	spin_unlock(&tp->rx_lock);
> 
> 

Yep, that patch does appear to make it stable again.

Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-08 10:23     ` Francois Romieu
@ 2017-04-10 14:22       ` tedheadster
  2017-04-10 14:55         ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: tedheadster @ 2017-04-10 14:22 UTC (permalink / raw)
  To: Francois Romieu
  Cc: ville.syrjala, David Miller, eric.dumazet, edumazet, netdev,
	linux-kernel

On Sat, Apr 8, 2017 at 6:23 AM, Francois Romieu <romieu@fr.zoreil.com> wrote:
> David Miller <davem@davemloft.net> :
> [...]
>> One theory is that the interrupt masking isn't working properly
>> and interrupts are still arriving and hitting the NAPI state even
>> when we are actively polling NAPI.
>>
>> And this problem was masked by the locking done here.
>
> Yes.
>
> Ville, can you rule out irq sharing between the 8139 and some other
> device ? It's a candidate for unexpected interrupt handler invocation
> with older pc, even with properly working hardware.
>

Eric,
  If napi_complete_done() calls could affect drivers on older
hardware, I can test the following:

drivers/net/ethernet/3com/typhoon.c
drivers/net/ethernet/amd/pcnet32.c
drivers/net/ethernet/broadcom/tg3.c
drivers/net/ethernet/dec/tulip/interrupt.c
drivers/net/ethernet/intel/e100.c
drivers/net/ethernet/intel/e1000/e1000_main.c
drivers/net/ethernet/smsc/epic100.c
drivers/net/ethernet/via/via-rhine.c

- Matthew

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-10 14:22       ` tedheadster
@ 2017-04-10 14:55         ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2017-04-10 14:55 UTC (permalink / raw)
  To: whiteheadm
  Cc: Francois Romieu, ville.syrjala, David Miller, Eric Dumazet,
	netdev, linux-kernel

On Mon, Apr 10, 2017 at 7:22 AM, tedheadster <tedheadster@gmail.com> wrote:
> On Sat, Apr 8, 2017 at 6:23 AM, Francois Romieu <romieu@fr.zoreil.com> wrote:
>> David Miller <davem@davemloft.net> :
>> [...]
>>> One theory is that the interrupt masking isn't working properly
>>> and interrupts are still arriving and hitting the NAPI state even
>>> when we are actively polling NAPI.
>>>
>>> And this problem was masked by the locking done here.
>>
>> Yes.
>>
>> Ville, can you rule out irq sharing between the 8139 and some other
>> device ? It's a candidate for unexpected interrupt handler invocation
>> with older pc, even with properly working hardware.
>>
>
> Eric,
>   If napi_complete_done() calls could affect drivers on older
> hardware, I can test the following:
>
> drivers/net/ethernet/3com/typhoon.c
> drivers/net/ethernet/amd/pcnet32.c
> drivers/net/ethernet/broadcom/tg3.c
> drivers/net/ethernet/dec/tulip/interrupt.c
> drivers/net/ethernet/intel/e100.c
> drivers/net/ethernet/intel/e1000/e1000_main.c
> drivers/net/ethernet/smsc/epic100.c
> drivers/net/ethernet/via/via-rhine.c

That would be great, thanks Matthew.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-07 18:17 [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()") Ville Syrjälä
  2017-04-07 18:38 ` Eric Dumazet
@ 2017-04-21 11:40 ` Ville Syrjälä
  2017-04-21 13:29   ` Eric Dumazet
  1 sibling, 1 reply; 21+ messages in thread
From: Ville Syrjälä @ 2017-04-21 11:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, netdev, linux-kernel

BTW I've also been getting some lockdep grief from r8169 netpoll stuff
recently. Not sure if it might be related to these changes or not, but
I don't remember seeing this sort of stuff until quite recently.

[  251.911044] ======================================================
[  251.911044] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
[  251.911045] 4.11.0-rc6-elk+ #101 Not tainted
[  251.911045] ------------------------------------------------------
[  251.911046] kms_plane_blink/2132 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[  251.911046]  (&syncp->seq#2){+.-...}, at: [<c14398a3>] netpoll_poll_dev+0xb3/0x2a0
[  251.911048] 
[  251.911048] and this task is already holding:
[  251.911049]  (target_list_lock){-.....}, at: [<f858693f>] write_msg+0x3f/0xd0 [netconsole]
[  251.911050] which would create a new lock dependency:
[  251.911050]  (target_list_lock){-.....} -> (&syncp->seq#2){+.-...}
[  251.911053] 
[  251.911053] but this new dependency connects a HARDIRQ-irq-safe lock:
[  251.911054]  (target_list_lock){-.....}
[  251.911055] ... which became HARDIRQ-irq-safe at:
[  251.911055]   __lock_acquire+0x6ec/0x1440
[  251.911056]   lock_acquire+0x7e/0x1a0
[  251.911056]   _raw_spin_lock_irqsave+0x2b/0x40
[  251.911056]   write_msg+0x3f/0xd0 [netconsole]
[  251.911057]   console_unlock+0x41c/0x570
[  251.911057]   vprintk_emit+0x26d/0x350
[  251.911058]   vprintk_default+0x12/0x20
[  251.911058]   vprintk_func+0x1c/0x50
[  251.911058]   printk+0xe/0x10
[  251.911059]   drm_printk+0x5a/0x60 [drm]
[  251.911059]   intel_get_hpd_pins+0x87/0xa0 [i915]
[  251.911060]   i9xx_hpd_irq_handler+0x96/0xe0 [i915]
[  251.911060]   i965_irq_handler+0x2c8/0x2f0 [i915]
[  251.911060]   __handle_irq_event_percpu+0x38/0x360
[  251.911061]   handle_irq_event_percpu+0x19/0x50
[  251.911061]   handle_irq_event+0x29/0x50
[  251.911062]   handle_edge_irq+0x65/0x110
[  251.911062]   handle_irq+0x9f/0xc0
[  251.911062]   do_IRQ+0x4f/0x110
[  251.911063]   common_interrupt+0x36/0x3c
[  251.911063]   cpuidle_enter_state+0xcc/0x390
[  251.911064]   cpuidle_enter+0xf/0x20
[  251.911064]   call_cpuidle+0x1c/0x40
[  251.911064]   do_idle+0x164/0x1d0
[  251.911065]   cpu_startup_entry+0x1d/0x20
[  251.911065]   start_secondary+0x104/0x170
[  251.911066]   startup_32_smp+0x16b/0x16d
[  251.911066] 
[  251.911066] to a HARDIRQ-irq-unsafe lock:
[  251.911067]  (&syncp->seq#2){+.-...}
[  251.911068] ... which became HARDIRQ-irq-unsafe at:
[  251.911068] ...  __lock_acquire+0x5d7/0x1440
[  251.911069]   lock_acquire+0x7e/0x1a0
[  251.911069]   rtl8169_poll+0x474/0x620 [r8169]
[  251.911070]   net_rx_action+0x1d0/0x3f0
[  251.911070]   __do_softirq+0x17d/0x460
[  251.911070]   do_softirq_own_stack+0x1d/0x30
[  251.911071]   irq_exit+0xa5/0xb0
[  251.911071]   do_IRQ+0x58/0x110
[  251.911072]   common_interrupt+0x36/0x3c
[  251.911072]   cpuidle_enter_state+0xcc/0x390
[  251.911072]   cpuidle_enter+0xf/0x20
[  251.911073]   call_cpuidle+0x1c/0x40
[  251.911073]   do_idle+0x164/0x1d0
[  251.911073]   cpu_startup_entry+0x1d/0x20
[  251.911074]   rest_init+0x10c/0x120
[  251.911074]   start_kernel+0x30d/0x312
[  251.911075]   i386_start_kernel+0x85/0x89
[  251.911075]   startup_32_smp+0x16b/0x16d
[  251.911075] 
[  251.911076] other info that might help us debug this:
[  251.911076] 
[  251.911077]  Possible interrupt unsafe locking scenario:
[  251.911077] 
[  251.911077]        CPU0                    CPU1
[  251.911078]        ----                    ----
[  251.911078]   lock(&syncp->seq#2);
[  251.911079]                                local_irq_disable();
[  251.911080]                                lock(target_list_lock);
[  251.911081]                                lock(&syncp->seq#2);
[  251.911082]   <Interrupt>
[  251.911082]     lock(target_list_lock);
[  251.911083] 
[  251.911084]  *** DEADLOCK ***
[  251.911084] 
[  251.911084] 2 locks held by kms_plane_blink/2132:
[  251.911085]  #0:  (console_lock){+.+.+.}, at: [<c10ad3b4>] vprintk_emit+0x264/0x350
[  251.911086]  #1:  (target_list_lock){-.....}, at: [<f858693f>] write_msg+0x3f/0xd0 [netconsole]
[  251.911088] 
[  251.911088] the dependencies between HARDIRQ-irq-safe lock and the holding lock:
[  251.911089] -> (target_list_lock){-.....} ops: 482 {
[  251.911090]    IN-HARDIRQ-W at:
[  251.911091]                        __lock_acquire+0x6ec/0x1440
[  251.911092]                        lock_acquire+0x7e/0x1a0
[  251.911092]                        _raw_spin_lock_irqsave+0x2b/0x40
[  251.911092]                        write_msg+0x3f/0xd0 [netconsole]
[  251.911093]                        console_unlock+0x41c/0x570
[  251.911093]                        vprintk_emit+0x26d/0x350
[  251.911094]                        vprintk_default+0x12/0x20
[  251.911094]                        vprintk_func+0x1c/0x50
[  251.911095]                        printk+0xe/0x10
[  251.911095]                        drm_printk+0x5a/0x60 [drm]
[  251.911096]                        intel_get_hpd_pins+0x87/0xa0 [i915]
[  251.911096]                        i9xx_hpd_irq_handler+0x96/0xe0 [i915]
[  251.911097]                        i965_irq_handler+0x2c8/0x2f0 [i915]
[  251.911097]                        __handle_irq_event_percpu+0x38/0x360
[  251.911098]                        handle_irq_event_percpu+0x19/0x50
[  251.911098]                        handle_irq_event+0x29/0x50
[  251.911098]                        handle_edge_irq+0x65/0x110
[  251.911099]                        handle_irq+0x9f/0xc0
[  251.911099]                        do_IRQ+0x4f/0x110
[  251.911100]                        common_interrupt+0x36/0x3c
[  251.911100]                        cpuidle_enter_state+0xcc/0x390
[  251.911101]                        cpuidle_enter+0xf/0x20
[  251.911101]                        call_cpuidle+0x1c/0x40
[  251.911102]                        do_idle+0x164/0x1d0
[  251.911102]                        cpu_startup_entry+0x1d/0x20
[  251.911102]                        start_secondary+0x104/0x170
[  251.911103]                        startup_32_smp+0x16b/0x16d
[  251.911103]    INITIAL USE at:
[  251.911104]                       __lock_acquire+0x221/0x1440
[  251.911104]                       lock_acquire+0x7e/0x1a0
[  251.911105]                       _raw_spin_lock_irqsave+0x2b/0x40
[  251.911105]                       make_netconsole_target+0x80/0xab [netconsole]
[  251.911106]                       configfs_mkdir+0x312/0x3b0 [configfs]
[  251.911106]                       vfs_mkdir+0xd2/0x1b0
[  251.911107]                       SyS_mkdir+0x57/0xc0
[  251.911107]                       do_fast_syscall_32+0x80/0x430
[  251.911108]                       entry_SYSENTER_32+0x4c/0x7b
[  251.911108]  }
[  251.911108]  ... key      at: [<f85882f0>] target_list_lock+0x10/0xffffed20 [netconsole]
[  251.911109]  ... acquired at:
[  251.911109]    check_irq_usage+0x42/0xb0
[  251.911110]    __lock_acquire+0xfbf/0x1440
[  251.911110]    lock_acquire+0x7e/0x1a0
[  251.911110]    rtl8169_poll+0x474/0x620 [r8169]
[  251.911111]    netpoll_poll_dev+0xb3/0x2a0
[  251.911111]    netpoll_send_skb_on_dev+0x180/0x240
[  251.911112]    netpoll_send_udp+0x292/0x440
[  251.911112]    write_msg+0x9c/0xd0 [netconsole]
[  251.911112]    console_unlock+0x41c/0x570
[  251.911113]    vprintk_emit+0x26d/0x350
[  251.911113]    printk_emit+0x1b/0x1d
[  251.911114]    devkmsg_write+0xeb/0x140
[  251.911114]    __vfs_write+0xb0/0x110
[  251.911114]    vfs_write+0xa1/0x1e0
[  251.911115]    SyS_write+0x3d/0x90
[  251.911115]    do_fast_syscall_32+0x80/0x430
[  251.911116]    entry_SYSENTER_32+0x4c/0x7b
[  251.911116] 
[  251.911116] 
[  251.911117] the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock:
[  251.911117] -> (&syncp->seq#2){+.-...} ops: 7439 {
[  251.911119]    HARDIRQ-ON-W at:
[  251.911120]                        __lock_acquire+0x5d7/0x1440
[  251.911120]                        lock_acquire+0x7e/0x1a0
[  251.911121]                        rtl8169_poll+0x474/0x620 [r8169]
[  251.911121]                        net_rx_action+0x1d0/0x3f0
[  251.911122]                        __do_softirq+0x17d/0x460
[  251.911122]                        do_softirq_own_stack+0x1d/0x30
[  251.911122]                        irq_exit+0xa5/0xb0
[  251.911123]                        do_IRQ+0x58/0x110
[  251.911123]                        common_interrupt+0x36/0x3c
[  251.911124]                        cpuidle_enter_state+0xcc/0x390
[  251.911124]                        cpuidle_enter+0xf/0x20
[  251.911125]                        call_cpuidle+0x1c/0x40
[  251.911125]                        do_idle+0x164/0x1d0
[  251.911126]                        cpu_startup_entry+0x1d/0x20
[  251.911126]                        rest_init+0x10c/0x120
[  251.911127]                        start_kernel+0x30d/0x312
[  251.911127]                        i386_start_kernel+0x85/0x89
[  251.911127]                        startup_32_smp+0x16b/0x16d
[  251.911128]    IN-SOFTIRQ-W at:
[  251.911129]                        __lock_acquire+0x5b5/0x1440
[  251.911129]                        lock_acquire+0x7e/0x1a0
[  251.911129]                        rtl8169_poll+0x474/0x620 [r8169]
[  251.911130]                        net_rx_action+0x1d0/0x3f0
[  251.911130]                        __do_softirq+0x17d/0x460
[  251.911131]                        do_softirq_own_stack+0x1d/0x30
[  251.911131]                        irq_exit+0xa5/0xb0
[  251.911132]                        do_IRQ+0x58/0x110
[  251.911132]                        common_interrupt+0x36/0x3c
[  251.911133]                        cpuidle_enter_state+0xcc/0x390
[  251.911133]                        cpuidle_enter+0xf/0x20
[  251.911133]                        call_cpuidle+0x1c/0x40
[  251.911134]                        do_idle+0x164/0x1d0
[  251.911134]                        cpu_startup_entry+0x1d/0x20
[  251.911135]                        rest_init+0x10c/0x120
[  251.911135]                        start_kernel+0x30d/0x312
[  251.911136]                        i386_start_kernel+0x85/0x89
[  251.911136]                        startup_32_smp+0x16b/0x16d
[  251.911136]    INITIAL USE at:
[  251.911137]                       __lock_acquire+0x221/0x1440
[  251.911138]                       lock_acquire+0x7e/0x1a0
[  251.911138]                       rtl8169_get_stats64+0x81/0x320 [r8169]
[  251.911139]                       dev_get_stats+0x30/0xa0
[  251.911139]                       rtnl_fill_stats+0x30/0x110
[  251.911139]                       rtnl_fill_ifinfo+0x4be/0xd80
[  251.911140]                       rtmsg_ifinfo_build_skb+0x45/0xc0
[  251.911140]                       rtmsg_ifinfo.part.33+0xd/0x30
[  251.911141]                       rtmsg_ifinfo+0x1b/0x20
[  251.911141]                       register_netdevice+0x495/0x5e0
[  251.911142]                       register_netdev+0x12/0x20
[  251.911142]                       rtl_init_one+0x86d/0x1060 [r8169]
[  251.911143]                       pci_device_probe+0x8b/0x100
[  251.911143]                       driver_probe_device+0x1ad/0x2a0
[  251.911144]                       __driver_attach+0x89/0x90
[  251.911144]                       bus_for_each_dev+0x4f/0x80
[  251.911144]                       driver_attach+0x14/0x20
[  251.911145]                 
[  251.911146] Lost 63 message(s)!

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-21 11:40 ` Ville Syrjälä
@ 2017-04-21 13:29   ` Eric Dumazet
  2017-04-21 15:09     ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2017-04-21 13:29 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Eric Dumazet, David S. Miller, netdev, linux-kernel

On Fri, 2017-04-21 at 14:40 +0300, Ville Syrjälä wrote:
> BTW I've also been getting some lockdep grief from r8169 netpoll stuff
> recently. Not sure if it might be related to these changes or not, but
> I don't remember seeing this sort of stuff until quite recently.
> 
> [  251.911044] ======================================================
> [  251.911044] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> [  251.911045] 4.11.0-rc6-elk+ #101 Not tainted
> [  251.911045] ------------------------------------------------------
> [  251.911046] kms_plane_blink/2132 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [  251.911046]  (&syncp->seq#2){+.-...}, at: [<c14398a3>] netpoll_poll_dev+0xb3/0x2a0
> [  251.911048] 
> [  251.911048] and this task is already holding:
> [  251.911049]  (target_list_lock){-.....}, at: [<f858693f>] write_msg+0x3f/0xd0 [netconsole]
> [  251.911050] which would create a new lock dependency:
> [  251.911050]  (target_list_lock){-.....} -> (&syncp->seq#2){+.-...}
> [  251.911053] 
> [  251.911053] but this new dependency connects a HARDIRQ-irq-safe lock:
> [  251.911054]  (target_list_lock){-.....}
> [  251.911055] ... which became HARDIRQ-irq-safe at:
> [  251.911055]   __lock_acquire+0x6ec/0x1440
> [  251.911056]   lock_acquire+0x7e/0x1a0
> [  251.911056]   _raw_spin_lock_irqsave+0x2b/0x40
> [  251.911056]   write_msg+0x3f/0xd0 [netconsole]
> [  251.911057]   console_unlock+0x41c/0x570
> [  251.911057]   vprintk_emit+0x26d/0x350
> [  251.911058]   vprintk_default+0x12/0x20
> [  251.911058]   vprintk_func+0x1c/0x50
> [  251.911058]   printk+0xe/0x10
> [  251.911059]   drm_printk+0x5a/0x60 [drm]
> [  251.911059]   intel_get_hpd_pins+0x87/0xa0 [i915]
> [  251.911060]   i9xx_hpd_irq_handler+0x96/0xe0 [i915]
> [  251.911060]   i965_irq_handler+0x2c8/0x2f0 [i915]
> [  251.911060]   __handle_irq_event_percpu+0x38/0x360
> [  251.911061]   handle_irq_event_percpu+0x19/0x50
> [  251.911061]   handle_irq_event+0x29/0x50
> [  251.911062]   handle_edge_irq+0x65/0x110
> [  251.911062]   handle_irq+0x9f/0xc0
> [  251.911062]   do_IRQ+0x4f/0x110
> [  251.911063]   common_interrupt+0x36/0x3c
> [  251.911063]   cpuidle_enter_state+0xcc/0x390
> [  251.911064]   cpuidle_enter+0xf/0x20
> [  251.911064]   call_cpuidle+0x1c/0x40
> [  251.911064]   do_idle+0x164/0x1d0
> [  251.911065]   cpu_startup_entry+0x1d/0x20
> [  251.911065]   start_secondary+0x104/0x170
> [  251.911066]   startup_32_smp+0x16b/0x16d
> [  251.911066] 
> [  251.911066] to a HARDIRQ-irq-unsafe lock:
> [  251.911067]  (&syncp->seq#2){+.-...}
> [  251.911068] ... which became HARDIRQ-irq-unsafe at:
> [  251.911068] ...  __lock_acquire+0x5d7/0x1440
> [  251.911069]   lock_acquire+0x7e/0x1a0
> [  251.911069]   rtl8169_poll+0x474/0x620 [r8169]
> [  251.911070]   net_rx_action+0x1d0/0x3f0
> [  251.911070]   __do_softirq+0x17d/0x460
> [  251.911070]   do_softirq_own_stack+0x1d/0x30
> [  251.911071]   irq_exit+0xa5/0xb0
> [  251.911071]   do_IRQ+0x58/0x110
> [  251.911072]   common_interrupt+0x36/0x3c
> [  251.911072]   cpuidle_enter_state+0xcc/0x390
> [  251.911072]   cpuidle_enter+0xf/0x20
> [  251.911073]   call_cpuidle+0x1c/0x40
> [  251.911073]   do_idle+0x164/0x1d0
> [  251.911073]   cpu_startup_entry+0x1d/0x20
> [  251.911074]   rest_init+0x10c/0x120
> [  251.911074]   start_kernel+0x30d/0x312
> [  251.911075]   i386_start_kernel+0x85/0x89
> [  251.911075]   startup_32_smp+0x16b/0x16d
> [  251.911075] 
> [  251.911076] other info that might help us debug this:
> [  251.911076] 
> [  251.911077]  Possible interrupt unsafe locking scenario:
> [  251.911077] 
> [  251.911077]        CPU0                    CPU1
> [  251.911078]        ----                    ----
> [  251.911078]   lock(&syncp->seq#2);
> [  251.911079]                                local_irq_disable();
> [  251.911080]                                lock(target_list_lock);
> [  251.911081]                                lock(&syncp->seq#2);
> [  251.911082]   <Interrupt>
> [  251.911082]     lock(target_list_lock);
> [  251.911083] 
> [  251.911084]  *** DEADLOCK ***
> [  251.911084] 
> [  251.911084] 2 locks held by kms_plane_blink/2132:
> [  251.911085]  #0:  (console_lock){+.+.+.}, at: [<c10ad3b4>] vprintk_emit+0x264/0x350
> [  251.911086]  #1:  (target_list_lock){-.....}, at: [<f858693f>] write_msg+0x3f/0xd0 [netconsole]
> [  251.911088] 
> [  251.911088] the dependencies between HARDIRQ-irq-safe lock and the holding lock:
> [  251.911089] -> (target_list_lock){-.....} ops: 482 {
> [  251.911090]    IN-HARDIRQ-W at:
> [  251.911091]                        __lock_acquire+0x6ec/0x1440
> [  251.911092]                        lock_acquire+0x7e/0x1a0
> [  251.911092]                        _raw_spin_lock_irqsave+0x2b/0x40
> [  251.911092]                        write_msg+0x3f/0xd0 [netconsole]
> [  251.911093]                        console_unlock+0x41c/0x570
> [  251.911093]                        vprintk_emit+0x26d/0x350
> [  251.911094]                        vprintk_default+0x12/0x20
> [  251.911094]                        vprintk_func+0x1c/0x50
> [  251.911095]                        printk+0xe/0x10
> [  251.911095]                        drm_printk+0x5a/0x60 [drm]
> [  251.911096]                        intel_get_hpd_pins+0x87/0xa0 [i915]
> [  251.911096]                        i9xx_hpd_irq_handler+0x96/0xe0 [i915]
> [  251.911097]                        i965_irq_handler+0x2c8/0x2f0 [i915]
> [  251.911097]                        __handle_irq_event_percpu+0x38/0x360
> [  251.911098]                        handle_irq_event_percpu+0x19/0x50
> [  251.911098]                        handle_irq_event+0x29/0x50
> [  251.911098]                        handle_edge_irq+0x65/0x110
> [  251.911099]                        handle_irq+0x9f/0xc0
> [  251.911099]                        do_IRQ+0x4f/0x110
> [  251.911100]                        common_interrupt+0x36/0x3c
> [  251.911100]                        cpuidle_enter_state+0xcc/0x390
> [  251.911101]                        cpuidle_enter+0xf/0x20
> [  251.911101]                        call_cpuidle+0x1c/0x40
> [  251.911102]                        do_idle+0x164/0x1d0
> [  251.911102]                        cpu_startup_entry+0x1d/0x20
> [  251.911102]                        start_secondary+0x104/0x170
> [  251.911103]                        startup_32_smp+0x16b/0x16d
> [  251.911103]    INITIAL USE at:
> [  251.911104]                       __lock_acquire+0x221/0x1440
> [  251.911104]                       lock_acquire+0x7e/0x1a0
> [  251.911105]                       _raw_spin_lock_irqsave+0x2b/0x40
> [  251.911105]                       make_netconsole_target+0x80/0xab [netconsole]
> [  251.911106]                       configfs_mkdir+0x312/0x3b0 [configfs]
> [  251.911106]                       vfs_mkdir+0xd2/0x1b0
> [  251.911107]                       SyS_mkdir+0x57/0xc0
> [  251.911107]                       do_fast_syscall_32+0x80/0x430
> [  251.911108]                       entry_SYSENTER_32+0x4c/0x7b
> [  251.911108]  }
> [  251.911108]  ... key      at: [<f85882f0>] target_list_lock+0x10/0xffffed20 [netconsole]
> [  251.911109]  ... acquired at:
> [  251.911109]    check_irq_usage+0x42/0xb0
> [  251.911110]    __lock_acquire+0xfbf/0x1440
> [  251.911110]    lock_acquire+0x7e/0x1a0
> [  251.911110]    rtl8169_poll+0x474/0x620 [r8169]
> [  251.911111]    netpoll_poll_dev+0xb3/0x2a0
> [  251.911111]    netpoll_send_skb_on_dev+0x180/0x240
> [  251.911112]    netpoll_send_udp+0x292/0x440
> [  251.911112]    write_msg+0x9c/0xd0 [netconsole]
> [  251.911112]    console_unlock+0x41c/0x570
> [  251.911113]    vprintk_emit+0x26d/0x350
> [  251.911113]    printk_emit+0x1b/0x1d
> [  251.911114]    devkmsg_write+0xeb/0x140
> [  251.911114]    __vfs_write+0xb0/0x110
> [  251.911114]    vfs_write+0xa1/0x1e0
> [  251.911115]    SyS_write+0x3d/0x90
> [  251.911115]    do_fast_syscall_32+0x80/0x430
> [  251.911116]    entry_SYSENTER_32+0x4c/0x7b
> [  251.911116] 
> [  251.911116] 
> [  251.911117] the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock:
> [  251.911117] -> (&syncp->seq#2){+.-...} ops: 7439 {
> [  251.911119]    HARDIRQ-ON-W at:
> [  251.911120]                        __lock_acquire+0x5d7/0x1440
> [  251.911120]                        lock_acquire+0x7e/0x1a0
> [  251.911121]                        rtl8169_poll+0x474/0x620 [r8169]
> [  251.911121]                        net_rx_action+0x1d0/0x3f0
> [  251.911122]                        __do_softirq+0x17d/0x460
> [  251.911122]                        do_softirq_own_stack+0x1d/0x30
> [  251.911122]                        irq_exit+0xa5/0xb0
> [  251.911123]                        do_IRQ+0x58/0x110
> [  251.911123]                        common_interrupt+0x36/0x3c
> [  251.911124]                        cpuidle_enter_state+0xcc/0x390
> [  251.911124]                        cpuidle_enter+0xf/0x20
> [  251.911125]                        call_cpuidle+0x1c/0x40
> [  251.911125]                        do_idle+0x164/0x1d0
> [  251.911126]                        cpu_startup_entry+0x1d/0x20
> [  251.911126]                        rest_init+0x10c/0x120
> [  251.911127]                        start_kernel+0x30d/0x312
> [  251.911127]                        i386_start_kernel+0x85/0x89
> [  251.911127]                        startup_32_smp+0x16b/0x16d
> [  251.911128]    IN-SOFTIRQ-W at:
> [  251.911129]                        __lock_acquire+0x5b5/0x1440
> [  251.911129]                        lock_acquire+0x7e/0x1a0
> [  251.911129]                        rtl8169_poll+0x474/0x620 [r8169]
> [  251.911130]                        net_rx_action+0x1d0/0x3f0
> [  251.911130]                        __do_softirq+0x17d/0x460
> [  251.911131]                        do_softirq_own_stack+0x1d/0x30
> [  251.911131]                        irq_exit+0xa5/0xb0
> [  251.911132]                        do_IRQ+0x58/0x110
> [  251.911132]                        common_interrupt+0x36/0x3c
> [  251.911133]                        cpuidle_enter_state+0xcc/0x390
> [  251.911133]                        cpuidle_enter+0xf/0x20
> [  251.911133]                        call_cpuidle+0x1c/0x40
> [  251.911134]                        do_idle+0x164/0x1d0
> [  251.911134]                        cpu_startup_entry+0x1d/0x20
> [  251.911135]                        rest_init+0x10c/0x120
> [  251.911135]                        start_kernel+0x30d/0x312
> [  251.911136]                        i386_start_kernel+0x85/0x89
> [  251.911136]                        startup_32_smp+0x16b/0x16d
> [  251.911136]    INITIAL USE at:
> [  251.911137]                       __lock_acquire+0x221/0x1440
> [  251.911138]                       lock_acquire+0x7e/0x1a0
> [  251.911138]                       rtl8169_get_stats64+0x81/0x320 [r8169]
> [  251.911139]                       dev_get_stats+0x30/0xa0
> [  251.911139]                       rtnl_fill_stats+0x30/0x110
> [  251.911139]                       rtnl_fill_ifinfo+0x4be/0xd80
> [  251.911140]                       rtmsg_ifinfo_build_skb+0x45/0xc0
> [  251.911140]                       rtmsg_ifinfo.part.33+0xd/0x30
> [  251.911141]                       rtmsg_ifinfo+0x1b/0x20
> [  251.911141]                       register_netdevice+0x495/0x5e0
> [  251.911142]                       register_netdev+0x12/0x20
> [  251.911142]                       rtl_init_one+0x86d/0x1060 [r8169]
> [  251.911143]                       pci_device_probe+0x8b/0x100
> [  251.911143]                       driver_probe_device+0x1ad/0x2a0
> [  251.911144]                       __driver_attach+0x89/0x90
> [  251.911144]                       bus_for_each_dev+0x4f/0x80
> [  251.911144]                       driver_attach+0x14/0x20
> [  251.911145]                 
> [  251.911146] Lost 63 message(s)!
> 

Thanks for this report.

Interesting to see how many drivers got the netpoll stuff wrong :/

Can you try :

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 81f18a8335276495a59fa93219c4607c2b8a47aa..74e4c72c331d5a6cc5b653970ef4133c8ddf9999 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7668,7 +7668,7 @@ static void rtl8169_netpoll(struct net_device *dev)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 
-	rtl8169_interrupt(tp->pci_dev->irq, dev);
+	napi_schedule(&tp->napi);
 }
 #endif
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-21 13:29   ` Eric Dumazet
@ 2017-04-21 15:09     ` Eric Dumazet
  2017-04-21 15:34       ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2017-04-21 15:09 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Eric Dumazet, David S. Miller, netdev, linux-kernel

On Fri, 2017-04-21 at 06:29 -0700, Eric Dumazet wrote:

> Thanks for this report.
> 
> Interesting to see how many drivers got the netpoll stuff wrong :/
> 
> Can you try :
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 81f18a8335276495a59fa93219c4607c2b8a47aa..74e4c72c331d5a6cc5b653970ef4133c8ddf9999 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -7668,7 +7668,7 @@ static void rtl8169_netpoll(struct net_device *dev)
>  {
>  	struct rtl8169_private *tp = netdev_priv(dev);
>  
> -	rtl8169_interrupt(tp->pci_dev->irq, dev);
> +	napi_schedule(&tp->napi);

The problem is more likely that netconsole handling can call rtl_tx()
from hard irq context, while standard NAPI poll calls it from BH

Meaning that the following sequence triggers a lockdep warning.

	u64_stats_update_begin(&tp->tx_stats.syncp);
	tp->tx_stats.packets++;
	tp->tx_stats.bytes += tx_skb->skb->len;
	u64_stats_update_end(&tp->tx_stats.syncp);

Lockdep does not know that poll_napi() ( called from netpoll_poll_dev())
uses an cmpxchg() to make sure that there is no race.

I am not sure how we can teach lockdep to not splat in this case.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-21 15:09     ` Eric Dumazet
@ 2017-04-21 15:34       ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2017-04-21 15:34 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Eric Dumazet, David S. Miller, netdev, linux-kernel

On Fri, 2017-04-21 at 08:09 -0700, Eric Dumazet wrote:
> On Fri, 2017-04-21 at 06:29 -0700, Eric Dumazet wrote:
> 
> > Thanks for this report.
> > 
> > Interesting to see how many drivers got the netpoll stuff wrong :/
> > 
> > Can you try :
> > 
> > diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> > index 81f18a8335276495a59fa93219c4607c2b8a47aa..74e4c72c331d5a6cc5b653970ef4133c8ddf9999 100644
> > --- a/drivers/net/ethernet/realtek/r8169.c
> > +++ b/drivers/net/ethernet/realtek/r8169.c
> > @@ -7668,7 +7668,7 @@ static void rtl8169_netpoll(struct net_device *dev)
> >  {
> >  	struct rtl8169_private *tp = netdev_priv(dev);
> >  
> > -	rtl8169_interrupt(tp->pci_dev->irq, dev);
> > +	napi_schedule(&tp->napi);
> 
> The problem is more likely that netconsole handling can call rtl_tx()
> from hard irq context, while standard NAPI poll calls it from BH
> 
> Meaning that the following sequence triggers a lockdep warning.
> 
> 	u64_stats_update_begin(&tp->tx_stats.syncp);
> 	tp->tx_stats.packets++;
> 	tp->tx_stats.bytes += tx_skb->skb->len;
> 	u64_stats_update_end(&tp->tx_stats.syncp);
> 
> Lockdep does not know that poll_napi() ( called from netpoll_poll_dev())
> uses an cmpxchg() to make sure that there is no race.
> 
> I am not sure how we can teach lockdep to not splat in this case.

Well, could you try the following patch ?
Thanks !

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 0a8f2817ea60f2172eb28177473a4879f85bd18a..f64f812b86029b772bb245e51cdc2263adc4e6ea 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7313,10 +7313,10 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 		rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
 				     tp->TxDescArray + entry);
 		if (status & LastFrag) {
-			u64_stats_update_begin(&tp->tx_stats.syncp);
+			u64_stats_update_begin_raw(&tp->tx_stats.syncp);
 			tp->tx_stats.packets++;
 			tp->tx_stats.bytes += tx_skb->skb->len;
-			u64_stats_update_end(&tp->tx_stats.syncp);
+			u64_stats_update_end_raw(&tp->tx_stats.syncp);
 			dev_kfree_skb_any(tx_skb->skb);
 			tx_skb->skb = NULL;
 		}

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-07 18:38 ` Eric Dumazet
  2017-04-07 18:42   ` David Miller
  2017-04-10 12:11   ` Ville Syrjälä
@ 2017-06-19 14:44   ` Michal Kubecek
  2017-07-05 19:53     ` Ville Syrjälä
  2 siblings, 1 reply; 21+ messages in thread
From: Michal Kubecek @ 2017-06-19 14:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ville Syrjälä,
	Eric Dumazet, David S. Miller, netdev, linux-kernel,
	Francois Romieu, whiteheadm

On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote:
> On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> > Hi,
> > 
> > My old P3 laptop started to die on me in the middle of larger compile
> > jobs (using distcc) after v4.11-rc<something>. I bisected the problem
> > to 617f01211baf ("8139too: use napi_complete_done()").
> > 
> > Unfortunately I wasn't able to capture a full oops as the machine doesn't
> > have serial and ramoops failed me. I did get one partial oops on vgacon
> > which showed rtl8139_poll() being involved (EIP was around
> > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> > bisect result.
> > 
> > So maybe some kind of nasty thing going between the hard irq and
> > softirq? Perhaps UP related? I tried to stare at the locking around
> > rtl8139_poll() for a while but it looked mostly sane to me.
> > 
> 
> Thanks a lot for the detective work, I am so sorry for this !
> 
> Could you try the following patch ?
> 
> I do not really see what could be wrong, the code should run just fine
> on UP.
> 
> Thanks.
> 
> diff --git a/drivers/net/ethernet/realtek/8139too.c b/drivers/net/ethernet/realtek/8139too.c
> index 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 100644
> --- a/drivers/net/ethernet/realtek/8139too.c
> +++ b/drivers/net/ethernet/realtek/8139too.c
> @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int budget)
>  	if (likely(RTL_R16(IntrStatus) & RxAckBits))
>  		work_done += rtl8139_rx(dev, tp, budget);
>  
> -	if (work_done < budget && napi_complete_done(napi, work_done)) {
> +	if (work_done < budget) {
>  		unsigned long flags;
>  
>  		spin_lock_irqsave(&tp->lock, flags);
> -		RTL_W16_F(IntrMask, rtl8139_intr_mask);
> +		if (napi_complete_done(napi, work_done))
> +			RTL_W16_F(IntrMask, rtl8139_intr_mask);
>  		spin_unlock_irqrestore(&tp->lock, flags);
>  	}
>  	spin_unlock(&tp->rx_lock);

Eric,

we have a bugreport of what seems to be the same problem:

  https://bugzilla.suse.com/show_bug.cgi?id=1042208

Do you plan to submit the patch above or is the conclusion that this is
rather a hardware problem?

Michal Kubecek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-06-19 14:44   ` [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()") Michal Kubecek
@ 2017-07-05 19:53     ` Ville Syrjälä
  0 siblings, 0 replies; 21+ messages in thread
From: Ville Syrjälä @ 2017-07-05 19:53 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Eric Dumazet, Eric Dumazet, David S. Miller, netdev,
	linux-kernel, Francois Romieu, whiteheadm

On Mon, Jun 19, 2017 at 04:44:45PM +0200, Michal Kubecek wrote:
> On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote:
> > On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> > > Hi,
> > > 
> > > My old P3 laptop started to die on me in the middle of larger compile
> > > jobs (using distcc) after v4.11-rc<something>. I bisected the problem
> > > to 617f01211baf ("8139too: use napi_complete_done()").
> > > 
> > > Unfortunately I wasn't able to capture a full oops as the machine doesn't
> > > have serial and ramoops failed me. I did get one partial oops on vgacon
> > > which showed rtl8139_poll() being involved (EIP was around
> > > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> > > bisect result.
> > > 
> > > So maybe some kind of nasty thing going between the hard irq and
> > > softirq? Perhaps UP related? I tried to stare at the locking around
> > > rtl8139_poll() for a while but it looked mostly sane to me.
> > > 
> > 
> > Thanks a lot for the detective work, I am so sorry for this !
> > 
> > Could you try the following patch ?
> > 
> > I do not really see what could be wrong, the code should run just fine
> > on UP.
> > 
> > Thanks.
> > 
> > diff --git a/drivers/net/ethernet/realtek/8139too.c b/drivers/net/ethernet/realtek/8139too.c
> > index 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 100644
> > --- a/drivers/net/ethernet/realtek/8139too.c
> > +++ b/drivers/net/ethernet/realtek/8139too.c
> > @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int budget)
> >  	if (likely(RTL_R16(IntrStatus) & RxAckBits))
> >  		work_done += rtl8139_rx(dev, tp, budget);
> >  
> > -	if (work_done < budget && napi_complete_done(napi, work_done)) {
> > +	if (work_done < budget) {
> >  		unsigned long flags;
> >  
> >  		spin_lock_irqsave(&tp->lock, flags);
> > -		RTL_W16_F(IntrMask, rtl8139_intr_mask);
> > +		if (napi_complete_done(napi, work_done))
> > +			RTL_W16_F(IntrMask, rtl8139_intr_mask);
> >  		spin_unlock_irqrestore(&tp->lock, flags);
> >  	}
> >  	spin_unlock(&tp->rx_lock);
> 
> Eric,
> 
> we have a bugreport of what seems to be the same problem:
> 
>   https://bugzilla.suse.com/show_bug.cgi?id=1042208
> 
> Do you plan to submit the patch above or is the conclusion that this is
> rather a hardware problem?

Could someone please push this patch forward? My machine just died again
with a fresh kernel because I forgot that I still need this patch.

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-04-10 12:11   ` Ville Syrjälä
@ 2017-09-18 19:46     ` Ville Syrjälä
  2017-09-18 19:52       ` Eric Dumazet
  2017-09-18 20:03       ` [PATCH net] 8139too: revisit napi_complete_done() usage Eric Dumazet
  0 siblings, 2 replies; 21+ messages in thread
From: Ville Syrjälä @ 2017-09-18 19:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, David S. Miller, netdev, linux-kernel

On Mon, Apr 10, 2017 at 03:11:02PM +0300, Ville Syrjälä wrote:
> On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote:
> > On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> > > Hi,
> > > 
> > > My old P3 laptop started to die on me in the middle of larger compile
> > > jobs (using distcc) after v4.11-rc<something>. I bisected the problem
> > > to 617f01211baf ("8139too: use napi_complete_done()").
> > > 
> > > Unfortunately I wasn't able to capture a full oops as the machine doesn't
> > > have serial and ramoops failed me. I did get one partial oops on vgacon
> > > which showed rtl8139_poll() being involved (EIP was around
> > > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> > > bisect result.
> > > 
> > > So maybe some kind of nasty thing going between the hard irq and
> > > softirq? Perhaps UP related? I tried to stare at the locking around
> > > rtl8139_poll() for a while but it looked mostly sane to me.
> > > 
> > 
> > Thanks a lot for the detective work, I am so sorry for this !
> > 
> > Could you try the following patch ?
> > 
> > I do not really see what could be wrong, the code should run just fine
> > on UP.
> > 
> > Thanks.
> > 
> > diff --git a/drivers/net/ethernet/realtek/8139too.c b/drivers/net/ethernet/realtek/8139too.c
> > index 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 100644
> > --- a/drivers/net/ethernet/realtek/8139too.c
> > +++ b/drivers/net/ethernet/realtek/8139too.c
> > @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int budget)
> >  	if (likely(RTL_R16(IntrStatus) & RxAckBits))
> >  		work_done += rtl8139_rx(dev, tp, budget);
> >  
> > -	if (work_done < budget && napi_complete_done(napi, work_done)) {
> > +	if (work_done < budget) {
> >  		unsigned long flags;
> >  
> >  		spin_lock_irqsave(&tp->lock, flags);
> > -		RTL_W16_F(IntrMask, rtl8139_intr_mask);
> > +		if (napi_complete_done(napi, work_done))
> > +			RTL_W16_F(IntrMask, rtl8139_intr_mask);
> >  		spin_unlock_irqrestore(&tp->lock, flags);
> >  	}
> >  	spin_unlock(&tp->rx_lock);
> > 
> > 
> 
> Yep, that patch does appear to make it stable again.
> 
> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

And five months later I'm still waiting for this patch to land...

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-09-18 19:46     ` Ville Syrjälä
@ 2017-09-18 19:52       ` Eric Dumazet
  2017-09-19 12:45         ` Ville Syrjälä
  2017-09-18 20:03       ` [PATCH net] 8139too: revisit napi_complete_done() usage Eric Dumazet
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2017-09-18 19:52 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: Eric Dumazet, David S. Miller, netdev, LKML

On Mon, Sep 18, 2017 at 12:46 PM, Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:

> And five months later I'm still waiting for this patch to land...

Sure, while I was on vacation, I received thousands of emails that I
simply have not read.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net] 8139too: revisit napi_complete_done() usage
  2017-09-18 19:46     ` Ville Syrjälä
  2017-09-18 19:52       ` Eric Dumazet
@ 2017-09-18 20:03       ` Eric Dumazet
  2017-09-19  3:57         ` David Miller
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2017-09-18 20:03 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: Eric Dumazet, David S. Miller, netdev

From: Eric Dumazet <edumazet@google.com>

It seems we have to be more careful in napi_complete_done()
use. This patch is not a revert, as it seems we can
avoid bug that Ville reported by moving the napi_complete_done()
test in the spinlock section.

Many thanks to Ville for detective work and all tests.

Fixes: 617f01211baf ("8139too: use napi_complete_done()")
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/net/ethernet/realtek/8139too.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/realtek/8139too.c b/drivers/net/ethernet/realtek/8139too.c
index 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 100644
--- a/drivers/net/ethernet/realtek/8139too.c
+++ b/drivers/net/ethernet/realtek/8139too.c
@@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int budget)
 	if (likely(RTL_R16(IntrStatus) & RxAckBits))
 		work_done += rtl8139_rx(dev, tp, budget);
 
-	if (work_done < budget && napi_complete_done(napi, work_done)) {
+	if (work_done < budget) {
 		unsigned long flags;
 
 		spin_lock_irqsave(&tp->lock, flags);
-		RTL_W16_F(IntrMask, rtl8139_intr_mask);
+		if (napi_complete_done(napi, work_done))
+			RTL_W16_F(IntrMask, rtl8139_intr_mask);
 		spin_unlock_irqrestore(&tp->lock, flags);
 	}
 	spin_unlock(&tp->rx_lock);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net] 8139too: revisit napi_complete_done() usage
  2017-09-18 20:03       ` [PATCH net] 8139too: revisit napi_complete_done() usage Eric Dumazet
@ 2017-09-19  3:57         ` David Miller
  2018-01-22 17:27           ` tedheadster
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2017-09-19  3:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ville.syrjala, edumazet, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 18 Sep 2017 13:03:43 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> It seems we have to be more careful in napi_complete_done()
> use. This patch is not a revert, as it seems we can
> avoid bug that Ville reported by moving the napi_complete_done()
> test in the spinlock section.
> 
> Many thanks to Ville for detective work and all tests.
> 
> Fixes: 617f01211baf ("8139too: use napi_complete_done()")
> Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

Applied and queued up for -stable.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-09-18 19:52       ` Eric Dumazet
@ 2017-09-19 12:45         ` Ville Syrjälä
  2017-09-19 13:51           ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Ville Syrjälä @ 2017-09-19 12:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, David S. Miller, netdev, LKML

On Mon, Sep 18, 2017 at 12:52:15PM -0700, Eric Dumazet wrote:
> On Mon, Sep 18, 2017 at 12:46 PM, Ville Syrjälä
> <ville.syrjala@linux.intel.com> wrote:
> 
> > And five months later I'm still waiting for this patch to land...
> 
> Sure, while I was on vacation, I received thousands of emails that I
> simply have not read.

Ah that explains where my previous ping went. Pardon the snarky comment,
and thanks for taking care of this.

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
  2017-09-19 12:45         ` Ville Syrjälä
@ 2017-09-19 13:51           ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2017-09-19 13:51 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: Eric Dumazet, David S. Miller, netdev, LKML

On Tue, 2017-09-19 at 15:45 +0300, Ville Syrjälä wrote:
> On Mon, Sep 18, 2017 at 12:52:15PM -0700, Eric Dumazet wrote:
> > On Mon, Sep 18, 2017 at 12:46 PM, Ville Syrjälä
> > <ville.syrjala@linux.intel.com> wrote:
> > 
> > > And five months later I'm still waiting for this patch to land...
> > 
> > Sure, while I was on vacation, I received thousands of emails that I
> > simply have not read.
> 
> Ah that explains where my previous ping went. Pardon the snarky comment,
> and thanks for taking care of this.
> 

No problem, this was my fault not following this bug.

Thanks a lot !

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] 8139too: revisit napi_complete_done() usage
  2017-09-19  3:57         ` David Miller
@ 2018-01-22 17:27           ` tedheadster
  2018-01-22 19:03             ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: tedheadster @ 2018-01-22 17:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: ville.syrjala, netdev

On Mon, Sep 18, 2017 at 11:57 PM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 18 Sep 2017 13:03:43 -0700
>
>> From: Eric Dumazet <edumazet@google.com>
>>
>> It seems we have to be more careful in napi_complete_done()
>> use. This patch is not a revert, as it seems we can
>> avoid bug that Ville reported by moving the napi_complete_done()
>> test in the spinlock section.
>>
>> Many thanks to Ville for detective work and all tests.
>>
>> Fixes: 617f01211baf ("8139too: use napi_complete_done()")
>> Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
>
> Applied and queued up for -stable.

Eric,
  sorry to bring up this old thread, but I had a question. Do we have
to surround most usage of napi_complete_done() with a spinlock, or was
this problem restricted to just the 8139too driver?

- Matthew Whitehead

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] 8139too: revisit napi_complete_done() usage
  2018-01-22 17:27           ` tedheadster
@ 2018-01-22 19:03             ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2018-01-22 19:03 UTC (permalink / raw)
  To: whiteheadm; +Cc: ville.syrjala, netdev

On Mon, 2018-01-22 at 12:27 -0500, tedheadster wrote:
> On Mon, Sep 18, 2017 at 11:57 PM, David Miller <davem@davemloft.net> wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon, 18 Sep 2017 13:03:43 -0700
> > 
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > It seems we have to be more careful in napi_complete_done()
> > > use. This patch is not a revert, as it seems we can
> > > avoid bug that Ville reported by moving the napi_complete_done()
> > > test in the spinlock section.
> > > 
> > > Many thanks to Ville for detective work and all tests.
> > > 
> > > Fixes: 617f01211baf ("8139too: use napi_complete_done()")
> > > Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > 
> > Applied and queued up for -stable.
> 
> Eric,
>   sorry to bring up this old thread, but I had a question. Do we have
> to surround most usage of napi_complete_done() with a spinlock, or was
> this problem restricted to just the 8139too driver?


I am not aware of any other NIC driver having a problem there.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-01-22 19:03 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-07 18:17 [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()") Ville Syrjälä
2017-04-07 18:38 ` Eric Dumazet
2017-04-07 18:42   ` David Miller
2017-04-08 10:23     ` Francois Romieu
2017-04-10 14:22       ` tedheadster
2017-04-10 14:55         ` Eric Dumazet
2017-04-10 12:11   ` Ville Syrjälä
2017-09-18 19:46     ` Ville Syrjälä
2017-09-18 19:52       ` Eric Dumazet
2017-09-19 12:45         ` Ville Syrjälä
2017-09-19 13:51           ` Eric Dumazet
2017-09-18 20:03       ` [PATCH net] 8139too: revisit napi_complete_done() usage Eric Dumazet
2017-09-19  3:57         ` David Miller
2018-01-22 17:27           ` tedheadster
2018-01-22 19:03             ` Eric Dumazet
2017-06-19 14:44   ` [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()") Michal Kubecek
2017-07-05 19:53     ` Ville Syrjälä
2017-04-21 11:40 ` Ville Syrjälä
2017-04-21 13:29   ` Eric Dumazet
2017-04-21 15:09     ` Eric Dumazet
2017-04-21 15:34       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.