Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
       [not found] <1172746367.2515.31.camel@xenon>
@ 2007-03-01 16:22 ` Sergei Shtylyov
  0 siblings, 0 replies; 13+ messages in thread
From: Sergei Shtylyov @ 2007-03-01 16:22 UTC (permalink / raw)
  To: komal; +Cc: kgdb-bugreport, netdev, Mark Huth

komal wrote:
> Hi all,

> 	As the discussion was going on about the effects of trapping the
> netpoll queue during KGDBoE debugging, I tried avoiding it. So in
> eth_pre_exception_handler() I did not set net_poll_trap to 1 and did not
> reset it back to 0 in eth_post_exception_handler() 

> file drivers/net/kgdboe.c

> static void eth_pre_exception_handler(void)
> {
>         /* Increment the module count when the debugger is active */
>         if (!kgdb_connected)
>                 try_module_get(THIS_MODULE);
> //      netpoll_set_trap(1);
> }
> 
> static void eth_post_exception_handler(void)
> {
>         /* decrement the module count when the debugger detaches */
>         if (!kgdb_connected)
>                 module_put(THIS_MODULE);
> //      netpoll_set_trap(0);
> }

    I'm afraid that was a wrong thing to do. We were talking only about 
disabling CONFIG_NETPOLL_TRAP option
   BTW, I don't see how CONFIG_NETPOLL_RX actually could influence anything -- 
looks like it just may be completely removed).

> 	As i started testing KGDBoe, 1st time I did 
> 	(gdb)info threads
> 	and it worked, but after that I set breakpoint
> 	(gdb)break link_path_walk
> 	after this any of the commands were not working and the test machine
> went in hang state. 
> 	To provide more information, I am working on an i386/x86_64 Athlon box,
> using 2.6.17 kernel and 100Mbps, full-duplex, VIA Rhine network card.

    This driver also seems prone to TX queue overwrites since its TX ring size 
is only 16.

> -Regards
>  Komal Nawandar

WBR, Sergei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-03-14 14:04         ` Sergei Shtylyov
@ 2007-03-14 21:40           ` Sergei Shtylyov
  0 siblings, 0 replies; 13+ messages in thread
From: Sergei Shtylyov @ 2007-03-14 21:40 UTC (permalink / raw)
  To: netdev; +Cc: Amit S. Kale, Mithlesh Thukral, Vitaly Wool, Mark Huth

Hello, I wrote:

>>> This thread came up on kgdb-bugreport mailing list. Could you please 
>>> suggest us what's the correct way of fixing this problem?

>>> 1. When running a kgdb on RTL8139 ethernet interface: 8139too driver 
>>> prints too many "Out-of-sync dirty pointer" messages on console and 
>>> gdb can't connect to kgdb stub. These messages can be suppressed, 
>>> though it still results in connection failures frequently. 

>>> 2. Here is how kgdb uses polling mechanism for communication to gdb.  
>>> kgdb calls netpoll_set_trap(1) just before entering a loop where it 
>>> communicates to gdb. It calls netpoll_set_trap(0) after it is done 
>>> and wants to resume a kernel. The communication to gdb goes through 
>>> netpoll_poll (which calls kgdb rx_hook) and netpoll_send_udp functions.

>>> 3. A queue for an interface may have been stopped by it's driver by 
>>> calling netif_stop_queue. After this if kgdb attempts to enter 
>>> communication with gdb, it'll call netpoll_set_trap(1), after which 
>>> the queue can't be started again. This is a potential deadlock 
>>> situation. Is there a way out of this?

    No way but at least "emulate" the queue controls...

>>> 4. Is it necessary to call netpoll_set_trap(1) at all before entering 
>>> gdb communication loop? Even if a driver stops the queue in middle of 
>>> the communication netpoll_poll and netpoll_send_udp calls can recover 
>>> from that by calling driver's interrupt and poll routines. Is this a 
>>> valid statement?

    It seems that having queue control working as usual is dangerous in case 
of KGDB being active: this leads to wake_softirqd() being called, which seems 
undesirable (there has been report about the eventual lockup trying to get 
runqueue lock).

>>    I'd like to return to this again (having received no feedback)...
>>    The idea is to change how CONFIG_NETPOLL_TRAP is implemented: 
>> instead of
>> completely bypassing queue locking after netpoll_set_trap(1) has been 
>> called, how about we set and chack some other flag (internal to 
>> netpoll) telling it that the queue is frozen, i.e. watch the queue 
>> state using a separate mechanism when traffic trapping is engaged?  
>> This certainly 

>    Well, this certainly won't work, as the bit should be tied to struct 
> net_device.

    Well, I hadn't yet discovered npinfo member of net_device before saying 
that. :-)

>>> Thanks a lot.
>>> -Amit

WBR, Sergei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-03-14 13:42       ` Sergei Shtylyov
@ 2007-03-14 14:04         ` Sergei Shtylyov
  2007-03-14 21:40           ` Sergei Shtylyov
  0 siblings, 1 reply; 13+ messages in thread
From: Sergei Shtylyov @ 2007-03-14 14:04 UTC (permalink / raw)
  To: netdev; +Cc: Amit S. Kale, Mithlesh Thukral, Vitaly Wool, Mark Huth

Hello, I wrote:

>> This thread came up on kgdb-bugreport mailing list. Could you please 
>> suggest us what's the correct way of fixing this problem?

>> 1. When running a kgdb on RTL8139 ethernet interface: 8139too driver 
>> prints too many "Out-of-sync dirty pointer" messages on console and 
>> gdb can't connect to kgdb stub. These messages can be suppressed, 
>> though it still results in connection failures frequently. 

>> 2. Here is how kgdb uses polling mechanism for communication to gdb.  
>> kgdb calls netpoll_set_trap(1) just before entering a loop where it 
>> communicates to gdb. It calls netpoll_set_trap(0) after it is done and 
>> wants to resume a kernel. The communication to gdb goes through 
>> netpoll_poll (which calls kgdb rx_hook) and netpoll_send_udp functions.

>> 3. A queue for an interface may have been stopped by it's driver by 
>> calling netif_stop_queue. After this if kgdb attempts to enter 
>> communication with gdb, it'll call netpoll_set_trap(1), after which 
>> the queue can't be started again. This is a potential deadlock 
>> situation. Is there a way out of this?

>> 4. Is it necessary to call netpoll_set_trap(1) at all before entering 
>> gdb communication loop? Even if a driver stops the queue in middle of 
>> the communication netpoll_poll and netpoll_send_udp calls can recover 
>> from that by calling driver's interrupt and poll routines. Is this a 
>> valid statement?

>    I'd like to return to this again (having received no feedback)...
>    The idea is to change how CONFIG_NETPOLL_TRAP is implemented: instead of
> completely bypassing queue locking after netpoll_set_trap(1) has been 
> called, how about we set and chack some other flag (internal to netpoll) 
> telling it that the queue is frozen, i.e. watch the queue state using a 
> separate mechanism when traffic trapping is engaged?  This certainly 

    Well, this certainly won't work, as the bit should be tied to struct 
net_device.  The first idea was more sound: just set/reset __LINK_STATE_XOFF 
flag, not calling __netif_schedule(), i.e. remove #ifdef from 
netif_stop_queue() and replace return stmt in netif_wake_queue() by 
clear_bit(__LINK_STATE_XOFF, &dev->state).

> would avoid TX queue overflows in drivers while also avoiding any 
> dev->state changes and even worse evil __netif_schedule() call, i.e. 
> things that CONFIG_NETPOLL_TRAP is currectly trying to avoid, AFAIU...

    I think I'll submit a patch -- netpoll traffic trapping is pretty broken 
as it is now.

>> Thanks a lot.
>> -Amit

WBR, Sergei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-02-23  7:08     ` Amit S. Kale
  2007-02-23 18:10       ` Mark Huth
@ 2007-03-14 13:42       ` Sergei Shtylyov
  2007-03-14 14:04         ` Sergei Shtylyov
  1 sibling, 1 reply; 13+ messages in thread
From: Sergei Shtylyov @ 2007-03-14 13:42 UTC (permalink / raw)
  To: Amit S. Kale; +Cc: netdev, Mithlesh Thukral, Vitaly Wool, Mark Huth

Hello.

Amit S. Kale wrote:

> This thread came up on kgdb-bugreport mailing list. Could you please suggest 
> us what's the correct way of fixing this problem?

> 1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
> too many "Out-of-sync dirty pointer" messages on console and gdb can't 
> connect to kgdb stub. These messages can be suppressed, though it still 
> results in connection failures frequently. 

> 2. Here is how kgdb uses polling mechanism for communication to gdb.  kgdb 
> calls netpoll_set_trap(1) just before entering a loop where it communicates 
> to gdb. It calls netpoll_set_trap(0) after it is done and wants to resume a 
> kernel. The communication to gdb goes through netpoll_poll (which calls kgdb 
> rx_hook) and netpoll_send_udp functions.

> 3. A queue for an interface may have been stopped by it's driver by calling 
> netif_stop_queue. After this if kgdb attempts to enter communication with 
> gdb, it'll call netpoll_set_trap(1), after which the queue can't be started 
> again. This is a potential deadlock situation. Is there a way out of this?

> 4. Is it necessary to call netpoll_set_trap(1) at all before entering gdb 
> communication loop? Even if a driver stops the queue in middle of the 
> communication netpoll_poll and netpoll_send_udp calls can recover from that 
> by calling driver's interrupt and poll routines. Is this a valid statement?

    I'd like to return to this again (having received no feedback)...
    The idea is to change how CONFIG_NETPOLL_TRAP is implemented: instead of
completely bypassing queue locking after netpoll_set_trap(1) has been called, 
how about we set and chack some other flag (internal to netpoll) telling it 
that the queue is frozen, i.e. watch the queue state using a separate 
mechanism when traffic trapping is engaged?  This certainly would avoid TX 
queue overflows in drivers while also avoiding any dev->state changes and even 
worse evil __netif_schedule() call, i.e. things that CONFIG_NETPOLL_TRAP is 
currectly trying to avoid, AFAIU...

> Thanks a lot.
> -Amit

WBR, Sergei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-02-23 19:04         ` Stephen Hemminger
  2007-02-23 19:09           ` Sergei Shtylyov
@ 2007-02-23 20:34           ` Mark Huth
  1 sibling, 0 replies; 13+ messages in thread
From: Mark Huth @ 2007-02-23 20:34 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Amit S. Kale, netdev, Sergei Shtylyov, Mithlesh Thukral, Vitaly Wool

Stephen Hemminger wrote:
> On Fri, 23 Feb 2007 11:10:40 -0700
> Mark Huth <mhuth@mvista.com> wrote:
>
>   
>> Amit S. Kale wrote:
>>     
>>> Hi Net Gurus,
>>>
>>> This thread came up on kgdb-bugreport mailing list. Could you please suggest 
>>> us what's the correct way of fixing this problem?
>>>
>>> 1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
>>> too many "Out-of-sync dirty pointer" messages on console and gdb can't 
>>> connect to kgdb stub. These messages can be suppressed, though it still 
>>> results in connection failures frequently. 
>>>   
>>>       
>> We think this comes from calling the driver while the queue is stopped.  
>> Drivers should not do horrible things when hard start is called with the 
>> queue stopped, but unfortunately, at this time, at least some drivers 
>> do  explode or complain under that condition.
>>     
>
> The kernel is built on a set of assumptions about calling context. Your
> out of tree code is violating one of them. Why not check for stopped queue
> and do some action to try and clear it, that is what netconsole does.
>   
Yes, of course.  This is just an incidental thing that happens because 
of the real problem, which is the use of CONFIG_NETPOL_TRAP in 
netif_stop/wake_queue routines.  Information about the necessity of that 
code would be appreciated.  Because when that option is selected, the 
queue management interface is squashed, leading to the situation where 
the device driver thinks the queue is stopped but the flag for that does 
not get changed.  Leading to the situation where device drivers either 
panic or complain.

AFAIK, NETPOLL_RX is not used at all, and NETPOLL_TRAP is only used in 
netdevice.h to turn off the transmit flow control/queue management 
function.  Netpoll already bypasses the actual queue, but it does try to 
honor the queue state.  However, KGDBOE breaks the queue state 
management by selecting NETPOLL_TRAP. 

This is not exactly out of tree code, because netpoll is the entity that 
calls the driver leading to errors and worse from the drivers.  And KGDB 
is from the community tree.  We're just trying to make it work, and the 
patches will be returned when we figure this out.  We're also trying to 
get this to work with the RT stuff, which creates another whole set of 
problems due to major semantic changes.  However it looks like the 
latest nepoll code should be okay wrt RT.

And I remain of the opinion that a device driver ought not panic or 
corrupt data, or anything else obnoxious given a hard_start call at the 
wrong time, but that's another battle for another day.
>>> 2. Here is how kgdb uses polling mechanism for communication to gdb.  kgdb 
>>> calls netpoll_set_trap(1) just before entering a loop where it communicates 
>>> to gdb. It calls netpoll_set_trap(0) after it is done and wants to resume a 
>>> kernel. The communication to gdb goes through netpoll_poll (which calls kgdb 
>>> rx_hook) and netpoll_send_udp functions.
>>>
>>> 3. A queue for an interface may have been stopped by it's driver by calling 
>>> netif_stop_queue. After this if kgdb attempts to enter communication with 
>>> gdb, it'll call netpoll_set_trap(1), after which the queue can't be started 
>>> again. This is a potential deadlock situation. Is there a way out of this?
>>>   
>>>       
>> We are trying without setting the CONFIG_NETPOLL_TRAP option.  This 
>> option is what turns off the function of the netif_stop/wake_queue 
>> calls, which breaks the usual flow control mechanism used by netpoll 
>> transmit function.  It also prevents the netif_schedule call, which will 
>> puts the device on the tx softirq queue.  However, in the case where 
>> interupts are off and scheduling is not allowed - which would be the 
>> netpoll_set_trap(1) condition, the softirq will not run until netpoll is 
>> done and the user of netpoll returns the system to normal operation.  So 
>> I am unclear that allowing the schedule is a problem.  There may be some 
>> obscure race conditions on smp, so we are trying to analyze that part, 
>> but for the moment are testing with the netif_schedule call allowed in 
>> the event of queuing the device.
>>     
>>> 4. Is it necessary to call netpoll_set_trap(1) at all before entering gdb 
>>> communication loop? Even if a driver stops the queue in middle of the 
>>> communication netpoll_poll and netpoll_send_udp calls can recover from that 
>>> by calling driver's interrupt and poll routines. Is this a valid statement?
>>>   
>>>       
>> netpoll_set_trap() is necessary, as it informs the netpoll code to 
>> respond to arp requests on behalf of the netpoll user, as well as making 
>> sure that skbs are freed without needing the completion queue stuff to 
>> run (I think)
>>     
>>> Thanks a lot.
>>> -Amit
>>>       
>
>
>   


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-02-23 19:22                 ` Stephen Hemminger
@ 2007-02-23 19:27                   ` Sergei Shtylyov
  0 siblings, 0 replies; 13+ messages in thread
From: Sergei Shtylyov @ 2007-02-23 19:27 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mark Huth, Amit S. Kale, netdev, Mithlesh Thukral, Vitaly Wool

Stephen Hemminger wrote:

>>>>>>>This thread came up on kgdb-bugreport mailing list. Could you please suggest 
>>>>>>>us what's the correct way of fixing this problem?

>>>>>>>1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
>>>>>>>too many "Out-of-sync dirty pointer" messages on console and gdb can't 
>>>>>>>connect to kgdb stub. These messages can be suppressed, though it still 
>>>>>>>results in connection failures frequently. 

>>>>>>We think this comes from calling the driver while the queue is stopped.  
>>>>>>Drivers should not do horrible things when hard start is called with the 
>>>>>>queue stopped, but unfortunately, at this time, at least some drivers 
>>>>>>do  explode or complain under that condition.

>>>>>The kernel is built on a set of assumptions about calling context. Your
>>>>>out of tree code is violating one of them. Why not check for stopped queue
>>>>>and do some action to try and clear it, that is what netconsole does.

>>>>   The queue can't be stopped when the netpoll traffic trapping is enabled 
>>>>(cause this effectively bypasses queue control), So, the stopped queue 
>>>>indoication doesn't work also -- *that* is the problem. It's not at all 
>>>>specific to KGBoE -- only to traffic trapping.

>>>You can't ask a device to send a packet when it has no resources.

>>    When traffic trapping is enabled, and driver stops the queue, the 
>>__LINK_STATE_XOFF flag does *not* get set, so netif_queue_stopped() resturns 
>>*zero*.  What may be done in this situation?

> Read netpoll_send_skb()

> 	int status = NETDEV_TX_BUSY;
> ...
> 		if (netif_tx_trylock(dev)) {
> 			/* try until next clock tick */
> 			for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
> 					tries > 0; --tries) {
> 				if (!netif_queue_stopped(dev))
> 					status = dev->hard_start_xmit(skb, dev);
> 
> 				if (status == NETDEV_TX_OK)
> 					break;
> 
> 				/* tickle device maybe there is some cleanup */
> 				netpoll_poll(np);
> 
> 				udelay(USEC_PER_POLL);
> 			}
> 			netif_tx_unlock(dev);

> netpoll_poll() allows device to try and cleanup transmit resources.

    Read <linux/netdevice.h>:

static inline void netif_stop_queue(struct net_device *dev)
{
#ifdef CONFIG_NETPOLL_TRAP
         if (netpoll_trap())
                 return;
#endif
         set_bit(__LINK_STATE_XOFF, &dev->state);
}

static inline int netif_queue_stopped(const struct net_device *dev)
{
         return test_bit(__LINK_STATE_XOFF, &dev->state);
}

    When the driver calls netif_stop_queue() having his TX queue filled to the 
brim (4 buffers in case of 8139too) and netpoll_trap() returns 1, what will 
happen?

WBR, Sergei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-02-23 19:16               ` Sergei Shtylyov
@ 2007-02-23 19:22                 ` Stephen Hemminger
  2007-02-23 19:27                   ` Sergei Shtylyov
  0 siblings, 1 reply; 13+ messages in thread
From: Stephen Hemminger @ 2007-02-23 19:22 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Mark Huth, Amit S. Kale, netdev, Mithlesh Thukral, Vitaly Wool

On Fri, 23 Feb 2007 22:16:59 +0300
Sergei Shtylyov <sshtylyov@ru.mvista.com> wrote:

> Hello.
> 
> Stephen Hemminger wrote:
> 
> >>>>>This thread came up on kgdb-bugreport mailing list. Could you please suggest 
> >>>>>us what's the correct way of fixing this problem?
> 
> >>>>>1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
> >>>>>too many "Out-of-sync dirty pointer" messages on console and gdb can't 
> >>>>>connect to kgdb stub. These messages can be suppressed, though it still 
> >>>>>results in connection failures frequently. 
> 
> >>>>We think this comes from calling the driver while the queue is stopped.  
> >>>>Drivers should not do horrible things when hard start is called with the 
> >>>>queue stopped, but unfortunately, at this time, at least some drivers 
> >>>>do  explode or complain under that condition.
> 
> >>>The kernel is built on a set of assumptions about calling context. Your
> >>>out of tree code is violating one of them. Why not check for stopped queue
> >>>and do some action to try and clear it, that is what netconsole does.
> 
> >>    The queue can't be stopped when the netpoll traffic trapping is enabled 
> >>(cause this effectively bypasses queue control), So, the stopped queue 
> >>indoication doesn't work also -- *that* is the problem. It's not at all 
> >>specific to KGBoE -- only to traffic trapping.
> 
> > You can't ask a device to send a packet when it has no resources.
> 
>     When traffic trapping is enabled, and driver stops the queue, the 
> __LINK_STATE_XOFF flag does *not* get set, so netif_queue_stopped() resturns 
> *zero*.  What may be done in this situation?

Read netpoll_send_skb()

	int status = NETDEV_TX_BUSY;
...
		if (netif_tx_trylock(dev)) {
			/* try until next clock tick */
			for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
					tries > 0; --tries) {
				if (!netif_queue_stopped(dev))
					status = dev->hard_start_xmit(skb, dev);

				if (status == NETDEV_TX_OK)
					break;

				/* tickle device maybe there is some cleanup */
				netpoll_poll(np);

				udelay(USEC_PER_POLL);
			}
			netif_tx_unlock(dev);

netpoll_poll() allows device to try and cleanup transmit resources.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-02-23 19:13             ` Stephen Hemminger
@ 2007-02-23 19:16               ` Sergei Shtylyov
  2007-02-23 19:22                 ` Stephen Hemminger
  0 siblings, 1 reply; 13+ messages in thread
From: Sergei Shtylyov @ 2007-02-23 19:16 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mark Huth, Amit S. Kale, netdev, Mithlesh Thukral, Vitaly Wool

Hello.

Stephen Hemminger wrote:

>>>>>This thread came up on kgdb-bugreport mailing list. Could you please suggest 
>>>>>us what's the correct way of fixing this problem?

>>>>>1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
>>>>>too many "Out-of-sync dirty pointer" messages on console and gdb can't 
>>>>>connect to kgdb stub. These messages can be suppressed, though it still 
>>>>>results in connection failures frequently. 

>>>>We think this comes from calling the driver while the queue is stopped.  
>>>>Drivers should not do horrible things when hard start is called with the 
>>>>queue stopped, but unfortunately, at this time, at least some drivers 
>>>>do  explode or complain under that condition.

>>>The kernel is built on a set of assumptions about calling context. Your
>>>out of tree code is violating one of them. Why not check for stopped queue
>>>and do some action to try and clear it, that is what netconsole does.

>>    The queue can't be stopped when the netpoll traffic trapping is enabled 
>>(cause this effectively bypasses queue control), So, the stopped queue 
>>indoication doesn't work also -- *that* is the problem. It's not at all 
>>specific to KGBoE -- only to traffic trapping.

> You can't ask a device to send a packet when it has no resources.

    When traffic trapping is enabled, and driver stops the queue, the 
__LINK_STATE_XOFF flag does *not* get set, so netif_queue_stopped() resturns 
*zero*.  What may be done in this situation?

WBR, Sergei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-02-23 19:09           ` Sergei Shtylyov
@ 2007-02-23 19:13             ` Stephen Hemminger
  2007-02-23 19:16               ` Sergei Shtylyov
  0 siblings, 1 reply; 13+ messages in thread
From: Stephen Hemminger @ 2007-02-23 19:13 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Mark Huth, Amit S. Kale, netdev, Mithlesh Thukral, Vitaly Wool

On Fri, 23 Feb 2007 22:09:16 +0300
Sergei Shtylyov <sshtylyov@ru.mvista.com> wrote:

> Hello.
> 
> Stephen Hemminger wrote:
> 
> >>>This thread came up on kgdb-bugreport mailing list. Could you please suggest 
> >>>us what's the correct way of fixing this problem?
> 
> >>>1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
> >>>too many "Out-of-sync dirty pointer" messages on console and gdb can't 
> >>>connect to kgdb stub. These messages can be suppressed, though it still 
> >>>results in connection failures frequently. 
> 
> >>We think this comes from calling the driver while the queue is stopped.  
> >>Drivers should not do horrible things when hard start is called with the 
> >>queue stopped, but unfortunately, at this time, at least some drivers 
> >>do  explode or complain under that condition.
> 
> > The kernel is built on a set of assumptions about calling context. Your
> > out of tree code is violating one of them. Why not check for stopped queue
> > and do some action to try and clear it, that is what netconsole does.
> 
>     The queue can't be stopped when the netpoll traffic trapping is enabled 
> (cause this effectively bypasses queue control), So, the stopped queue 
> indoication doesn't work also -- *that* is the problem. It's not at all 
> specific to KGBoE -- only to traffic trapping.

You can't ask a device to send a packet when it has no resources.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-02-23 19:04         ` Stephen Hemminger
@ 2007-02-23 19:09           ` Sergei Shtylyov
  2007-02-23 19:13             ` Stephen Hemminger
  2007-02-23 20:34           ` Mark Huth
  1 sibling, 1 reply; 13+ messages in thread
From: Sergei Shtylyov @ 2007-02-23 19:09 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mark Huth, Amit S. Kale, netdev, Mithlesh Thukral, Vitaly Wool

Hello.

Stephen Hemminger wrote:

>>>This thread came up on kgdb-bugreport mailing list. Could you please suggest 
>>>us what's the correct way of fixing this problem?

>>>1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
>>>too many "Out-of-sync dirty pointer" messages on console and gdb can't 
>>>connect to kgdb stub. These messages can be suppressed, though it still 
>>>results in connection failures frequently. 

>>We think this comes from calling the driver while the queue is stopped.  
>>Drivers should not do horrible things when hard start is called with the 
>>queue stopped, but unfortunately, at this time, at least some drivers 
>>do  explode or complain under that condition.

> The kernel is built on a set of assumptions about calling context. Your
> out of tree code is violating one of them. Why not check for stopped queue
> and do some action to try and clear it, that is what netconsole does.

    The queue can't be stopped when the netpoll traffic trapping is enabled 
(cause this effectively bypasses queue control), So, the stopped queue 
indoication doesn't work also -- *that* is the problem. It's not at all 
specific to KGBoE -- only to traffic trapping.

WBR, Sergei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-02-23 18:10       ` Mark Huth
@ 2007-02-23 19:04         ` Stephen Hemminger
  2007-02-23 19:09           ` Sergei Shtylyov
  2007-02-23 20:34           ` Mark Huth
  0 siblings, 2 replies; 13+ messages in thread
From: Stephen Hemminger @ 2007-02-23 19:04 UTC (permalink / raw)
  To: Mark Huth
  Cc: Amit S. Kale, netdev, Sergei Shtylyov, Mithlesh Thukral, Vitaly Wool

On Fri, 23 Feb 2007 11:10:40 -0700
Mark Huth <mhuth@mvista.com> wrote:

> Amit S. Kale wrote:
> > Hi Net Gurus,
> >
> > This thread came up on kgdb-bugreport mailing list. Could you please suggest 
> > us what's the correct way of fixing this problem?
> >
> > 1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
> > too many "Out-of-sync dirty pointer" messages on console and gdb can't 
> > connect to kgdb stub. These messages can be suppressed, though it still 
> > results in connection failures frequently. 
> >   
> We think this comes from calling the driver while the queue is stopped.  
> Drivers should not do horrible things when hard start is called with the 
> queue stopped, but unfortunately, at this time, at least some drivers 
> do  explode or complain under that condition.

The kernel is built on a set of assumptions about calling context. Your
out of tree code is violating one of them. Why not check for stopped queue
and do some action to try and clear it, that is what netconsole does.

> > 2. Here is how kgdb uses polling mechanism for communication to gdb.  kgdb 
> > calls netpoll_set_trap(1) just before entering a loop where it communicates 
> > to gdb. It calls netpoll_set_trap(0) after it is done and wants to resume a 
> > kernel. The communication to gdb goes through netpoll_poll (which calls kgdb 
> > rx_hook) and netpoll_send_udp functions.
> >
> > 3. A queue for an interface may have been stopped by it's driver by calling 
> > netif_stop_queue. After this if kgdb attempts to enter communication with 
> > gdb, it'll call netpoll_set_trap(1), after which the queue can't be started 
> > again. This is a potential deadlock situation. Is there a way out of this?
> >   
> We are trying without setting the CONFIG_NETPOLL_TRAP option.  This 
> option is what turns off the function of the netif_stop/wake_queue 
> calls, which breaks the usual flow control mechanism used by netpoll 
> transmit function.  It also prevents the netif_schedule call, which will 
> puts the device on the tx softirq queue.  However, in the case where 
> interupts are off and scheduling is not allowed - which would be the 
> netpoll_set_trap(1) condition, the softirq will not run until netpoll is 
> done and the user of netpoll returns the system to normal operation.  So 
> I am unclear that allowing the schedule is a problem.  There may be some 
> obscure race conditions on smp, so we are trying to analyze that part, 
> but for the moment are testing with the netif_schedule call allowed in 
> the event of queuing the device.
> > 4. Is it necessary to call netpoll_set_trap(1) at all before entering gdb 
> > communication loop? Even if a driver stops the queue in middle of the 
> > communication netpoll_poll and netpoll_send_udp calls can recover from that 
> > by calling driver's interrupt and poll routines. Is this a valid statement?
> >   
> netpoll_set_trap() is necessary, as it informs the netpoll code to 
> respond to arp requests on behalf of the netpoll user, as well as making 
> sure that skbs are freed without needing the completion queue stuff to 
> run (I think)
> > Thanks a lot.
> > -Amit


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
  2007-02-23  7:08     ` Amit S. Kale
@ 2007-02-23 18:10       ` Mark Huth
  2007-02-23 19:04         ` Stephen Hemminger
  2007-03-14 13:42       ` Sergei Shtylyov
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Huth @ 2007-02-23 18:10 UTC (permalink / raw)
  To: Amit S. Kale; +Cc: netdev, Sergei Shtylyov, Mithlesh Thukral, Vitaly Wool

Amit S. Kale wrote:
> Hi Net Gurus,
>
> This thread came up on kgdb-bugreport mailing list. Could you please suggest 
> us what's the correct way of fixing this problem?
>
> 1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
> too many "Out-of-sync dirty pointer" messages on console and gdb can't 
> connect to kgdb stub. These messages can be suppressed, though it still 
> results in connection failures frequently. 
>   
We think this comes from calling the driver while the queue is stopped.  
Drivers should not do horrible things when hard start is called with the 
queue stopped, but unfortunately, at this time, at least some drivers 
do  explode or complain under that condition.
> 2. Here is how kgdb uses polling mechanism for communication to gdb.  kgdb 
> calls netpoll_set_trap(1) just before entering a loop where it communicates 
> to gdb. It calls netpoll_set_trap(0) after it is done and wants to resume a 
> kernel. The communication to gdb goes through netpoll_poll (which calls kgdb 
> rx_hook) and netpoll_send_udp functions.
>
> 3. A queue for an interface may have been stopped by it's driver by calling 
> netif_stop_queue. After this if kgdb attempts to enter communication with 
> gdb, it'll call netpoll_set_trap(1), after which the queue can't be started 
> again. This is a potential deadlock situation. Is there a way out of this?
>   
We are trying without setting the CONFIG_NETPOLL_TRAP option.  This 
option is what turns off the function of the netif_stop/wake_queue 
calls, which breaks the usual flow control mechanism used by netpoll 
transmit function.  It also prevents the netif_schedule call, which will 
puts the device on the tx softirq queue.  However, in the case where 
interupts are off and scheduling is not allowed - which would be the 
netpoll_set_trap(1) condition, the softirq will not run until netpoll is 
done and the user of netpoll returns the system to normal operation.  So 
I am unclear that allowing the schedule is a problem.  There may be some 
obscure race conditions on smp, so we are trying to analyze that part, 
but for the moment are testing with the netif_schedule call allowed in 
the event of queuing the device.
> 4. Is it necessary to call netpoll_set_trap(1) at all before entering gdb 
> communication loop? Even if a driver stops the queue in middle of the 
> communication netpoll_poll and netpoll_send_udp calls can recover from that 
> by calling driver's interrupt and poll routines. Is this a valid statement?
>   
netpoll_set_trap() is necessary, as it informs the netpoll code to 
respond to arp requests on behalf of the netpoll user, as well as making 
sure that skbs are freed without needing the completion queue stuff to 
run (I think)
> Thanks a lot.
> -Amit
>
>
>
> On Thursday 22 February 2007 22:11, Sergei Shtylyov wrote:
>   
>> Hello, I wrote:
>>     
>>>>>>>>> Even with this patch, the packets probably get stuck somewhere in
>>>>>>>>> the driver, as cross-gdb sees tail of the $g packet reply only in
>>>>>>>>> reply to next packet...
>>>>>>>>>                   
>>>>>  This wasn;t happeing on x86 probably because the register packet
>>>>> should be much shorted there than on PPC...
>>>>>
>>>>>           
>>>>>>>>  Argh! That's all because of the CONFIG_NETPOLL_TRAP that
>>>>>>>> CONFIG_KGDBOE* options select -- since the initial breakpoint enables
>>>>>>>> trapping via KGDBoE's pre_exception() handler,
>>>>>>>> netif_{stop/wake}_queue() stop to work and that causes KGDBoE to
>>>>>>>> literally flood 8139too with packets (although it can't queue up
>>>>>>>> more than 4). Looks like a general design issue to me... :-/
>>>>>>>>                 
>>>>>>> Well, maybe not. But many drivers are surely unprepared to their
>>>>>>> hard_start_xmit() method being called with queue alraedy stopped and
>>>>>>> those with small TX queue (like natsemi with which we're also having
>>>>>>> trouble) would get flooded as well. I'm going to submit a patch to
>>>>>>> netdev adding extra check for TX ring being full -- after/if it gets
>>>>>>> accepted, this patch won't be needed anymore.
>>>>>>>               
>>>>>> Here is what comes to my mind right away. It might need some more
>>>>>> polishing or cleaning up:
>>>>>>
>>>>>> A potential solution will be to check the if hard_start_xmit() returns
>>>>>> NETDEV_TX_BUSY. In case transmit queue is busy (due to lot of threads
>>>>>> or queue getting full), we should wait in netpoll_send_skb(), call a
>>>>>> cleanup through poll() and then retry sending packet.
>>>>>>             
>>>>>   This is already being done by netpoll iself. The thing is that
>>>>> hard_start_xmit() doesdn't return NETDEV_TX_BUSY in those drivers. :-/
>>>>>           
>>>> In addition to that we set trapped. I wonder whether it is possible that
>>>> a queue is stopped and we enter kgdb. It would be a deadlock.
>>>> -Amit
>>>>         
>>>     Why? Netpoll does call the driver's interrupt and NAPI handlers in
>>> that case (until the retry count is 0).
>>>       
>>     Ah, got it -- since the traffic trapping (when enabled) effectively
>> bypasses netif_wake_queue(), a queue would never be actually woken up.
>> Maybe it's worth to always return 0 from netif_queue_stopped() in this
>> case? Or maybe the correct thing to do when trapping is to just thiddle the
>> __LINK_STATE_XOFF bit, bypassing call to netif_schedule()?
>>
>>     
>>>>>> Regards,
>>>>>> Mithlesh Thukral
>>>>>>             
>> WBR, Sergei
>>
>> -------------------------------------------------------------------------
>> Take Surveys. Earn Cash. Influence the Future of IT
>> Join SourceForge.net's Techsay panel and you'll get the chance to share
>> your opinions on IT & business topics through brief surveys-and earn cash
>> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>> _______________________________________________
>> Kgdb-bugreport mailing list
>> Kgdb-bugreport@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport
>>     
>
>   


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix
       [not found]   ` <45DDC7C0.8050100@ru.mvista.com>
@ 2007-02-23  7:08     ` Amit S. Kale
  2007-02-23 18:10       ` Mark Huth
  2007-03-14 13:42       ` Sergei Shtylyov
  0 siblings, 2 replies; 13+ messages in thread
From: Amit S. Kale @ 2007-02-23  7:08 UTC (permalink / raw)
  To: netdev; +Cc: Sergei Shtylyov, Mithlesh Thukral, Vitaly Wool, Mark Huth

Hi Net Gurus,

This thread came up on kgdb-bugreport mailing list. Could you please suggest 
us what's the correct way of fixing this problem?

1. When running a kgdb on RTL8139 ethernet interface: 8139too driver prints 
too many "Out-of-sync dirty pointer" messages on console and gdb can't 
connect to kgdb stub. These messages can be suppressed, though it still 
results in connection failures frequently. 

2. Here is how kgdb uses polling mechanism for communication to gdb.  kgdb 
calls netpoll_set_trap(1) just before entering a loop where it communicates 
to gdb. It calls netpoll_set_trap(0) after it is done and wants to resume a 
kernel. The communication to gdb goes through netpoll_poll (which calls kgdb 
rx_hook) and netpoll_send_udp functions.

3. A queue for an interface may have been stopped by it's driver by calling 
netif_stop_queue. After this if kgdb attempts to enter communication with 
gdb, it'll call netpoll_set_trap(1), after which the queue can't be started 
again. This is a potential deadlock situation. Is there a way out of this?

4. Is it necessary to call netpoll_set_trap(1) at all before entering gdb 
communication loop? Even if a driver stops the queue in middle of the 
communication netpoll_poll and netpoll_send_udp calls can recover from that 
by calling driver's interrupt and poll routines. Is this a valid statement?

Thanks a lot.
-Amit



On Thursday 22 February 2007 22:11, Sergei Shtylyov wrote:
> Hello, I wrote:
> >>>>>>>Even with this patch, the packets probably get stuck somewhere in
> >>>>>>> the driver, as cross-gdb sees tail of the $g packet reply only in
> >>>>>>> reply to next packet...
> >>>
> >>>  This wasn;t happeing on x86 probably because the register packet
> >>> should be much shorted there than on PPC...
> >>>
> >>>>>>  Argh! That's all because of the CONFIG_NETPOLL_TRAP that
> >>>>>>CONFIG_KGDBOE* options select -- since the initial breakpoint enables
> >>>>>>trapping via KGDBoE's pre_exception() handler,
> >>>>>> netif_{stop/wake}_queue() stop to work and that causes KGDBoE to
> >>>>>> literally flood 8139too with packets (although it can't queue up
> >>>>>> more than 4). Looks like a general design issue to me... :-/
> >>>>>
> >>>>> Well, maybe not. But many drivers are surely unprepared to their
> >>>>>hard_start_xmit() method being called with queue alraedy stopped and
> >>>>>those with small TX queue (like natsemi with which we're also having
> >>>>>trouble) would get flooded as well. I'm going to submit a patch to
> >>>>>netdev adding extra check for TX ring being full -- after/if it gets
> >>>>>accepted, this patch won't be needed anymore.
> >>>>
> >>>>Here is what comes to my mind right away. It might need some more
> >>>>polishing or cleaning up:
> >>>>
> >>>>A potential solution will be to check the if hard_start_xmit() returns
> >>>>NETDEV_TX_BUSY. In case transmit queue is busy (due to lot of threads
> >>>> or queue getting full), we should wait in netpoll_send_skb(), call a
> >>>> cleanup through poll() and then retry sending packet.
> >>>
> >>>   This is already being done by netpoll iself. The thing is that
> >>>hard_start_xmit() doesdn't return NETDEV_TX_BUSY in those drivers. :-/
> >>
> >>In addition to that we set trapped. I wonder whether it is possible that
> >> a queue is stopped and we enter kgdb. It would be a deadlock.
> >>-Amit
> >
> >     Why? Netpoll does call the driver's interrupt and NAPI handlers in
> > that case (until the retry count is 0).
>
>     Ah, got it -- since the traffic trapping (when enabled) effectively
> bypasses netif_wake_queue(), a queue would never be actually woken up.
> Maybe it's worth to always return 0 from netif_queue_stopped() in this
> case? Or maybe the correct thing to do when trapping is to just thiddle the
> __LINK_STATE_XOFF bit, bypassing call to netif_schedule()?
>
> >>>>Regards,
> >>>>Mithlesh Thukral
>
> WBR, Sergei
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Kgdb-bugreport mailing list
> Kgdb-bugreport@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-03-14 21:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1172746367.2515.31.camel@xenon>
2007-03-01 16:22 ` [Kgdb-bugreport] [PATCH 2.6.20-rc7] 8139too KGDBoE fix Sergei Shtylyov
     [not found] <200701312144.56497.sshtylyov@ru.mvista.com>
     [not found] ` <45DDBD96.10000@ru.mvista.com>
     [not found]   ` <45DDC7C0.8050100@ru.mvista.com>
2007-02-23  7:08     ` Amit S. Kale
2007-02-23 18:10       ` Mark Huth
2007-02-23 19:04         ` Stephen Hemminger
2007-02-23 19:09           ` Sergei Shtylyov
2007-02-23 19:13             ` Stephen Hemminger
2007-02-23 19:16               ` Sergei Shtylyov
2007-02-23 19:22                 ` Stephen Hemminger
2007-02-23 19:27                   ` Sergei Shtylyov
2007-02-23 20:34           ` Mark Huth
2007-03-14 13:42       ` Sergei Shtylyov
2007-03-14 14:04         ` Sergei Shtylyov
2007-03-14 21:40           ` Sergei Shtylyov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.