On Sun, 2012-03-25 at 17:36 -0400, David Miller wrote: > From: David Woodhouse > Date: Sun, 25 Mar 2012 11:43:50 +0100 > > > It's a bad idea to have huge hidden queues (a whole wmem_default worth > > of packets are in a hidden queue between ppp_generic and the ATM device, > > ffs!) anyway, so perhaps if we just fix *that* within PPP, it should > > work a bit better with TEQL? > > Yes, the ATM devices deep transmit queue is quite undesirable. Indeed, although I don't think it's the only cause of the problem I saw. The first thing I tried was a hack in ppoatm_assign_vcc() to set the socket's sk_sndbuf to 4KiB. It *seemed* to work, but only while all my debugging printks in sch_teql were being spewed at 115200 baud over the serial port. As soon as I hit SysRq-0 and the serial port delays went away, I was back to bursts on one line then the other. > But I actually don't see how the problem arises yet, I need more > details. > > PPP itself will always stop the queue, and return NETDEV_TX_OK on a > transmit attempt. It may wake the queue back up before returning if > the downstream device (such as pppoatm) accepted the packet. It does indeed stop the queue. I think it then wakes it right back up again in ppp_xmit_process(), *before* returning NETDEV_TX_OK. So the offending calls to skb_dequeue() which are putting it back to the front of the list are going to be from the softirq trying to feed the device. I'll confirm that, then try fixing the PPP code so it doesn't stop and immediately restart the queue. If it only stops the queue *conditionally*, that may well fix the problem. > But in either case NETDEV_TX_OK is returned and this is what the teql > master transmit sees, and this takes the code path which advances the > slave pointer to the next device. > > Therefore the next teql master transmit should try the next device in > the slave list, not the PPP device used in the previous call. I instrumented everywhere that the 'next device' pointer (m->slaves) is assigned in sch_teql. One of the printks you see below is in teql_master_xmit(), and it's doing exactly what you say. And then immediately afterwards you see the other printk in teql_dequeue(), setting m->slaves right back to the original device again: Mar 22 15:36:07 net1-173.woodhou.se kernel: [12612.673308] teql xmit cebca100 next cebca400 Mar 22 15:36:07 net1-173.woodhou.se kernel: [12612.677630] m->slaves becomes cebca100 Mar 22 15:36:07 net1-173.woodhou.se kernel: [12613.069589] teql xmit cebca100 next cebca400 Mar 22 15:36:07 net1-173.woodhou.se kernel: [12613.073884] m->slaves becomes cebca100 Mar 22 15:36:07 net1-173.woodhou.se kernel: [12613.113584] teql xmit cebca100 next cebca400 Mar 22 15:36:07 net1-173.woodhou.se kernel: [12613.117908] m->slaves becomes cebca100 Mar 22 15:36:08 net1-173.woodhou.se kernel: [12614.041113] teql xmit cebca100 next cebca400 Mar 22 15:36:08 net1-173.woodhou.se kernel: [12614.045411] m->slaves becomes cebca100 Mar 22 15:36:09 net1-173.woodhou.se kernel: [12614.258464] teql xmit cebca100 next cebca400 Mar 22 15:36:09 net1-173.woodhou.se kernel: [12614.262762] m->slaves becomes cebca100 Mar 22 15:36:09 net1-173.woodhou.se kernel: [12614.896259] teql xmit cebca100 next cebca400 Mar 22 15:36:09 net1-173.woodhou.se kernel: [12614.900559] m->slaves becomes cebca100 Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.129265] teql xmit cebca100 next cebca400 Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.133599] teql xmit cebca400 next cebca100 Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.137919] m->slaves becomes cebca100 Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.141673] m->slaves becomes cebca400 Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.148321] teql xmit cebca400 next cebca100 Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.152623] m->slaves becomes cebca400 Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.157979] teql xmit cebca400 next cebca100 Mar 22 15:36:10 net1-173.woodhou.se kernel: [12616.162276] m->slaves becomes cebca400 Mar 22 15:36:11 net1-173.woodhou.se kernel: [12616.172402] teql xmit cebca400 next cebca100 Mar 22 15:36:11 net1-173.woodhou.se kernel: [12616.176731] m->slaves becomes cebca400 Mar 22 15:36:13 net1-173.woodhou.se kernel: [12618.693948] teql xmit cebca400 next cebca100 Mar 22 15:36:13 net1-173.woodhou.se kernel: [12618.698275] m->slaves becomes cebca400 Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.263215] teql xmit cebca400 next cebca100 Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.267539] m->slaves becomes cebca400 Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.311534] teql xmit cebca400 next cebca100 Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.315828] m->slaves becomes cebca400 Mar 22 15:36:14 net1-173.woodhou.se kernel: [12619.645580] teql xmit cebca400 next cebca100 For the first few seconds it doesn't manage to send *any* packets out the cebca400 queue. That queue gets marked as 'next', but never quite makes it. And then it manages to flip, and for another few seconds it sends *all* its packets out that queue, leaving the cebca100 queue idle. -- dwmw2