All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] tcp: delack_timer expiration changes for every frame
@ 2010-05-20 20:47 Eric Dumazet
  2010-06-02 11:11 ` David Miller
  0 siblings, 1 reply; 2+ messages in thread
From: Eric Dumazet @ 2010-05-20 20:47 UTC (permalink / raw)
  To: netdev, David Miller; +Cc: Ilpo Järvinen

While oprofiling net-next-2.6 during tcp workloads I found
mod_timer(delack_timer) was used too much, even if we receive/send more
than one frame per jiffie.

Something seems wrong, since we should try to change this timer only
when jiffies changes. mod_timer() has a special optimization for this,
but something is broken in our tcp stack ?

I added some logs in mod_timer() :

HZ = 250

results for one socket shown :

[  392.116735] timer->expires=22997, expires=23024(37) diff=-27 timer=e5ecb754
[  392.120627] timer->expires=23024, expires=22998(10) diff=26 timer=e5ecb754
[  392.123245] timer->expires=22998, expires=23025(37) diff=-27 timer=e5ecb754
[  392.133688] timer->expires=23025, expires=23001(10) diff=24 timer=e5ecb754
[  392.136502] timer->expires=23001, expires=23029(37) diff=-28 timer=e5ecb754
[  392.140392] timer->expires=23029, expires=23003(10) diff=26 timer=e5ecb754
[  392.143142] timer->expires=23003, expires=23030(37) diff=-27 timer=e5ecb754
[  392.153812] timer->expires=23030, expires=23006(10) diff=24 timer=e5ecb754
[  392.156658] timer->expires=23006, expires=23034(37) diff=-28 timer=e5ecb754
[  392.160474] timer->expires=23034, expires=23008(10) diff=26 timer=e5ecb754
[  392.163317] timer->expires=23008, expires=23035(37) diff=-27 timer=e5ecb754
[  392.167176] timer->expires=23035, expires=23009(10) diff=26 timer=e5ecb754
[  392.176963] timer->expires=23009, expires=23039(37) diff=-30 timer=e5ecb754
[  392.180863] timer->expires=23039, expires=23013(10) diff=26 timer=e5ecb754
[  392.183577] timer->expires=23013, expires=23040(37) diff=-27 timer=e5ecb754
[  392.187537] timer->expires=23040, expires=23014(10) diff=26 timer=e5ecb754
[  392.197286] timer->expires=23014, expires=23044(37) diff=-30 timer=e5ecb754
[  392.201047] timer->expires=23044, expires=23018(10) diff=26 timer=e5ecb754
[  392.203761] timer->expires=23018, expires=23045(37) diff=-27 timer=e5ecb754
[  392.207721] timer->expires=23045, expires=23019(10) diff=26 timer=e5ecb754
[  392.217454] timer->expires=23019, expires=23049(37) diff=-30 timer=e5ecb754

So we change the delack_timer by a positive delta (~ HZ/10) and a
 negative delta (~HZ/10), on the typical netperf TCP_RR workload.



Here, the incoming frame is handled by netperf, doing a recvmsg().
tcp_send_delayed_ack() sets the delack_timer to jiffies + HZ/25

[  392.207721] timer->expires=23045, new expires=23019(10) diff=26 timer=e5ecb754
[  392.207785] ------------[ cut here ]------------
[  392.207846] WARNING: at kernel/timer.c:753 mod_timer+0x55/0x18e()
[  392.207908] Hardware name: ProLiant BL460c G6
[  392.207965] Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler ipv6 dm_mod button battery ac ehci_hcd uhci_hcd tg3 libphy bnx2x crc32c libcrc32c mdio [last unloaded: x_tables]
[  392.208900] Pid: 5320, comm: netperf Tainted: G        W  2.6.34-06175-g801cae3-dirty #33
[  392.208979] Call Trace:
[  392.209036]  [<c102df55>] ? warn_slowpath_common+0x5d/0x70
[  392.209098]  [<c102df73>] ? warn_slowpath_null+0xb/0xd
[  392.209159]  [<c10388de>] ? mod_timer+0x55/0x18e
[  392.209221]  [<c1279ce7>] ? tcp_send_delayed_ack+0xb5/0xc1
[  392.209282]  [<c1276d26>] ? tcp_rcv_established+0x39f/0x4f7
[  392.209345]  [<c127bae5>] ? tcp_v4_do_rcv+0x22/0x161
[  392.209406]  [<c126d6b4>] ? tcp_prequeue_process+0x47/0x5b
[  392.209468]  [<c12701d2>] ? tcp_recvmsg+0x371/0x691
[  392.209529]  [<c12b8c91>] ? _raw_spin_lock_bh+0x8/0x1e
[  392.209590]  [<c1240bcd>] ? release_sock+0x10/0xc9
[  392.216514]  [<c1285bad>] ? inet_recvmsg+0x5d/0x72
[  392.216575]  [<c123e725>] ? sock_recvmsg+0xb4/0xd1
[  392.216636]  [<c1032ace>] ? irq_exit+0x39/0x5b
[  392.216696]  [<c123fab5>] ? sys_recvfrom+0xb4/0x117
[  392.216757]  [<c10483be>] ? ktime_get+0x61/0xe8
[  392.216817]  [<c1016431>] ? lapic_next_event+0x13/0x16
[  392.216878]  [<c104ba9d>] ? clockevents_program_event+0xac/0xbc
[  392.216940]  [<c104c6cc>] ? tick_dev_program_event+0x34/0x138
[  392.217002]  [<c104c7ed>] ? tick_program_event+0x1d/0x21
[  392.217064]  [<c1044da0>] ? hrtimer_interrupt+0x10b/0x1c1
[  392.217126]  [<c123fb31>] ? sys_recv+0x19/0x1d
[  392.217186]  [<c12401dc>] ? sys_socketcall+0x120/0x1c6
[  392.217303]  [<c100268c>] ? sysenter_do_call+0x12/0x22
[  392.217364] ---[ end trace e9475c06f1d49408 ]---

Here, the incoming frame is handled by the other side (netserver),
but still for the netperf socket, (softirq handling)
tcp_v4_rcv() sets the delack timer to 37 ticks, so mod_timer() optimizations is not
working at all.

[  392.217454] timer->expires=23019, new expires=23049(37) diff=-30 timer=e5ecb754
[  392.217518] ------------[ cut here ]------------
[  392.217578] WARNING: at kernel/timer.c:753 mod_timer+0x55/0x18e()
[  392.217639] Hardware name: ProLiant BL460c G6
[  392.217697] Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler ipv6 dm_mod button battery ac ehci_hcd uhci_hcd tg3 libphy bnx2x crc32c libcrc32c mdio [last unloaded: x_tables]
[  392.218439] Pid: 5321, comm: netserver Tainted: G        W  2.6.34-06175-g801cae3-dirty #33
[  392.218526] Call Trace:
[  392.218582]  [<c102df55>] ? warn_slowpath_common+0x5d/0x70
[  392.218644]  [<c102df73>] ? warn_slowpath_null+0xb/0xd
[  392.218705]  [<c10388de>] ? mod_timer+0x55/0x18e
[  392.218765]  [<c127cf56>] ? tcp_v4_rcv+0x41c/0x6b7
[  392.218826]  [<c1265832>] ? ip_local_deliver_finish+0xe9/0x178
[  392.218888]  [<c126572a>] ? ip_rcv_finish+0x262/0x281
[  392.218949]  [<c1249986>] ? __netif_receive_skb+0x267/0x282
[  392.219011]  [<c1249a0d>] ? process_backlog+0x6c/0x113
[  392.219072]  [<c124a2e6>] ? net_rx_action+0x8a/0x15a
[  392.219133]  [<c106234e>] ? __rcu_process_callbacks+0xb9/0x1d1
[  392.219195]  [<c1032910>] ? __do_softirq+0x0/0x13a
[  392.219255]  [<c10329b5>] ? __do_softirq+0xa5/0x13a
[  392.219316]  [<c1032910>] ? __do_softirq+0x0/0x13a
[  392.219375]  <IRQ>  [<c1032449>] ? local_bh_enable+0x5f/0x6a
[  392.219474]  [<c124c0be>] ? dev_queue_xmit+0x34d/0x37a
[  392.219536]  [<c1268115>] ? ip_finish_output+0x1c7/0x1ff
[  392.219610]  [<c1268255>] ? ip_local_out+0x18/0x1a
[  392.219670]  [<c12684f4>] ? ip_queue_xmit+0x29d/0x2d5
[  392.219731]  [<c1293e89>] ? bictcp_acked+0x4f/0x139
[  392.219791]  [<c1275ee8>] ? tcp_ack+0x155b/0x16e9
[  392.219851]  [<c127851a>] ? tcp_transmit_skb+0x62a/0x65f
[  392.219912]  [<c1038a03>] ? mod_timer+0x17a/0x18e
[  392.219972]  [<c127900a>] ? tcp_write_xmit+0x73a/0x81c
[  392.220033]  [<c109e487>] ? __kmalloc_node+0x30/0x76
[  392.220095]  [<c1279127>] ? __tcp_push_pending_frames+0x15/0x6c
[  392.220159]  [<c126f214>] ? tcp_sendmsg+0x7ee/0x8c5
[  392.220223]  [<c123e823>] ? sock_sendmsg+0xa7/0xc1
[  392.220284]  [<c1044da0>] ? hrtimer_interrupt+0x10b/0x1c1
[  392.220346]  [<c1032ace>] ? irq_exit+0x39/0x5b
[  392.220406]  [<c1016811>] ? smp_apic_timer_interrupt+0x6b/0x75
[  392.220468]  [<c102007b>] ? pud_huge+0x1/0x9
[  392.220536]  [<c123f9b9>] ? sys_sendto+0xfc/0x127
[  392.220599]  [<c1044bd4>] ? hrtimer_start_range_ns+0xf/0x13
[  392.220661]  [<c1023848>] ? update_curr+0x60/0xdf
[  392.220722]  [<c1044871>] ? hrtimer_forward+0x10f/0x123
[  392.220784]  [<c10483be>] ? ktime_get+0x61/0xe8
[  392.220844]  [<c123f9fd>] ? sys_send+0x19/0x1d
[  392.220903]  [<c12401af>] ? sys_socketcall+0xf3/0x1c6
[  392.220964]  [<c1032ace>] ? irq_exit+0x39/0x5b
[  392.221024]  [<c1016811>] ? smp_apic_timer_interrupt+0x6b/0x75
[  392.221086]  [<c100268c>] ? sysenter_do_call+0x12/0x22
[  392.221147] ---[ end trace e9475c06f1d49409 ]---

Its a bit late here to investigate, maybe one of you guys have an idea about this...




^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [RFC] tcp: delack_timer expiration changes for every frame
  2010-05-20 20:47 [RFC] tcp: delack_timer expiration changes for every frame Eric Dumazet
@ 2010-06-02 11:11 ` David Miller
  0 siblings, 0 replies; 2+ messages in thread
From: David Miller @ 2010-06-02 11:11 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ilpo.jarvinen

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 20 May 2010 22:47:19 +0200

> So we change the delack_timer by a positive delta (~ HZ/10) and a
>  negative delta (~HZ/10), on the typical netperf TCP_RR workload.

Yes, this is silly.

> Here, the incoming frame is handled by netperf, doing a recvmsg().
> tcp_send_delayed_ack() sets the delack_timer to jiffies + HZ/25
> 
> [  392.207721] timer->expires=23045, new expires=23019(10) diff=26 timer=e5ecb754
 ...
> [  392.209221]  [<c1279ce7>] ? tcp_send_delayed_ack+0xb5/0xc1
> [  392.209282]  [<c1276d26>] ? tcp_rcv_established+0x39f/0x4f7

HZ/25 is TCP_DELACK_MIN and TCP_ATO_MIN, but the actual value we use
here involves incorporation of various measurements made on the
connection (RTO, etc.)

 ...
> tcp_v4_rcv() sets the delack timer to 37 ticks, so mod_timer() optimizations is not
> working at all.
> 
> [  392.217454] timer->expires=23019, new expires=23049(37) diff=-30 timer=e5ecb754
 ...
> [  392.218765]  [<c127cf56>] ? tcp_v4_rcv+0x41c/0x6b7
> [  392.218826]  [<c1265832>] ? ip_local_deliver_finish+0xe9/0x178

There must be a tail-call here at tcp_v4_rcv() or something missed in
the backtrace stack scanning logic, because tcp_v4_rcv() and it's main
inline tcp_v4_do_rcv() do not modify established state socket timers,
and in particular do not modify the delack timer, that I can see.

It must be in via tcp_rcv_established() or similar.
...

Nevermind, it's the inlined prequeue stuff.

It uses a seperate calculation of the delack timer offset, independant
of the one made by tcp_send_delayed_ack(), it's timer offset formula is:

	(3 * tcp_rto_min(sk)) / 4

with a MAX of:

	TCP_RTO_MAX

So every time we go in and out of recvmsg() we'll hop between these
two different delayed ACK settings.

The prequeue logic is trying to stretch the delayed ACK to 3/4 of a
window of data.  It's set a bit high, intentionally, in the hopes that
we'll get the process into recvmsg() and have it emit it's response
packet from a subsequent sendmsg() (that the ACK can ride on) before
this timer fires.

But when we drop the socket lock to sleep or return to userspace, one
of the next packets is just going to reset this timer differently.

While the intentions of the prequeue code look legit, the use of two
different delayed ACK timeout schemes has bad implications elsewhere.

For example, if the delack timer does actually fire, there is this
ATO fixup code here:

		if (!icsk->icsk_ack.pingpong) {
			/* Delayed ACK missed: inflate ATO. */
			icsk->icsk_ack.ato = min(icsk->icsk_ack.ato << 1, icsk->icsk_rto);
		} else {
			/* Delayed ACK missed: leave pingpong mode and
			 * deflate ATO.
			 */
			icsk->icsk_ack.pingpong = 0;
			icsk->icsk_ack.ato      = TCP_ATO_MIN;
		}

which is totally wrong if the delack timer offset is the one
calculated by the prequeue code.  Doubling the ATO in that case
is completely the wrong thing to do.

So yes we have all kinds of inconsistencies here and we should
probably unify things so that the timer gets kicked less often.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2010-06-02 11:11 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-20 20:47 [RFC] tcp: delack_timer expiration changes for every frame Eric Dumazet
2010-06-02 11:11 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.