All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <edumazet@google.com>
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: netdev <netdev@vger.kernel.org>,
	Yuchung Cheng <ycheng@google.com>,
	Neal Cardwell <ncardwell@google.com>
Subject: Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer
Date: Wed, 23 Oct 2019 11:14:20 -0700	[thread overview]
Message-ID: <CANn89iKNAg9gwe-ZLSoknwG6-XS44aRZrEv4pDeiON50uXv-0A@mail.gmail.com> (raw)
In-Reply-To: <CAM_iQpWah2M2tG=+eRS86VtjknTiBC42DSwdHB8USpXgRsfWjw@mail.gmail.com>

On Wed, Oct 23, 2019 at 10:40 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, Oct 22, 2019 at 7:15 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 6:10 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > On Tue, Oct 22, 2019 at 4:24 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > >
> > > > > Currently RTO, TLP and PROBE0 all share a same timer instance
> > > > > in kernel and use icsk->icsk_pending to dispatch the work.
> > > > > This causes spinlock contention when resetting the timer is
> > > > > too frequent, as clearly shown in the perf report:
> > > > >
> > > > >    61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
> > > > >    ...
> > > > >     - 58.83% tcp_v4_rcv
> > > > >       - 58.80% tcp_v4_do_rcv
> > > > >          - 58.80% tcp_rcv_established
> > > > >             - 52.88% __tcp_push_pending_frames
> > > > >                - 52.88% tcp_write_xmit
> > > > >                   - 28.16% tcp_event_new_data_sent
> > > > >                      - 28.15% sk_reset_timer
> > > > >                         + mod_timer
> > > > >                   - 24.68% tcp_schedule_loss_probe
> > > > >                      - 24.68% sk_reset_timer
> > > > >                         + 24.68% mod_timer
> > > > >
> > > > > This patch decouples TLP timer from RTO timer by adding a new
> > > > > timer instance but still uses icsk->icsk_pending to dispatch,
> > > > > in order to minimize the risk of this patch.
> > > > >
> > > > > After this patch, the CPU time spent in tcp_write_xmit() reduced
> > > > > down to 10.92%.
> > > >
> > > > What is the exact benchmark you are running ?
> > > >
> > > > We never saw any contention like that, so lets make sure you are not
> > > > working around another issue.
> > >
> > > I simply ran 256 parallel netperf with 128 CPU's to trigger this
> > > spinlock contention, 100% reproducible here.
> >
> > How many TX/RX queues on the NIC ?
>
> 60 queues (default), 25Gbps NIC, mlx5.
>
> > What is the qdisc setup ?
>
> fq_codel, which is default here. Its parameters are default too.
>
> >
> > >
> > > A single netperf TCP_RR could _also_ confirm the improvement:
> > >
> > > Before patch:
> > >
> > > $ netperf -H XXX -t TCP_RR -l 20
> > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
> > > AF_INET to XXX () port 0 AF_INET : first burst 0
> > > Local /Remote
> > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > Send   Recv   Size     Size    Time     Rate
> > > bytes  Bytes  bytes    bytes   secs.    per sec
> > >
> > > 655360 873800 1        1       20.00    17665.59
> > > 655360 873800
> > >
> > >
> > > After patch:
> > >
> > > $ netperf -H XXX -t TCP_RR -l 20
> > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
> > > AF_INET to XXX () port 0 AF_INET : first burst 0
> > > Local /Remote
> > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > Send   Recv   Size     Size    Time     Rate
> > > bytes  Bytes  bytes    bytes   secs.    per sec
> > >
> > > 655360 873800 1        1       20.00    18829.31
> > > 655360 873800
> > >
> > > (I have run it for multiple times, just pick a median one here.)
> > >
> > > The difference can also be observed by turning off/on TLP without patch.
> >
> > OK thanks for using something I can repro easily :)
> >
> > I ran the experiment ten times :
>
> How many CPU's do you have?
>
>
> >
> > lpaa23:/export/hda3/google/edumazet# echo 3
> > >/proc/sys/net/ipv4/tcp_early_retrans
> > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
> > ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done
> >   26797
> >   26850
> >   25266
> >   27605
> >   26586
> >   26341
> >   27255
> >   27532
> >   26657
> >   27253
> >
> >
> > Then disabled tlp, and got no obvious difference
> >
> > lpaa23:/export/hda3/google/edumazet# echo 0
> > >/proc/sys/net/ipv4/tcp_early_retrans
> > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
> > ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done
> >   25311
> >   24658
> >   27105
> >   27421
> >   27604
> >   24649
> >   26259
> >   27615
> >   27543
> >   26217
> >
> > I tried with 256 concurrent flows, and same overall observation about
> > tlp not changing the numbers.
> > (In fact I am not even sure we arm RTO at all while doing a TCP_RR)
>
> In case you misunderstand, the CPU profiling I used is captured
> during 256 parallel TCP_STREAM.

When I asked you the workload, you gave me TCP_RR output, not TCP_STREAM :/

"A single netperf TCP_RR could _also_ confirm the improvement:"

  reply	other threads:[~2019-10-23 18:14 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang
2019-10-22 23:10 ` [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS Cong Wang
2019-10-22 23:10 ` [Patch net-next 2/3] tcp: make tcp_send_loss_probe() boolean Cong Wang
2019-10-22 23:10 ` [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer Cong Wang
2019-10-22 23:24   ` Eric Dumazet
2019-10-23  1:10     ` Cong Wang
2019-10-23  2:15       ` Eric Dumazet
2019-10-23 17:40         ` Cong Wang
2019-10-23 18:14           ` Eric Dumazet [this message]
2019-10-23 18:30             ` Cong Wang
2019-10-23 18:55               ` Eric Dumazet
2019-10-28 18:29 ` [Patch net-next 0/3] " David Miller
2019-10-28 18:34   ` Eric Dumazet
2019-10-28 20:13     ` Cong Wang
2019-10-28 20:31       ` Eric Dumazet
2019-10-29  0:49         ` Cong Wang
2019-10-30 21:56 ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANn89iKNAg9gwe-ZLSoknwG6-XS44aRZrEv4pDeiON50uXv-0A@mail.gmail.com \
    --to=edumazet@google.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=xiyou.wangcong@gmail.com \
    --cc=ycheng@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.