All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Jonathan Davies <Jonathan.Davies@citrix.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	netdev <netdev@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Paul Durrant <paul.durrant@citrix.com>,
	Christoffer Dall <christoffer.dall@linaro.org>,
	Felipe Franciosi <felipe.franciosi@citrix.com>,
	linux-arm-kernel@lists.infradead.org,
	David Vrabel <david.vrabel@citrix.com>
Subject: Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
Date: Wed, 15 Apr 2015 09:38:54 -0700	[thread overview]
Message-ID: <1429115934.7346.107.camel@edumazet-glaptop2.roam.corp.google.com> (raw)
In-Reply-To: <CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>

On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
> >
> >> Is the problem perhaps that netback/netfront delays TX completion?
> >> Would it be better to see if that can be addressed properly, so that
> >> the original purpose of the patch (fighting bufferbloat) can be
> >> achieved while not degrading performance for Xen?  Or at least, so
> >> that people get decent perfomance out of the box without having to
> >> tweak TCP parameters?
> >
> > Sure, please provide a patch, that does not break back pressure.
> >
> > But just in case, if Xen performance relied on bufferbloat, it might be
> > very difficult to reach a stable equilibrium : Any small change in stack
> > or scheduling might introduce a significant difference in 'raw
> > performance'.
> 
> So help me understand this a little bit here.  tcp_limit_output_bytes
> limits the amount of data allowed to be "in-transit" between a send()
> and the wire, is that right?
> 
> And so the "bufferbloat" problem you're talking about here are TCP
> buffers inside the kernel, and/or buffers in the NIC, is that right?
> 
> So ideally, you want this to be large enough to fill the "pipeline"
> all the way from send() down to actually getting out on the wire;
> otherwise, you'll have gaps in the pipeline, and the machinery won't
> be working at full throttle.
> 
> And the reason it's a problem is that many NICs now come with large
> send buffers; and effectively what happens then is that this makes the
> "pipeline" longer -- as the buffer fills up, the time between send()
> and the wire is increased.  This increased latency causes delays in
> round-trip-times and interferes with the mechanisms TCP uses to try to
> determine what the actual sustainable rate of data trasmission is.
> 
> By limiting the number of "in-transit" bytes, you make sure that
> neither the kernel nor the NIC are going to have packets queues up for
> long lengths of time in buffers, and you keep this "pipeline" as close
> to the actual minimal length of the pipeline as possible.  And it
> sounds like for your 40G NIC, 128k is big enough to fill the pipeline
> without unduly making it longer by introducing buffering.
> 
> Is that an accurate picture of what you're trying to achieve?
> 
> But the problem for xennet (and a number of other drivers), as I
> understand it, is that at the moment the "pipeline" itself is just
> longer -- it just takes a longer time from the time you send a packet
> to the time it actually gets out on the wire.
> 
> So it's not actually accurate to say that "Xen performance relies on
> bufferbloat".  There's no buffering involved -- the pipeline is just
> longer, and so to fill up the pipeline you need more data.
> 
> Basically, to maximize throughput while minimizing buffering, for
> *any* connection, tcp_limit_output_bytes should ideally be around
> (min_tx_latency * max_bandwidth).  For physical NICs, the minimum
> latency is really small, but for xennet -- and I'm guessing for a lot
> of virtualized cards -- the min_tx_latency will be a lot higher,
> requiring a much higher ideal tcp_limit_output value.
> 
> Rather than trying to pick a single value which will be good for all
> NICs, it seems like it would make more sense to have this vary
> depending on the parameters of the NIC.  After all, for NICs that have
> low throughput -- say, old 100MiB NICs -- even 128k may still
> introduce a significant amount of buffering.
> 
> Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case.
> 
> Probably the most sustainable solution would be to have the networking
> layer keep track of the average and minimum transmit latencies, and
> automatically adjust tcp_limit_output_bytes based on that.  (Keeping
> the minimum as well as the average because the whole problem with
> bufferbloat is that the more data you give it, the longer the apparent
> "pipeline" becomes.)
> 
> Thoughts?

My thoughts that instead of these long talks you should guys read the
code :

                /* TCP Small Queues :
                 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
                 * This allows for :
                 *  - better RTT estimation and ACK scheduling
                 *  - faster recovery
                 *  - high rates
                 * Alas, some drivers / subsystems require a fair amount
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
                limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);


Then you'll see that most of your questions are already answered.

Feel free to try to improve the behavior, if it does not hurt critical workloads
like TCP_RR, where we we send very small messages, millions times per second.

Thanks




WARNING: multiple messages have this Message-ID (diff)
From: Eric Dumazet <eric.dumazet@gmail.com>
To: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Jonathan Davies <Jonathan.Davies@citrix.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	netdev <netdev@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Paul Durrant <paul.durrant@citrix.com>,
	linux-arm-kernel@lists.infradead.org,
	Felipe Franciosi <felipe.franciosi@citrix.com>,
	Christoffer Dall <christoffer.dall@linaro.org>,
	David Vrabel <david.vrabel@citrix.com>
Subject: Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
Date: Wed, 15 Apr 2015 09:38:54 -0700	[thread overview]
Message-ID: <1429115934.7346.107.camel@edumazet-glaptop2.roam.corp.google.com> (raw)
In-Reply-To: <CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>

On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
> >
> >> Is the problem perhaps that netback/netfront delays TX completion?
> >> Would it be better to see if that can be addressed properly, so that
> >> the original purpose of the patch (fighting bufferbloat) can be
> >> achieved while not degrading performance for Xen?  Or at least, so
> >> that people get decent perfomance out of the box without having to
> >> tweak TCP parameters?
> >
> > Sure, please provide a patch, that does not break back pressure.
> >
> > But just in case, if Xen performance relied on bufferbloat, it might be
> > very difficult to reach a stable equilibrium : Any small change in stack
> > or scheduling might introduce a significant difference in 'raw
> > performance'.
> 
> So help me understand this a little bit here.  tcp_limit_output_bytes
> limits the amount of data allowed to be "in-transit" between a send()
> and the wire, is that right?
> 
> And so the "bufferbloat" problem you're talking about here are TCP
> buffers inside the kernel, and/or buffers in the NIC, is that right?
> 
> So ideally, you want this to be large enough to fill the "pipeline"
> all the way from send() down to actually getting out on the wire;
> otherwise, you'll have gaps in the pipeline, and the machinery won't
> be working at full throttle.
> 
> And the reason it's a problem is that many NICs now come with large
> send buffers; and effectively what happens then is that this makes the
> "pipeline" longer -- as the buffer fills up, the time between send()
> and the wire is increased.  This increased latency causes delays in
> round-trip-times and interferes with the mechanisms TCP uses to try to
> determine what the actual sustainable rate of data trasmission is.
> 
> By limiting the number of "in-transit" bytes, you make sure that
> neither the kernel nor the NIC are going to have packets queues up for
> long lengths of time in buffers, and you keep this "pipeline" as close
> to the actual minimal length of the pipeline as possible.  And it
> sounds like for your 40G NIC, 128k is big enough to fill the pipeline
> without unduly making it longer by introducing buffering.
> 
> Is that an accurate picture of what you're trying to achieve?
> 
> But the problem for xennet (and a number of other drivers), as I
> understand it, is that at the moment the "pipeline" itself is just
> longer -- it just takes a longer time from the time you send a packet
> to the time it actually gets out on the wire.
> 
> So it's not actually accurate to say that "Xen performance relies on
> bufferbloat".  There's no buffering involved -- the pipeline is just
> longer, and so to fill up the pipeline you need more data.
> 
> Basically, to maximize throughput while minimizing buffering, for
> *any* connection, tcp_limit_output_bytes should ideally be around
> (min_tx_latency * max_bandwidth).  For physical NICs, the minimum
> latency is really small, but for xennet -- and I'm guessing for a lot
> of virtualized cards -- the min_tx_latency will be a lot higher,
> requiring a much higher ideal tcp_limit_output value.
> 
> Rather than trying to pick a single value which will be good for all
> NICs, it seems like it would make more sense to have this vary
> depending on the parameters of the NIC.  After all, for NICs that have
> low throughput -- say, old 100MiB NICs -- even 128k may still
> introduce a significant amount of buffering.
> 
> Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case.
> 
> Probably the most sustainable solution would be to have the networking
> layer keep track of the average and minimum transmit latencies, and
> automatically adjust tcp_limit_output_bytes based on that.  (Keeping
> the minimum as well as the average because the whole problem with
> bufferbloat is that the more data you give it, the longer the apparent
> "pipeline" becomes.)
> 
> Thoughts?

My thoughts that instead of these long talks you should guys read the
code :

                /* TCP Small Queues :
                 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
                 * This allows for :
                 *  - better RTT estimation and ACK scheduling
                 *  - faster recovery
                 *  - high rates
                 * Alas, some drivers / subsystems require a fair amount
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
                limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);


Then you'll see that most of your questions are already answered.

Feel free to try to improve the behavior, if it does not hurt critical workloads
like TCP_RR, where we we send very small messages, millions times per second.

Thanks

WARNING: multiple messages have this Message-ID (diff)
From: eric.dumazet@gmail.com (Eric Dumazet)
To: linux-arm-kernel@lists.infradead.org
Subject: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
Date: Wed, 15 Apr 2015 09:38:54 -0700	[thread overview]
Message-ID: <1429115934.7346.107.camel@edumazet-glaptop2.roam.corp.google.com> (raw)
In-Reply-To: <CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>

On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
> >
> >> Is the problem perhaps that netback/netfront delays TX completion?
> >> Would it be better to see if that can be addressed properly, so that
> >> the original purpose of the patch (fighting bufferbloat) can be
> >> achieved while not degrading performance for Xen?  Or at least, so
> >> that people get decent perfomance out of the box without having to
> >> tweak TCP parameters?
> >
> > Sure, please provide a patch, that does not break back pressure.
> >
> > But just in case, if Xen performance relied on bufferbloat, it might be
> > very difficult to reach a stable equilibrium : Any small change in stack
> > or scheduling might introduce a significant difference in 'raw
> > performance'.
> 
> So help me understand this a little bit here.  tcp_limit_output_bytes
> limits the amount of data allowed to be "in-transit" between a send()
> and the wire, is that right?
> 
> And so the "bufferbloat" problem you're talking about here are TCP
> buffers inside the kernel, and/or buffers in the NIC, is that right?
> 
> So ideally, you want this to be large enough to fill the "pipeline"
> all the way from send() down to actually getting out on the wire;
> otherwise, you'll have gaps in the pipeline, and the machinery won't
> be working at full throttle.
> 
> And the reason it's a problem is that many NICs now come with large
> send buffers; and effectively what happens then is that this makes the
> "pipeline" longer -- as the buffer fills up, the time between send()
> and the wire is increased.  This increased latency causes delays in
> round-trip-times and interferes with the mechanisms TCP uses to try to
> determine what the actual sustainable rate of data trasmission is.
> 
> By limiting the number of "in-transit" bytes, you make sure that
> neither the kernel nor the NIC are going to have packets queues up for
> long lengths of time in buffers, and you keep this "pipeline" as close
> to the actual minimal length of the pipeline as possible.  And it
> sounds like for your 40G NIC, 128k is big enough to fill the pipeline
> without unduly making it longer by introducing buffering.
> 
> Is that an accurate picture of what you're trying to achieve?
> 
> But the problem for xennet (and a number of other drivers), as I
> understand it, is that at the moment the "pipeline" itself is just
> longer -- it just takes a longer time from the time you send a packet
> to the time it actually gets out on the wire.
> 
> So it's not actually accurate to say that "Xen performance relies on
> bufferbloat".  There's no buffering involved -- the pipeline is just
> longer, and so to fill up the pipeline you need more data.
> 
> Basically, to maximize throughput while minimizing buffering, for
> *any* connection, tcp_limit_output_bytes should ideally be around
> (min_tx_latency * max_bandwidth).  For physical NICs, the minimum
> latency is really small, but for xennet -- and I'm guessing for a lot
> of virtualized cards -- the min_tx_latency will be a lot higher,
> requiring a much higher ideal tcp_limit_output value.
> 
> Rather than trying to pick a single value which will be good for all
> NICs, it seems like it would make more sense to have this vary
> depending on the parameters of the NIC.  After all, for NICs that have
> low throughput -- say, old 100MiB NICs -- even 128k may still
> introduce a significant amount of buffering.
> 
> Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case.
> 
> Probably the most sustainable solution would be to have the networking
> layer keep track of the average and minimum transmit latencies, and
> automatically adjust tcp_limit_output_bytes based on that.  (Keeping
> the minimum as well as the average because the whole problem with
> bufferbloat is that the more data you give it, the longer the apparent
> "pipeline" becomes.)
> 
> Thoughts?

My thoughts that instead of these long talks you should guys read the
code :

                /* TCP Small Queues :
                 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
                 * This allows for :
                 *  - better RTT estimation and ACK scheduling
                 *  - faster recovery
                 *  - high rates
                 * Alas, some drivers / subsystems require a fair amount
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
                limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);


Then you'll see that most of your questions are already answered.

Feel free to try to improve the behavior, if it does not hurt critical workloads
like TCP_RR, where we we send very small messages, millions times per second.

Thanks

  reply	other threads:[~2015-04-15 16:39 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-09 15:46 "tcp: refine TSO autosizing" causes performance regression on Xen Stefano Stabellini
2015-04-09 15:46 ` Stefano Stabellini
2015-04-09 15:46 ` Stefano Stabellini
2015-04-09 16:16 ` Eric Dumazet
2015-04-09 16:16   ` Eric Dumazet
2015-04-09 16:36   ` Stefano Stabellini
2015-04-09 16:36     ` Stefano Stabellini
2015-04-09 16:36     ` Stefano Stabellini
2015-04-09 17:07     ` Eric Dumazet
2015-04-09 17:07       ` Eric Dumazet
2015-04-13 10:56     ` [Xen-devel] " George Dunlap
2015-04-13 10:56       ` George Dunlap
2015-04-13 13:38       ` Jonathan Davies
2015-04-13 13:38         ` Jonathan Davies
2015-04-13 13:38         ` Jonathan Davies
2015-04-13 13:49       ` Eric Dumazet
2015-04-13 13:49         ` Eric Dumazet
2015-04-15 13:43         ` George Dunlap
2015-04-15 13:43           ` George Dunlap
2015-04-15 16:38           ` Eric Dumazet [this message]
2015-04-15 16:38             ` Eric Dumazet
2015-04-15 16:38             ` Eric Dumazet
2015-04-15 17:23             ` George Dunlap
2015-04-15 17:23               ` George Dunlap
2015-04-15 17:23               ` George Dunlap
2015-04-15 17:29               ` Eric Dumazet
2015-04-15 17:29                 ` Eric Dumazet
2015-04-15 17:41                 ` George Dunlap
2015-04-15 17:41                   ` George Dunlap
2015-04-15 17:41                   ` George Dunlap
2015-04-15 17:52                   ` Eric Dumazet
2015-04-15 17:52                     ` Eric Dumazet
2015-04-15 17:55                     ` Rick Jones
2015-04-15 17:55                       ` Rick Jones
2015-04-15 18:08                       ` Eric Dumazet
2015-04-15 18:08                         ` Eric Dumazet
2015-04-15 18:19                         ` Rick Jones
2015-04-15 18:19                           ` Rick Jones
2015-04-15 18:32                           ` Eric Dumazet
2015-04-15 18:32                             ` Eric Dumazet
2015-04-15 18:32                             ` Eric Dumazet
2015-04-15 20:08                             ` [Xen-devel] " Rick Jones
2015-04-15 20:08                               ` Rick Jones
2015-04-15 20:08                               ` Rick Jones
2015-04-15 18:04                     ` George Dunlap
2015-04-15 18:04                       ` George Dunlap
2015-04-15 18:04                       ` George Dunlap
2015-04-15 18:19                       ` Eric Dumazet
2015-04-15 18:19                         ` Eric Dumazet
2015-04-16  8:56                         ` George Dunlap
2015-04-16  8:56                           ` George Dunlap
2015-04-16  8:56                           ` George Dunlap
2015-04-16  9:20                           ` Daniel Borkmann
2015-04-16  9:20                             ` Daniel Borkmann
2015-04-16  9:20                             ` Daniel Borkmann
2015-04-16 10:01                             ` George Dunlap
2015-04-16 10:01                               ` George Dunlap
2015-04-16 10:01                               ` George Dunlap
2015-04-16 12:42                               ` Eric Dumazet
2015-04-16 12:42                                 ` Eric Dumazet
2015-04-20 11:03                                 ` George Dunlap
2015-04-20 11:03                                   ` George Dunlap
2015-06-02  9:52                                 ` Wei Liu
2015-06-02  9:52                                   ` Wei Liu
2015-06-02  9:52                                   ` Wei Liu
2015-06-02 16:16                                   ` Eric Dumazet
2015-06-02 16:16                                     ` Eric Dumazet
2015-04-16  9:22                           ` David Laight
2015-04-16  9:22                             ` David Laight
2015-04-16  9:22                             ` David Laight
2015-04-16 10:57                             ` George Dunlap
2015-04-16 10:57                               ` George Dunlap
2015-04-16 10:57                               ` George Dunlap
2015-04-15 17:41               ` Eric Dumazet
2015-04-15 17:41                 ` Eric Dumazet
2015-04-15 17:58                 ` Stefano Stabellini
2015-04-15 17:58                   ` Stefano Stabellini
2015-04-15 17:58                   ` Stefano Stabellini
2015-04-15 18:17                   ` Eric Dumazet
2015-04-15 18:17                     ` Eric Dumazet
2015-04-16  4:20                     ` Herbert Xu
2015-04-16  4:20                       ` Herbert Xu
2015-04-16  4:30                       ` Eric Dumazet
2015-04-16  4:30                         ` Eric Dumazet
2015-04-16 11:39                     ` George Dunlap
2015-04-16 11:39                       ` George Dunlap
2015-04-16 11:39                       ` George Dunlap
2015-04-16 12:16                       ` Eric Dumazet
2015-04-16 12:16                         ` Eric Dumazet
2015-04-16 13:00                       ` Tim Deegan
2015-04-16 13:00                         ` Tim Deegan
2015-04-16 13:00                         ` Tim Deegan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1429115934.7346.107.camel@edumazet-glaptop2.roam.corp.google.com \
    --to=eric.dumazet@gmail.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Jonathan.Davies@citrix.com \
    --cc=christoffer.dall@linaro.org \
    --cc=david.vrabel@citrix.com \
    --cc=edumazet@google.com \
    --cc=felipe.franciosi@citrix.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=paul.durrant@citrix.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.