All of lore.kernel.org
 help / color / mirror / Atom feed
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtualization@lists.linux-foundation.org,
	Network Development <netdev@vger.kernel.org>,
	Jason Wang <jasowang@redhat.com>
Subject: Re: [PATCH rfc 0/3] virtio-net: add tx-hash, rx-tstamp and tx-tstamp
Date: Wed, 6 Jan 2021 15:32:51 -0500	[thread overview]
Message-ID: <CAF=yD-Lcad6Sw6zkQGrCqck+s3rit-m6FLL6th9=G2pZOr=1Gw@mail.gmail.com> (raw)
In-Reply-To: <CA+FuTSdEqk8gxptnOSpNnm6YPSJv=62wKHqe4GbVAiKQRUfmXQ@mail.gmail.com>

On Mon, Dec 28, 2020 at 8:15 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Dec 28, 2020 at 7:47 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Dec 28, 2020 at 02:51:09PM -0500, Willem de Bruijn wrote:
> > > On Mon, Dec 28, 2020 at 12:29 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Mon, Dec 28, 2020 at 11:22:30AM -0500, Willem de Bruijn wrote:
> > > > > From: Willem de Bruijn <willemb@google.com>
> > > > >
> > > > > RFC for three new features to the virtio network device:
> > > > >
> > > > > 1. pass tx flow hash and state to host, for routing + telemetry
> > > > > 2. pass rx tstamp to guest, for better RTT estimation
> > > > > 3. pass tx tstamp to host, for accurate pacing
> > > > >
> > > > > All three would introduce an extension to the virtio spec.
> > > > > I assume this would require opening three ballots against v1.2 at
> > > > > https://www.oasis-open.org/committees/ballots.php?wg_abbrev=virtio
> > > > >
> > > > > This RFC is to informally discuss the proposals first.
> > > > >
> > > > > The patchset is against v5.10. Evaluation additionally requires
> > > > > changes to qemu and at least one back-end. I implemented preliminary
> > > > > support in Linux vhost-net. Both patches available through github at
> > > > >
> > > > > https://github.com/wdebruij/linux/tree/virtio-net-txhash-1
> > > > > https://github.com/wdebruij/qemu/tree/virtio-net-txhash-1
> > > >
> > > > Any data on what the benefits are?
> > >
> > > For the general method, yes. For this specific implementation, not  yet.
> > >
> > > Swift congestion control is delay based. It won the best paper award
> > > at SIGCOMM this year. That paper has a lot of data:
> > > https://dl.acm.org/doi/pdf/10.1145/3387514.3406591 . Section 3.1 talks
> > > about the different components that contribute to delay and how to
> > > isolate them.
> >
> > And for the hashing part?
>
> A few concrete examples of error conditions that can be resolved are
> mentioned in the commits that add sk_rethink_txhash calls. Such as
> commit 7788174e8726 ("tcp: change IPv6 flow-label upon receiving
> spurious retransmission"):
>
> "
>     Currently a Linux IPv6 TCP sender will change the flow label upon
>     timeouts to potentially steer away from a data path that has gone
>     bad. However this does not help if the problem is on the ACK path
>     and the data path is healthy. In this case the receiver is likely
>     to receive repeated spurious retransmission because the sender
>     couldn't get the ACKs in time and has recurring timeouts.
>
>     This patch adds another feature to mitigate this problem. It
>     leverages the DSACK states in the receiver to change the flow
>     label of the ACKs to speculatively re-route the ACK packets.
>     In order to allow triggering on the second consecutive spurious
>     RTO, the receiver changes the flow label upon sending a second
>     consecutive DSACK for a sequence number below RCV.NXT.
> "
>
> I don't have quantitative data on the efficacy at scale at hand. Let
> me see what I can find. This will probably take some time, at least
> until people are back after the holidays. I didn't want to delay the
> patch, as the merge window was a nice time for RFC. But agreed that it
> deserves stronger justification.

The practical results mirror what the theory suggests: that in the
presence of multiple paths, of which one goes bad, this method
maintains connectivity where otherwise it would disconnect.

When IPv6 FlowLabel was included in path selection (e.g., LAG/ECMP),
flowlabel rotation on TCP timeout avoided the vast majority of TCP
disconnections that would otherwise have occurred during to link
failures in long-haul backbones, when an alternative path was
available.

So it's not a matter of percentages, just the existence of an
alternative healthy path on which the packets will eventually land
quite deterministically as it rotates the txhash on each timeout.

This method can be deployed based on a variety of "bad connection"
signals. Besides timeouts, the aforementioned spurious retransmits,
for one. This TCP connection-level information can independent of
flowlabel rotation be valuable information to the cloud provider to
detect and pinpoint network issues. As mentioned before, ideally we
can pass along such details of the type of signal along with the hash.
But that also requires passing that state in the guest from the TCP
layer to the virtio-net device. So left for separate later work. For
now we just have the reserved space in the header.

Michael, what is the best way to proceed with this? Send the patches
for review to net-next, or should I start by opening ballots to
https://www.oasis-open.org/committees/ballots.php?wg_abbrev=virtio
first? Thanks.

WARNING: multiple messages have this Message-ID (diff)
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Network Development <netdev@vger.kernel.org>,
	virtualization@lists.linux-foundation.org
Subject: Re: [PATCH rfc 0/3] virtio-net: add tx-hash, rx-tstamp and tx-tstamp
Date: Wed, 6 Jan 2021 15:32:51 -0500	[thread overview]
Message-ID: <CAF=yD-Lcad6Sw6zkQGrCqck+s3rit-m6FLL6th9=G2pZOr=1Gw@mail.gmail.com> (raw)
In-Reply-To: <CA+FuTSdEqk8gxptnOSpNnm6YPSJv=62wKHqe4GbVAiKQRUfmXQ@mail.gmail.com>

On Mon, Dec 28, 2020 at 8:15 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Dec 28, 2020 at 7:47 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Dec 28, 2020 at 02:51:09PM -0500, Willem de Bruijn wrote:
> > > On Mon, Dec 28, 2020 at 12:29 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Mon, Dec 28, 2020 at 11:22:30AM -0500, Willem de Bruijn wrote:
> > > > > From: Willem de Bruijn <willemb@google.com>
> > > > >
> > > > > RFC for three new features to the virtio network device:
> > > > >
> > > > > 1. pass tx flow hash and state to host, for routing + telemetry
> > > > > 2. pass rx tstamp to guest, for better RTT estimation
> > > > > 3. pass tx tstamp to host, for accurate pacing
> > > > >
> > > > > All three would introduce an extension to the virtio spec.
> > > > > I assume this would require opening three ballots against v1.2 at
> > > > > https://www.oasis-open.org/committees/ballots.php?wg_abbrev=virtio
> > > > >
> > > > > This RFC is to informally discuss the proposals first.
> > > > >
> > > > > The patchset is against v5.10. Evaluation additionally requires
> > > > > changes to qemu and at least one back-end. I implemented preliminary
> > > > > support in Linux vhost-net. Both patches available through github at
> > > > >
> > > > > https://github.com/wdebruij/linux/tree/virtio-net-txhash-1
> > > > > https://github.com/wdebruij/qemu/tree/virtio-net-txhash-1
> > > >
> > > > Any data on what the benefits are?
> > >
> > > For the general method, yes. For this specific implementation, not  yet.
> > >
> > > Swift congestion control is delay based. It won the best paper award
> > > at SIGCOMM this year. That paper has a lot of data:
> > > https://dl.acm.org/doi/pdf/10.1145/3387514.3406591 . Section 3.1 talks
> > > about the different components that contribute to delay and how to
> > > isolate them.
> >
> > And for the hashing part?
>
> A few concrete examples of error conditions that can be resolved are
> mentioned in the commits that add sk_rethink_txhash calls. Such as
> commit 7788174e8726 ("tcp: change IPv6 flow-label upon receiving
> spurious retransmission"):
>
> "
>     Currently a Linux IPv6 TCP sender will change the flow label upon
>     timeouts to potentially steer away from a data path that has gone
>     bad. However this does not help if the problem is on the ACK path
>     and the data path is healthy. In this case the receiver is likely
>     to receive repeated spurious retransmission because the sender
>     couldn't get the ACKs in time and has recurring timeouts.
>
>     This patch adds another feature to mitigate this problem. It
>     leverages the DSACK states in the receiver to change the flow
>     label of the ACKs to speculatively re-route the ACK packets.
>     In order to allow triggering on the second consecutive spurious
>     RTO, the receiver changes the flow label upon sending a second
>     consecutive DSACK for a sequence number below RCV.NXT.
> "
>
> I don't have quantitative data on the efficacy at scale at hand. Let
> me see what I can find. This will probably take some time, at least
> until people are back after the holidays. I didn't want to delay the
> patch, as the merge window was a nice time for RFC. But agreed that it
> deserves stronger justification.

The practical results mirror what the theory suggests: that in the
presence of multiple paths, of which one goes bad, this method
maintains connectivity where otherwise it would disconnect.

When IPv6 FlowLabel was included in path selection (e.g., LAG/ECMP),
flowlabel rotation on TCP timeout avoided the vast majority of TCP
disconnections that would otherwise have occurred during to link
failures in long-haul backbones, when an alternative path was
available.

So it's not a matter of percentages, just the existence of an
alternative healthy path on which the packets will eventually land
quite deterministically as it rotates the txhash on each timeout.

This method can be deployed based on a variety of "bad connection"
signals. Besides timeouts, the aforementioned spurious retransmits,
for one. This TCP connection-level information can independent of
flowlabel rotation be valuable information to the cloud provider to
detect and pinpoint network issues. As mentioned before, ideally we
can pass along such details of the type of signal along with the hash.
But that also requires passing that state in the guest from the TCP
layer to the virtio-net device. So left for separate later work. For
now we just have the reserved space in the header.

Michael, what is the best way to proceed with this? Send the patches
for review to net-next, or should I start by opening ballots to
https://www.oasis-open.org/committees/ballots.php?wg_abbrev=virtio
first? Thanks.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

  reply	other threads:[~2021-01-06 20:34 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-28 16:22 [PATCH rfc 0/3] virtio-net: add tx-hash, rx-tstamp and tx-tstamp Willem de Bruijn
2020-12-28 16:22 ` Willem de Bruijn
2020-12-28 16:22 ` [PATCH rfc 1/3] virtio-net: support transmit hash report Willem de Bruijn
2020-12-28 16:22   ` Willem de Bruijn
2020-12-28 16:28   ` Michael S. Tsirkin
2020-12-28 16:28     ` Michael S. Tsirkin
2020-12-28 16:47     ` Willem de Bruijn
2020-12-28 16:47       ` Willem de Bruijn
2020-12-28 17:22       ` Michael S. Tsirkin
2020-12-28 17:22         ` Michael S. Tsirkin
2020-12-29  1:19         ` Willem de Bruijn
2020-12-29  1:19           ` Willem de Bruijn
2020-12-28 21:36   ` Michael S. Tsirkin
2020-12-28 21:36     ` Michael S. Tsirkin
2020-12-29  1:23     ` Willem de Bruijn
2020-12-29  1:23       ` Willem de Bruijn
2020-12-28 16:22 ` [PATCH rfc 2/3] virtio-net: support receive timestamp Willem de Bruijn
2020-12-28 16:22   ` Willem de Bruijn
2020-12-28 17:28   ` Michael S. Tsirkin
2020-12-28 17:28     ` Michael S. Tsirkin
2020-12-28 19:30     ` Willem de Bruijn
2020-12-28 19:30       ` Willem de Bruijn
2020-12-28 21:32       ` Michael S. Tsirkin
2020-12-28 21:32         ` Michael S. Tsirkin
2020-12-29  1:05         ` Willem de Bruijn
2020-12-29  1:05           ` Willem de Bruijn
2020-12-29  9:17           ` Jason Wang
2020-12-29  9:17             ` Jason Wang
2020-12-29 14:20             ` Willem de Bruijn
2020-12-29 14:20               ` Willem de Bruijn
2020-12-30  8:38               ` Jason Wang
2020-12-30  8:38                 ` Jason Wang
2020-12-28 22:59   ` Jakub Kicinski
2020-12-29  0:57     ` Willem de Bruijn
2020-12-29  0:57       ` Willem de Bruijn
2020-12-30  8:44       ` Jason Wang
2020-12-30  8:44         ` Jason Wang
2020-12-30 12:30       ` Richard Cochran
2021-02-02 13:05   ` kernel test robot
2021-02-02 13:05     ` kernel test robot
2021-02-02 13:05     ` kernel test robot
2021-02-02 14:08   ` Michael S. Tsirkin
2021-02-02 14:08     ` Michael S. Tsirkin
2021-02-02 22:17     ` Willem de Bruijn
2021-02-02 22:17       ` Willem de Bruijn
2021-02-02 23:02       ` Michael S. Tsirkin
2021-02-02 23:02         ` Michael S. Tsirkin
2021-02-02 23:43         ` Willem de Bruijn
2021-02-02 23:43           ` Willem de Bruijn
2020-12-28 16:22 ` [PATCH rfc 3/3] virtio-net: support transmit timestamp Willem de Bruijn
2020-12-28 16:22   ` Willem de Bruijn
2020-12-30 12:38   ` Richard Cochran
2020-12-30 15:25     ` Willem de Bruijn
2020-12-30 15:25       ` Willem de Bruijn
2021-02-02 13:47   ` kernel test robot
2021-02-02 13:47     ` kernel test robot
2021-02-02 13:47     ` kernel test robot
2020-12-28 17:29 ` [PATCH rfc 0/3] virtio-net: add tx-hash, rx-tstamp and tx-tstamp Michael S. Tsirkin
2020-12-28 17:29   ` Michael S. Tsirkin
2020-12-28 19:51   ` Willem de Bruijn
2020-12-28 19:51     ` Willem de Bruijn
2020-12-28 21:38     ` Michael S. Tsirkin
2020-12-28 21:38       ` Michael S. Tsirkin
2020-12-29  1:14       ` Willem de Bruijn
2020-12-29  1:14         ` Willem de Bruijn
2021-01-06 20:32         ` Willem de Bruijn [this message]
2021-01-06 20:32           ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAF=yD-Lcad6Sw6zkQGrCqck+s3rit-m6FLL6th9=G2pZOr=1Gw@mail.gmail.com' \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=jasowang@redhat.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.