All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tom Herbert <tom@herbertland.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Hannes Frederic Sowa <hannes@redhat.com>,
	Edward Cree <ecree@solarflare.com>,
	David Miller <davem@davemloft.net>,
	Alex Duyck <aduyck@mirantis.com>, Netdev <netdev@vger.kernel.org>,
	intel-wired-lan <intel-wired-lan@lists.osuosl.org>,
	Jesse Gross <jesse@kernel.org>,
	Eugenia Emantayev <eugenia@mellanox.com>,
	Jiri Benc <jbenc@redhat.com>,
	Saeed Mahameed <saeedm@mellanox.com>,
	Ariel Elior <ariel.elior@qlogic.com>,
	Michael Chan <michael.chan@broadcom.com>,
	Dept-GELinuxNICDev@qlogic.com
Subject: Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
Date: Tue, 21 Jun 2016 11:42:52 -0700	[thread overview]
Message-ID: <CALx6S34EXpxA5fs+HDUczMHaAD5q_yQMRVvwhCbSJCBO8_PRuQ@mail.gmail.com> (raw)
In-Reply-To: <CAKgT0UdsYo84J087KUTiiwuHGUaykPvpQTVxCdvE9k21tM5hsQ@mail.gmail.com>

On Tue, Jun 21, 2016 at 11:17 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Jun 21, 2016 at 10:40 AM, Hannes Frederic Sowa
> <hannes@redhat.com> wrote:
>> On 21.06.2016 10:27, Edward Cree wrote:
>>> On 21/06/16 18:05, Alexander Duyck wrote:
>>>> On Tue, Jun 21, 2016 at 1:22 AM, David Miller <davem@davemloft.net> wrote:
>>>>> But anyways, the vastness of the key is why we want to keep "sockets"
>>>>> out of network cards, because proper support of "sockets" requires
>>>>> access to information the card simply does not and should not have.
>>>> Right.  Really what I would like to see for most of these devices is a
>>>> 2 tuple filter where you specify the UDP port number, and the PF/VF ID
>>>> that the traffic is received on.
>>> But that doesn't make sense - the traffic is received on a physical network
>>> port, and it's the headers (i.e. flow) at that point that determine whether
>>> the traffic is encap or not.  After all, those headers are all that can
>>> determine which PF or VF it's sent to; and if it's multicast and goes to
>>> more than one of them, it seems odd for one to treat it as encap and the
>>> other to treat it as normal UDP - one of them must be misinterpreting it
>>> (unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring
>>> that complication for now).
>>
>> Disabling offloading of packets is never going to cause data corruptions
>> or misinterpretations. In some cases we can hint the network card to do
>> even more (RSS+checksumming). We always have a safe choice, namely not
>> doing hw offloading.
>
> Agreed.  Also we need to keep in mind that in many cases things like
> RSS and checksumming can be very easily made port specific since what
> we are talking about is just what is reported in the Rx descriptor and
> not any sort of change to the packet data.
>
>> Multicast is often scoped, in some cases we have different multicast
>> scopes but the same addresses. In case of scoped traffic, we must verify
>> the device as well and can't install the same flow on every NIC.
>
> Right.  Hopefully the NIC vendors are thinking ahead and testing to
> validate such cases where multicast or broadcast traffic doesn't do
> anything weird to their NICs in terms of offloads.
>
>>> At a given physical point in the network, a given UDP flow either is or is
>>> not carrying encapsulated traffic, and if it tries to be both then things
>>> are certain to break, just as much as if two different applications try to
>>> use the same UDP flow for two different application protocols.
>>
>> I think the example Tom was hinting at initially is like that:
>>
>> A net namespace acts as a router and has a vxlan endpoint active. The
>> vxlan endpoint enables vxlan offloading on all net_devices in the same
>> namespace. Because we only identify the tunnel endpoint by UDP port
>> number, traffic which should actually just be forwarded and should never
>> be processed locally suddenly can become processed by the offloading hw
>> units. Because UDP ports only form a contract between the end points and
>> not with the router in between it would be illegal to treat those not
>> locally designated packets as vxlan by the router.
>
> Yes.  The problem is I am sure there are some vendors out there
> wanting to tout their product as being excellent at routing VXLAN
> traffic so they are probably exploiting this to try and claim
> performance gains.
>
> There is also some argument to be had for theory versus application.
> Arguably it is the customers that are leading to some of the dirty
> hacks as I think vendors are building NICs based on customer use cases
> versus following any specifications.  In most data centers the tunnel
> underlays will be deployed throughout the network and UDP will likely
> be blocked for anything that isn't being used explicitly for
> tunneling.  As such we seem to be seeing a lot of NICs that are only
> supporting one port for things like this instead of designing them to
> handle whatever we can throw at them.
>
Actually, I don't believe that's true. It is not typical to deploy
firewalls within a data center fabric, and nor do we restrict
applications from binding to any UDP ports and they can pretty much
transmit to any port on any host without cost using an unconnected UDP
socket. I think it's more likely that NIC (and switch vendors) simply
assumed that port numbers can be treated as global values. That's
expedient and at small scale we can probably get away with it, but at
large scale this will eventually bite someone.

> I really think it may be a few more years before we hit the point
> where the vendors start to catch a clue about the fact that they need
> to have a generic approach that works in all cases versus what we have
> now were they are supporting whatever the buzzword of the day is and
> not looking much further down the road than that.  The fact is in a
> few years time we might even have to start dealing with
> tunnel-in-tunnel type workloads to address the use of containers
> inside of KVM guests.  I'm pretty sure we don't have support for
> recursive tunnel offloads in hardware and likely never will.  To that
> end all I would really need is support for CHECKSUM_COMPLETE or outer
> Rx checksums enabled, RSS based on the outer source port assuming the
> destination port is recognized as a tunnel, the ability to have DF bit
> set for any of the inner tunnel headers, and GSO partial extended to
> support tunnel-in-tunnel scenarios.
>
> - Alex

WARNING: multiple messages have this Message-ID (diff)
From: Tom Herbert <tom@herbertland.com>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
Date: Tue, 21 Jun 2016 11:42:52 -0700	[thread overview]
Message-ID: <CALx6S34EXpxA5fs+HDUczMHaAD5q_yQMRVvwhCbSJCBO8_PRuQ@mail.gmail.com> (raw)
In-Reply-To: <CAKgT0UdsYo84J087KUTiiwuHGUaykPvpQTVxCdvE9k21tM5hsQ@mail.gmail.com>

On Tue, Jun 21, 2016 at 11:17 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Jun 21, 2016 at 10:40 AM, Hannes Frederic Sowa
> <hannes@redhat.com> wrote:
>> On 21.06.2016 10:27, Edward Cree wrote:
>>> On 21/06/16 18:05, Alexander Duyck wrote:
>>>> On Tue, Jun 21, 2016 at 1:22 AM, David Miller <davem@davemloft.net> wrote:
>>>>> But anyways, the vastness of the key is why we want to keep "sockets"
>>>>> out of network cards, because proper support of "sockets" requires
>>>>> access to information the card simply does not and should not have.
>>>> Right.  Really what I would like to see for most of these devices is a
>>>> 2 tuple filter where you specify the UDP port number, and the PF/VF ID
>>>> that the traffic is received on.
>>> But that doesn't make sense - the traffic is received on a physical network
>>> port, and it's the headers (i.e. flow) at that point that determine whether
>>> the traffic is encap or not.  After all, those headers are all that can
>>> determine which PF or VF it's sent to; and if it's multicast and goes to
>>> more than one of them, it seems odd for one to treat it as encap and the
>>> other to treat it as normal UDP - one of them must be misinterpreting it
>>> (unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring
>>> that complication for now).
>>
>> Disabling offloading of packets is never going to cause data corruptions
>> or misinterpretations. In some cases we can hint the network card to do
>> even more (RSS+checksumming). We always have a safe choice, namely not
>> doing hw offloading.
>
> Agreed.  Also we need to keep in mind that in many cases things like
> RSS and checksumming can be very easily made port specific since what
> we are talking about is just what is reported in the Rx descriptor and
> not any sort of change to the packet data.
>
>> Multicast is often scoped, in some cases we have different multicast
>> scopes but the same addresses. In case of scoped traffic, we must verify
>> the device as well and can't install the same flow on every NIC.
>
> Right.  Hopefully the NIC vendors are thinking ahead and testing to
> validate such cases where multicast or broadcast traffic doesn't do
> anything weird to their NICs in terms of offloads.
>
>>> At a given physical point in the network, a given UDP flow either is or is
>>> not carrying encapsulated traffic, and if it tries to be both then things
>>> are certain to break, just as much as if two different applications try to
>>> use the same UDP flow for two different application protocols.
>>
>> I think the example Tom was hinting at initially is like that:
>>
>> A net namespace acts as a router and has a vxlan endpoint active. The
>> vxlan endpoint enables vxlan offloading on all net_devices in the same
>> namespace. Because we only identify the tunnel endpoint by UDP port
>> number, traffic which should actually just be forwarded and should never
>> be processed locally suddenly can become processed by the offloading hw
>> units. Because UDP ports only form a contract between the end points and
>> not with the router in between it would be illegal to treat those not
>> locally designated packets as vxlan by the router.
>
> Yes.  The problem is I am sure there are some vendors out there
> wanting to tout their product as being excellent at routing VXLAN
> traffic so they are probably exploiting this to try and claim
> performance gains.
>
> There is also some argument to be had for theory versus application.
> Arguably it is the customers that are leading to some of the dirty
> hacks as I think vendors are building NICs based on customer use cases
> versus following any specifications.  In most data centers the tunnel
> underlays will be deployed throughout the network and UDP will likely
> be blocked for anything that isn't being used explicitly for
> tunneling.  As such we seem to be seeing a lot of NICs that are only
> supporting one port for things like this instead of designing them to
> handle whatever we can throw at them.
>
Actually, I don't believe that's true. It is not typical to deploy
firewalls within a data center fabric, and nor do we restrict
applications from binding to any UDP ports and they can pretty much
transmit to any port on any host without cost using an unconnected UDP
socket. I think it's more likely that NIC (and switch vendors) simply
assumed that port numbers can be treated as global values. That's
expedient and at small scale we can probably get away with it, but at
large scale this will eventually bite someone.

> I really think it may be a few more years before we hit the point
> where the vendors start to catch a clue about the fact that they need
> to have a generic approach that works in all cases versus what we have
> now were they are supporting whatever the buzzword of the day is and
> not looking much further down the road than that.  The fact is in a
> few years time we might even have to start dealing with
> tunnel-in-tunnel type workloads to address the use of containers
> inside of KVM guests.  I'm pretty sure we don't have support for
> recursive tunnel offloads in hardware and likely never will.  To that
> end all I would really need is support for CHECKSUM_COMPLETE or outer
> Rx checksums enabled, RSS based on the outer source port assuming the
> destination port is recognized as a tunnel, the ability to have DF bit
> set for any of the inner tunnel headers, and GSO partial extended to
> support tunnel-in-tunnel scenarios.
>
> - Alex

  reply	other threads:[~2016-06-21 19:35 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 19:20 [net-next PATCH v3 00/17] Future-proof tunnel offload handlers Alexander Duyck
2016-06-16 19:20 ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:20 ` [net-next PATCH v3 01/17] vxlan/geneve: Include udp_tunnel.h in vxlan/geneve.h and fixup includes Alexander Duyck
2016-06-16 19:20   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 23:06   ` Hannes Frederic Sowa
2016-06-16 23:06     ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-16 19:20 ` [net-next PATCH v3 02/17] net: Combine GENEVE and VXLAN port notifiers into single functions Alexander Duyck
2016-06-16 19:20   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 22:45   ` Hannes Frederic Sowa
2016-06-16 22:45     ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-16 19:21 ` [net-next PATCH v3 03/17] net: Merge VXLAN and GENEVE push notifiers into a single notifier Alexander Duyck
2016-06-16 19:21   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 22:47   ` Hannes Frederic Sowa
2016-06-16 22:47     ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-16 19:21 ` [net-next PATCH v3 04/17] bnx2x: Move all UDP port notifiers to single function Alexander Duyck
2016-06-16 19:21   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:21 ` [net-next PATCH v3 05/17] bnxt: Update drivers to support unified UDP encapsulation offload functions Alexander Duyck
2016-06-16 19:21   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:21 ` [net-next PATCH v3 06/17] bnxt: Move GENEVE support from hard-coded port to using port notifier Alexander Duyck
2016-06-16 19:21   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 23:12   ` Michael Chan
2016-06-16 23:12     ` [Intel-wired-lan] " Michael Chan
2016-06-16 19:21 ` [net-next PATCH v3 07/17] benet: Replace ndo_add/del_vxlan_port with ndo_add/del_udp_enc_port Alexander Duyck
2016-06-16 19:21   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:21 ` [net-next PATCH v3 08/17] fm10k: " Alexander Duyck
2016-06-16 19:21   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:22 ` [net-next PATCH v3 09/17] i40e: Move all UDP port notifiers to single function Alexander Duyck
2016-06-16 19:22   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:22 ` [net-next PATCH v3 10/17] ixgbe: Replace ndo_add/del_vxlan_port with ndo_add/del_udp_enc_port Alexander Duyck
2016-06-16 19:22   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:22 ` [net-next PATCH v3 11/17] mlx4_en: " Alexander Duyck
2016-06-16 19:22   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:22 ` [net-next PATCH v3 12/17] mlx5_en: " Alexander Duyck
2016-06-16 19:22   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:22 ` [net-next PATCH v3 13/17] nfp: " Alexander Duyck
2016-06-16 19:22   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:22 ` [net-next PATCH v3 14/17] qede: Move all UDP port notifiers to single function Alexander Duyck
2016-06-16 19:22   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:23 ` [net-next PATCH v3 15/17] qlcnic: Replace ndo_add/del_vxlan_port with ndo_add/del_udp_enc_port Alexander Duyck
2016-06-16 19:23   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 19:23 ` [net-next PATCH v3 16/17] net: Remove deprecated tunnel specific UDP offload functions Alexander Duyck
2016-06-16 19:23   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 22:59   ` Hannes Frederic Sowa
2016-06-16 22:59     ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-16 19:23 ` [net-next PATCH v3 17/17] vxlan: Add new UDP encapsulation offload type for VXLAN-GPE Alexander Duyck
2016-06-16 19:23   ` [Intel-wired-lan] " Alexander Duyck
2016-06-16 23:01   ` Hannes Frederic Sowa
2016-06-16 23:01     ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-18  3:26 ` [net-next PATCH v3 00/17] Future-proof tunnel offload handlers David Miller
2016-06-18  3:26   ` [Intel-wired-lan] " David Miller
2016-06-20 17:05   ` Tom Herbert
2016-06-20 17:05     ` [Intel-wired-lan] " Tom Herbert
2016-06-20 18:11     ` Hannes Frederic Sowa
2016-06-20 18:11       ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-20 19:27       ` Tom Herbert
2016-06-20 19:27         ` [Intel-wired-lan] " Tom Herbert
2016-06-20 21:36         ` Hannes Frederic Sowa
2016-06-20 21:36           ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-20 21:45           ` Tom Herbert
2016-06-20 21:45             ` [Intel-wired-lan] " Tom Herbert
2016-06-21  8:34       ` David Miller
2016-06-21  8:34         ` [Intel-wired-lan] " David Miller
2016-06-21  8:22     ` David Miller
2016-06-21  8:22       ` [Intel-wired-lan] " David Miller
2016-06-21 10:41       ` Edward Cree
2016-06-21 10:41         ` [Intel-wired-lan] " Edward Cree
2016-06-21 15:23       ` Hannes Frederic Sowa
2016-06-21 15:23         ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-21 17:05       ` Alexander Duyck
2016-06-21 17:05         ` [Intel-wired-lan] " Alexander Duyck
2016-06-21 17:27         ` Edward Cree
2016-06-21 17:27           ` [Intel-wired-lan] " Edward Cree
2016-06-21 17:40           ` Hannes Frederic Sowa
2016-06-21 17:40             ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-21 18:17             ` Alexander Duyck
2016-06-21 18:17               ` [Intel-wired-lan] " Alexander Duyck
2016-06-21 18:42               ` Tom Herbert [this message]
2016-06-21 18:42                 ` Tom Herbert
2016-06-21 21:34                 ` Hannes Frederic Sowa
2016-06-21 21:34                   ` [Intel-wired-lan] " Hannes Frederic Sowa
2016-06-21 18:23             ` Edward Cree
2016-06-21 18:23               ` [Intel-wired-lan] " Edward Cree

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALx6S34EXpxA5fs+HDUczMHaAD5q_yQMRVvwhCbSJCBO8_PRuQ@mail.gmail.com \
    --to=tom@herbertland.com \
    --cc=Dept-GELinuxNICDev@qlogic.com \
    --cc=aduyck@mirantis.com \
    --cc=alexander.duyck@gmail.com \
    --cc=ariel.elior@qlogic.com \
    --cc=davem@davemloft.net \
    --cc=ecree@solarflare.com \
    --cc=eugenia@mellanox.com \
    --cc=hannes@redhat.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jbenc@redhat.com \
    --cc=jesse@kernel.org \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.