Linux-NVME Archive on
 help / color / Atom feed
From: Sagi Grimberg <>
To: "Belanger, Martin" <>,
	Hannes Reinecke <>,
	Martin Belanger <>,
	"" <>
Cc: "" <>,
	"" <>, "" <>
Subject: Re: [PATCH 1/1] Add 'Transport Interface' (triface) option. This can be used to specify the IP interface to use for the connection. The driver uses that to set SO_BINDTODEVICE on the socket before connecting.
Date: Fri, 7 May 2021 11:20:29 -0700
Message-ID: <> (raw)
In-Reply-To: <>

On 5/6/21 8:46 AM, Belanger, Martin wrote:
>> On 5/6/21 8:05 AM, Hannes Reinecke wrote:
>>> On 5/5/21 4:31 PM, Belanger, Martin wrote:
>> [ .. ]
>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state
>>>> group default qlen 1000
>>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>       inet scope global lo
>>>>          valid_lft forever preferred_lft forever
>>>> 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> fq_codel
>>>> state UP group default qlen 1000
>>>>       link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
>>>>       inet scope global enp0s3
>>>>          valid_lft forever preferred_lft forever
>>>> 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> fq_codel
>>>> state UP group default qlen 1000
>>>>       link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
>>>>       inet scope global enp0s8
>>>>          valid_lft forever preferred_lft forever
>>>> The above is a VM that I configured with the same IP address
>>>> ( on all interfaces. Doing a reverse lookup to identify
>>>> the unique interface associated with would simply not
>>>> work here. And this is why the option host_iface is required. I
>>>> understand that the above config does not represent a standard host
>>>> system, but I'm using this to prove a point: "we can never know how a
>>>> user will configure their system and the above configuration is
>>>> perfectly fine by Linux".
>>> ... and messing up any switch MAC address caching when doing so. I
>>> guess the network admin will come down hard on you if you try that on
>>> a production system.
>>> And I sincerely question whether this is a valid use-case; I'm already
>>> getting grief from our network admins if I dare to put two network
>>> interfaces from the same machine in the same network.
>>>> The current TCP implementation for host_traddr uses
>>>> bind()-before-connect(). This is a common construct to set the source
>>>> IP address on the socket before connecting. This has no effect on how
>>>> Linux will select the interface for the connection. That's because
>>>> Linux uses the Weak End System model as described in RFC1122 [2].
>>>> Setting the source address on a connection is a common requirement
>>>> that linux-nvme needs to support. In fact, specifying the Source IP
>>>> address is a mandatory FedGov requirement (e.g. connection to a
>>>> RADIUS/TACACS+ server). Consider the following configuration.
>>>> $ ip addr list dev enp0s8
>>>> 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> fq_codel
>>>> state UP group default qlen 1000
>>>>       link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
>>>>       inet brd scope global dynamic
>>>> noprefixroute enp0s8
>>>>          valid_lft 426sec preferred_lft 426sec
>>>>       inet scope global secondary enp0s8
>>>>          valid_lft forever preferred_lft forever
>>>>       inet scope global secondary enp0s8
>>>>          valid_lft forever preferred_lft forever
>>>>       inet scope global secondary enp0s8
>>>>          valid_lft forever preferred_lft forever
>>>> Here we can see that several addresses are associated with interface
>>>> enp0s8. By default, Linux will select the default IP address,
>>>>, as the source address when connecting over interface
>>>> enp0s8. Some users, however, want the ability to specify a different
>>>> address (e.g.,
>>>> to be used as the source address.
>>>> The option host_traddr can be used as-is to perform this function (I
>>>> tested it).
>>> No disagreement here.
>>>> In conclusion, I believe that for TCP we need 2 options. One that can
>>>> be used to specify an interface. And one that can be used to set the
>>>> source address. And users should be allowed to use one or the other,
>>>> or both, or none.
>>>> Of course, the documentation for host_traddr will need some
>>>> clarification. It should state that when used for TCP connection,
>>>> this option only sets the source address. And the documentation for
>>>> host_iface should say that this option only applies to TCP
>>>> connections.
>>> I'm with James Smart here. I do fail to see the need for 'host_iface'
>>> _without_ 'host_traddr'; especially for IPv6 where several addresses
>>> are standard just specifying 'host_iface' simply is not enough, and
>>> one has to specify 'host_traddr' additionally.
>>> So 'host_iface' should be contingent on 'host_traddr', meaning we can
>>> just expand the syntax of 'host_traddr'.
>>> One easy possibility would be to add ',nobind' to the host_traddr
>>> syntax which would indicate that we should _not_ bind to the
>>> underlying interface; I do think that binding to the respective
>>> interface should be the default.
>> A-ha. Just spoke to our network folks, and they clarified the usage of binding
>> to an IP address vs binding to a network interface.
>> Apparently, binding to a source IP address does just that, setting the source
>> IP address of the outgoing packet. That packet will _still_ be subjected to the
>> normal routing table, as the routing table is just influenced by the
>> _destination_ IP address.
>> So if we want to have it routed via a specific interface (and thereby
>> influencing the routing table) we need to bind it to that interface.
>> The only valid scenario our network folks could come up with where we do
>> _not_ want to bind to an interface is for asymmetric flows, ie in cases where
>> the outgoing flow is routed to one interface and the incoming flow is arriving
>> on another interface. But even they admitted that it's not a common
>> scenario, and probably will be killed by anti-spoofing software running on
>> the core switches ...
>> But if we want to support _that_ then clearly binding to a specific interface
>> doesn't work.
>> So I would vote for making binding to the network interface holding the IP
>> address the default, and add an option ',nobind' to host_traddr to skip it.
>> Cheers,
>> Hannes
>> --
>> Dr. Hannes Reinecke		        Kernel Storage Architect
>>			               +49 911 74053 688
>> SUSE Software Solutions Germany GmbH, 90409 Nürnberg
>> GF: F. Imendörffer, HRB 36809 (AG Nürnberg)
> Hi Hannes,
> If the only concern here is the addition of yet another option (--host-iface), then may I suggest a simpler approach. What I'm proposing adheres to RFC4007 [1], which defines a way to specify an interface by using the '%' delimiter between the Destination IP address and the Interface. In fact, "ping" uses this approach [2]. With ping, one can force the connection to go a specific interface like this:
> ping <dest-ip-addr>%<interface>

Ping only supports this syntax for IPv6 no?

> Extending this approach to nvme-cli we arrive to something like this:
> nvme discover --traddr --host-traddr ....

We already support this for IPv6, we can do that also for IPv4, but this
syntax may not be trivially expected for ipv4?

> This tells nvme to connect to on interface enp0s8. We make no change to the --host-traddr option. It continues to be used to specify the Source IP address only (for the rare cases where users want to specify a Source Address other than the default). With this, the interface is specified by name and not by its associated address. This is not only more intuitive, but, as I stated before, eliminates the problem caused by mapping the same IP address to multiple interfaces (not to mention that doing a reverse lookup on an IP address to find the interface is extra work that we don’t need to do in kernel space).

Maybe we do something like ping -I for host_traddr, from ping man pages:

-I interface
            interface is either an address, an interface name or a VRF 
name. If interface is an address, it sets source address to specified 
interface address. If interface is an
            interface name, it sets source interface to specified 
interface. If interface is a VRF name, each packet is routed using the 
corresponding routing table; in this case, the -I
            option can be repeated to specify a source address. NOTE: 
For IPv6, when doing ping to a link-local scope address, link 
specification (by the '%'-notation in destination, or
            by this option) can be used but it is no longer required.

Without the repetition though, unless we need to support two interfaces
that share the same multiple addresses in the same subnet, which sounds
completely crazy to me...

Linux-nvme mailing list

  reply index

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <>
2021-04-15 19:28 ` Martin Belanger
2021-05-01 11:34   ` Hannes Reinecke
2021-05-03 16:59     ` Belanger, Martin
2021-05-04 13:25       ` Hannes Reinecke
2021-05-04 19:56   ` Sagi Grimberg
2021-05-05  8:47     ` Hannes Reinecke
2021-05-05 14:31       ` Belanger, Martin
2021-05-05 18:33         ` James Smart
2021-05-05 20:32         ` Sagi Grimberg
2021-05-06 18:27           ` Michael Christie
2021-05-06  6:05         ` Hannes Reinecke
2021-05-06  7:00           ` Hannes Reinecke
2021-05-06 15:46             ` Belanger, Martin
2021-05-07 18:20               ` Sagi Grimberg [this message]
2021-05-10 13:49                 ` Belanger, Martin
2021-05-10 18:13                   ` Sagi Grimberg
2021-05-10 19:18                     ` Belanger, Martin
2021-05-11  0:28                       ` Sagi Grimberg
2021-05-11 13:41                         ` Belanger, Martin
2021-05-11 17:13                           ` Sagi Grimberg
2021-05-12  6:09                             ` Hannes Reinecke
2021-05-12 12:12                               ` Belanger, Martin
2021-05-12 22:12                                 ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-NVME Archive on

Archives are clonable:
	git clone --mirror linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ \
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone