netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] net: Allow to create links with given ifindex
@ 2012-07-30  4:34 Pavel Emelyanov
  2012-07-30  4:36 ` [PATCH 2/2] veth: Allow to create peer link " Pavel Emelyanov
  2012-07-30 10:49 ` [PATCH 1/2] net: Allow to create links " Eric W. Biederman
  0 siblings, 2 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2012-07-30  4:34 UTC (permalink / raw)
  To: Linux Netdev List, David Miller

Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
is not zero. I propose to allow requesting ifindices on link creation. This
is required by the checkpoint-restore to correctly restore a net namespace
(i.e. -- a container). The question what to do with pre-created devices such
as lo or sit fbdev is open, but for manually created devices this can be 
solved by this patch.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---

diff --git a/net/core/dev.c b/net/core/dev.c
index 0ebaea1..5966e2f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5533,7 +5533,12 @@ int register_netdevice(struct net_device *dev)
 		}
 	}
 
-	dev->ifindex = dev_new_index(net);
+	ret = -EBUSY;
+	if (!dev->ifindex)
+		dev->ifindex = dev_new_index(net);
+	else if (__dev_get_by_index(net, dev->ifindex))
+		goto err_uninit;
+
 	if (dev->iflink == -1)
 		dev->iflink = dev->ifindex;
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 334b930..76e19aa 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1801,8 +1801,6 @@ replay:
 			return -ENODEV;
 		}
 
-		if (ifm->ifi_index)
-			return -EOPNOTSUPP;
 		if (tb[IFLA_MAP] || tb[IFLA_MASTER] || tb[IFLA_PROTINFO])
 			return -EOPNOTSUPP;
 
@@ -1828,10 +1826,14 @@ replay:
 			return PTR_ERR(dest_net);
 
 		dev = rtnl_create_link(net, dest_net, ifname, ops, tb);
-
-		if (IS_ERR(dev))
+		if (IS_ERR(dev)) {
 			err = PTR_ERR(dev);
-		else if (ops->newlink)
+			goto out;
+		}
+
+		dev->ifindex = ifm->ifi_index;
+
+		if (ops->newlink)
 			err = ops->newlink(net, dev, tb, data);
 		else
 			err = register_netdevice(dev);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] veth: Allow to create peer link with given ifindex
  2012-07-30  4:34 [PATCH 1/2] net: Allow to create links with given ifindex Pavel Emelyanov
@ 2012-07-30  4:36 ` Pavel Emelyanov
  2012-07-30 10:49 ` [PATCH 1/2] net: Allow to create links " Eric W. Biederman
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2012-07-30  4:36 UTC (permalink / raw)
  To: Linux Netdev List, David Miller

The ifinfomsg is in there (thanks kaber@ for foreseeing this long time ago),
so take the given ifidex and register netdev with it.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 5852361..496c026 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -348,6 +348,9 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	if (tbp[IFLA_ADDRESS] == NULL)
 		eth_hw_addr_random(peer);
 
+	if (ifmp)
+		peer->ifindex = ifmp->ifi_index;
+
 	err = register_netdevice(peer);
 	put_net(net);
 	net = NULL;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-07-30  4:34 [PATCH 1/2] net: Allow to create links with given ifindex Pavel Emelyanov
  2012-07-30  4:36 ` [PATCH 2/2] veth: Allow to create peer link " Pavel Emelyanov
@ 2012-07-30 10:49 ` Eric W. Biederman
  2012-07-30 10:56   ` Eric W. Biederman
  2012-07-30 11:51   ` Eric Dumazet
  1 sibling, 2 replies; 19+ messages in thread
From: Eric W. Biederman @ 2012-07-30 10:49 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Netdev List, David Miller

Pavel Emelyanov <xemul@parallels.com> writes:

> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
> is not zero. I propose to allow requesting ifindices on link creation. This
> is required by the checkpoint-restore to correctly restore a net namespace
> (i.e. -- a container). The question what to do with pre-created devices such
> as lo or sit fbdev is open, but for manually created devices this can be 
> solved by this patch.

Have you walked through and found the locations where we still rely on
ifindex being globally unique?

Last time I was working in this area there were serveral places where
things were indexed by just the interface index.

I susepct it might be easier to generate hotplug events at restart time
saying someone removed and added an identical set of network devices.
Certainly for physical hardware that needs to happen, because things
like mac addresses will change.

Eric


> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
>
> ---
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0ebaea1..5966e2f 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5533,7 +5533,12 @@ int register_netdevice(struct net_device *dev)
>  		}
>  	}
>  
> -	dev->ifindex = dev_new_index(net);
> +	ret = -EBUSY;
> +	if (!dev->ifindex)
> +		dev->ifindex = dev_new_index(net);
> +	else if (__dev_get_by_index(net, dev->ifindex))
> +		goto err_uninit;
> +
>  	if (dev->iflink == -1)
>  		dev->iflink = dev->ifindex;
>  
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 334b930..76e19aa 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1801,8 +1801,6 @@ replay:
>  			return -ENODEV;
>  		}
>  
> -		if (ifm->ifi_index)
> -			return -EOPNOTSUPP;
>  		if (tb[IFLA_MAP] || tb[IFLA_MASTER] || tb[IFLA_PROTINFO])
>  			return -EOPNOTSUPP;
>  
> @@ -1828,10 +1826,14 @@ replay:
>  			return PTR_ERR(dest_net);
>  
>  		dev = rtnl_create_link(net, dest_net, ifname, ops, tb);
> -
> -		if (IS_ERR(dev))
> +		if (IS_ERR(dev)) {
>  			err = PTR_ERR(dev);
> -		else if (ops->newlink)
> +			goto out;
> +		}
> +
> +		dev->ifindex = ifm->ifi_index;
> +
> +		if (ops->newlink)
>  			err = ops->newlink(net, dev, tb, data);
>  		else
>  			err = register_netdevice(dev);
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-07-30 10:49 ` [PATCH 1/2] net: Allow to create links " Eric W. Biederman
@ 2012-07-30 10:56   ` Eric W. Biederman
  2012-07-31  9:03     ` Pavel Emelyanov
  2012-07-30 11:51   ` Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: Eric W. Biederman @ 2012-07-30 10:56 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Netdev List, David Miller

ebiederm@xmission.com (Eric W. Biederman) writes:

> Pavel Emelyanov <xemul@parallels.com> writes:
>
>> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
>> is not zero. I propose to allow requesting ifindices on link creation. This
>> is required by the checkpoint-restore to correctly restore a net namespace
>> (i.e. -- a container). The question what to do with pre-created devices such
>> as lo or sit fbdev is open, but for manually created devices this can be 
>> solved by this patch.
>
> Have you walked through and found the locations where we still rely on
> ifindex being globally unique?
>
> Last time I was working in this area there were serveral places where
> things were indexed by just the interface index.

If it is really safe to make ifindex per network namespace at this
point you can make dev_new_ifindex have a per network namespace base
counter, and that will fix your problems with the loopback device.

Unless you have done the work to root out the last of dependencies on
ifindex being globally unique I think you will run into some operational
problems.

Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-07-30 10:49 ` [PATCH 1/2] net: Allow to create links " Eric W. Biederman
  2012-07-30 10:56   ` Eric W. Biederman
@ 2012-07-30 11:51   ` Eric Dumazet
  2012-07-30 12:33     ` Eric W. Biederman
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-07-30 11:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Pavel Emelyanov, Linux Netdev List, David Miller

On Mon, 2012-07-30 at 03:49 -0700, Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
> > Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
> > is not zero. I propose to allow requesting ifindices on link creation. This
> > is required by the checkpoint-restore to correctly restore a net namespace
> > (i.e. -- a container). The question what to do with pre-created devices such
> > as lo or sit fbdev is open, but for manually created devices this can be 
> > solved by this patch.
> 
> Have you walked through and found the locations where we still rely on
> ifindex being globally unique?
> 
> Last time I was working in this area there were serveral places where
> things were indexed by just the interface index.

Really ? This would be very strange.

AFAIK dev_new_index() is always called, even in the
dev_change_net_namespace() case if there is a conflict.

And dev_new_index() could use a pernet net->ifindex instead of a
shared/static one.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-07-30 11:51   ` Eric Dumazet
@ 2012-07-30 12:33     ` Eric W. Biederman
  2012-07-31  9:06       ` Pavel Emelyanov
  0 siblings, 1 reply; 19+ messages in thread
From: Eric W. Biederman @ 2012-07-30 12:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Pavel Emelyanov, Linux Netdev List, David Miller

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Mon, 2012-07-30 at 03:49 -0700, Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>> 
>> > Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
>> > is not zero. I propose to allow requesting ifindices on link creation. This
>> > is required by the checkpoint-restore to correctly restore a net namespace
>> > (i.e. -- a container). The question what to do with pre-created devices such
>> > as lo or sit fbdev is open, but for manually created devices this can be 
>> > solved by this patch.
>> 
>> Have you walked through and found the locations where we still rely on
>> ifindex being globally unique?
>> 
>> Last time I was working in this area there were serveral places where
>> things were indexed by just the interface index.
>
> Really ? This would be very strange.

There at least were places that used oif or iff without being pernet
last time I was working on this.

It was never code that I understood particularly well so my memory of
what that code is, is unfortunately fuzzy.

> AFAIK dev_new_index() is always called, even in the
> dev_change_net_namespace() case if there is a conflict.

Except we never have a conflict because it takes an absurd number of
network devices to cause a 32bit counter to wrap.

> And dev_new_index() could use a pernet net->ifindex instead of a
> shared/static one.

Yes.  I made all of the core changes, and held back on making
dev_new_index() use a pernet net->ifindex because of a couple of problem
cases.

It has been a long time and those cases might have been fixed.

I'm not seeing anything obvious in the network stack with a quick skim,
but before we start relying on the property that interface indicies are
not globally unique I expect an good hard look at the networking stack
to see if any of those cases where there were problems still exist.

Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-07-30 10:56   ` Eric W. Biederman
@ 2012-07-31  9:03     ` Pavel Emelyanov
  2012-07-31 11:58       ` Eric W. Biederman
  0 siblings, 1 reply; 19+ messages in thread
From: Pavel Emelyanov @ 2012-07-31  9:03 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Netdev List, David Miller

On 07/30/2012 02:56 PM, Eric W. Biederman wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
>> Pavel Emelyanov <xemul@parallels.com> writes:
>>
>>> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
>>> is not zero. I propose to allow requesting ifindices on link creation. This
>>> is required by the checkpoint-restore to correctly restore a net namespace
>>> (i.e. -- a container). The question what to do with pre-created devices such
>>> as lo or sit fbdev is open, but for manually created devices this can be 
>>> solved by this patch.
>>
>> Have you walked through and found the locations where we still rely on
>> ifindex being globally unique?
>>
>> Last time I was working in this area there were serveral places where
>> things were indexed by just the interface index.
> 
> If it is really safe to make ifindex per network namespace at this
> point you can make dev_new_ifindex have a per network namespace base
> counter, and that will fix your problems with the loopback device.

Not it's not so unfortunately :(

First, let's imagine that on host A the loopback device got registered as
first device, but on host B for some reason some other device got registered
first. In that case after migration from A to B the lo on B will have index
equals 2. And there's no any strict requirement that lo's per net operations
are registered first. Please, correct me if I'm wrong.

Next. In fact, lo is not the only problem. Look at the e.g. sit versus ipgre
fallback devices. Both gets created on netns creation and obtain whatever
ifindices are generated for them. Even if we make ifidex per netns chances
that sit gets registered _strictly_ before ipgre equal zero, since they are
both modules.

> Unless you have done the work to root out the last of dependencies on
> ifindex being globally unique I think you will run into some operational
> problems.

I totally agree with that. Before doing this patch I revisited the ancient
attempt to make ifindices per netns and checked the issues Dave and you
discussed then -- I have looked through how the ifindices are used in the
networking code and found no places where the system-wide uniqueness is still
required. That's why I proposed this patch for inclusion. If you know the 
places I've missed, please let me know, I will work on it.

> Eric
> 
> .
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-07-30 12:33     ` Eric W. Biederman
@ 2012-07-31  9:06       ` Pavel Emelyanov
  0 siblings, 0 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2012-07-31  9:06 UTC (permalink / raw)
  To: Eric W. Biederman, Eric Dumazet; +Cc: Linux Netdev List, David Miller

> I'm not seeing anything obvious in the network stack with a quick skim,
> but before we start relying on the property that interface indicies are
> not globally unique I expect an good hard look at the networking stack
> to see if any of those cases where there were problems still exist.

Just an idea -- is it worth moving the possibility to have ifindidces intersect
under CONFIG_<SOMETHING> (EXPERT/CHECKPOINT_RESTORE) to let wider audience check
the code in real-life?

> Eric
> 
> .
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-07-31  9:03     ` Pavel Emelyanov
@ 2012-07-31 11:58       ` Eric W. Biederman
  2012-07-31 13:30         ` Pavel Emelyanov
  2012-08-02 10:28         ` Eric Dumazet
  0 siblings, 2 replies; 19+ messages in thread
From: Eric W. Biederman @ 2012-07-31 11:58 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Netdev List, David Miller

Pavel Emelyanov <xemul@parallels.com> writes:

> On 07/30/2012 02:56 PM, Eric W. Biederman wrote:
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> 
>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>
>>>> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
>>>> is not zero. I propose to allow requesting ifindices on link creation. This
>>>> is required by the checkpoint-restore to correctly restore a net namespace
>>>> (i.e. -- a container). The question what to do with pre-created devices such
>>>> as lo or sit fbdev is open, but for manually created devices this can be 
>>>> solved by this patch.
>>>
>>> Have you walked through and found the locations where we still rely on
>>> ifindex being globally unique?
>>>
>>> Last time I was working in this area there were serveral places where
>>> things were indexed by just the interface index.
>> 
>> If it is really safe to make ifindex per network namespace at this
>> point you can make dev_new_ifindex have a per network namespace base
>> counter, and that will fix your problems with the loopback device.
>
> Not it's not so unfortunately :(
>
> First, let's imagine that on host A the loopback device got registered as
> first device, but on host B for some reason some other device got registered
> first. In that case after migration from A to B the lo on B will have index
> equals 2. And there's no any strict requirement that lo's per net operations
> are registered first. Please, correct me if I'm wrong.

Actually there is a hard requirement that the loopback device be the
last device in a network namespace to be unregistered.  We meet that
requirement by registering the loopback device first
"net/core/dev.c:net_dev_init()".

> Next. In fact, lo is not the only problem. Look at the e.g. sit versus ipgre
> fallback devices. Both gets created on netns creation and obtain whatever
> ifindices are generated for them. Even if we make ifidex per netns chances
> that sit gets registered _strictly_ before ipgre equal zero, since they are
> both modules.

True.  However those fallback devices should no longer be needed,
and even if they are I think you can delete and recreate them.

Making lo the particularly interesting case.

>> Unless you have done the work to root out the last of dependencies on
>> ifindex being globally unique I think you will run into some operational
>> problems.
>
> I totally agree with that. Before doing this patch I revisited the ancient
> attempt to make ifindices per netns and checked the issues Dave and you
> discussed then -- I have looked through how the ifindices are used in the
> networking code and found no places where the system-wide uniqueness is still
> required. That's why I proposed this patch for inclusion. If you know the 
> places I've missed, please let me know, I will work on it.

I took a quick look and I did not see anything.  I saw places under
net/sched/ that looked a bit suspicious, and of course there are places
where we use oif and iff in some of the routing code that make we wonder
a bit.  But if you have looked and if I have looked I think we are ok.

> Just an idea -- is it worth moving the possibility to have ifindidces intersect
> under CONFIG_<SOMETHING> (EXPERT/CHECKPOINT_RESTORE) to let wider audience check
> the code in real-life?

I think the best testing we are going to get diversity wise is to create
a per netns counter into dev_new_index when net-next opens up.

Having an ifindex that we can only set at netdevice creation time seems
reasonable.  

Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-07-31 11:58       ` Eric W. Biederman
@ 2012-07-31 13:30         ` Pavel Emelyanov
  2012-08-02 10:28         ` Eric Dumazet
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Emelyanov @ 2012-07-31 13:30 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Netdev List, David Miller

>> First, let's imagine that on host A the loopback device got registered as
>> first device, but on host B for some reason some other device got registered
>> first. In that case after migration from A to B the lo on B will have index
>> equals 2. And there's no any strict requirement that lo's per net operations
>> are registered first. Please, correct me if I'm wrong.
> 
> Actually there is a hard requirement that the loopback device be the
> last device in a network namespace to be unregistered.  We meet that
> requirement by registering the loopback device first
> "net/core/dev.c:net_dev_init()".

Hm... Indeed, and this is good news!

>> Next. In fact, lo is not the only problem. Look at the e.g. sit versus ipgre
>> fallback devices. Both gets created on netns creation and obtain whatever
>> ifindices are generated for them. Even if we make ifidex per netns chances
>> that sit gets registered _strictly_ before ipgre equal zero, since they are
>> both modules.
> 
> True.  However those fallback devices should no longer be needed,
> and even if they are I think you can delete and recreate them.

Good idea! I will look at that direction.

> Making lo the particularly interesting case.

Yup, provided we can manually recreate those auto-created devices this solves
the issue.

>> Just an idea -- is it worth moving the possibility to have ifindidces intersect
>> under CONFIG_<SOMETHING> (EXPERT/CHECKPOINT_RESTORE) to let wider audience check
>> the code in real-life?
> 
> I think the best testing we are going to get diversity wise is to create
> a per netns counter into dev_new_index when net-next opens up.
> 
> Having an ifindex that we can only set at netdevice creation time seems
> reasonable.  

OK, thank you, Eric.

> Eric
> .
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-07-31 11:58       ` Eric W. Biederman
  2012-07-31 13:30         ` Pavel Emelyanov
@ 2012-08-02 10:28         ` Eric Dumazet
  2012-08-02 11:09           ` Eric W. Biederman
  2012-08-02 23:26           ` David Miller
  1 sibling, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-08-02 10:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Pavel Emelyanov, Linux Netdev List, David Miller

On Tue, 2012-07-31 at 04:58 -0700, Eric W. Biederman wrote:

> Making lo the particularly interesting case.


BTW, I noticed in my benchmarks, that once I remove the contention on
dst refcnt (using a percpu cache of dsts), I have a strange performance
cost accessing net->loopback_dev->ifindex in ip_route_output_key.

Strange because I see no false sharing on this ifindex location for
loopback device.

So we probably can save some cycles adding a net->loopback_ifindex
to remove one dereference.

If ifindex are per network space, I guess we'll need to change
arp_hashfn() or else we'll use some slots more than others.

diff --git a/include/net/arp.h b/include/net/arp.h
index 7f7df93..37aac58 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -10,7 +10,7 @@ extern struct neigh_table arp_tbl;
 
 static inline u32 arp_hashfn(u32 key, const struct net_device *dev, u32 hash_rnd)
 {
-	u32 val = key ^ dev->ifindex;
+	u32 val = key ^ (u32)(unsigned long)dev;
 
 	return val * hash_rnd;
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-08-02 10:28         ` Eric Dumazet
@ 2012-08-02 11:09           ` Eric W. Biederman
  2012-08-02 23:37             ` David Miller
  2012-08-02 23:26           ` David Miller
  1 sibling, 1 reply; 19+ messages in thread
From: Eric W. Biederman @ 2012-08-02 11:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Pavel Emelyanov, Linux Netdev List, David Miller

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Tue, 2012-07-31 at 04:58 -0700, Eric W. Biederman wrote:
>
>> Making lo the particularly interesting case.
>
>
> BTW, I noticed in my benchmarks, that once I remove the contention on
> dst refcnt (using a percpu cache of dsts), I have a strange performance
> cost accessing net->loopback_dev->ifindex in ip_route_output_key.
>
> Strange because I see no false sharing on this ifindex location for
> loopback device.
>
> So we probably can save some cycles adding a net->loopback_ifindex
> to remove one dereference.

I am going to let Pavel tackle the actual work because only migration
really cares and he is working on migration right now.

But assuming we merge the per network namespace ifindex counter we
can change net->loopback_dev->ifindex to LOOPBACK_IFINDEX and
define "#define LOOPBACK_IFINDEX 1"

Certainly that works in the initial network namespace today and might be
worth testing.

> If ifindex are per network space, I guess we'll need to change
> arp_hashfn() or else we'll use some slots more than others.

Darn.  I hate being right about there being a few places to fix
up.

ndisc_hashfn also has the same limitation.

Eric

> diff --git a/include/net/arp.h b/include/net/arp.h
> index 7f7df93..37aac58 100644
> --- a/include/net/arp.h
> +++ b/include/net/arp.h
> @@ -10,7 +10,7 @@ extern struct neigh_table arp_tbl;
>  
>  static inline u32 arp_hashfn(u32 key, const struct net_device *dev, u32 hash_rnd)
>  {
> -	u32 val = key ^ dev->ifindex;
> +	u32 val = key ^ (u32)(unsigned long)dev;
>  
>  	return val * hash_rnd;
>  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-08-02 10:28         ` Eric Dumazet
  2012-08-02 11:09           ` Eric W. Biederman
@ 2012-08-02 23:26           ` David Miller
  2012-08-03  5:45             ` Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2012-08-02 23:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ebiederm, xemul, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 02 Aug 2012 12:28:30 +0200

> Strange because I see no false sharing on this ifindex location for
> loopback device.

Are you sure netdev->rx_dropped isn't being incremented?  That appears
as if it would land on the same cache line as netdev->ifindex.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-08-02 11:09           ` Eric W. Biederman
@ 2012-08-02 23:37             ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2012-08-02 23:37 UTC (permalink / raw)
  To: ebiederm; +Cc: eric.dumazet, xemul, netdev

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Thu, 02 Aug 2012 04:09:39 -0700

> Eric Dumazet <eric.dumazet@gmail.com> writes:
> 
>> If ifindex are per network space, I guess we'll need to change
>> arp_hashfn() or else we'll use some slots more than others.
> 
> Darn.  I hate being right about there being a few places to fix
> up.
> 
> ndisc_hashfn also has the same limitation.

And netlabel's inteface hashing as well.

LLC works with ifindex hashing and is not namespace aware.  It's
should therefore limited to &init_net and therefore OK.  Likewise
for the CAN code.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-08-02 23:26           ` David Miller
@ 2012-08-03  5:45             ` Eric Dumazet
  2012-08-03  5:51               ` Eric Dumazet
  2012-08-03 23:56               ` David Miller
  0 siblings, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-08-03  5:45 UTC (permalink / raw)
  To: David Miller; +Cc: ebiederm, xemul, netdev

On Thu, 2012-08-02 at 16:26 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 02 Aug 2012 12:28:30 +0200
> 
> > Strange because I see no false sharing on this ifindex location for
> > loopback device.
> 
> Are you sure netdev->rx_dropped isn't being incremented?  That appears
> as if it would land on the same cache line as netdev->ifindex.
> 

Yes I am sure (by the way my output device was dummy0, not lo, but its
the same for the case here)

offsetof(struct net_device, features)=0x80
... (all features are mostly read only)
offsetof(struct net_device, ifindex)=0xa0

struct net_device_stats stats; is untouched on dummy device

offsetof(struct net_device, rx_dropped)=0x160

So thats only the dereference done million times per second that show in
the profiles, even if cache lines are clean and in cpu cache.

I see that even more clear in the IN_DEV_ROUTE_LOCALNET(in_dev) macro in
ip_route_input_slow(), doing so many derefs :

Its actually faster to avoid it, if we already have the net pointer in
hand : IN_DEV_NET_ROUTE_LOCALNET(in_dev, net) 

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 67f9dda..d032780 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -104,9 +104,14 @@ static inline void ipv4_devconf_setall(struct in_device *in_dev)
 #define IN_DEV_ANDCONF(in_dev, attr) \
 	(IPV4_DEVCONF_ALL(dev_net(in_dev->dev), attr) && \
 	 IN_DEV_CONF_GET((in_dev), attr))
-#define IN_DEV_ORCONF(in_dev, attr) \
-	(IPV4_DEVCONF_ALL(dev_net(in_dev->dev), attr) || \
+
+#define IN_DEV_NET_ORCONF(in_dev, net, attr) \
+	(IPV4_DEVCONF_ALL(net, attr) || \
 	 IN_DEV_CONF_GET((in_dev), attr))
+
+#define IN_DEV_ORCONF(in_dev, attr) \
+	IN_DEV_NET_ORCONF(in_dev, dev_net(in_dev->dev), attr)
+
 #define IN_DEV_MAXCONF(in_dev, attr) \
 	(max(IPV4_DEVCONF_ALL(dev_net(in_dev->dev), attr), \
 	     IN_DEV_CONF_GET((in_dev), attr)))
@@ -133,6 +138,8 @@ static inline void ipv4_devconf_setall(struct in_device *in_dev)
 					IN_DEV_ORCONF((in_dev), \
 						      PROMOTE_SECONDARIES)
 #define IN_DEV_ROUTE_LOCALNET(in_dev)	IN_DEV_ORCONF(in_dev, ROUTE_LOCALNET)
+#define IN_DEV_NET_ROUTE_LOCALNET(in_dev, net)	\
+	IN_DEV_NET_ORCONF(in_dev, net, ROUTE_LOCALNET)
 
 #define IN_DEV_RX_REDIRECTS(in_dev) \
 	((IN_DEV_FORWARD(in_dev) && \
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e4ba974..5e88e3b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1587,13 +1587,11 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	if (ipv4_is_zeronet(daddr))
 		goto martian_destination;
 
-	if (likely(!IN_DEV_ROUTE_LOCALNET(in_dev))) {
-		if (ipv4_is_loopback(daddr))
-			goto martian_destination;
+	if (ipv4_is_loopback(daddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
+		goto martian_destination;
 
-		if (ipv4_is_loopback(saddr))
-			goto martian_source;
-	}
+	if (ipv4_is_loopback(saddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
+		goto martian_source;
 
 	/*
 	 *	Now we are ready to route packet.

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-08-03  5:45             ` Eric Dumazet
@ 2012-08-03  5:51               ` Eric Dumazet
  2012-08-03 23:56               ` David Miller
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-08-03  5:51 UTC (permalink / raw)
  To: David Miller; +Cc: ebiederm, xemul, netdev

On Fri, 2012-08-03 at 07:45 +0200, Eric Dumazet wrote:

> Yes I am sure (by the way my output device was dummy0, not lo, but its
> the same for the case here)

Of course, the ifindex was related to lo, sorry for the confusion,
-ENOCOFFEE_YET_THIS_MORNING

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-08-03  5:45             ` Eric Dumazet
  2012-08-03  5:51               ` Eric Dumazet
@ 2012-08-03 23:56               ` David Miller
  2012-08-04  7:10                 ` Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2012-08-03 23:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ebiederm, xemul, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 03 Aug 2012 07:45:29 +0200

> @@ -1587,13 +1587,11 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
>  	if (ipv4_is_zeronet(daddr))
>  		goto martian_destination;
>  
> -	if (likely(!IN_DEV_ROUTE_LOCALNET(in_dev))) {
> -		if (ipv4_is_loopback(daddr))
> -			goto martian_destination;
> +	if (ipv4_is_loopback(daddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
> +		goto martian_destination;
>  
> -		if (ipv4_is_loopback(saddr))
> -			goto martian_source;
> -	}
> +	if (ipv4_is_loopback(saddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
> +		goto martian_source;

Maybe clearer as:

	if ((ipv4_is_loopback(daddr) || ipv4_is_loopback(saddr)) &&
	    !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
		goto martian_source;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-08-03 23:56               ` David Miller
@ 2012-08-04  7:10                 ` Eric Dumazet
  2012-08-04  8:25                   ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-08-04  7:10 UTC (permalink / raw)
  To: David Miller; +Cc: ebiederm, xemul, netdev

On Fri, 2012-08-03 at 16:56 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 03 Aug 2012 07:45:29 +0200
> 
> > @@ -1587,13 +1587,11 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
> >  	if (ipv4_is_zeronet(daddr))
> >  		goto martian_destination;
> >  
> > -	if (likely(!IN_DEV_ROUTE_LOCALNET(in_dev))) {
> > -		if (ipv4_is_loopback(daddr))
> > -			goto martian_destination;
> > +	if (ipv4_is_loopback(daddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
> > +		goto martian_destination;
> >  
> > -		if (ipv4_is_loopback(saddr))
> > -			goto martian_source;
> > -	}
> > +	if (ipv4_is_loopback(saddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
> > +		goto martian_source;
> 
> Maybe clearer as:
> 
> 	if ((ipv4_is_loopback(daddr) || ipv4_is_loopback(saddr)) &&
> 	    !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
> 		goto martian_source;

Clearer, but handling of a martian destination is different of the
martian source ;)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] net: Allow to create links with given ifindex
  2012-08-04  7:10                 ` Eric Dumazet
@ 2012-08-04  8:25                   ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2012-08-04  8:25 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ebiederm, xemul, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 04 Aug 2012 09:10:40 +0200

> On Fri, 2012-08-03 at 16:56 -0700, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Fri, 03 Aug 2012 07:45:29 +0200
>> 
>> > @@ -1587,13 +1587,11 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
>> >  	if (ipv4_is_zeronet(daddr))
>> >  		goto martian_destination;
>> >  
>> > -	if (likely(!IN_DEV_ROUTE_LOCALNET(in_dev))) {
>> > -		if (ipv4_is_loopback(daddr))
>> > -			goto martian_destination;
>> > +	if (ipv4_is_loopback(daddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
>> > +		goto martian_destination;
>> >  
>> > -		if (ipv4_is_loopback(saddr))
>> > -			goto martian_source;
>> > -	}
>> > +	if (ipv4_is_loopback(saddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
>> > +		goto martian_source;
>> 
>> Maybe clearer as:
>> 
>> 	if ((ipv4_is_loopback(daddr) || ipv4_is_loopback(saddr)) &&
>> 	    !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
>> 		goto martian_source;
> 
> Clearer, but handling of a martian destination is different of the
> martian source ;)

Duh, I missed that, too many martians :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-08-04  8:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-30  4:34 [PATCH 1/2] net: Allow to create links with given ifindex Pavel Emelyanov
2012-07-30  4:36 ` [PATCH 2/2] veth: Allow to create peer link " Pavel Emelyanov
2012-07-30 10:49 ` [PATCH 1/2] net: Allow to create links " Eric W. Biederman
2012-07-30 10:56   ` Eric W. Biederman
2012-07-31  9:03     ` Pavel Emelyanov
2012-07-31 11:58       ` Eric W. Biederman
2012-07-31 13:30         ` Pavel Emelyanov
2012-08-02 10:28         ` Eric Dumazet
2012-08-02 11:09           ` Eric W. Biederman
2012-08-02 23:37             ` David Miller
2012-08-02 23:26           ` David Miller
2012-08-03  5:45             ` Eric Dumazet
2012-08-03  5:51               ` Eric Dumazet
2012-08-03 23:56               ` David Miller
2012-08-04  7:10                 ` Eric Dumazet
2012-08-04  8:25                   ` David Miller
2012-07-30 11:51   ` Eric Dumazet
2012-07-30 12:33     ` Eric W. Biederman
2012-07-31  9:06       ` Pavel Emelyanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).