All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
@ 2015-05-17 23:42 Roopa Prabhu
  2015-05-18  5:11 ` Scott Feldman
  2015-05-18 20:19 ` David Miller
  0 siblings, 2 replies; 34+ messages in thread
From: Roopa Prabhu @ 2015-05-17 23:42 UTC (permalink / raw)
  To: davem, sfeldma, john.fastabend, jiri; +Cc: netdev

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch removes the calls to netdev_switch_fib_ipv4_abort when
there is an error in programming fib entry in hardware.

On most systems where you can offload routes to hardware,
doing routing in software is not an option (the cpu limitations
make routing impossible in software).

I understand that this was added to keep the first fib offload support
simple.

As discussed in the RFC patch, available options:
a) Fail fib entry add on hardware offload failure on switch devices, and
return appropriate error to the user by default (this patch)
b) make the behaviour in a) conditional on a global flag/sysctl (a per fib
entry flag will not work because by default user/routing-daemons dont care
if they are hardware offloaded)
c) for users/routing-daemons interested in controlling hardware
offload behaviour there is always the per fib entry flag RTNH_F_OFFLOAD

Considering the characteristics of the systems that support fib offloads,
this patch implements a). Also making a) the default will enable easier/faster
migration of existing routing apps to switch devices.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
RFC to v1:
  - rebase to net
  - remove fib_offload_disabled flag and all associated code as suggested by
    scott feldman

(This patch is currently against net because fib offload was introduced in 4.1.
If there is any reason to move it to net-next, I can respin)

v1 to v2:
  - include missing switchdev.h changes

 include/net/netns/ipv4.h  |    1 -
 include/net/switchdev.h   |    6 ------
 net/ipv4/fib_trie.c       |    5 +----
 net/switchdev/switchdev.c |   23 -----------------------
 4 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 614a49b..31919b1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -47,7 +47,6 @@ struct netns_ipv4 {
 	int			fib_num_tclassid_users;
 #endif
 	struct hlist_head	*fib_table_hash;
-	bool			fib_offload_disabled;
 	struct sock		*fibnl;
 
 	struct sock  * __percpu	*icmp_sk;
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index d2e69ee..1dbcdd9 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -85,8 +85,6 @@ int netdev_switch_fib_ipv4_add(u32 dst, int dst_len, struct fib_info *fi,
 			       u8 tos, u8 type, u32 nlflags, u32 tb_id);
 int netdev_switch_fib_ipv4_del(u32 dst, int dst_len, struct fib_info *fi,
 			       u8 tos, u8 type, u32 tb_id);
-void netdev_switch_fib_ipv4_abort(struct fib_info *fi);
-
 #else
 
 static inline int netdev_switch_parent_id_get(struct net_device *dev,
@@ -160,10 +158,6 @@ static inline int netdev_switch_fib_ipv4_del(u32 dst, int dst_len,
 	return 0;
 }
 
-static inline void netdev_switch_fib_ipv4_abort(struct fib_info *fi)
-{
-}
-
 #endif
 
 #endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 64c2076..ef523d3 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1171,7 +1171,6 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
 							 cfg->fc_nlflags,
 							 tb->tb_id);
 			if (err) {
-				netdev_switch_fib_ipv4_abort(fi);
 				kmem_cache_free(fn_alias_kmem, new_fa);
 				goto out;
 			}
@@ -1219,10 +1218,8 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
 					 cfg->fc_type,
 					 cfg->fc_nlflags,
 					 tb->tb_id);
-	if (err) {
-		netdev_switch_fib_ipv4_abort(fi);
+	if (err)
 		goto out_free_new_fa;
-	}
 
 	/* Insert new entry to the list. */
 	err = fib_insert_alias(t, tp, l, new_fa, fa, key);
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 055453d..d1269c4 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -325,9 +325,6 @@ int netdev_switch_fib_ipv4_add(u32 dst, int dst_len, struct fib_info *fi,
 		return 0;
 #endif
 
-	if (fi->fib_net->ipv4.fib_offload_disabled)
-		return 0;
-
 	dev = netdev_switch_get_dev_by_nhs(fi);
 	if (!dev)
 		return 0;
@@ -382,23 +379,3 @@ int netdev_switch_fib_ipv4_del(u32 dst, int dst_len, struct fib_info *fi,
 	return err;
 }
 EXPORT_SYMBOL_GPL(netdev_switch_fib_ipv4_del);
-
-/**
- *	netdev_switch_fib_ipv4_abort - Abort an IPv4 FIB operation
- *
- *	@fi: route FIB info structure
- */
-void netdev_switch_fib_ipv4_abort(struct fib_info *fi)
-{
-	/* There was a problem installing this route to the offload
-	 * device.  For now, until we come up with more refined
-	 * policy handling, abruptly end IPv4 fib offloading for
-	 * for entire net by flushing offload device(s) of all
-	 * IPv4 routes, and mark IPv4 fib offloading broken from
-	 * this point forward.
-	 */
-
-	fib_flush_external(fi->fib_net);
-	fi->fib_net->ipv4.fib_offload_disabled = true;
-}
-EXPORT_SYMBOL_GPL(netdev_switch_fib_ipv4_abort);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-17 23:42 [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware Roopa Prabhu
@ 2015-05-18  5:11 ` Scott Feldman
  2015-05-18 20:19 ` David Miller
  1 sibling, 0 replies; 34+ messages in thread
From: Scott Feldman @ 2015-05-18  5:11 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: David S. Miller, john fastabend, Jiří Pírko, Netdev

On Sun, May 17, 2015 at 4:42 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
> This patch removes the calls to netdev_switch_fib_ipv4_abort when
> there is an error in programming fib entry in hardware.
>
> On most systems where you can offload routes to hardware,
> doing routing in software is not an option (the cpu limitations
> make routing impossible in software).
>
> I understand that this was added to keep the first fib offload support
> simple.
>
> As discussed in the RFC patch, available options:
> a) Fail fib entry add on hardware offload failure on switch devices, and
> return appropriate error to the user by default (this patch)
> b) make the behaviour in a) conditional on a global flag/sysctl (a per fib
> entry flag will not work because by default user/routing-daemons dont care
> if they are hardware offloaded)
> c) for users/routing-daemons interested in controlling hardware
> offload behaviour there is always the per fib entry flag RTNH_F_OFFLOAD
>
> Considering the characteristics of the systems that support fib offloads,
> this patch implements a). Also making a) the default will enable easier/faster
> migration of existing routing apps to switch devices.
>
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>

Acked-by: Scott Feldman <sfeldma@gmail.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-17 23:42 [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware Roopa Prabhu
  2015-05-18  5:11 ` Scott Feldman
@ 2015-05-18 20:19 ` David Miller
  2015-05-19  0:21   ` John Fastabend
                     ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: David Miller @ 2015-05-18 20:19 UTC (permalink / raw)
  To: roopa; +Cc: sfeldma, john.fastabend, jiri, netdev

From: Roopa Prabhu <roopa@cumulusnetworks.com>
Date: Sun, 17 May 2015 16:42:05 -0700

> On most systems where you can offload routes to hardware,
> doing routing in software is not an option (the cpu limitations
> make routing impossible in software).

You absolutely do not get to determine this policy, none of us
do.

What matters is that by default the damn switch device being there
is %100 transparent to the user.

And the way to achieve that default is to do software routes as
a fallback.

I am not going to entertain changes of this nature which fail
route loading by default just because we've exceeded a device's
HW capacity to offload.

I thought I was _really_ clear about this at netdev 0.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-18 20:19 ` David Miller
@ 2015-05-19  0:21   ` John Fastabend
  2015-05-19  3:48     ` David Miller
  2015-05-19  5:57   ` roopa
  2015-05-28  9:42   ` Jiri Pirko
  2 siblings, 1 reply; 34+ messages in thread
From: John Fastabend @ 2015-05-19  0:21 UTC (permalink / raw)
  To: David Miller, roopa; +Cc: sfeldma, john.fastabend, jiri, netdev

On 05/18/2015 01:19 PM, David Miller wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> Date: Sun, 17 May 2015 16:42:05 -0700
> 
>> On most systems where you can offload routes to hardware,
>> doing routing in software is not an option (the cpu limitations
>> make routing impossible in software).
> 
> You absolutely do not get to determine this policy, none of us
> do.
> 
> What matters is that by default the damn switch device being there
> is %100 transparent to the user.
> 
> And the way to achieve that default is to do software routes as
> a fallback.
> 

Although this is not really usable for a class of switches this way,
as noted above.

> I am not going to entertain changes of this nature which fail
> route loading by default just because we've exceeded a device's
> HW capacity to offload.
> 
> I thought I was _really_ clear about this at netdev 0.1

So how about having an error strategy sysctl field that we can set
at provisioning time. I think this would align to Roopa's option (b).
This way we can default to "transparent" mode and the users where
this wont work can set the error mode. This way user land software
stacks that work today should continue to work in both modes.

.John

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-19  0:21   ` John Fastabend
@ 2015-05-19  3:48     ` David Miller
  2015-05-19  5:58       ` roopa
  0 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2015-05-19  3:48 UTC (permalink / raw)
  To: john.r.fastabend; +Cc: roopa, sfeldma, john.fastabend, jiri, netdev

From: John Fastabend <john.r.fastabend@intel.com>
Date: Mon, 18 May 2015 17:21:29 -0700

> So how about having an error strategy sysctl field that we can set
> at provisioning time. I think this would align to Roopa's option (b).
> This way we can default to "transparent" mode and the users where
> this wont work can set the error mode. This way user land software
> stacks that work today should continue to work in both modes.

Alert: This is not a switch provisioning issue.

You can frame it like that all day, and continue to talk about
low power cpus or other things which are completely and utterly
irrelevant.

Stop looking at how some specific piece of hardware is configured,
and think about what actually is asking the kernel to do stuff.

That's because the real issue is _semantics_ and what a Linux machine
is expected to do when you insert a route and valid reasons why a
route insertion can fail.

That is the _only_ issue.

And that has to do with what semantics _applcations_ making these
routing change requests expect.

There is nothing else that matters.

And since it is an issue of what semantics those application want and
are able to handle, that is where the request of changed behavior
belongs.

If we added your suggested sysctl, we'd have to name it
"sysctl_break_all_my_apps_please" because that is exactly what it
would be doing. :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-18 20:19 ` David Miller
  2015-05-19  0:21   ` John Fastabend
@ 2015-05-19  5:57   ` roopa
  2015-05-28  9:42   ` Jiri Pirko
  2 siblings, 0 replies; 34+ messages in thread
From: roopa @ 2015-05-19  5:57 UTC (permalink / raw)
  To: David Miller; +Cc: sfeldma, john.fastabend, jiri, netdev

On 5/18/15, 1:19 PM, David Miller wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> Date: Sun, 17 May 2015 16:42:05 -0700
>
>> On most systems where you can offload routes to hardware,
>> doing routing in software is not an option (the cpu limitations
>> make routing impossible in software).
> You absolutely do not get to determine this policy, none of us
> do.
>
> What matters is that by default the damn switch device being there
> is %100 transparent to the user.
>
> And the way to achieve that default is to do software routes as
> a fallback.
>
> I am not going to entertain changes of this nature which fail
> route loading by default just because we've exceeded a device's
> HW capacity to offload.
>
> I thought I was _really_ clear about this at netdev 0.1
sorry if i missed you emphasize this.
Current fib hw offload code aborts hw offload on the first hw offload 
failure and moves all routes to software and before 4.1 goes out I was 
trying to revisit this because the existing defaults will not work for 
most devices. To that effect, i started with an RFC patch to get some 
current thoughts and hence this patch.

I will follow up on the other options mentioned in the comment.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-19  3:48     ` David Miller
@ 2015-05-19  5:58       ` roopa
  2015-05-19 16:34         ` David Miller
  0 siblings, 1 reply; 34+ messages in thread
From: roopa @ 2015-05-19  5:58 UTC (permalink / raw)
  To: David Miller; +Cc: john.r.fastabend, sfeldma, john.fastabend, jiri, netdev

On 5/18/15, 8:48 PM, David Miller wrote:
> From: John Fastabend <john.r.fastabend@intel.com>
> Date: Mon, 18 May 2015 17:21:29 -0700
>
>> So how about having an error strategy sysctl field that we can set
>> at provisioning time. I think this would align to Roopa's option (b).
>> This way we can default to "transparent" mode and the users where
>> this wont work can set the error mode. This way user land software
>> stacks that work today should continue to work in both modes.
> Alert: This is not a switch provisioning issue.
>
> You can frame it like that all day, and continue to talk about
> low power cpus or other things which are completely and utterly
> irrelevant.
>
> Stop looking at how some specific piece of hardware is configured,
> and think about what actually is asking the kernel to do stuff.
>
> That's because the real issue is _semantics_ and what a Linux machine
> is expected to do when you insert a route and valid reasons why a
> route insertion can fail.
>
> That is the _only_ issue.
>
> And that has to do with what semantics _applcations_ making these
> routing change requests expect.
>
> There is nothing else that matters.
>
> And since it is an issue of what semantics those application want and
> are able to handle, that is where the request of changed behavior
> belongs.
>
> If we added your suggested sysctl, we'd have to name it
> "sysctl_break_all_my_apps_please" because that is exactly what it
> would be doing. :-)
understood. This seems to lean towards option c) where app explicitly 
requests offload with RTNH_F_OFFLOAD for every route.

from where I see, with the limitations on these boxes,
this requires every app, every `ip route` cmd running on the box to 
explicitly specify offload when running on this hardware. In which case 
having a way to specify a global system policy seemed appropriate. Hence 
the sysctl suggestion.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-19  5:58       ` roopa
@ 2015-05-19 16:34         ` David Miller
  2015-05-19 17:01           ` Jiri Pirko
  2015-05-19 19:47           ` Andy Gospodarek
  0 siblings, 2 replies; 34+ messages in thread
From: David Miller @ 2015-05-19 16:34 UTC (permalink / raw)
  To: roopa; +Cc: john.r.fastabend, sfeldma, john.fastabend, jiri, netdev

From: roopa <roopa@cumulusnetworks.com>
Date: Mon, 18 May 2015 22:58:54 -0700

> from where I see, with the limitations on these boxes, this requires
> every app, every `ip route` cmd running on the box to explicitly
> specify offload when running on this hardware.

Users know what they are doing, and will ask for a policy change.

The default should be what happens right now.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-19 16:34         ` David Miller
@ 2015-05-19 17:01           ` Jiri Pirko
  2015-05-19 19:47           ` Andy Gospodarek
  1 sibling, 0 replies; 34+ messages in thread
From: Jiri Pirko @ 2015-05-19 17:01 UTC (permalink / raw)
  To: David Miller; +Cc: roopa, john.r.fastabend, sfeldma, john.fastabend, netdev

Tue, May 19, 2015 at 06:34:18PM CEST, davem@davemloft.net wrote:
>From: roopa <roopa@cumulusnetworks.com>
>Date: Mon, 18 May 2015 22:58:54 -0700
>
>> from where I see, with the limitations on these boxes, this requires
>> every app, every `ip route` cmd running on the box to explicitly
>> specify offload when running on this hardware.
>
>Users know what they are doing, and will ask for a policy change.
>
>The default should be what happens right now.

I agree.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-19 16:34         ` David Miller
  2015-05-19 17:01           ` Jiri Pirko
@ 2015-05-19 19:47           ` Andy Gospodarek
  2015-05-19 20:28             ` David Miller
  1 sibling, 1 reply; 34+ messages in thread
From: Andy Gospodarek @ 2015-05-19 19:47 UTC (permalink / raw)
  To: David Miller
  Cc: roopa, john.r.fastabend, sfeldma, john.fastabend, jiri, netdev

On Tue, May 19, 2015 at 12:34:18PM -0400, David Miller wrote:
> From: roopa <roopa@cumulusnetworks.com>
> Date: Mon, 18 May 2015 22:58:54 -0700
> 
> > from where I see, with the limitations on these boxes, this requires
> > every app, every `ip route` cmd running on the box to explicitly
> > specify offload when running on this hardware.
> 
> Users know what they are doing, and will ask for a policy change.
> 

Dave,

I'm only writing this as one of the things you said at netdev01 is that
we as a community need to be clear when collectively making decisions
about kernel functionality so there are reasonable 'breadcrumbs' when
someone asks why a decision was made.  So...

Are you actually saying that if users complain loudly enough about the
current behavior (not the change Roopa has proposed) that you would be
open to considering a change the current behavior?

-or-

Are you saying that users who are aware of offload hardware will
begin to make specifically crafted requests (as Roopa suggested with
option 'c') and those requests will force the kernel to behave in
differently that the current default behavior?

-or-

[Insert different conclusion here.]

Thanks,

-andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-19 19:47           ` Andy Gospodarek
@ 2015-05-19 20:28             ` David Miller
  2015-05-20 14:37               ` Andy Gospodarek
  2015-05-21  5:46               ` Scott Feldman
  0 siblings, 2 replies; 34+ messages in thread
From: David Miller @ 2015-05-19 20:28 UTC (permalink / raw)
  To: gospo; +Cc: roopa, john.r.fastabend, sfeldma, john.fastabend, jiri, netdev

From: Andy Gospodarek <gospo@cumulusnetworks.com>
Date: Tue, 19 May 2015 15:47:32 -0400

> Are you actually saying that if users complain loudly enough about
> the current behavior (not the change Roopa has proposed) that you
> would be open to considering a change the current behavior?

I am saying that we have a contract with users not to break existing
behavior.  Full stop.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-19 20:28             ` David Miller
@ 2015-05-20 14:37               ` Andy Gospodarek
  2015-05-21  5:46               ` Scott Feldman
  1 sibling, 0 replies; 34+ messages in thread
From: Andy Gospodarek @ 2015-05-20 14:37 UTC (permalink / raw)
  To: David Miller
  Cc: roopa, john.r.fastabend, sfeldma, john.fastabend, jiri, netdev

On Tue, May 19, 2015 at 04:28:45PM -0400, David Miller wrote:
> From: Andy Gospodarek <gospo@cumulusnetworks.com>
> Date: Tue, 19 May 2015 15:47:32 -0400
> 
> > Are you actually saying that if users complain loudly enough about
> > the current behavior (not the change Roopa has proposed) that you
> > would be open to considering a change the current behavior?
> 
> I am saying that we have a contract with users not to break existing
> behavior.  Full stop.

I stepped away from the ledge and now definitely agree that the default
behavior should stay the same in the offload or non-offload case.

There is probably room for some level of notification (even if just a
printk) to let one know that offloading is disabled and it would be
great if there was a way to set fib_offload_disabled to false again, but
that's not going to happen until patches are submitted, so I'll get to
that instead of complaining more.  :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-19 20:28             ` David Miller
  2015-05-20 14:37               ` Andy Gospodarek
@ 2015-05-21  5:46               ` Scott Feldman
  2015-05-21 15:37                 ` roopa
  2015-05-29  7:50                 ` Jiri Pirko
  1 sibling, 2 replies; 34+ messages in thread
From: Scott Feldman @ 2015-05-21  5:46 UTC (permalink / raw)
  To: David Miller
  Cc: Andy Gospodarek, Roopa Prabhu, Fastabend, John R, john fastabend,
	Jiří Pírko, Netdev

On Tue, May 19, 2015 at 1:28 PM, David Miller <davem@davemloft.net> wrote:
> From: Andy Gospodarek <gospo@cumulusnetworks.com>
> Date: Tue, 19 May 2015 15:47:32 -0400
>
>> Are you actually saying that if users complain loudly enough about
>> the current behavior (not the change Roopa has proposed) that you
>> would be open to considering a change the current behavior?
>
> I am saying that we have a contract with users not to break existing
> behavior.  Full stop.

After rehearing David's argument, we should probably explore option d)
which is a refinement on the fib_offload_disable mechanism we have
today.  fib_offload_disable is global for all routes.  Once we hit a
HW install problem, the global flag is set and all routes fallback to
SW.  We did this because we can't allow the failed route to exist in
SW and not in HW because it could mess up LPM searches (HW could hit
on a lesser prefix even when SW has the true LPM, because HW gets
first shot at match).  The refinement on fib_offload_disable is this:
make it per-related-prefix rather than global, and on a HW install
problem, set the flag for the related-prefix and uninstall only those
routes from HW.  Related-prefix (is there a correct term for this?)
are routes to the same dst addr but with different prefix lengths.  I
haven't parsed the fib_trie structure to see how routes are organized,
but I suspect since it's optimized for lookup the related-prefix
tracking is already there and we can build on that.

Option d) requires no application changes.  It requires no additional
understanding or input from the user.  Kernel fallback happens
transparently.  In the case where the HW install failure was due to
out-of-resource in HW, there may be some oscillation as
related-prefixes are removed from HW, freeing up a little space, only
to be filled as new routes come in, and so on.  Actually, now that I
think of it, the device/driver could decide which related-prefix to
evict from HW, if driver/device wanted to have a sense of which routes
are more important to offload than others, when resources are limited.

I think the parts we need are:

1) A new fib_offload_disable bit for related-prefixes.
2) On switchdev fib offload, look up if route is marked as
fib_offload_disabled based on it's related-prefix membership
3) A notification mechanism from driver to indicate a related-prefix
is fib_offload_disabled.

Feasible?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-21  5:46               ` Scott Feldman
@ 2015-05-21 15:37                 ` roopa
  2015-05-29  7:50                 ` Jiri Pirko
  1 sibling, 0 replies; 34+ messages in thread
From: roopa @ 2015-05-21 15:37 UTC (permalink / raw)
  To: Scott Feldman
  Cc: David Miller, Andy Gospodarek, Fastabend, John R, john fastabend,
	Jiří Pírko, Netdev, Shrijeet Mukherjee

On 5/20/15, 10:46 PM, Scott Feldman wrote:
> On Tue, May 19, 2015 at 1:28 PM, David Miller <davem@davemloft.net> wrote:
>> From: Andy Gospodarek <gospo@cumulusnetworks.com>
>> Date: Tue, 19 May 2015 15:47:32 -0400
>>
>>> Are you actually saying that if users complain loudly enough about
>>> the current behavior (not the change Roopa has proposed) that you
>>> would be open to considering a change the current behavior?
>> I am saying that we have a contract with users not to break existing
>> behavior.  Full stop.
> After rehearing David's argument, we should probably explore option d)
> which is a refinement on the fib_offload_disable mechanism we have
> today.  fib_offload_disable is global for all routes.  Once we hit a
> HW install problem, the global flag is set and all routes fallback to
> SW.  We did this because we can't allow the failed route to exist in
> SW and not in HW because it could mess up LPM searches (HW could hit
> on a lesser prefix even when SW has the true LPM, because HW gets
> first shot at match).  The refinement on fib_offload_disable is this:
> make it per-related-prefix rather than global, and on a HW install
> problem, set the flag for the related-prefix and uninstall only those
> routes from HW.  Related-prefix (is there a correct term for this?)
> are routes to the same dst addr but with different prefix lengths.  I
> haven't parsed the fib_trie structure to see how routes are organized,
> but I suspect since it's optimized for lookup the related-prefix
> tracking is already there and we can build on that.
>
> Option d) requires no application changes.  It requires no additional
> understanding or input from the user.  Kernel fallback happens
> transparently.  In the case where the HW install failure was due to
> out-of-resource in HW, there may be some oscillation as
> related-prefixes are removed from HW, freeing up a little space, only
> to be filled as new routes come in, and so on.  Actually, now that I
> think of it, the device/driver could decide which related-prefix to
> evict from HW, if driver/device wanted to have a sense of which routes
> are more important to offload than others, when resources are limited.
>
> I think the parts we need are:
>
> 1) A new fib_offload_disable bit for related-prefixes.
> 2) On switchdev fib offload, look up if route is marked as
> fib_offload_disabled based on it's related-prefix membership
> 3) A notification mechanism from driver to indicate a related-prefix
> is fib_offload_disabled.
>
> Feasible?
neat idea scott. But in practice I would be a bit nervous about getting 
the distribution of related prefixes between
hardware and software right (need to think about the nuances a bit more).

I am still leaning towards the possibility of making this a global 
system policy decision and leave the app and the switch driver less 
surprised. But this needs a better explicit offload policy/resource 
manager infra. I don't have any concrete thoughts on it yet.
(Understand that it will also need to make sure it does not break the 
existing contract with Linux apps (that dave was pointing too)).


thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-18 20:19 ` David Miller
  2015-05-19  0:21   ` John Fastabend
  2015-05-19  5:57   ` roopa
@ 2015-05-28  9:42   ` Jiri Pirko
  2015-05-28 15:35     ` John Fastabend
                       ` (3 more replies)
  2 siblings, 4 replies; 34+ messages in thread
From: Jiri Pirko @ 2015-05-28  9:42 UTC (permalink / raw)
  To: David Miller; +Cc: roopa, sfeldma, john.fastabend, netdev, andy

Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>Date: Sun, 17 May 2015 16:42:05 -0700
>
>> On most systems where you can offload routes to hardware,
>> doing routing in software is not an option (the cpu limitations
>> make routing impossible in software).
>
>You absolutely do not get to determine this policy, none of us
>do.
>
>What matters is that by default the damn switch device being there
>is %100 transparent to the user.
>
>And the way to achieve that default is to do software routes as
>a fallback.
>
>I am not going to entertain changes of this nature which fail
>route loading by default just because we've exceeded a device's
>HW capacity to offload.
>
>I thought I was _really_ clear about this at netdev 0.1

I certainly agree that by default, transparency 1:1 sw:hw mapping is
what we need for fib. The current code is a good start!

I see couple of issues regarding switchdev_fib_ipv4_abort:
1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
   executed -> and, error returned. I would expect that route entry should
   be added in this case. The next attempt of adding the same entry will
   be successful.
   The current behaviour breaks the transparency you are reffering to.
2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
   disabled for good (until reboot). That is certainly not nice, alhough
   I understand that is the easiest solution for now.

I believe that we all agree that the 1:1 transparency, although it is a
default, may not be optimal for real-life usage. HW resources are
limited and user does not know them. The danger of hitting _abort and
screwing-up the whole system is huge, unacceptable.

So here, there are couple of more or less simple things that I suggest to
do in order to move a little bit forward:
1) Introduce system-wide option to switch _abort to just plain fail.
   When HW does not have capacity, do not flush and fallback to sw, but
   rather just fail to add the entry. This would not break anything.
   Userspace has to be prepared that entry add could fail.
2) Introduce a way to propagate resources to userspace. Driver knows about
   resources used/available/potentially_available. Switchdev infra could
   be extended in order to propagate the info to the user.
3) Introduce couple of flags for entry add that would alter the default
   behaviour. Something like:
   	NLM_F_SKIP_KERNEL
   	NLM_F_SKIP_OFFLOAD
   Again, this does not break the current users. On the other hand, this
   gives new users a leverage to instruct kernel where the entry should
   be added to (or not added to).

Any thoughts? Objections?

Thanks!

Jiri

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28  9:42   ` Jiri Pirko
@ 2015-05-28 15:35     ` John Fastabend
  2015-05-29  7:42       ` Jiri Pirko
  2015-05-28 15:40     ` Scott Feldman
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: John Fastabend @ 2015-05-28 15:35 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: David Miller, roopa, sfeldma, netdev, andy

On 05/28/2015 02:42 AM, Jiri Pirko wrote:
> Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>> Date: Sun, 17 May 2015 16:42:05 -0700
>>
>>> On most systems where you can offload routes to hardware,
>>> doing routing in software is not an option (the cpu limitations
>>> make routing impossible in software).
>>
>> You absolutely do not get to determine this policy, none of us
>> do.
>>
>> What matters is that by default the damn switch device being there
>> is %100 transparent to the user.
>>
>> And the way to achieve that default is to do software routes as
>> a fallback.
>>
>> I am not going to entertain changes of this nature which fail
>> route loading by default just because we've exceeded a device's
>> HW capacity to offload.
>>
>> I thought I was _really_ clear about this at netdev 0.1
>
> I certainly agree that by default, transparency 1:1 sw:hw mapping is
> what we need for fib. The current code is a good start!
>
> I see couple of issues regarding switchdev_fib_ipv4_abort:
> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>     executed -> and, error returned. I would expect that route entry should
>     be added in this case. The next attempt of adding the same entry will
>     be successful.
>     The current behaviour breaks the transparency you are reffering to.
> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>     disabled for good (until reboot). That is certainly not nice, alhough
>     I understand that is the easiest solution for now.
>
> I believe that we all agree that the 1:1 transparency, although it is a
> default, may not be optimal for real-life usage. HW resources are
> limited and user does not know them. The danger of hitting _abort and
> screwing-up the whole system is huge, unacceptable.
>
> So here, there are couple of more or less simple things that I suggest to
> do in order to move a little bit forward:
> 1) Introduce system-wide option to switch _abort to just plain fail.
>     When HW does not have capacity, do not flush and fallback to sw, but
>     rather just fail to add the entry. This would not break anything.
>     Userspace has to be prepared that entry add could fail.
> 2) Introduce a way to propagate resources to userspace. Driver knows about
>     resources used/available/potentially_available. Switchdev infra could
>     be extended in order to propagate the info to the user.

I currently use the FlowAPI work I presented at netdev conference for
this. Perhaps I was a bit reaching by trying to also push it as a
replacement for the ethtool flow classification mechanism all in one
shot. For what it is worth replacing 'ethtool' flow classifier when
I have a pipeline of tables in a NIC is really my first use case for
the 'set' operations but that is off-topic probably.

The benefits I see of using this interface (or if you want rename
it and push it into a different netlink type) is it gives you the entire
view of the switch resources and pipeline from a single interface. Also
because you are talking about system-wide behaviour above it nicely
rolls up into user space software where we can act on it with the
flags we have for l2 already and if we pursue your option (3) also l3.
I like the single interface vs. scattering the information across many
different interfaces this way we can do it once and be done with it.
If you scatter it across all the interfaces just l2,l3 for now but
we will get more then each interface will have its own mechanism and
I have no idea where you put global information such as table ordering.

IMO we are going to need at least the base operations I outlined when
we want to work on many different pipelines possibly with different
ordering of tables, different amounts of resource sharing (l2 vs l3 vs
acls vs...), different levels of support (mac/vlan or just mac). And I
don't think it fits into an existing netlink structure because its not
specific to any one thing but the model of the hardware.

Also I believe that match/action tables are a really nice way to work
with hardware so this aligns with that. That said I think the interface
would need some tweaks to fit into the current code base. The biggest
one I would want is to make l2/l3 tables 'well-defined' e.g. give them
a #define value so we can always track them down easily, drop the set
operation (at least for now because the tables we have already have
defined interfaces l2/l3  I'll reopen this in the context of extending
flow classification on the NIC), and clean up the action bits so they
are well defined. I've pushed an update to my code on github to restrict
the hardware from exporting arbitrary actions which should be a
reasonable first step.

What do you think? I would like to try to make the above updates and
resubmit if we can get an agreement that "knowing" the hardware
resources and capabilities is useful. It is at least useful for my
software stacks/use cases.

> 3) Introduce couple of flags for entry add that would alter the default
>     behaviour. Something like:
>     	NLM_F_SKIP_KERNEL
>     	NLM_F_SKIP_OFFLOAD
>     Again, this does not break the current users. On the other hand, this
>     gives new users a leverage to instruct kernel where the entry should
>     be added to (or not added to).
>

This would be my choice. Although at least in my use case space I do
not mind making the software stack aware it is offloading rules.

> Any thoughts? Objections?
>
> Thanks!
>
> Jiri
>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28  9:42   ` Jiri Pirko
  2015-05-28 15:35     ` John Fastabend
@ 2015-05-28 15:40     ` Scott Feldman
  2015-05-28 16:10       ` John Fastabend
                         ` (2 more replies)
  2015-05-29  5:31     ` roopa
  2015-05-29 15:12     ` Scott Feldman
  3 siblings, 3 replies; 34+ messages in thread
From: Scott Feldman @ 2015-05-28 15:40 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Miller, Roopa Prabhu, john fastabend, Netdev, Andy Gospodarek

On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>Date: Sun, 17 May 2015 16:42:05 -0700
>>
>>> On most systems where you can offload routes to hardware,
>>> doing routing in software is not an option (the cpu limitations
>>> make routing impossible in software).
>>
>>You absolutely do not get to determine this policy, none of us
>>do.
>>
>>What matters is that by default the damn switch device being there
>>is %100 transparent to the user.
>>
>>And the way to achieve that default is to do software routes as
>>a fallback.
>>
>>I am not going to entertain changes of this nature which fail
>>route loading by default just because we've exceeded a device's
>>HW capacity to offload.
>>
>>I thought I was _really_ clear about this at netdev 0.1
>
> I certainly agree that by default, transparency 1:1 sw:hw mapping is
> what we need for fib. The current code is a good start!
>
> I see couple of issues regarding switchdev_fib_ipv4_abort:
> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>    executed -> and, error returned. I would expect that route entry should
>    be added in this case. The next attempt of adding the same entry will
>    be successful.
>    The current behaviour breaks the transparency you are reffering to.
> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>    disabled for good (until reboot). That is certainly not nice, alhough
>    I understand that is the easiest solution for now.
>
> I believe that we all agree that the 1:1 transparency, although it is a
> default, may not be optimal for real-life usage. HW resources are
> limited and user does not know them. The danger of hitting _abort and
> screwing-up the whole system is huge, unacceptable.
>
> So here, there are couple of more or less simple things that I suggest to
> do in order to move a little bit forward:
> 1) Introduce system-wide option to switch _abort to just plain fail.
>    When HW does not have capacity, do not flush and fallback to sw, but
>    rather just fail to add the entry. This would not break anything.
>    Userspace has to be prepared that entry add could fail.
> 2) Introduce a way to propagate resources to userspace. Driver knows about
>    resources used/available/potentially_available. Switchdev infra could
>    be extended in order to propagate the info to the user.
> 3) Introduce couple of flags for entry add that would alter the default
>    behaviour. Something like:
>         NLM_F_SKIP_KERNEL
>         NLM_F_SKIP_OFFLOAD
>    Again, this does not break the current users. On the other hand, this
>    gives new users a leverage to instruct kernel where the entry should
>    be added to (or not added to).
>
> Any thoughts? Objections?

I don't like these.  Breaks transparency and forces the user in a
position of having to know hardware failures modes (unique to each
hardware device).  I presented an option d) which avoids this issues;
was it not understood?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28 15:40     ` Scott Feldman
@ 2015-05-28 16:10       ` John Fastabend
  2015-05-29  5:37         ` roopa
  2015-05-28 22:35       ` Andy Gospodarek
  2015-05-29  7:50       ` Jiri Pirko
  2 siblings, 1 reply; 34+ messages in thread
From: John Fastabend @ 2015-05-28 16:10 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Jiri Pirko, David Miller, Roopa Prabhu, Netdev, Andy Gospodarek

On 05/28/2015 08:40 AM, Scott Feldman wrote:
> On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>> Date: Sun, 17 May 2015 16:42:05 -0700
>>>
>>>> On most systems where you can offload routes to hardware,
>>>> doing routing in software is not an option (the cpu limitations
>>>> make routing impossible in software).
>>>
>>> You absolutely do not get to determine this policy, none of us
>>> do.
>>>
>>> What matters is that by default the damn switch device being there
>>> is %100 transparent to the user.
>>>
>>> And the way to achieve that default is to do software routes as
>>> a fallback.
>>>
>>> I am not going to entertain changes of this nature which fail
>>> route loading by default just because we've exceeded a device's
>>> HW capacity to offload.
>>>
>>> I thought I was _really_ clear about this at netdev 0.1
>>
>> I certainly agree that by default, transparency 1:1 sw:hw mapping is
>> what we need for fib. The current code is a good start!
>>
>> I see couple of issues regarding switchdev_fib_ipv4_abort:
>> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>>     executed -> and, error returned. I would expect that route entry should
>>     be added in this case. The next attempt of adding the same entry will
>>     be successful.
>>     The current behaviour breaks the transparency you are reffering to.
>> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>>     disabled for good (until reboot). That is certainly not nice, alhough
>>     I understand that is the easiest solution for now.
>>
>> I believe that we all agree that the 1:1 transparency, although it is a
>> default, may not be optimal for real-life usage. HW resources are
>> limited and user does not know them. The danger of hitting _abort and
>> screwing-up the whole system is huge, unacceptable.
>>
>> So here, there are couple of more or less simple things that I suggest to
>> do in order to move a little bit forward:
>> 1) Introduce system-wide option to switch _abort to just plain fail.
>>     When HW does not have capacity, do not flush and fallback to sw, but
>>     rather just fail to add the entry. This would not break anything.
>>     Userspace has to be prepared that entry add could fail.
>> 2) Introduce a way to propagate resources to userspace. Driver knows about
>>     resources used/available/potentially_available. Switchdev infra could
>>     be extended in order to propagate the info to the user.
>> 3) Introduce couple of flags for entry add that would alter the default
>>     behaviour. Something like:
>>          NLM_F_SKIP_KERNEL
>>          NLM_F_SKIP_OFFLOAD
>>     Again, this does not break the current users. On the other hand, this
>>     gives new users a leverage to instruct kernel where the entry should
>>     be added to (or not added to).
>>
>> Any thoughts? Objections?
>
> I don't like these.  Breaks transparency and forces the user in a
> position of having to know hardware failures modes (unique to each
> hardware device).  I presented an option d) which avoids this issues;
> was it not understood?
>

Hi Scott,

I understood your proposal. One caveat I had is in response to this,

"Actually, now that I think of it, the device/driver could decide which
related-prefix to evict from HW, if driver/device wanted to have a
sense of which routes are more important to offload than other"

hardware/driver/device shouldn't have a sense of which routes are more
important than others. I think this is where the NLM_F_* flags come in.
If userspace _wants_ to push policy into the kernel about what is
important it can. If it doesn't we get a sensible heuristic that does
a reasonable job offloading rules transparently. This is how we did
L2 and I think that seems to work fairly well. At least for me but,
always interested to hear other use cases though.

Also I guess I'm not seeing the multitude of hardware failure modes. I
see two either the hardware doesn't support the operation or it is out
of resources. Both can be learned if the hardware exports a model of its
capabilities and resources.

Thanks,
John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28 15:40     ` Scott Feldman
  2015-05-28 16:10       ` John Fastabend
@ 2015-05-28 22:35       ` Andy Gospodarek
  2015-05-29  5:51         ` roopa
  2015-05-29  7:50       ` Jiri Pirko
  2 siblings, 1 reply; 34+ messages in thread
From: Andy Gospodarek @ 2015-05-28 22:35 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Jiri Pirko, David Miller, Roopa Prabhu, john fastabend, Netdev,
	Andy Gospodarek

On Thu, May 28, 2015 at 08:40:11AM -0700, Scott Feldman wrote:
> On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> > Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
> >>From: Roopa Prabhu <roopa@cumulusnetworks.com>
> >>Date: Sun, 17 May 2015 16:42:05 -0700
> >>
> >>> On most systems where you can offload routes to hardware,
> >>> doing routing in software is not an option (the cpu limitations
> >>> make routing impossible in software).
> >>
> >>You absolutely do not get to determine this policy, none of us
> >>do.
> >>
> >>What matters is that by default the damn switch device being there
> >>is %100 transparent to the user.
> >>
> >>And the way to achieve that default is to do software routes as
> >>a fallback.
> >>
> >>I am not going to entertain changes of this nature which fail
> >>route loading by default just because we've exceeded a device's
> >>HW capacity to offload.
> >>
> >>I thought I was _really_ clear about this at netdev 0.1
> >
> > I certainly agree that by default, transparency 1:1 sw:hw mapping is
> > what we need for fib. The current code is a good start!
> >
> > I see couple of issues regarding switchdev_fib_ipv4_abort:
> > 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
> >    executed -> and, error returned. I would expect that route entry should
> >    be added in this case. The next attempt of adding the same entry will
> >    be successful.
> >    The current behaviour breaks the transparency you are reffering to.
> > 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
> >    disabled for good (until reboot). That is certainly not nice, alhough
> >    I understand that is the easiest solution for now.
> >
> > I believe that we all agree that the 1:1 transparency, although it is a
> > default, may not be optimal for real-life usage. HW resources are
> > limited and user does not know them. The danger of hitting _abort and
> > screwing-up the whole system is huge, unacceptable.
> >
> > So here, there are couple of more or less simple things that I suggest to
> > do in order to move a little bit forward:
> > 1) Introduce system-wide option to switch _abort to just plain fail.
> >    When HW does not have capacity, do not flush and fallback to sw, but
> >    rather just fail to add the entry. This would not break anything.
> >    Userspace has to be prepared that entry add could fail.
> > 2) Introduce a way to propagate resources to userspace. Driver knows about
> >    resources used/available/potentially_available. Switchdev infra could
> >    be extended in order to propagate the info to the user.
> > 3) Introduce couple of flags for entry add that would alter the default
> >    behaviour. Something like:
> >         NLM_F_SKIP_KERNEL
> >         NLM_F_SKIP_OFFLOAD
> >    Again, this does not break the current users. On the other hand, this
> >    gives new users a leverage to instruct kernel where the entry should
> >    be added to (or not added to).
> >
> > Any thoughts? Objections?
> 
> I don't like these.  Breaks transparency and forces the user in a
> position of having to know hardware failures modes (unique to each
> hardware device).  I presented an option d) which avoids this issues;
> was it not understood?

I actually really like the way Jiri succinctly covered the different
cases to move us forward from what we have today (Thanks, Jiri!).  I
completely agree with you on both of your problem statements and the
idea that what have is fine for the short-term.  I see definite room to
improve the the user experience available via upstream kernels.  

Option 1 has appeal since userspace applications that control FDB, FIB,
etc entries could work without modification (the when in this mode the
kernel could choose to ignore any NLM_F_* flags Jiri proposed), but I
agree that a system-wide (or maybe offload-device-wide?) configuration
option needs to exist as this should not be the default behavior. 

Option 2 could also work as userspace applications could query for
space availability before attempting to add a route.  This could be
nice during bootup as then apps could periodically double check that
their view of the world is accurate.

Option 3 also has appeal since there exists the ability to allow
fine-grained control from userspace applications since less used routes
(or routes that could be summarized) could be combined in userspace if
needed.

The great part about all suggestions is that when combined they can
provide a great user experience, but doing all 3 at once is probably too
aggressive.  My vote would be to see if we can work together on a
combination of Option 1 and 3 together as they seem to provide a great
first start to this...

If an application tried to add a route (called A) to the route table
in the kernel and code to support Option 1 existed (similar to what
Roopa posted to start this series) then the kernel could fail to add
route A.  

If the user noted that some other route (called B) was lower priority
for _any_ reason, the user could delete route B from the kernel and
hardware and add route A to hardware and kernel.  Then the user could
make a call to add route B with the flag 'NLM_F_SKIP_OFFLOAD' which
would tell the kernel not to perform a FIB offload of that route.

Now we have some routes in the table that will be offloaded to hardware
and software and some that will be handled only in software -- as long
as the user has explicitly enabled some sort of offload option.

The lingering question in my mind is, what interface do we use to
configure it....

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28  9:42   ` Jiri Pirko
  2015-05-28 15:35     ` John Fastabend
  2015-05-28 15:40     ` Scott Feldman
@ 2015-05-29  5:31     ` roopa
  2015-05-29 15:12     ` Scott Feldman
  3 siblings, 0 replies; 34+ messages in thread
From: roopa @ 2015-05-29  5:31 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: David Miller, sfeldma, john.fastabend, netdev, andy

On 5/28/15, 2:42 AM, Jiri Pirko wrote:
> Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>> Date: Sun, 17 May 2015 16:42:05 -0700
>>
>>> On most systems where you can offload routes to hardware,
>>> doing routing in software is not an option (the cpu limitations
>>> make routing impossible in software).
>> You absolutely do not get to determine this policy, none of us
>> do.
>>
>> What matters is that by default the damn switch device being there
>> is %100 transparent to the user.
>>
>> And the way to achieve that default is to do software routes as
>> a fallback.
>>
>> I am not going to entertain changes of this nature which fail
>> route loading by default just because we've exceeded a device's
>> HW capacity to offload.
>>
>> I thought I was _really_ clear about this at netdev 0.1
> I certainly agree that by default, transparency 1:1 sw:hw mapping is
> what we need for fib. The current code is a good start!
>
> I see couple of issues regarding switchdev_fib_ipv4_abort:
> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>     executed -> and, error returned. I would expect that route entry should
>     be added in this case. The next attempt of adding the same entry will
>     be successful.
>     The current behaviour breaks the transparency you are reffering to.
> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>     disabled for good (until reboot). That is certainly not nice, alhough
>     I understand that is the easiest solution for now.

+1
>
> I believe that we all agree that the 1:1 transparency, although it is a
> default, may not be optimal for real-life usage. HW resources are
> limited and user does not know them. The danger of hitting _abort and
> screwing-up the whole system is huge, unacceptable.
>
> So here, there are couple of more or less simple things that I suggest to
> do in order to move a little bit forward:
> 1) Introduce system-wide option to switch _abort to just plain fail.
>     When HW does not have capacity, do not flush and fallback to sw, but
>     rather just fail to add the entry. This would not break anything.
>     Userspace has to be prepared that entry add could fail.
This was my option b)
> 2) Introduce a way to propagate resources to userspace. Driver knows about
>     resources used/available/potentially_available. Switchdev infra could
>     be extended in order to propagate the info to the user.
This could be an option as well. On our switches we do provide a utility 
to query
similar hardware resources/stats. We were planning to propose a netlink
based query/get api for the switchdev case.
> 3) Introduce couple of flags for entry add that would alter the default
>     behaviour. Something like:
>     	NLM_F_SKIP_KERNEL
>     	NLM_F_SKIP_OFFLOAD
+1, we have discussed similar flags in many other switchdev discussions 
as well.
and this is also along the lines of option c) that i was proposing as 
possible alternatives with this patch.
>     Again, this does not break the current users. On the other hand, this
>     gives new users a leverage to instruct kernel where the entry should
>     be added to (or not added to).
>
> Any thoughts? Objections?
>
>
+1 to all what you said.

thanks

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28 16:10       ` John Fastabend
@ 2015-05-29  5:37         ` roopa
  0 siblings, 0 replies; 34+ messages in thread
From: roopa @ 2015-05-29  5:37 UTC (permalink / raw)
  To: John Fastabend
  Cc: Scott Feldman, Jiri Pirko, David Miller, Netdev, Andy Gospodarek

On 5/28/15, 9:10 AM, John Fastabend wrote:
> On 05/28/2015 08:40 AM, Scott Feldman wrote:
>> On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>> Date: Sun, 17 May 2015 16:42:05 -0700
>>>>
>>>>> On most systems where you can offload routes to hardware,
>>>>> doing routing in software is not an option (the cpu limitations
>>>>> make routing impossible in software).
>>>>
>>>> You absolutely do not get to determine this policy, none of us
>>>> do.
>>>>
>>>> What matters is that by default the damn switch device being there
>>>> is %100 transparent to the user.
>>>>
>>>> And the way to achieve that default is to do software routes as
>>>> a fallback.
>>>>
>>>> I am not going to entertain changes of this nature which fail
>>>> route loading by default just because we've exceeded a device's
>>>> HW capacity to offload.
>>>>
>>>> I thought I was _really_ clear about this at netdev 0.1
>>>
>>> I certainly agree that by default, transparency 1:1 sw:hw mapping is
>>> what we need for fib. The current code is a good start!
>>>
>>> I see couple of issues regarding switchdev_fib_ipv4_abort:
>>> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>>>     executed -> and, error returned. I would expect that route entry 
>>> should
>>>     be added in this case. The next attempt of adding the same entry 
>>> will
>>>     be successful.
>>>     The current behaviour breaks the transparency you are reffering to.
>>> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>>>     disabled for good (until reboot). That is certainly not nice, 
>>> alhough
>>>     I understand that is the easiest solution for now.
>>>
>>> I believe that we all agree that the 1:1 transparency, although it is a
>>> default, may not be optimal for real-life usage. HW resources are
>>> limited and user does not know them. The danger of hitting _abort and
>>> screwing-up the whole system is huge, unacceptable.
>>>
>>> So here, there are couple of more or less simple things that I 
>>> suggest to
>>> do in order to move a little bit forward:
>>> 1) Introduce system-wide option to switch _abort to just plain fail.
>>>     When HW does not have capacity, do not flush and fallback to sw, 
>>> but
>>>     rather just fail to add the entry. This would not break anything.
>>>     Userspace has to be prepared that entry add could fail.
>>> 2) Introduce a way to propagate resources to userspace. Driver knows 
>>> about
>>>     resources used/available/potentially_available. Switchdev infra 
>>> could
>>>     be extended in order to propagate the info to the user.
>>> 3) Introduce couple of flags for entry add that would alter the default
>>>     behaviour. Something like:
>>>          NLM_F_SKIP_KERNEL
>>>          NLM_F_SKIP_OFFLOAD
>>>     Again, this does not break the current users. On the other hand, 
>>> this
>>>     gives new users a leverage to instruct kernel where the entry 
>>> should
>>>     be added to (or not added to).
>>>
>>> Any thoughts? Objections?
>>
>> I don't like these.  Breaks transparency and forces the user in a
>> position of having to know hardware failures modes (unique to each
>> hardware device).  I presented an option d) which avoids this issues;
>> was it not understood?
>>
>
> Hi Scott,
>
> I understood your proposal. One caveat I had is in response to this,
>
> "Actually, now that I think of it, the device/driver could decide which
> related-prefix to evict from HW, if driver/device wanted to have a
> sense of which routes are more important to offload than other"
>
> hardware/driver/device shouldn't have a sense of which routes are more
> important than others. 

correct. The routing daemons know this best.
> I think this is where the NLM_F_* flags come in.
> If userspace _wants_ to push policy into the kernel about what is
> important it can. If it doesn't we get a sensible heuristic that does
> a reasonable job offloading rules transparently. This is how we did
> L2 and I think that seems to work fairly well. At least for me but,
> always interested to hear other use cases though.
agree.
>
> Also I guess I'm not seeing the multitude of hardware failure modes. I
> see two either the hardware doesn't support the operation or it is out
> of resources. Both can be learned if the hardware exports a model of its
> capabilities and resources.
agree, A switchdev api to query hardware resource and capability is due.
We can start with rocker. It gives an app the choice to control the policy.

But, for our usecase/deployments today,  i am more interested in a 
system wide policy because
it is easier on my apps today.

thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28 22:35       ` Andy Gospodarek
@ 2015-05-29  5:51         ` roopa
  0 siblings, 0 replies; 34+ messages in thread
From: roopa @ 2015-05-29  5:51 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Scott Feldman, Jiri Pirko, David Miller, john fastabend, Netdev,
	Andy Gospodarek

On 5/28/15, 3:35 PM, Andy Gospodarek wrote:
> On Thu, May 28, 2015 at 08:40:11AM -0700, Scott Feldman wrote:
>> On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>> Date: Sun, 17 May 2015 16:42:05 -0700
>>>>
>>>>> On most systems where you can offload routes to hardware,
>>>>> doing routing in software is not an option (the cpu limitations
>>>>> make routing impossible in software).
>>>> You absolutely do not get to determine this policy, none of us
>>>> do.
>>>>
>>>> What matters is that by default the damn switch device being there
>>>> is %100 transparent to the user.
>>>>
>>>> And the way to achieve that default is to do software routes as
>>>> a fallback.
>>>>
>>>> I am not going to entertain changes of this nature which fail
>>>> route loading by default just because we've exceeded a device's
>>>> HW capacity to offload.
>>>>
>>>> I thought I was _really_ clear about this at netdev 0.1
>>> I certainly agree that by default, transparency 1:1 sw:hw mapping is
>>> what we need for fib. The current code is a good start!
>>>
>>> I see couple of issues regarding switchdev_fib_ipv4_abort:
>>> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>>>     executed -> and, error returned. I would expect that route entry should
>>>     be added in this case. The next attempt of adding the same entry will
>>>     be successful.
>>>     The current behaviour breaks the transparency you are reffering to.
>>> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>>>     disabled for good (until reboot). That is certainly not nice, alhough
>>>     I understand that is the easiest solution for now.
>>>
>>> I believe that we all agree that the 1:1 transparency, although it is a
>>> default, may not be optimal for real-life usage. HW resources are
>>> limited and user does not know them. The danger of hitting _abort and
>>> screwing-up the whole system is huge, unacceptable.
>>>
>>> So here, there are couple of more or less simple things that I suggest to
>>> do in order to move a little bit forward:
>>> 1) Introduce system-wide option to switch _abort to just plain fail.
>>>     When HW does not have capacity, do not flush and fallback to sw, but
>>>     rather just fail to add the entry. This would not break anything.
>>>     Userspace has to be prepared that entry add could fail.
>>> 2) Introduce a way to propagate resources to userspace. Driver knows about
>>>     resources used/available/potentially_available. Switchdev infra could
>>>     be extended in order to propagate the info to the user.
>>> 3) Introduce couple of flags for entry add that would alter the default
>>>     behaviour. Something like:
>>>          NLM_F_SKIP_KERNEL
>>>          NLM_F_SKIP_OFFLOAD
>>>     Again, this does not break the current users. On the other hand, this
>>>     gives new users a leverage to instruct kernel where the entry should
>>>     be added to (or not added to).
>>>
>>> Any thoughts? Objections?
>> I don't like these.  Breaks transparency and forces the user in a
>> position of having to know hardware failures modes (unique to each
>> hardware device).  I presented an option d) which avoids this issues;
>> was it not understood?
> I actually really like the way Jiri succinctly covered the different
> cases to move us forward from what we have today (Thanks, Jiri!).  I
> completely agree with you on both of your problem statements and the
> idea that what have is fine for the short-term.  I see definite room to
> improve the the user experience available via upstream kernels.
>
> Option 1 has appeal since userspace applications that control FDB, FIB,
> etc entries could work without modification (the when in this mode the
> kernel could choose to ignore any NLM_F_* flags Jiri proposed), but I
> agree that a system-wide (or maybe offload-device-wide?) configuration
> option needs to exist as this should not be the default behavior.

+1...i started with a sysctl for this as an example in my patch. But, 
instead of adding 10
different sysctls in different places, a switchdev policy infra/api and 
flags is due.

I have been planning to post a v3 of my patch with some of these policy 
flags.
>
> Option 2 could also work as userspace applications could query for
> space availability before attempting to add a route.  This could be
> nice during bootup as then apps could periodically double check that
> their view of the world is accurate.
>
> Option 3 also has appeal since there exists the ability to allow
> fine-grained control from userspace applications since less used routes
> (or routes that could be summarized) could be combined in userspace if
> needed.
>
> The great part about all suggestions is that when combined they can
> provide a great user experience, but doing all 3 at once is probably too
> aggressive.  My vote would be to see if we can work together on a
> combination of Option 1 and 3 together as they seem to provide a great
> first start to this...
>
> If an application tried to add a route (called A) to the route table
> in the kernel and code to support Option 1 existed (similar to what
> Roopa posted to start this series) then the kernel could fail to add
> route A.
>
> If the user noted that some other route (called B) was lower priority
> for _any_ reason, the user could delete route B from the kernel and
> hardware and add route A to hardware and kernel.  Then the user could
> make a call to add route B with the flag 'NLM_F_SKIP_OFFLOAD' which
> would tell the kernel not to perform a FIB offload of that route.
+1, that was the exact intent of the flow of my options a), b), c).
thanks for putting it in an example!
> Now we have some routes in the table that will be offloaded to hardware
> and software and some that will be handled only in software -- as long
> as the user has explicitly enabled some sort of offload option.
>
> The lingering question in my mind is, what interface do we use to
> configure it....
>

I have been thinking of switchdev infra specific netlink attributes and 
APIs for v3.
This will also contain resource/capability query attributes (Need to 
check how this aligns or can be replaced by
JohnF's resource query/capability api's. And also how this fits with 
rocker). I am hoping to come up with a v3 (RFC) soon.

thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28 15:35     ` John Fastabend
@ 2015-05-29  7:42       ` Jiri Pirko
  0 siblings, 0 replies; 34+ messages in thread
From: Jiri Pirko @ 2015-05-29  7:42 UTC (permalink / raw)
  To: John Fastabend; +Cc: David Miller, roopa, sfeldma, netdev, andy

Thu, May 28, 2015 at 05:35:05PM CEST, john.fastabend@gmail.com wrote:
>On 05/28/2015 02:42 AM, Jiri Pirko wrote:
>>Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>>>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>Date: Sun, 17 May 2015 16:42:05 -0700
>>>
>>>>On most systems where you can offload routes to hardware,
>>>>doing routing in software is not an option (the cpu limitations
>>>>make routing impossible in software).
>>>
>>>You absolutely do not get to determine this policy, none of us
>>>do.
>>>
>>>What matters is that by default the damn switch device being there
>>>is %100 transparent to the user.
>>>
>>>And the way to achieve that default is to do software routes as
>>>a fallback.
>>>
>>>I am not going to entertain changes of this nature which fail
>>>route loading by default just because we've exceeded a device's
>>>HW capacity to offload.
>>>
>>>I thought I was _really_ clear about this at netdev 0.1
>>
>>I certainly agree that by default, transparency 1:1 sw:hw mapping is
>>what we need for fib. The current code is a good start!
>>
>>I see couple of issues regarding switchdev_fib_ipv4_abort:
>>1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>>    executed -> and, error returned. I would expect that route entry should
>>    be added in this case. The next attempt of adding the same entry will
>>    be successful.
>>    The current behaviour breaks the transparency you are reffering to.
>>2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>>    disabled for good (until reboot). That is certainly not nice, alhough
>>    I understand that is the easiest solution for now.
>>
>>I believe that we all agree that the 1:1 transparency, although it is a
>>default, may not be optimal for real-life usage. HW resources are
>>limited and user does not know them. The danger of hitting _abort and
>>screwing-up the whole system is huge, unacceptable.
>>
>>So here, there are couple of more or less simple things that I suggest to
>>do in order to move a little bit forward:
>>1) Introduce system-wide option to switch _abort to just plain fail.
>>    When HW does not have capacity, do not flush and fallback to sw, but
>>    rather just fail to add the entry. This would not break anything.
>>    Userspace has to be prepared that entry add could fail.
>>2) Introduce a way to propagate resources to userspace. Driver knows about
>>    resources used/available/potentially_available. Switchdev infra could
>>    be extended in order to propagate the info to the user.
>
>I currently use the FlowAPI work I presented at netdev conference for
>this. Perhaps I was a bit reaching by trying to also push it as a
>replacement for the ethtool flow classification mechanism all in one
>shot. For what it is worth replacing 'ethtool' flow classifier when
>I have a pipeline of tables in a NIC is really my first use case for
>the 'set' operations but that is off-topic probably.
>
>The benefits I see of using this interface (or if you want rename
>it and push it into a different netlink type) is it gives you the entire
>view of the switch resources and pipeline from a single interface. Also
>because you are talking about system-wide behaviour above it nicely
>rolls up into user space software where we can act on it with the
>flags we have for l2 already and if we pursue your option (3) also l3.
>I like the single interface vs. scattering the information across many
>different interfaces this way we can do it once and be done with it.
>If you scatter it across all the interfaces just l2,l3 for now but
>we will get more then each interface will have its own mechanism and
>I have no idea where you put global information such as table ordering.

I think that for fib capacities/capabilities, user should be able to use
extended existing Netlink interface. Not some parallel one.

I'm still not convinced that user should care about the actual hw
pipeline. We already have a pipeline in kernel. Switch drivers should just
do mapping, easy as that.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-21  5:46               ` Scott Feldman
  2015-05-21 15:37                 ` roopa
@ 2015-05-29  7:50                 ` Jiri Pirko
  2015-05-29 15:39                   ` Scott Feldman
  1 sibling, 1 reply; 34+ messages in thread
From: Jiri Pirko @ 2015-05-29  7:50 UTC (permalink / raw)
  To: Scott Feldman
  Cc: David Miller, Andy Gospodarek, Roopa Prabhu, Fastabend, John R,
	john fastabend, Netdev

Thu, May 21, 2015 at 07:46:54AM CEST, sfeldma@gmail.com wrote:
>On Tue, May 19, 2015 at 1:28 PM, David Miller <davem@davemloft.net> wrote:
>> From: Andy Gospodarek <gospo@cumulusnetworks.com>
>> Date: Tue, 19 May 2015 15:47:32 -0400
>>
>>> Are you actually saying that if users complain loudly enough about
>>> the current behavior (not the change Roopa has proposed) that you
>>> would be open to considering a change the current behavior?
>>
>> I am saying that we have a contract with users not to break existing
>> behavior.  Full stop.
>
>After rehearing David's argument, we should probably explore option d)
>which is a refinement on the fib_offload_disable mechanism we have
>today.  fib_offload_disable is global for all routes.  Once we hit a
>HW install problem, the global flag is set and all routes fallback to
>SW.  We did this because we can't allow the failed route to exist in
>SW and not in HW because it could mess up LPM searches (HW could hit
>on a lesser prefix even when SW has the true LPM, because HW gets
>first shot at match).  The refinement on fib_offload_disable is this:
>make it per-related-prefix rather than global, and on a HW install
>problem, set the flag for the related-prefix and uninstall only those
>routes from HW.  Related-prefix (is there a correct term for this?)
>are routes to the same dst addr but with different prefix lengths.  I
>haven't parsed the fib_trie structure to see how routes are organized,
>but I suspect since it's optimized for lookup the related-prefix
>tracking is already there and we can build on that.

This looks interesting. However, I'm not sure that it is acceptable for
user to experience this hw evict of "random entries". User knows what
entries are essential to have in hw. With your solution, I can see no way
user can actually say what should be offloaded or not. Kernel just
automagically decides.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28 15:40     ` Scott Feldman
  2015-05-28 16:10       ` John Fastabend
  2015-05-28 22:35       ` Andy Gospodarek
@ 2015-05-29  7:50       ` Jiri Pirko
  2 siblings, 0 replies; 34+ messages in thread
From: Jiri Pirko @ 2015-05-29  7:50 UTC (permalink / raw)
  To: Scott Feldman
  Cc: David Miller, Roopa Prabhu, john fastabend, Netdev, Andy Gospodarek

Thu, May 28, 2015 at 05:40:11PM CEST, sfeldma@gmail.com wrote:
>On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>>>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>Date: Sun, 17 May 2015 16:42:05 -0700
>>>
>>>> On most systems where you can offload routes to hardware,
>>>> doing routing in software is not an option (the cpu limitations
>>>> make routing impossible in software).
>>>
>>>You absolutely do not get to determine this policy, none of us
>>>do.
>>>
>>>What matters is that by default the damn switch device being there
>>>is %100 transparent to the user.
>>>
>>>And the way to achieve that default is to do software routes as
>>>a fallback.
>>>
>>>I am not going to entertain changes of this nature which fail
>>>route loading by default just because we've exceeded a device's
>>>HW capacity to offload.
>>>
>>>I thought I was _really_ clear about this at netdev 0.1
>>
>> I certainly agree that by default, transparency 1:1 sw:hw mapping is
>> what we need for fib. The current code is a good start!
>>
>> I see couple of issues regarding switchdev_fib_ipv4_abort:
>> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>>    executed -> and, error returned. I would expect that route entry should
>>    be added in this case. The next attempt of adding the same entry will
>>    be successful.
>>    The current behaviour breaks the transparency you are reffering to.
>> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>>    disabled for good (until reboot). That is certainly not nice, alhough
>>    I understand that is the easiest solution for now.
>>
>> I believe that we all agree that the 1:1 transparency, although it is a
>> default, may not be optimal for real-life usage. HW resources are
>> limited and user does not know them. The danger of hitting _abort and
>> screwing-up the whole system is huge, unacceptable.
>>
>> So here, there are couple of more or less simple things that I suggest to
>> do in order to move a little bit forward:
>> 1) Introduce system-wide option to switch _abort to just plain fail.
>>    When HW does not have capacity, do not flush and fallback to sw, but
>>    rather just fail to add the entry. This would not break anything.
>>    Userspace has to be prepared that entry add could fail.
>> 2) Introduce a way to propagate resources to userspace. Driver knows about
>>    resources used/available/potentially_available. Switchdev infra could
>>    be extended in order to propagate the info to the user.
>> 3) Introduce couple of flags for entry add that would alter the default
>>    behaviour. Something like:
>>         NLM_F_SKIP_KERNEL
>>         NLM_F_SKIP_OFFLOAD
>>    Again, this does not break the current users. On the other hand, this
>>    gives new users a leverage to instruct kernel where the entry should
>>    be added to (or not added to).
>>
>> Any thoughts? Objections?
>
>I don't like these.  Breaks transparency and forces the user in a
>position of having to know hardware failures modes (unique to each


Can you please elaborate on this a bit more? I fail to see transparency
breaking in my proposal :/ Maybe it is by different understanding of the
term?

Also I do not understand the remark about user having to know hardware
failure modes. Could you please explain?


>hardware device).  I presented an option d) which avoids this issues;
>was it not understood?

I just commented on option d) it other email.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-28  9:42   ` Jiri Pirko
                       ` (2 preceding siblings ...)
  2015-05-29  5:31     ` roopa
@ 2015-05-29 15:12     ` Scott Feldman
  2015-05-29 15:37       ` Jiri Pirko
  3 siblings, 1 reply; 34+ messages in thread
From: Scott Feldman @ 2015-05-29 15:12 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Miller, Roopa Prabhu, john fastabend, Netdev, Andy Gospodarek

On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>Date: Sun, 17 May 2015 16:42:05 -0700
>>
>>> On most systems where you can offload routes to hardware,
>>> doing routing in software is not an option (the cpu limitations
>>> make routing impossible in software).
>>
>>You absolutely do not get to determine this policy, none of us
>>do.
>>
>>What matters is that by default the damn switch device being there
>>is %100 transparent to the user.
>>
>>And the way to achieve that default is to do software routes as
>>a fallback.
>>
>>I am not going to entertain changes of this nature which fail
>>route loading by default just because we've exceeded a device's
>>HW capacity to offload.
>>
>>I thought I was _really_ clear about this at netdev 0.1
>
> I certainly agree that by default, transparency 1:1 sw:hw mapping is
> what we need for fib. The current code is a good start!
>
> I see couple of issues regarding switchdev_fib_ipv4_abort:
> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>    executed -> and, error returned. I would expect that route entry should
>    be added in this case. The next attempt of adding the same entry will
>    be successful.
>    The current behaviour breaks the transparency you are reffering to.
> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>    disabled for good (until reboot). That is certainly not nice, alhough
>    I understand that is the easiest solution for now.
>
> I believe that we all agree that the 1:1 transparency, although it is a
> default, may not be optimal for real-life usage. HW resources are
> limited and user does not know them. The danger of hitting _abort and
> screwing-up the whole system is huge, unacceptable.
>
> So here, there are couple of more or less simple things that I suggest to
> do in order to move a little bit forward:
> 1) Introduce system-wide option to switch _abort to just plain fail.
>    When HW does not have capacity, do not flush and fallback to sw, but
>    rather just fail to add the entry. This would not break anything.
>    Userspace has to be prepared that entry add could fail.

This breaks 1:1 transparency.  A route now fails to install and the
user is scratching his/her head as to why it failed.  It used to work
when there was no switch offload.  It works with switch offload on
this other device.  So it must be a failure due to switch offload on
this device.  But why this route?  I just installed 20 IPv4 routes and
10 IPv6 routes.  Why did this 11th IPv6 route fail to install?  See,
now user needs to learn about details of that particular device's
limits to understand failure.  When they move their application to
another device, they need to re-learn failure modes.

> 2) Introduce a way to propagate resources to userspace. Driver knows about
>    resources used/available/potentially_available. Switchdev infra could
>    be extended in order to propagate the info to the user.
> 3) Introduce couple of flags for entry add that would alter the default
>    behaviour. Something like:
>         NLM_F_SKIP_KERNEL
>         NLM_F_SKIP_OFFLOAD
>    Again, this does not break the current users. On the other hand, this
>    gives new users a leverage to instruct kernel where the entry should
>    be added to (or not added to).

I don't think we want an NLM_F_SKIP_KERNEL option and only have the
route installed on the device.  We want offload to be an acceleration
of the kernel's FIB, not a bypass.

SKIP_OFFLOAD can mess up LPM if the user is not really really careful.

> Any thoughts? Objections?
>
> Thanks!
>
> Jiri

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-29 15:12     ` Scott Feldman
@ 2015-05-29 15:37       ` Jiri Pirko
  0 siblings, 0 replies; 34+ messages in thread
From: Jiri Pirko @ 2015-05-29 15:37 UTC (permalink / raw)
  To: Scott Feldman
  Cc: David Miller, Roopa Prabhu, john fastabend, Netdev, Andy Gospodarek

Fri, May 29, 2015 at 05:12:35PM CEST, sfeldma@gmail.com wrote:
>On Thu, May 28, 2015 at 2:42 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, May 18, 2015 at 10:19:16PM CEST, davem@davemloft.net wrote:
>>>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>Date: Sun, 17 May 2015 16:42:05 -0700
>>>
>>>> On most systems where you can offload routes to hardware,
>>>> doing routing in software is not an option (the cpu limitations
>>>> make routing impossible in software).
>>>
>>>You absolutely do not get to determine this policy, none of us
>>>do.
>>>
>>>What matters is that by default the damn switch device being there
>>>is %100 transparent to the user.
>>>
>>>And the way to achieve that default is to do software routes as
>>>a fallback.
>>>
>>>I am not going to entertain changes of this nature which fail
>>>route loading by default just because we've exceeded a device's
>>>HW capacity to offload.
>>>
>>>I thought I was _really_ clear about this at netdev 0.1
>>
>> I certainly agree that by default, transparency 1:1 sw:hw mapping is
>> what we need for fib. The current code is a good start!
>>
>> I see couple of issues regarding switchdev_fib_ipv4_abort:
>> 1) If user adds and entry, switchdev_fib_ipv4_add fails, abort is
>>    executed -> and, error returned. I would expect that route entry should
>>    be added in this case. The next attempt of adding the same entry will
>>    be successful.
>>    The current behaviour breaks the transparency you are reffering to.
>> 2) When switchdev_fib_ipv4_abort happens to be executed, the offload is
>>    disabled for good (until reboot). That is certainly not nice, alhough
>>    I understand that is the easiest solution for now.
>>
>> I believe that we all agree that the 1:1 transparency, although it is a
>> default, may not be optimal for real-life usage. HW resources are
>> limited and user does not know them. The danger of hitting _abort and
>> screwing-up the whole system is huge, unacceptable.
>>
>> So here, there are couple of more or less simple things that I suggest to
>> do in order to move a little bit forward:
>> 1) Introduce system-wide option to switch _abort to just plain fail.
>>    When HW does not have capacity, do not flush and fallback to sw, but
>>    rather just fail to add the entry. This would not break anything.
>>    Userspace has to be prepared that entry add could fail.
>
>This breaks 1:1 transparency.  A route now fails to install and the
>user is scratching his/her head as to why it failed.  It used to work
>when there was no switch offload.  It works with switch offload on
>this other device.  So it must be a failure due to switch offload on
>this device.  But why this route?  I just installed 20 IPv4 routes and
>10 IPv6 routes.  Why did this 11th IPv6 route fail to install?  See,
>now user needs to learn about details of that particular device's
>limits to understand failure.  When they move their application to
>another device, they need to re-learn failure modes.

I don't want this behaviour as the default. Default should be what is at
this moment. This would be tunable by user. That, I believe is correct.


>
>> 2) Introduce a way to propagate resources to userspace. Driver knows about
>>    resources used/available/potentially_available. Switchdev infra could
>>    be extended in order to propagate the info to the user.
>> 3) Introduce couple of flags for entry add that would alter the default
>>    behaviour. Something like:
>>         NLM_F_SKIP_KERNEL
>>         NLM_F_SKIP_OFFLOAD
>>    Again, this does not break the current users. On the other hand, this
>>    gives new users a leverage to instruct kernel where the entry should
>>    be added to (or not added to).
>
>I don't think we want an NLM_F_SKIP_KERNEL option and only have the
>route installed on the device.  We want offload to be an acceleration
>of the kernel's FIB, not a bypass.

Okay, fair enough. Let's have NLM_F_SKIP_OFFLOAD only.

>
>SKIP_OFFLOAD can mess up LPM if the user is not really really careful.
>
>> Any thoughts? Objections?
>>
>> Thanks!
>>
>> Jiri

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-29  7:50                 ` Jiri Pirko
@ 2015-05-29 15:39                   ` Scott Feldman
  2015-05-30  9:00                     ` Jiri Pirko
  0 siblings, 1 reply; 34+ messages in thread
From: Scott Feldman @ 2015-05-29 15:39 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Miller, Andy Gospodarek, Roopa Prabhu, Fastabend, John R,
	john fastabend, Netdev

On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, May 21, 2015 at 07:46:54AM CEST, sfeldma@gmail.com wrote:
>>On Tue, May 19, 2015 at 1:28 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Andy Gospodarek <gospo@cumulusnetworks.com>
>>> Date: Tue, 19 May 2015 15:47:32 -0400
>>>
>>>> Are you actually saying that if users complain loudly enough about
>>>> the current behavior (not the change Roopa has proposed) that you
>>>> would be open to considering a change the current behavior?
>>>
>>> I am saying that we have a contract with users not to break existing
>>> behavior.  Full stop.
>>
>>After rehearing David's argument, we should probably explore option d)
>>which is a refinement on the fib_offload_disable mechanism we have
>>today.  fib_offload_disable is global for all routes.  Once we hit a
>>HW install problem, the global flag is set and all routes fallback to
>>SW.  We did this because we can't allow the failed route to exist in
>>SW and not in HW because it could mess up LPM searches (HW could hit
>>on a lesser prefix even when SW has the true LPM, because HW gets
>>first shot at match).  The refinement on fib_offload_disable is this:
>>make it per-related-prefix rather than global, and on a HW install
>>problem, set the flag for the related-prefix and uninstall only those
>>routes from HW.  Related-prefix (is there a correct term for this?)
>>are routes to the same dst addr but with different prefix lengths.  I
>>haven't parsed the fib_trie structure to see how routes are organized,
>>but I suspect since it's optimized for lookup the related-prefix
>>tracking is already there and we can build on that.
>
> This looks interesting. However, I'm not sure that it is acceptable for
> user to experience this hw evict of "random entries". User knows what
> entries are essential to have in hw. With your solution, I can see no way
> user can actually say what should be offloaded or not. Kernel just
> automagically decides.

The default eviction policy could be based on RTA_PRIORITY: evict
lower priority routes first.  It would be up to the device driver to
decide between two routes of same priority.

To help device driver make the decision, we could have eviction policy options:

    Priority-base (default)
    Prefer IPv6 over IPv4
    Prefer IPv4 over IPv6
    Prefer single path over multipath
    Prefer longer prefix lengths over shorter
    Optimize for resource utilization

These are portable across different switches.   They're in terms a
user understands.  It's up to the device driver which truly
understands the device constraints to translates the user's eviction
policy choices into something that makes sense to that device.

-scott

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-29 15:39                   ` Scott Feldman
@ 2015-05-30  9:00                     ` Jiri Pirko
  2015-05-31  4:19                       ` John Fastabend
  2015-05-31  7:34                       ` Scott Feldman
  0 siblings, 2 replies; 34+ messages in thread
From: Jiri Pirko @ 2015-05-30  9:00 UTC (permalink / raw)
  To: Scott Feldman
  Cc: David Miller, Andy Gospodarek, Roopa Prabhu, Fastabend, John R,
	john fastabend, Netdev

Fri, May 29, 2015 at 05:39:46PM CEST, sfeldma@gmail.com wrote:
>On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Thu, May 21, 2015 at 07:46:54AM CEST, sfeldma@gmail.com wrote:
>>>On Tue, May 19, 2015 at 1:28 PM, David Miller <davem@davemloft.net> wrote:
>>>> From: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>> Date: Tue, 19 May 2015 15:47:32 -0400
>>>>
>>>>> Are you actually saying that if users complain loudly enough about
>>>>> the current behavior (not the change Roopa has proposed) that you
>>>>> would be open to considering a change the current behavior?
>>>>
>>>> I am saying that we have a contract with users not to break existing
>>>> behavior.  Full stop.
>>>
>>>After rehearing David's argument, we should probably explore option d)
>>>which is a refinement on the fib_offload_disable mechanism we have
>>>today.  fib_offload_disable is global for all routes.  Once we hit a
>>>HW install problem, the global flag is set and all routes fallback to
>>>SW.  We did this because we can't allow the failed route to exist in
>>>SW and not in HW because it could mess up LPM searches (HW could hit
>>>on a lesser prefix even when SW has the true LPM, because HW gets
>>>first shot at match).  The refinement on fib_offload_disable is this:
>>>make it per-related-prefix rather than global, and on a HW install
>>>problem, set the flag for the related-prefix and uninstall only those
>>>routes from HW.  Related-prefix (is there a correct term for this?)
>>>are routes to the same dst addr but with different prefix lengths.  I
>>>haven't parsed the fib_trie structure to see how routes are organized,
>>>but I suspect since it's optimized for lookup the related-prefix
>>>tracking is already there and we can build on that.
>>
>> This looks interesting. However, I'm not sure that it is acceptable for
>> user to experience this hw evict of "random entries". User knows what
>> entries are essential to have in hw. With your solution, I can see no way
>> user can actually say what should be offloaded or not. Kernel just
>> automagically decides.
>
>The default eviction policy could be based on RTA_PRIORITY: evict
>lower priority routes first.  It would be up to the device driver to
>decide between two routes of same priority.
>
>To help device driver make the decision, we could have eviction policy options:
>
>    Priority-base (default)
>    Prefer IPv6 over IPv4
>    Prefer IPv4 over IPv6
>    Prefer single path over multipath
>    Prefer longer prefix lengths over shorter
>    Optimize for resource utilization
>
>These are portable across different switches.   They're in terms a
>user understands.  It's up to the device driver which truly
>understands the device constraints to translates the user's eviction
>policy choices into something that makes sense to that device.

This sounds tempting... You plan to throw in some patches, or should I
take care of that?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-30  9:00                     ` Jiri Pirko
@ 2015-05-31  4:19                       ` John Fastabend
  2015-05-31  6:34                         ` Scott Feldman
  2015-05-31  7:34                       ` Scott Feldman
  1 sibling, 1 reply; 34+ messages in thread
From: John Fastabend @ 2015-05-31  4:19 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Scott Feldman, David Miller, Andy Gospodarek, Roopa Prabhu,
	Fastabend, John R, Netdev

On 05/30/2015 02:00 AM, Jiri Pirko wrote:
> Fri, May 29, 2015 at 05:39:46PM CEST, sfeldma@gmail.com wrote:
>> On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Thu, May 21, 2015 at 07:46:54AM CEST, sfeldma@gmail.com wrote:
>>>> On Tue, May 19, 2015 at 1:28 PM, David Miller <davem@davemloft.net> wrote:
>>>>> From: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>>> Date: Tue, 19 May 2015 15:47:32 -0400
>>>>>
>>>>>> Are you actually saying that if users complain loudly enough about
>>>>>> the current behavior (not the change Roopa has proposed) that you
>>>>>> would be open to considering a change the current behavior?
>>>>>
>>>>> I am saying that we have a contract with users not to break existing
>>>>> behavior.  Full stop.
>>>>
>>>> After rehearing David's argument, we should probably explore option d)
>>>> which is a refinement on the fib_offload_disable mechanism we have
>>>> today.  fib_offload_disable is global for all routes.  Once we hit a
>>>> HW install problem, the global flag is set and all routes fallback to
>>>> SW.  We did this because we can't allow the failed route to exist in
>>>> SW and not in HW because it could mess up LPM searches (HW could hit
>>>> on a lesser prefix even when SW has the true LPM, because HW gets
>>>> first shot at match).  The refinement on fib_offload_disable is this:
>>>> make it per-related-prefix rather than global, and on a HW install
>>>> problem, set the flag for the related-prefix and uninstall only those
>>>> routes from HW.  Related-prefix (is there a correct term for this?)
>>>> are routes to the same dst addr but with different prefix lengths.  I
>>>> haven't parsed the fib_trie structure to see how routes are organized,
>>>> but I suspect since it's optimized for lookup the related-prefix
>>>> tracking is already there and we can build on that.
>>>
>>> This looks interesting. However, I'm not sure that it is acceptable for
>>> user to experience this hw evict of "random entries". User knows what
>>> entries are essential to have in hw. With your solution, I can see no way
>>> user can actually say what should be offloaded or not. Kernel just
>>> automagically decides.
>>
>> The default eviction policy could be based on RTA_PRIORITY: evict
>> lower priority routes first.  It would be up to the device driver to
>> decide between two routes of same priority.
>>
>> To help device driver make the decision, we could have eviction policy options:
>>
>>     Priority-base (default)
>>     Prefer IPv6 over IPv4
>>     Prefer IPv4 over IPv6
>>     Prefer single path over multipath
>>     Prefer longer prefix lengths over shorter
>>     Optimize for resource utilization
>>
>> These are portable across different switches.   They're in terms a
>> user understands.  It's up to the device driver which truly
>> understands the device constraints to translates the user's eviction
>> policy choices into something that makes sense to that device.
>
> This sounds tempting... You plan to throw in some patches, or should I
> take care of that?
>

This is encoding specific policies into the kernel. I was hoping to
avoid this and let user space develop whatever policy it wants. If you
use Jiri's proposed NLM_F_SKIP_{KERNEL|OFFLOAD} flags you get this.

Also I don't understand the "truly  understands the device constraints"
comment. We can export a model of the device and know how many rules
of each type will fit exactly into the table. This doesn't seem like
much of a problem to me. In fact the driver developer should know this
anyway.

Part of my motivation here is I really don't want to get stuck with a
case where each driver writer gets to translate the eviction policy
onto their device in some device specific and slightly different way.
It means every developer has to write a new mapping and get it correct.
At very least we should put a layer in switchdev that reads the table
out of the driver and does the mapping so we have it one spot. At least
then the kernel is enforcing policy the same on all devices. Better
still IMO would be to develop the policy in user space and have a
library/tool that does this so we don't end up with a bunch of policy
blobs in the kernel. The 6 above is a good start but over time we more
policy blobs will surely pop up. I would for example put 'optimize for
throughput' on the list.

.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-31  4:19                       ` John Fastabend
@ 2015-05-31  6:34                         ` Scott Feldman
  0 siblings, 0 replies; 34+ messages in thread
From: Scott Feldman @ 2015-05-31  6:34 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jiri Pirko, David Miller, Andy Gospodarek, Roopa Prabhu,
	Fastabend, John R, Netdev

On Sat, May 30, 2015 at 9:19 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 05/30/2015 02:00 AM, Jiri Pirko wrote:
>>
>> Fri, May 29, 2015 at 05:39:46PM CEST, sfeldma@gmail.com wrote:
>>>
>>> On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>
>>>> Thu, May 21, 2015 at 07:46:54AM CEST, sfeldma@gmail.com wrote:
>>>>>
>>>>> On Tue, May 19, 2015 at 1:28 PM, David Miller <davem@davemloft.net>
>>>>> wrote:
>>>>>>
>>>>>> From: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>>>> Date: Tue, 19 May 2015 15:47:32 -0400
>>>>>>
>>>>>>> Are you actually saying that if users complain loudly enough about
>>>>>>> the current behavior (not the change Roopa has proposed) that you
>>>>>>> would be open to considering a change the current behavior?
>>>>>>
>>>>>>
>>>>>> I am saying that we have a contract with users not to break existing
>>>>>> behavior.  Full stop.
>>>>>
>>>>>
>>>>> After rehearing David's argument, we should probably explore option d)
>>>>> which is a refinement on the fib_offload_disable mechanism we have
>>>>> today.  fib_offload_disable is global for all routes.  Once we hit a
>>>>> HW install problem, the global flag is set and all routes fallback to
>>>>> SW.  We did this because we can't allow the failed route to exist in
>>>>> SW and not in HW because it could mess up LPM searches (HW could hit
>>>>> on a lesser prefix even when SW has the true LPM, because HW gets
>>>>> first shot at match).  The refinement on fib_offload_disable is this:
>>>>> make it per-related-prefix rather than global, and on a HW install
>>>>> problem, set the flag for the related-prefix and uninstall only those
>>>>> routes from HW.  Related-prefix (is there a correct term for this?)
>>>>> are routes to the same dst addr but with different prefix lengths.  I
>>>>> haven't parsed the fib_trie structure to see how routes are organized,
>>>>> but I suspect since it's optimized for lookup the related-prefix
>>>>> tracking is already there and we can build on that.
>>>>
>>>>
>>>> This looks interesting. However, I'm not sure that it is acceptable for
>>>> user to experience this hw evict of "random entries". User knows what
>>>> entries are essential to have in hw. With your solution, I can see no
>>>> way
>>>> user can actually say what should be offloaded or not. Kernel just
>>>> automagically decides.
>>>
>>>
>>> The default eviction policy could be based on RTA_PRIORITY: evict
>>> lower priority routes first.  It would be up to the device driver to
>>> decide between two routes of same priority.
>>>
>>> To help device driver make the decision, we could have eviction policy
>>> options:
>>>
>>>     Priority-base (default)
>>>     Prefer IPv6 over IPv4
>>>     Prefer IPv4 over IPv6
>>>     Prefer single path over multipath
>>>     Prefer longer prefix lengths over shorter
>>>     Optimize for resource utilization
>>>
>>> These are portable across different switches.   They're in terms a
>>> user understands.  It's up to the device driver which truly
>>> understands the device constraints to translates the user's eviction
>>> policy choices into something that makes sense to that device.
>>
>>
>> This sounds tempting... You plan to throw in some patches, or should I
>> take care of that?
>>
>
> This is encoding specific policies into the kernel. I was hoping to
> avoid this and let user space develop whatever policy it wants. If you
> use Jiri's proposed NLM_F_SKIP_{KERNEL|OFFLOAD} flags you get this.
>
> Also I don't understand the "truly  understands the device constraints"
> comment. We can export a model of the device and know how many rules
> of each type will fit exactly into the table. This doesn't seem like
> much of a problem to me. In fact the driver developer should know this
> anyway.
>
> Part of my motivation here is I really don't want to get stuck with a
> case where each driver writer gets to translate the eviction policy
> onto their device in some device specific and slightly different way.

But this is _exactly_ what I want.  Here's why: my claim is it will be
impossible for us (device vendors) to define a universal set of
resource constraints that works for all devices from all vendors.  I
was kind of hoping some vendor would throw out a set to get us
started.  Ok, I'll start with rocker: rocker will enforce in the
device these constraints listed below.  There will be a device command
to query the raw constraints.   So here goes:

VLAN table max entries: 16K     // a VLAN on a port takes one entry
Term MAC table max entries: no limit
Bridging table:
     Unicast max entries: 12K
     Multicast max entries: 4K
Unicast Routing table (shared for v4 and v6 entries):
     Prefix max slots: 16K
         IPv4 route takes one slot
         IPv6 prefix len <= 64 route takes two slots
         IPv6 prefix len > 64 takes four slots
    Nexthop max slots: 4K
         Max ECMP width: 32
         Each nexthop MAC takes one slot, but there is a stride of 4 slots
Multicast Routing table (shared for v4 and v6 entries):
    (same as unicast routing, except max slots are 1/2 as big)
ACL table:
    IPv4 max entries: 8K
    IPv6 max entries: 8K
    Combined IPv4 and IPv6 entries: 12K

The table names are OF-DPA-specific.  The limits are contrived,
obviously, but are representative of real-world devices.  The v4/v6
splits are a PITA, but a reality.

Ok, your turn.  Let's see a list of constraints for a device you care
about and see what the union between rocker and yours is.  Maybe
others can contribute their lists?  My expectation is the union of all
these lists is something not implementable.  And we want to push this
to the user so they can tune their application, and run their
application on multiple switches?

> It means every developer has to write a new mapping and get it correct.
> At very least we should put a layer in switchdev that reads the table
> out of the driver and does the mapping so we have it one spot. At least
> then the kernel is enforcing policy the same on all devices. Better
> still IMO would be to develop the policy in user space and have a
> library/tool that does this so we don't end up with a bunch of policy
> blobs in the kernel. The 6 above is a good start but over time we more
> policy blobs will surely pop up. I would for example put 'optimize for
> throughput' on the list.

I don't even know what 'optimize for throughput' means in the context
offloading fib entries.  I'm curious, what is an example?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-30  9:00                     ` Jiri Pirko
  2015-05-31  4:19                       ` John Fastabend
@ 2015-05-31  7:34                       ` Scott Feldman
  1 sibling, 0 replies; 34+ messages in thread
From: Scott Feldman @ 2015-05-31  7:34 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Miller, Andy Gospodarek, Roopa Prabhu, Fastabend, John R,
	john fastabend, Netdev

On Sat, May 30, 2015 at 2:00 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, May 29, 2015 at 05:39:46PM CEST, sfeldma@gmail.com wrote:
>>On Fri, May 29, 2015 at 12:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Thu, May 21, 2015 at 07:46:54AM CEST, sfeldma@gmail.com wrote:
>>>>On Tue, May 19, 2015 at 1:28 PM, David Miller <davem@davemloft.net> wrote:
>>>>> From: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>>> Date: Tue, 19 May 2015 15:47:32 -0400
>>>>>
>>>>>> Are you actually saying that if users complain loudly enough about
>>>>>> the current behavior (not the change Roopa has proposed) that you
>>>>>> would be open to considering a change the current behavior?
>>>>>
>>>>> I am saying that we have a contract with users not to break existing
>>>>> behavior.  Full stop.
>>>>
>>>>After rehearing David's argument, we should probably explore option d)
>>>>which is a refinement on the fib_offload_disable mechanism we have
>>>>today.  fib_offload_disable is global for all routes.  Once we hit a
>>>>HW install problem, the global flag is set and all routes fallback to
>>>>SW.  We did this because we can't allow the failed route to exist in
>>>>SW and not in HW because it could mess up LPM searches (HW could hit
>>>>on a lesser prefix even when SW has the true LPM, because HW gets
>>>>first shot at match).  The refinement on fib_offload_disable is this:
>>>>make it per-related-prefix rather than global, and on a HW install
>>>>problem, set the flag for the related-prefix and uninstall only those
>>>>routes from HW.  Related-prefix (is there a correct term for this?)
>>>>are routes to the same dst addr but with different prefix lengths.  I
>>>>haven't parsed the fib_trie structure to see how routes are organized,
>>>>but I suspect since it's optimized for lookup the related-prefix
>>>>tracking is already there and we can build on that.
>>>
>>> This looks interesting. However, I'm not sure that it is acceptable for
>>> user to experience this hw evict of "random entries". User knows what
>>> entries are essential to have in hw. With your solution, I can see no way
>>> user can actually say what should be offloaded or not. Kernel just
>>> automagically decides.
>>
>>The default eviction policy could be based on RTA_PRIORITY: evict
>>lower priority routes first.  It would be up to the device driver to
>>decide between two routes of same priority.
>>
>>To help device driver make the decision, we could have eviction policy options:
>>
>>    Priority-base (default)
>>    Prefer IPv6 over IPv4
>>    Prefer IPv4 over IPv6
>>    Prefer single path over multipath
>>    Prefer longer prefix lengths over shorter
>>    Optimize for resource utilization
>>
>>These are portable across different switches.   They're in terms a
>>user understands.  It's up to the device driver which truly
>>understands the device constraints to translates the user's eviction
>>policy choices into something that makes sense to that device.
>
> This sounds tempting... You plan to throw in some patches, or should I
> take care of that?

My L2 backlog keeps from L3 at the moment.  What would be nice is if
someone would look into the feasibility of option d) at the
swithdev/fib level and see if it's doable.  The policy/user interface
stuff can come later.  What do we need?  My thoughts:

- Remove the check for fi->fib_net->ipv4.fib_offload_disabled in
switchdev_fib_ipv4_add() and replace it with a check to see if route
is offload_disabled.  A route is offload_disabled if the set of
related routes is marked disabled.  If route is offload_disabled, just
return 0 in switchdev_fib_ipv4_add() so that route is only installed
in the kernel fib.  (It's fine now to have a route in SW but not in
HW).

- Find/invent in fib trie the structure that relates routes with the
same prefix but different prefix lengths.  So A/8 and A/16 and A/30
are related.  B/16 is not related to the other three.  Add a bool
offload_disabled to that structure to mark them offload_disabled.  We
don't want LPM broken with something like this:

   HW: A/8, A/16, B/8
   SW: A/8, A/16, A/30, B/8

Here when HW gets an A pkt, it'll match A/16 when really SW's A/30
should have won.  Now LPM is broken.

- In the driver's fib obj add handler, if it can't install the route
because the device flat out can't support the route, then return
-EOPNOTSUPP and switchdev_fib_ipv4_add() will mark the route as
offload_disabled.  Actually, switchdev_fib_ipv4_add() needs to
additionally remove related routes from device.

- In the driver's fib obj add handler, if the route can be installed,
but at the expense of evicting some other route(s) due to evict policy
(driver's choice on which route(s) get evicted), the handler returns 0
and switchdev_fib_ipv4_add() is happy.  The driver should send a
switchdev notification to the fib layer to mark the evicted route(s)
as offload disable.

- Routes marked with offload_disabled should have RTNH_F_OFFLOAD cleared.

I think I'm being sloppy with making sure once a route is removed from
the device, all the related routes are also removed from the device.
It's late; brain not working 100%.  But I think you have the gist of
it.  I don't think there is a lot of code changes here to pull this
off.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
  2015-05-17  3:46 Roopa Prabhu
@ 2015-05-17 23:41 ` roopa
  0 siblings, 0 replies; 34+ messages in thread
From: roopa @ 2015-05-17 23:41 UTC (permalink / raw)
  To: davem, sfeldma, john.fastabend, jiri; +Cc: netdev

On 5/16/15, 8:46 PM, Roopa Prabhu wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
> This patch removes the calls to netdev_switch_fib_ipv4_abort when
> there is an error in programming fib entry in hardware.
>
> On most systems where you can offload routes to hardware,
> doing routing in software is not an option (the cpu limitations
> make routing impossible in software).
>
> I understand that this was added to keep the first fib offload support
> simple.
>
> As discussed in the RFC patch, available options:
> a) Fail fib entry add on hardware offload failure on switch devices, and
> return appropriate error to the user by default (this patch)
> b) make the behaviour in a) conditional on a global flag/sysctl (a per fib
> entry flag will not work because by default user/routing-daemons dont care
> if they are hardware offloaded)
> c) for users/routing-daemons interested in controlling hardware
> offload behaviour there is always the per fib entry flag RTNH_F_OFFLOAD
>
> Considering the characteristics of the systems that support fib offloads,
> this patch implements a). Also making a) the default will enable easier/faster
> migration of existing routing apps to switch devices.
>
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> ---
> RFC to v1:
>    - rebase to net
>    - remove fib_offload_disabled flag and all associated code as suggested by
>      scott feldman
>
> (This patch is currently against net. If there is any reason to move it to
> net-next, I can respin)
>
>
pls ignore, i will be resending v2 with missing switchdev.h cleanups.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware
@ 2015-05-17  3:46 Roopa Prabhu
  2015-05-17 23:41 ` roopa
  0 siblings, 1 reply; 34+ messages in thread
From: Roopa Prabhu @ 2015-05-17  3:46 UTC (permalink / raw)
  To: davem, sfeldma, john.fastabend, jiri; +Cc: netdev

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This patch removes the calls to netdev_switch_fib_ipv4_abort when
there is an error in programming fib entry in hardware.

On most systems where you can offload routes to hardware,
doing routing in software is not an option (the cpu limitations
make routing impossible in software).

I understand that this was added to keep the first fib offload support
simple.

As discussed in the RFC patch, available options:
a) Fail fib entry add on hardware offload failure on switch devices, and
return appropriate error to the user by default (this patch)
b) make the behaviour in a) conditional on a global flag/sysctl (a per fib
entry flag will not work because by default user/routing-daemons dont care
if they are hardware offloaded)
c) for users/routing-daemons interested in controlling hardware
offload behaviour there is always the per fib entry flag RTNH_F_OFFLOAD

Considering the characteristics of the systems that support fib offloads,
this patch implements a). Also making a) the default will enable easier/faster
migration of existing routing apps to switch devices.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
RFC to v1:
  - rebase to net
  - remove fib_offload_disabled flag and all associated code as suggested by
    scott feldman

(This patch is currently against net. If there is any reason to move it to
net-next, I can respin)

 include/net/netns/ipv4.h  |    1 -
 net/ipv4/fib_trie.c       |    5 +----
 net/switchdev/switchdev.c |   23 -----------------------
 3 files changed, 1 insertion(+), 28 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 614a49b..31919b1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -47,7 +47,6 @@ struct netns_ipv4 {
 	int			fib_num_tclassid_users;
 #endif
 	struct hlist_head	*fib_table_hash;
-	bool			fib_offload_disabled;
 	struct sock		*fibnl;
 
 	struct sock  * __percpu	*icmp_sk;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 64c2076..ef523d3 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1171,7 +1171,6 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
 							 cfg->fc_nlflags,
 							 tb->tb_id);
 			if (err) {
-				netdev_switch_fib_ipv4_abort(fi);
 				kmem_cache_free(fn_alias_kmem, new_fa);
 				goto out;
 			}
@@ -1219,10 +1218,8 @@ int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
 					 cfg->fc_type,
 					 cfg->fc_nlflags,
 					 tb->tb_id);
-	if (err) {
-		netdev_switch_fib_ipv4_abort(fi);
+	if (err)
 		goto out_free_new_fa;
-	}
 
 	/* Insert new entry to the list. */
 	err = fib_insert_alias(t, tp, l, new_fa, fa, key);
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 055453d..d1269c4 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -325,9 +325,6 @@ int netdev_switch_fib_ipv4_add(u32 dst, int dst_len, struct fib_info *fi,
 		return 0;
 #endif
 
-	if (fi->fib_net->ipv4.fib_offload_disabled)
-		return 0;
-
 	dev = netdev_switch_get_dev_by_nhs(fi);
 	if (!dev)
 		return 0;
@@ -382,23 +379,3 @@ int netdev_switch_fib_ipv4_del(u32 dst, int dst_len, struct fib_info *fi,
 	return err;
 }
 EXPORT_SYMBOL_GPL(netdev_switch_fib_ipv4_del);
-
-/**
- *	netdev_switch_fib_ipv4_abort - Abort an IPv4 FIB operation
- *
- *	@fi: route FIB info structure
- */
-void netdev_switch_fib_ipv4_abort(struct fib_info *fi)
-{
-	/* There was a problem installing this route to the offload
-	 * device.  For now, until we come up with more refined
-	 * policy handling, abruptly end IPv4 fib offloading for
-	 * for entire net by flushing offload device(s) of all
-	 * IPv4 routes, and mark IPv4 fib offloading broken from
-	 * this point forward.
-	 */
-
-	fib_flush_external(fi->fib_net);
-	fi->fib_net->ipv4.fib_offload_disabled = true;
-}
-EXPORT_SYMBOL_GPL(netdev_switch_fib_ipv4_abort);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-05-31  7:34 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-17 23:42 [PATCH net v2] switchdev: don't abort hardware ipv4 fib offload on failure to program fib entry in hardware Roopa Prabhu
2015-05-18  5:11 ` Scott Feldman
2015-05-18 20:19 ` David Miller
2015-05-19  0:21   ` John Fastabend
2015-05-19  3:48     ` David Miller
2015-05-19  5:58       ` roopa
2015-05-19 16:34         ` David Miller
2015-05-19 17:01           ` Jiri Pirko
2015-05-19 19:47           ` Andy Gospodarek
2015-05-19 20:28             ` David Miller
2015-05-20 14:37               ` Andy Gospodarek
2015-05-21  5:46               ` Scott Feldman
2015-05-21 15:37                 ` roopa
2015-05-29  7:50                 ` Jiri Pirko
2015-05-29 15:39                   ` Scott Feldman
2015-05-30  9:00                     ` Jiri Pirko
2015-05-31  4:19                       ` John Fastabend
2015-05-31  6:34                         ` Scott Feldman
2015-05-31  7:34                       ` Scott Feldman
2015-05-19  5:57   ` roopa
2015-05-28  9:42   ` Jiri Pirko
2015-05-28 15:35     ` John Fastabend
2015-05-29  7:42       ` Jiri Pirko
2015-05-28 15:40     ` Scott Feldman
2015-05-28 16:10       ` John Fastabend
2015-05-29  5:37         ` roopa
2015-05-28 22:35       ` Andy Gospodarek
2015-05-29  5:51         ` roopa
2015-05-29  7:50       ` Jiri Pirko
2015-05-29  5:31     ` roopa
2015-05-29 15:12     ` Scott Feldman
2015-05-29 15:37       ` Jiri Pirko
  -- strict thread matches above, loose matches on Subject: below --
2015-05-17  3:46 Roopa Prabhu
2015-05-17 23:41 ` roopa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.