All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables
@ 2016-11-23 14:34 Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 01/11] ipv4: fib: Export free_fib_info() Jiri Pirko
                   ` (10 more replies)
  0 siblings, 11 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Jiri Pirko <jiri@mellanox.com>

Ido says:

In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.

The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by introducing a new API to dump
the FIB tables.

The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first seven patches.

The eighth patch adds a change sequence counter to ensure the integrity
of the FIB dump, which is finally introduced in the following patch. The
last two patches modify current listeners of the FIB notification chain
to invoke the dump during their init.

---
v1->v2:
- Add a sequence counter to ensure the integrity of the FIB dump
  (David S. Miller, Hannes Frederic Sowa).
- Protect notifications from re-ordering in listeners by using an
  ordered workqueue (Hannes Frederic Sowa).
- Introduce fib_info_hold() (Jiri Pirko).
- Relieve rocker from the need to invoke the FIB dump by registering
  to the FIB notification chain prior to ports creation.
 
Ido Schimmel (11):
  ipv4: fib: Export free_fib_info()
  ipv4: fib: Add fib_info_hold() helper
  mlxsw: core: Create an ordered workqueue for FIB offload
  mlxsw: spectrum_router: Implement FIB offload in deferred work
  rocker: Create an ordered workqueue for FIB offload
  rocker: Implement FIB offload in deferred work
  ipv4: fib: Convert FIB notification chain to be atomic
  ipv4: fib: Allow for consistent FIB dumping
  ipv4: fib: Add an API to request a FIB dump
  mlxsw: spectrum_router: Request a dump of FIB tables during init
  rocker: Register FIB notifier before creating ports

 drivers/net/ethernet/mellanox/mlxsw/core.c         |  22 ++++
 drivers/net/ethernet/mellanox/mlxsw/core.h         |   2 +
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  88 ++++++++++++--
 drivers/net/ethernet/rocker/rocker.h               |   1 +
 drivers/net/ethernet/rocker/rocker_main.c          |  78 +++++++++++--
 drivers/net/ethernet/rocker/rocker_ofdpa.c         |   1 +
 include/net/ip_fib.h                               |   6 +
 include/net/netns/ipv4.h                           |   2 +
 net/ipv4/fib_frontend.c                            |   2 +
 net/ipv4/fib_semantics.c                           |   1 +
 net/ipv4/fib_trie.c                                | 126 ++++++++++++++++++++-
 11 files changed, 303 insertions(+), 26 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch net-next v2 01/11] ipv4: fib: Export free_fib_info()
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
@ 2016-11-23 14:34 ` Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 02/11] ipv4: fib: Add fib_info_hold() helper Jiri Pirko
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

The FIB notification chain is going to be converted to an atomic chain,
which means switchdev drivers will have to offload FIB entries in
deferred work, as hardware operations entail sleeping.

However, while the work is queued fib info might be freed, so a
reference must be taken. To release the reference (and potentially free
the fib info) fib_info_put() will be called, which in turn calls
free_fib_info().

Export free_fib_info() so that modules will be able to invoke
fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 net/ipv4/fib_semantics.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 388d3e2..c1bc1e9 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -234,6 +234,7 @@ void free_fib_info(struct fib_info *fi)
 #endif
 	call_rcu(&fi->rcu, free_fib_info_rcu);
 }
+EXPORT_SYMBOL_GPL(free_fib_info);
 
 void fib_release_info(struct fib_info *fi)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 02/11] ipv4: fib: Add fib_info_hold() helper
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 01/11] ipv4: fib: Export free_fib_info() Jiri Pirko
@ 2016-11-23 14:34 ` Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 03/11] mlxsw: core: Create an ordered workqueue for FIB offload Jiri Pirko
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

As explained in the previous commit, modules are going to need to take a
reference on fib info and then drop it using fib_info_put().

Add the fib_info_hold() helper to make the code more readable and also
symmetric with fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/ip_fib.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index f390c3b..6c67b93 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -397,6 +397,11 @@ static inline void fib_combine_itag(u32 *itag, const struct fib_result *res)
 
 void free_fib_info(struct fib_info *fi);
 
+static inline void fib_info_hold(struct fib_info *fi)
+{
+	atomic_inc(&fi->fib_clntref);
+}
+
 static inline void fib_info_put(struct fib_info *fi)
 {
 	if (atomic_dec_and_test(&fi->fib_clntref))
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 03/11] mlxsw: core: Create an ordered workqueue for FIB offload
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 01/11] ipv4: fib: Export free_fib_info() Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 02/11] ipv4: fib: Add fib_info_hold() helper Jiri Pirko
@ 2016-11-23 14:34 ` Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 04/11] mlxsw: spectrum_router: Implement FIB offload in deferred work Jiri Pirko
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

We're going to start processing FIB entries addition / deletion events
in deferred work. These work items must be processed in the order they
were submitted or otherwise we can have differences between the kernel's
FIB table and the device's.

Solve this by creating an ordered workqueue to which these work items
will be submitted to. Note that we can't simply convert the current
workqueue to be ordered, as EMADs re-transmissions are also processed in
deferred work.

Later on, we can migrate other work items to this workqueue, such as FDB
notification processing and nexthop resolution, since they all take the
same lock anyway.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 22 ++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlxsw/core.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index bcd7251..a8d9a9c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -77,6 +77,7 @@ static const char mlxsw_core_driver_name[] = "mlxsw_core";
 static struct dentry *mlxsw_core_dbg_root;
 
 static struct workqueue_struct *mlxsw_wq;
+static struct workqueue_struct *mlxsw_owq;
 
 struct mlxsw_core_pcpu_stats {
 	u64			trap_rx_packets[MLXSW_TRAP_ID_MAX];
@@ -1857,6 +1858,18 @@ int mlxsw_core_schedule_dw(struct delayed_work *dwork, unsigned long delay)
 }
 EXPORT_SYMBOL(mlxsw_core_schedule_dw);
 
+int mlxsw_core_schedule_odw(struct delayed_work *dwork, unsigned long delay)
+{
+	return queue_delayed_work(mlxsw_owq, dwork, delay);
+}
+EXPORT_SYMBOL(mlxsw_core_schedule_odw);
+
+void mlxsw_core_flush_owq(void)
+{
+	flush_workqueue(mlxsw_owq);
+}
+EXPORT_SYMBOL(mlxsw_core_flush_owq);
+
 static int __init mlxsw_core_module_init(void)
 {
 	int err;
@@ -1864,6 +1877,12 @@ static int __init mlxsw_core_module_init(void)
 	mlxsw_wq = alloc_workqueue(mlxsw_core_driver_name, WQ_MEM_RECLAIM, 0);
 	if (!mlxsw_wq)
 		return -ENOMEM;
+	mlxsw_owq = alloc_ordered_workqueue("%s_ordered", WQ_MEM_RECLAIM,
+					    mlxsw_core_driver_name);
+	if (!mlxsw_owq) {
+		err = -ENOMEM;
+		goto err_alloc_ordered_workqueue;
+	}
 	mlxsw_core_dbg_root = debugfs_create_dir(mlxsw_core_driver_name, NULL);
 	if (!mlxsw_core_dbg_root) {
 		err = -ENOMEM;
@@ -1872,6 +1891,8 @@ static int __init mlxsw_core_module_init(void)
 	return 0;
 
 err_debugfs_create_dir:
+	destroy_workqueue(mlxsw_owq);
+err_alloc_ordered_workqueue:
 	destroy_workqueue(mlxsw_wq);
 	return err;
 }
@@ -1879,6 +1900,7 @@ static int __init mlxsw_core_module_init(void)
 static void __exit mlxsw_core_module_exit(void)
 {
 	debugfs_remove_recursive(mlxsw_core_dbg_root);
+	destroy_workqueue(mlxsw_owq);
 	destroy_workqueue(mlxsw_wq);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 3de8955..f676ee9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -156,6 +156,8 @@ enum devlink_port_type mlxsw_core_port_type_get(struct mlxsw_core *mlxsw_core,
 						u8 local_port);
 
 int mlxsw_core_schedule_dw(struct delayed_work *dwork, unsigned long delay);
+int mlxsw_core_schedule_odw(struct delayed_work *dwork, unsigned long delay);
+void mlxsw_core_flush_owq(void);
 
 #define MLXSW_CONFIG_PROFILE_SWID_COUNT 8
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 04/11] mlxsw: spectrum_router: Implement FIB offload in deferred work
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
                   ` (2 preceding siblings ...)
  2016-11-23 14:34 ` [patch net-next v2 03/11] mlxsw: core: Create an ordered workqueue for FIB offload Jiri Pirko
@ 2016-11-23 14:34 ` Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 05/11] rocker: Create an ordered workqueue for FIB offload Jiri Pirko
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

FIB offload is currently done in process context with RTNL held, but
we're about to dump the FIB tables in RCU critical section, so we can no
longer sleep.

Instead, defer the operation to process context using deferred work. Make
sure fib info isn't freed while the work is queued by taking a reference
on it and releasing it after the operation is done.

Deferring the operation is valid because the upper layers always assume
the operation was successful. If it's not, then the driver-specific
abort mechanism is called and all routed traffic is directed to slow
path.

The work items are submitted to an ordered workqueue to prevent a
mismatch between the kernel's FIB table and the device's.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 72 +++++++++++++++++++---
 1 file changed, 62 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 683f045..14bed1d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -593,6 +593,14 @@ static void mlxsw_sp_router_fib_flush(struct mlxsw_sp *mlxsw_sp);
 
 static void mlxsw_sp_vrs_fini(struct mlxsw_sp *mlxsw_sp)
 {
+	/* At this stage we're guaranteed not to have new incoming
+	 * FIB notifications and the work queue is free from FIBs
+	 * sitting on top of mlxsw netdevs. However, we can still
+	 * have other FIBs queued. Flush the queue before flushing
+	 * the device's tables. No need for locks, as we're the only
+	 * writer.
+	 */
+	mlxsw_core_flush_owq();
 	mlxsw_sp_router_fib_flush(mlxsw_sp);
 	kfree(mlxsw_sp->router.vrs);
 }
@@ -1948,30 +1956,74 @@ static void __mlxsw_sp_router_fini(struct mlxsw_sp *mlxsw_sp)
 	kfree(mlxsw_sp->rifs);
 }
 
-static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
-				     unsigned long event, void *ptr)
+struct mlxsw_sp_fib_event_work {
+	struct delayed_work dw;
+	struct fib_entry_notifier_info fen_info;
+	struct mlxsw_sp *mlxsw_sp;
+	unsigned long event;
+};
+
+static void mlxsw_sp_router_fib_event_work(struct work_struct *work)
 {
-	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
-	struct fib_entry_notifier_info *fen_info = ptr;
+	struct mlxsw_sp_fib_event_work *fib_work =
+		container_of(work, struct mlxsw_sp_fib_event_work, dw.work);
+	struct mlxsw_sp *mlxsw_sp = fib_work->mlxsw_sp;
 	int err;
 
-	if (!net_eq(fen_info->info.net, &init_net))
-		return NOTIFY_DONE;
-
-	switch (event) {
+	/* Protect internal structures from changes */
+	rtnl_lock();
+	switch (fib_work->event) {
 	case FIB_EVENT_ENTRY_ADD:
-		err = mlxsw_sp_router_fib4_add(mlxsw_sp, fen_info);
+		err = mlxsw_sp_router_fib4_add(mlxsw_sp, &fib_work->fen_info);
 		if (err)
 			mlxsw_sp_router_fib4_abort(mlxsw_sp);
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_ENTRY_DEL:
-		mlxsw_sp_router_fib4_del(mlxsw_sp, fen_info);
+		mlxsw_sp_router_fib4_del(mlxsw_sp, &fib_work->fen_info);
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_RULE_ADD: /* fall through */
 	case FIB_EVENT_RULE_DEL:
 		mlxsw_sp_router_fib4_abort(mlxsw_sp);
 		break;
 	}
+	rtnl_unlock();
+	kfree(fib_work);
+}
+
+/* Called with rcu_read_lock() */
+static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
+				     unsigned long event, void *ptr)
+{
+	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
+	struct mlxsw_sp_fib_event_work *fib_work;
+	struct fib_notifier_info *info = ptr;
+
+	if (!net_eq(info->net, &init_net))
+		return NOTIFY_DONE;
+
+	fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
+	if (WARN_ON(!fib_work))
+		return NOTIFY_BAD;
+
+	INIT_DELAYED_WORK(&fib_work->dw, mlxsw_sp_router_fib_event_work);
+	fib_work->mlxsw_sp = mlxsw_sp;
+	fib_work->event = event;
+
+	switch (event) {
+	case FIB_EVENT_ENTRY_ADD: /* fall through */
+	case FIB_EVENT_ENTRY_DEL:
+		memcpy(&fib_work->fen_info, ptr, sizeof(fib_work->fen_info));
+		/* Take referece on fib_info to prevent it from being
+		 * freed while work is queued. Release it afterwards.
+		 */
+		fib_info_hold(fib_work->fen_info.fi);
+		break;
+	}
+
+	mlxsw_core_schedule_odw(&fib_work->dw, 0);
+
 	return NOTIFY_DONE;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 05/11] rocker: Create an ordered workqueue for FIB offload
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
                   ` (3 preceding siblings ...)
  2016-11-23 14:34 ` [patch net-next v2 04/11] mlxsw: spectrum_router: Implement FIB offload in deferred work Jiri Pirko
@ 2016-11-23 14:34 ` Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 06/11] rocker: Implement FIB offload in deferred work Jiri Pirko
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

As explained in the previous patches, we need to process FIB entries
addition / deletion events in FIFO order or otherwise we can have a
mismatch between the kernel's FIB table and the device's.

Create an ordered workqueue for rocker to which these work items will be
submitted to.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/rocker/rocker.h      |  1 +
 drivers/net/ethernet/rocker/rocker_main.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/rocker/rocker.h b/drivers/net/ethernet/rocker/rocker.h
index 2eb9b49..ee9675d 100644
--- a/drivers/net/ethernet/rocker/rocker.h
+++ b/drivers/net/ethernet/rocker/rocker.h
@@ -72,6 +72,7 @@ struct rocker {
 	struct rocker_dma_ring_info event_ring;
 	struct notifier_block fib_nb;
 	struct rocker_world_ops *wops;
+	struct workqueue_struct *rocker_owq;
 	void *wpriv;
 };
 
diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index 67df4cf..424be96 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -28,6 +28,7 @@
 #include <linux/if_bridge.h>
 #include <linux/bitops.h>
 #include <linux/ctype.h>
+#include <linux/workqueue.h>
 #include <net/switchdev.h>
 #include <net/rtnetlink.h>
 #include <net/netevent.h>
@@ -2754,6 +2755,13 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_request_event_irq;
 	}
 
+	rocker->rocker_owq = alloc_ordered_workqueue(rocker_driver_name,
+						     WQ_MEM_RECLAIM);
+	if (!rocker->rocker_owq) {
+		err = -ENOMEM;
+		goto err_alloc_ordered_workqueue;
+	}
+
 	rocker->hw.id = rocker_read64(rocker, SWITCH_ID);
 
 	err = rocker_probe_ports(rocker);
@@ -2771,6 +2779,8 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	return 0;
 
 err_probe_ports:
+	destroy_workqueue(rocker->rocker_owq);
+err_alloc_ordered_workqueue:
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
 err_request_event_irq:
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
@@ -2799,6 +2809,7 @@ static void rocker_remove(struct pci_dev *pdev)
 	unregister_fib_notifier(&rocker->fib_nb);
 	rocker_write32(rocker, CONTROL, ROCKER_CONTROL_RESET);
 	rocker_remove_ports(rocker);
+	destroy_workqueue(rocker->rocker_owq);
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
 	rocker_dma_rings_fini(rocker);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 06/11] rocker: Implement FIB offload in deferred work
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
                   ` (4 preceding siblings ...)
  2016-11-23 14:34 ` [patch net-next v2 05/11] rocker: Create an ordered workqueue for FIB offload Jiri Pirko
@ 2016-11-23 14:34 ` Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 07/11] ipv4: fib: Convert FIB notification chain to be atomic Jiri Pirko
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

Convert rocker to offload FIBs in deferred work in a similar fashion to
mlxsw, which was converted in the previous patches.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/rocker/rocker_main.c  | 58 +++++++++++++++++++++++++-----
 drivers/net/ethernet/rocker/rocker_ofdpa.c |  1 +
 2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index 424be96..914e9e1 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2166,28 +2166,70 @@ static const struct switchdev_ops rocker_port_switchdev_ops = {
 	.switchdev_port_obj_dump	= rocker_port_obj_dump,
 };
 
-static int rocker_router_fib_event(struct notifier_block *nb,
-				   unsigned long event, void *ptr)
+struct rocker_fib_event_work {
+	struct work_struct work;
+	struct fib_entry_notifier_info fen_info;
+	struct rocker *rocker;
+	unsigned long event;
+};
+
+static void rocker_router_fib_event_work(struct work_struct *work)
 {
-	struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
-	struct fib_entry_notifier_info *fen_info = ptr;
+	struct rocker_fib_event_work *fib_work =
+		container_of(work, struct rocker_fib_event_work, work);
+	struct rocker *rocker = fib_work->rocker;
 	int err;
 
-	switch (event) {
+	/* Protect internal structures from changes */
+	rtnl_lock();
+	switch (fib_work->event) {
 	case FIB_EVENT_ENTRY_ADD:
-		err = rocker_world_fib4_add(rocker, fen_info);
+		err = rocker_world_fib4_add(rocker, &fib_work->fen_info);
 		if (err)
 			rocker_world_fib4_abort(rocker);
-		else
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_ENTRY_DEL:
-		rocker_world_fib4_del(rocker, fen_info);
+		rocker_world_fib4_del(rocker, &fib_work->fen_info);
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_RULE_ADD: /* fall through */
 	case FIB_EVENT_RULE_DEL:
 		rocker_world_fib4_abort(rocker);
 		break;
 	}
+	rtnl_unlock();
+	kfree(fib_work);
+}
+
+/* Called with rcu_read_lock() */
+static int rocker_router_fib_event(struct notifier_block *nb,
+				   unsigned long event, void *ptr)
+{
+	struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
+	struct rocker_fib_event_work *fib_work;
+
+	fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
+	if (WARN_ON(!fib_work))
+		return NOTIFY_BAD;
+
+	INIT_WORK(&fib_work->work, rocker_router_fib_event_work);
+	fib_work->rocker = rocker;
+	fib_work->event = event;
+
+	switch (event) {
+	case FIB_EVENT_ENTRY_ADD: /* fall through */
+	case FIB_EVENT_ENTRY_DEL:
+		memcpy(&fib_work->fen_info, ptr, sizeof(fib_work->fen_info));
+		/* Take referece on fib_info to prevent it from being
+		 * freed while work is queued. Release it afterwards.
+		 */
+		fib_info_hold(fib_work->fen_info.fi);
+		break;
+	}
+
+	queue_work(rocker->rocker_owq, &fib_work->work);
+
 	return NOTIFY_DONE;
 }
 
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 4ca4613..7cd76b6 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -2516,6 +2516,7 @@ static void ofdpa_fini(struct rocker *rocker)
 	int bkt;
 
 	del_timer_sync(&ofdpa->fdb_cleanup_timer);
+	flush_workqueue(rocker->rocker_owq);
 
 	spin_lock_irqsave(&ofdpa->flow_tbl_lock, flags);
 	hash_for_each_safe(ofdpa->flow_tbl, bkt, tmp, flow_entry, entry)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 07/11] ipv4: fib: Convert FIB notification chain to be atomic
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
                   ` (5 preceding siblings ...)
  2016-11-23 14:34 ` [patch net-next v2 06/11] rocker: Implement FIB offload in deferred work Jiri Pirko
@ 2016-11-23 14:34 ` Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 08/11] ipv4: fib: Allow for consistent FIB dumping Jiri Pirko
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

In order not to hold RTNL for long periods of time we're going to dump
the FIB tables using RCU.

Convert the FIB notification chain to be atomic, as we can't block in
RCU critical sections.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 net/ipv4/fib_trie.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 026f309..9bfce0d 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -84,17 +84,17 @@
 #include <trace/events/fib.h>
 #include "fib_lookup.h"
 
-static BLOCKING_NOTIFIER_HEAD(fib_chain);
+static ATOMIC_NOTIFIER_HEAD(fib_chain);
 
 int register_fib_notifier(struct notifier_block *nb)
 {
-	return blocking_notifier_chain_register(&fib_chain, nb);
+	return atomic_notifier_chain_register(&fib_chain, nb);
 }
 EXPORT_SYMBOL(register_fib_notifier);
 
 int unregister_fib_notifier(struct notifier_block *nb)
 {
-	return blocking_notifier_chain_unregister(&fib_chain, nb);
+	return atomic_notifier_chain_unregister(&fib_chain, nb);
 }
 EXPORT_SYMBOL(unregister_fib_notifier);
 
@@ -102,7 +102,7 @@ int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
 		       struct fib_notifier_info *info)
 {
 	info->net = net;
-	return blocking_notifier_call_chain(&fib_chain, event_type, info);
+	return atomic_notifier_call_chain(&fib_chain, event_type, info);
 }
 
 static int call_fib_entry_notifiers(struct net *net,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 08/11] ipv4: fib: Allow for consistent FIB dumping
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
                   ` (6 preceding siblings ...)
  2016-11-23 14:34 ` [patch net-next v2 07/11] ipv4: fib: Convert FIB notification chain to be atomic Jiri Pirko
@ 2016-11-23 14:34 ` Jiri Pirko
  2016-11-23 14:34 ` [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump Jiri Pirko
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

The next patch will enable listeners of the FIB notification chain to
request a dump of the FIB tables. However, since RTNL isn't taken during
the dump, it's possible for the FIB tables to change mid-dump, which
will result in inconsistency between the listener's table and the
kernel's.

Allow listeners to know about changes that occurred mid-dump, by adding
a change sequence counter to each net namespace. The counter is
incremented just before a notification is sent in the FIB chain.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/netns/ipv4.h | 2 ++
 net/ipv4/fib_frontend.c  | 2 ++
 net/ipv4/fib_trie.c      | 1 +
 3 files changed, 5 insertions(+)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 7adf438..d236c08 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -136,5 +136,7 @@ struct netns_ipv4 {
 	int sysctl_fib_multipath_use_neigh;
 #endif
 	atomic_t	rt_genid;
+
+	atomic_t	fib_seq;
 };
 #endif
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 121384b..cf8c867 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1219,6 +1219,8 @@ static int __net_init ip_fib_net_init(struct net *net)
 	int err;
 	size_t size = sizeof(struct hlist_head) * FIB_TABLE_HASHSZ;
 
+	atomic_set(&net->ipv4.fib_seq, 0);
+
 	/* Avoid false sharing : Use at least a full cache line */
 	size = max_t(size_t, size, L1_CACHE_BYTES);
 
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 9bfce0d..b1d2d09 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -101,6 +101,7 @@ EXPORT_SYMBOL(unregister_fib_notifier);
 int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
 		       struct fib_notifier_info *info)
 {
+	atomic_inc(&net->ipv4.fib_seq);
 	info->net = net;
 	return atomic_notifier_call_chain(&fib_chain, event_type, info);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
                   ` (7 preceding siblings ...)
  2016-11-23 14:34 ` [patch net-next v2 08/11] ipv4: fib: Allow for consistent FIB dumping Jiri Pirko
@ 2016-11-23 14:34 ` Jiri Pirko
  2016-11-23 17:47   ` Hannes Frederic Sowa
  2016-11-23 14:48 ` [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init Jiri Pirko
  2016-11-23 14:48 ` [patch net-next v2 11/11] rocker: Register FIB notifier before creating ports Jiri Pirko
  10 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:34 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.

However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.

Solve that by adding an API to request a FIB dump. The dump itself it
done using RCU in order not to starve consumers that need RTNL to make
progress.

For each net namespace the integrity of the dump is ensured by reading
the atomic change sequence counter before and after the dump. This
allows us to avoid the problematic situation in which the dumping
process sends a ENTRY_ADD notification following ENTRY_DEL generated by
another process holding RTNL.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/ip_fib.h |   1 +
 net/ipv4/fib_trie.c  | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 6c67b93..c76303e 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -221,6 +221,7 @@ enum fib_event_type {
 	FIB_EVENT_RULE_DEL,
 };
 
+bool fib_notifier_dump(struct notifier_block *nb);
 int register_fib_notifier(struct notifier_block *nb);
 int unregister_fib_notifier(struct notifier_block *nb);
 int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index b1d2d09..9770edfe 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -86,6 +86,67 @@
 
 static ATOMIC_NOTIFIER_HEAD(fib_chain);
 
+static int call_fib_notifier(struct notifier_block *nb, struct net *net,
+			     enum fib_event_type event_type,
+			     struct fib_notifier_info *info)
+{
+	info->net = net;
+	return nb->notifier_call(nb, event_type, info);
+}
+
+static void fib_rules_notify(struct net *net, struct notifier_block *nb,
+			     enum fib_event_type event_type)
+{
+#ifdef CONFIG_IP_MULTIPLE_TABLES
+	struct fib_notifier_info info;
+
+	if (net->ipv4.fib_has_custom_rules)
+		call_fib_notifier(nb, net, event_type, &info);
+#endif
+}
+
+static void fib_notify(struct net *net, struct notifier_block *nb,
+		       enum fib_event_type event_type);
+
+static int call_fib_entry_notifier(struct notifier_block *nb, struct net *net,
+				   enum fib_event_type event_type, u32 dst,
+				   int dst_len, struct fib_info *fi,
+				   u8 tos, u8 type, u32 tb_id, u32 nlflags)
+{
+	struct fib_entry_notifier_info info = {
+		.dst = dst,
+		.dst_len = dst_len,
+		.fi = fi,
+		.tos = tos,
+		.type = type,
+		.tb_id = tb_id,
+		.nlflags = nlflags,
+	};
+	return call_fib_notifier(nb, net, event_type, &info.info);
+}
+
+bool fib_notifier_dump(struct notifier_block *nb)
+{
+	struct net *net;
+	bool ret = true;
+
+	rcu_read_lock();
+	for_each_net_rcu(net) {
+		int fib_seq = atomic_read(&net->ipv4.fib_seq);
+
+		fib_rules_notify(net, nb, FIB_EVENT_RULE_ADD);
+		fib_notify(net, nb, FIB_EVENT_ENTRY_ADD);
+		if (atomic_read(&net->ipv4.fib_seq) != fib_seq) {
+			ret = false;
+			goto out_unlock;
+		}
+	}
+out_unlock:
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(fib_notifier_dump);
+
 int register_fib_notifier(struct notifier_block *nb)
 {
 	return atomic_notifier_chain_register(&fib_chain, nb);
@@ -1902,6 +1963,62 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
 	return found;
 }
 
+static void fib_leaf_notify(struct net *net, struct key_vector *l,
+			    struct fib_table *tb, struct notifier_block *nb,
+			    enum fib_event_type event_type)
+{
+	struct fib_alias *fa;
+
+	hlist_for_each_entry_rcu(fa, &l->leaf, fa_list) {
+		struct fib_info *fi = fa->fa_info;
+
+		if (!fi)
+			continue;
+
+		/* local and main table can share the same trie,
+		 * so don't notify twice for the same entry.
+		 */
+		if (tb->tb_id != fa->tb_id)
+			continue;
+
+		call_fib_entry_notifier(nb, net, event_type, l->key,
+					KEYLENGTH - fa->fa_slen, fi, fa->fa_tos,
+					fa->fa_type, fa->tb_id, 0);
+	}
+}
+
+static void fib_table_notify(struct net *net, struct fib_table *tb,
+			     struct notifier_block *nb,
+			     enum fib_event_type event_type)
+{
+	struct trie *t = (struct trie *)tb->tb_data;
+	struct key_vector *l, *tp = t->kv;
+	t_key key = 0;
+
+	while ((l = leaf_walk_rcu(&tp, key)) != NULL) {
+		fib_leaf_notify(net, l, tb, nb, event_type);
+
+		key = l->key + 1;
+		/* stop in case of wrap around */
+		if (key < l->key)
+			break;
+	}
+}
+
+static void fib_notify(struct net *net, struct notifier_block *nb,
+		       enum fib_event_type event_type)
+{
+	unsigned int h;
+
+	for (h = 0; h < FIB_TABLE_HASHSZ; h++) {
+		struct hlist_head *head = &net->ipv4.fib_table_hash[h];
+		struct fib_table *tb;
+
+		hlist_for_each_entry_rcu(tb, head, tb_hlist)
+			fib_table_notify(net, tb, nb, event_type);
+	}
+}
+
 static void __trie_free_rcu(struct rcu_head *head)
 {
 	struct fib_table *tb = container_of(head, struct fib_table, rcu);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
                   ` (8 preceding siblings ...)
  2016-11-23 14:34 ` [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump Jiri Pirko
@ 2016-11-23 14:48 ` Jiri Pirko
  2016-11-23 16:00   ` Hannes Frederic Sowa
  2016-11-23 14:48 ` [patch net-next v2 11/11] rocker: Register FIB notifier before creating ports Jiri Pirko
  10 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:48 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

Make sure the device has a complete view of the FIB tables by invoking
their dump during module init.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 14bed1d..36a71d2 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -2027,6 +2027,21 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
 	return NOTIFY_DONE;
 }
 
+static void mlxsw_sp_router_fib_dump(struct mlxsw_sp *mlxsw_sp)
+{
+	while (!fib_notifier_dump(&mlxsw_sp->fib_nb)) {
+		/* Flush pending FIB notifications and then flush the
+		 * device's table before requesting another dump. Do
+		 * that with RTNL held, as FIB notification block is
+		 * already registered.
+		 */
+		mlxsw_core_flush_owq();
+		rtnl_lock();
+		mlxsw_sp_router_fib_flush(mlxsw_sp);
+		rtnl_unlock();
+	}
+}
+
 int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
 {
 	int err;
@@ -2048,6 +2063,7 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
 
 	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
 	register_fib_notifier(&mlxsw_sp->fib_nb);
+	mlxsw_sp_router_fib_dump(mlxsw_sp);
 	return 0;
 
 err_neigh_init:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch net-next v2 11/11] rocker: Register FIB notifier before creating ports
  2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
                   ` (9 preceding siblings ...)
  2016-11-23 14:48 ` [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init Jiri Pirko
@ 2016-11-23 14:48 ` Jiri Pirko
  10 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 14:48 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Ido Schimmel <idosch@mellanox.com>

Unlike mlxsw, rocker only supports the reflection of routes pointing to
its own netdevs. Therefore, instead of requesting a FIB dump during
init, simply register the FIB notifier before creating the ports.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/rocker/rocker_main.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index 914e9e1..8c9c90a 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2804,6 +2804,9 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_alloc_ordered_workqueue;
 	}
 
+	rocker->fib_nb.notifier_call = rocker_router_fib_event;
+	register_fib_notifier(&rocker->fib_nb);
+
 	rocker->hw.id = rocker_read64(rocker, SWITCH_ID);
 
 	err = rocker_probe_ports(rocker);
@@ -2812,15 +2815,13 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_probe_ports;
 	}
 
-	rocker->fib_nb.notifier_call = rocker_router_fib_event;
-	register_fib_notifier(&rocker->fib_nb);
-
 	dev_info(&pdev->dev, "Rocker switch with id %*phN\n",
 		 (int)sizeof(rocker->hw.id), &rocker->hw.id);
 
 	return 0;
 
 err_probe_ports:
+	unregister_fib_notifier(&rocker->fib_nb);
 	destroy_workqueue(rocker->rocker_owq);
 err_alloc_ordered_workqueue:
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
@@ -2848,9 +2849,9 @@ static void rocker_remove(struct pci_dev *pdev)
 {
 	struct rocker *rocker = pci_get_drvdata(pdev);
 
+	rocker_remove_ports(rocker);
 	unregister_fib_notifier(&rocker->fib_nb);
 	rocker_write32(rocker, CONTROL, ROCKER_CONTROL_RESET);
-	rocker_remove_ports(rocker);
 	destroy_workqueue(rocker->rocker_owq);
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init
  2016-11-23 14:48 ` [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init Jiri Pirko
@ 2016-11-23 16:00   ` Hannes Frederic Sowa
  2016-11-23 16:04     ` Jiri Pirko
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-23 16:00 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kaber

On Wed, Nov 23, 2016, at 15:48, Jiri Pirko wrote:
> From: Ido Schimmel <idosch@mellanox.com>
> 
> Make sure the device has a complete view of the FIB tables by invoking
> their dump during module init.
> 
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 16
>  ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> index 14bed1d..36a71d2 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> @@ -2027,6 +2027,21 @@ static int mlxsw_sp_router_fib_event(struct
> notifier_block *nb,
>  	return NOTIFY_DONE;
>  }
>  
> +static void mlxsw_sp_router_fib_dump(struct mlxsw_sp *mlxsw_sp)
> +{
> +       while (!fib_notifier_dump(&mlxsw_sp->fib_nb)) {
> +               /* Flush pending FIB notifications and then flush the
> +                * device's table before requesting another dump. Do
> +                * that with RTNL held, as FIB notification block is
> +                * already registered.
> +                */
> +               mlxsw_core_flush_owq();
> +               rtnl_lock();
> +               mlxsw_sp_router_fib_flush(mlxsw_sp);
> +               rtnl_unlock();
> +       }
> +}

I think it is fine to use this kind of synchronization.

But I think that this part of the logic still belongs into the core
kernel. I still think it could happen that we will loop here
indefinitely because of a lot of routing updates and as such would need
to abort this loop after a number of tries.

I would like that the kernel has one function to do this decision
instead of later patching all users of this API. Do you think it is
worth it?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init
  2016-11-23 16:00   ` Hannes Frederic Sowa
@ 2016-11-23 16:04     ` Jiri Pirko
  2016-11-23 16:59       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 16:04 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: netdev, davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz,
	roopa, dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kaber

Wed, Nov 23, 2016 at 05:00:00PM CET, hannes@stressinduktion.org wrote:
>On Wed, Nov 23, 2016, at 15:48, Jiri Pirko wrote:
>> From: Ido Schimmel <idosch@mellanox.com>
>> 
>> Make sure the device has a complete view of the FIB tables by invoking
>> their dump during module init.
>> 
>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 16
>>  ++++++++++++++++
>>  1 file changed, 16 insertions(+)
>> 
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> index 14bed1d..36a71d2 100644
>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> @@ -2027,6 +2027,21 @@ static int mlxsw_sp_router_fib_event(struct
>> notifier_block *nb,
>>  	return NOTIFY_DONE;
>>  }
>>  
>> +static void mlxsw_sp_router_fib_dump(struct mlxsw_sp *mlxsw_sp)
>> +{
>> +       while (!fib_notifier_dump(&mlxsw_sp->fib_nb)) {
>> +               /* Flush pending FIB notifications and then flush the
>> +                * device's table before requesting another dump. Do
>> +                * that with RTNL held, as FIB notification block is
>> +                * already registered.
>> +                */
>> +               mlxsw_core_flush_owq();
>> +               rtnl_lock();
>> +               mlxsw_sp_router_fib_flush(mlxsw_sp);
>> +               rtnl_unlock();
>> +       }
>> +}
>
>I think it is fine to use this kind of synchronization.
>
>But I think that this part of the logic still belongs into the core

Core does not know how driver handles the offloaded fibs. So only driver
knows how/if he needs to do flush in case of retry.


>kernel. I still think it could happen that we will loop here
>indefinitely because of a lot of routing updates and as such would need
>to abort this loop after a number of tries.

In theory, it is possible, howevery quite unlikely.

>
>I would like that the kernel has one function to do this decision
>instead of later patching all users of this API. Do you think it is
>worth it?

For the reason I stated above, I'm not sure that could be done...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init
  2016-11-23 16:04     ` Jiri Pirko
@ 2016-11-23 16:59       ` Hannes Frederic Sowa
  2016-11-23 17:04         ` Jiri Pirko
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-23 16:59 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz,
	roopa, dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kaber

On Wed, Nov 23, 2016, at 17:04, Jiri Pirko wrote:
> Wed, Nov 23, 2016 at 05:00:00PM CET, hannes@stressinduktion.org wrote:
> >On Wed, Nov 23, 2016, at 15:48, Jiri Pirko wrote:
> >> From: Ido Schimmel <idosch@mellanox.com>
> >> 
> >> Make sure the device has a complete view of the FIB tables by invoking
> >> their dump during module init.
> >> 
> >> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> >> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> >> ---
> >>  drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 16
> >>  ++++++++++++++++
> >>  1 file changed, 16 insertions(+)
> >> 
> >> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >> index 14bed1d..36a71d2 100644
> >> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> >> @@ -2027,6 +2027,21 @@ static int mlxsw_sp_router_fib_event(struct
> >> notifier_block *nb,
> >>  	return NOTIFY_DONE;
> >>  }
> >>  
> >> +static void mlxsw_sp_router_fib_dump(struct mlxsw_sp *mlxsw_sp)
> >> +{
> >> +       while (!fib_notifier_dump(&mlxsw_sp->fib_nb)) {
> >> +               /* Flush pending FIB notifications and then flush the
> >> +                * device's table before requesting another dump. Do
> >> +                * that with RTNL held, as FIB notification block is
> >> +                * already registered.
> >> +                */
> >> +               mlxsw_core_flush_owq();
> >> +               rtnl_lock();
> >> +               mlxsw_sp_router_fib_flush(mlxsw_sp);
> >> +               rtnl_unlock();
> >> +       }
> >> +}
> >
> >I think it is fine to use this kind of synchronization.
> >
> >But I think that this part of the logic still belongs into the core
> 
> Core does not know how driver handles the offloaded fibs. So only driver
> knows how/if he needs to do flush in case of retry.

Sure, but an abort function can be provided to the kernel anyway and the
driver can care about that.

> >kernel. I still think it could happen that we will loop here
> >indefinitely because of a lot of routing updates and as such would need
> >to abort this loop after a number of tries.
> 
> In theory, it is possible, howevery quite unlikely.

I think the "quite unlikely" already got us down the path to not using
rtnl_lock in the first place.

As I said, I am not sure about this as I didn't try any hardware
offloading before and delays how long it needs to be transferred to
hardware, but having a fail case for that seems like a nice improvement.
At the same time I know of Linux boxes running in internet exchanges
having several peers. The high update rates actually led to bgp
implementation specifying flap damping which is actually nowadays
considered harmful.

Seriously, while most of the time convergence in routing protocols is
good and most updates only hit the BGP user space table anyway and the
change is suppressed because recursive routing lookup idempotence, quite
unlikely events happen to the internet now and then:
http://research.dyn.com/2009/02/longer-is-not-better/, which caused *a
lot* of flapping and ongoing events on BGP routers throughout the world.

I agree it is unlikely that you have to refresh your hw dump during this
time, but who knows what customers do and what admins do in case
something like this happens. I just don't favor to looping endlessly
trying to sync up and getting into a stable state but tell the admin to
detach the control plane from the forwarding plane and sync up then.

That said, I think a sysctl for a maximum number of loops respected by
drivers that needs to do so, should be enough for the time being.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init
  2016-11-23 16:59       ` Hannes Frederic Sowa
@ 2016-11-23 17:04         ` Jiri Pirko
  2016-11-23 17:08           ` Hannes Frederic Sowa
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 17:04 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: netdev, davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz,
	roopa, dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kaber

Wed, Nov 23, 2016 at 05:59:05PM CET, hannes@stressinduktion.org wrote:
>On Wed, Nov 23, 2016, at 17:04, Jiri Pirko wrote:
>> Wed, Nov 23, 2016 at 05:00:00PM CET, hannes@stressinduktion.org wrote:
>> >On Wed, Nov 23, 2016, at 15:48, Jiri Pirko wrote:
>> >> From: Ido Schimmel <idosch@mellanox.com>
>> >> 
>> >> Make sure the device has a complete view of the FIB tables by invoking
>> >> their dump during module init.
>> >> 
>> >> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>> >> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> >> ---
>> >>  drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 16
>> >>  ++++++++++++++++
>> >>  1 file changed, 16 insertions(+)
>> >> 
>> >> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> >> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> >> index 14bed1d..36a71d2 100644
>> >> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> >> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>> >> @@ -2027,6 +2027,21 @@ static int mlxsw_sp_router_fib_event(struct
>> >> notifier_block *nb,
>> >>  	return NOTIFY_DONE;
>> >>  }
>> >>  
>> >> +static void mlxsw_sp_router_fib_dump(struct mlxsw_sp *mlxsw_sp)
>> >> +{
>> >> +       while (!fib_notifier_dump(&mlxsw_sp->fib_nb)) {
>> >> +               /* Flush pending FIB notifications and then flush the
>> >> +                * device's table before requesting another dump. Do
>> >> +                * that with RTNL held, as FIB notification block is
>> >> +                * already registered.
>> >> +                */
>> >> +               mlxsw_core_flush_owq();
>> >> +               rtnl_lock();
>> >> +               mlxsw_sp_router_fib_flush(mlxsw_sp);
>> >> +               rtnl_unlock();
>> >> +       }
>> >> +}
>> >
>> >I think it is fine to use this kind of synchronization.
>> >
>> >But I think that this part of the logic still belongs into the core
>> 
>> Core does not know how driver handles the offloaded fibs. So only driver
>> knows how/if he needs to do flush in case of retry.
>
>Sure, but an abort function can be provided to the kernel anyway and the
>driver can care about that.

Ok, how?


>
>> >kernel. I still think it could happen that we will loop here
>> >indefinitely because of a lot of routing updates and as such would need
>> >to abort this loop after a number of tries.
>> 
>> In theory, it is possible, howevery quite unlikely.
>
>I think the "quite unlikely" already got us down the path to not using
>rtnl_lock in the first place.
>
>As I said, I am not sure about this as I didn't try any hardware
>offloading before and delays how long it needs to be transferred to
>hardware, but having a fail case for that seems like a nice improvement.
>At the same time I know of Linux boxes running in internet exchanges
>having several peers. The high update rates actually led to bgp
>implementation specifying flap damping which is actually nowadays
>considered harmful.
>
>Seriously, while most of the time convergence in routing protocols is
>good and most updates only hit the BGP user space table anyway and the
>change is suppressed because recursive routing lookup idempotence, quite
>unlikely events happen to the internet now and then:
>http://research.dyn.com/2009/02/longer-is-not-better/, which caused *a
>lot* of flapping and ongoing events on BGP routers throughout the world.
>
>I agree it is unlikely that you have to refresh your hw dump during this
>time, but who knows what customers do and what admins do in case
>something like this happens. I just don't favor to looping endlessly
>trying to sync up and getting into a stable state but tell the admin to
>detach the control plane from the forwarding plane and sync up then.
>
>That said, I think a sysctl for a maximum number of loops respected by
>drivers that needs to do so, should be enough for the time being.

Okay. Point taken.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init
  2016-11-23 17:04         ` Jiri Pirko
@ 2016-11-23 17:08           ` Hannes Frederic Sowa
  2016-11-23 19:22             ` Ido Schimmel
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-23 17:08 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz,
	roopa, dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kaber

On Wed, Nov 23, 2016, at 18:04, Jiri Pirko wrote:
> >Sure, but an abort function can be provided to the kernel anyway and the
> >driver can care about that.
> 
> Ok, how?

I think just a sysctl ontop of this series is enough plus a pr_warn.
Rocker and mlxsw are responsible to loop for a maximum amount of time.

Otherwise, if more fancy, can we provide an
fib_inconsistency_notification function pointer in netdev_ops?

Bye and thanks,
Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump
  2016-11-23 14:34 ` [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump Jiri Pirko
@ 2016-11-23 17:47   ` Hannes Frederic Sowa
  2016-11-23 19:53     ` Ido Schimmel
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-23 17:47 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, kaber

On 23.11.2016 15:34, Jiri Pirko wrote:
> From: Ido Schimmel <idosch@mellanox.com>
> 
> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> introduced a new notification chain to notify listeners (f.e., switchdev
> drivers) about addition and deletion of routes.
> 
> However, upon registration to the chain the FIB tables can already be
> populated, which means potential listeners will have an incomplete view
> of the tables.
> 
> Solve that by adding an API to request a FIB dump. The dump itself it
> done using RCU in order not to starve consumers that need RTNL to make
> progress.
> 
> For each net namespace the integrity of the dump is ensured by reading
> the atomic change sequence counter before and after the dump. This
> allows us to avoid the problematic situation in which the dumping
> process sends a ENTRY_ADD notification following ENTRY_DEL generated by
> another process holding RTNL.
> 
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  include/net/ip_fib.h |   1 +
>  net/ipv4/fib_trie.c  | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 118 insertions(+)
> 
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index 6c67b93..c76303e 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -221,6 +221,7 @@ enum fib_event_type {
>  	FIB_EVENT_RULE_DEL,
>  };
>  
> +bool fib_notifier_dump(struct notifier_block *nb);
>  int register_fib_notifier(struct notifier_block *nb);
>  int unregister_fib_notifier(struct notifier_block *nb);
>  int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index b1d2d09..9770edfe 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -86,6 +86,67 @@
>  
>  static ATOMIC_NOTIFIER_HEAD(fib_chain);
>  
> +static int call_fib_notifier(struct notifier_block *nb, struct net *net,
> +			     enum fib_event_type event_type,
> +			     struct fib_notifier_info *info)
> +{
> +	info->net = net;
> +	return nb->notifier_call(nb, event_type, info);
> +}
> +
> +static void fib_rules_notify(struct net *net, struct notifier_block *nb,
> +			     enum fib_event_type event_type)
> +{
> +#ifdef CONFIG_IP_MULTIPLE_TABLES
> +	struct fib_notifier_info info;
> +
> +	if (net->ipv4.fib_has_custom_rules)
> +		call_fib_notifier(nb, net, event_type, &info);
> +#endif
> +}
> +
> +static void fib_notify(struct net *net, struct notifier_block *nb,
> +		       enum fib_event_type event_type);
> +
> +static int call_fib_entry_notifier(struct notifier_block *nb, struct net *net,
> +				   enum fib_event_type event_type, u32 dst,
> +				   int dst_len, struct fib_info *fi,
> +				   u8 tos, u8 type, u32 tb_id, u32 nlflags)
> +{
> +	struct fib_entry_notifier_info info = {
> +		.dst = dst,
> +		.dst_len = dst_len,
> +		.fi = fi,
> +		.tos = tos,
> +		.type = type,
> +		.tb_id = tb_id,
> +		.nlflags = nlflags,
> +	};
> +	return call_fib_notifier(nb, net, event_type, &info.info);
> +}
> +
> +bool fib_notifier_dump(struct notifier_block *nb)
> +{
> +	struct net *net;
> +	bool ret = true;



> +	rcu_read_lock();
> +	for_each_net_rcu(net) {
> +		int fib_seq = atomic_read(&net->ipv4.fib_seq);
> +
> +		fib_rules_notify(net, nb, FIB_EVENT_RULE_ADD);
> +		fib_notify(net, nb, FIB_EVENT_ENTRY_ADD);
> +		if (atomic_read(&net->ipv4.fib_seq) != fib_seq) {
> +			ret = false;
> +			goto out_unlock;
> +		}

Hmm, I think you need to read the sequence counter under rtnl_lock to
have an ordering with the rest of the updates to the RCU trie. Otherwise
you don't know if the fib trie has the correct view regarding to the
incoming notifications as a whole. This is also necessary during restarts.

You can also try to register the notifier after the dump and check for
the sequence number after registering the notifier, maybe that is easier
(and restart unregisters and does the same).

Bye,
Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init
  2016-11-23 17:08           ` Hannes Frederic Sowa
@ 2016-11-23 19:22             ` Ido Schimmel
  2016-11-23 19:45               ` Jiri Pirko
  0 siblings, 1 reply; 24+ messages in thread
From: Ido Schimmel @ 2016-11-23 19:22 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber

On Wed, Nov 23, 2016 at 06:08:23PM +0100, Hannes Frederic Sowa wrote:
> On Wed, Nov 23, 2016, at 18:04, Jiri Pirko wrote:
> > >Sure, but an abort function can be provided to the kernel anyway and the
> > >driver can care about that.
> > 
> > Ok, how?
> 
> I think just a sysctl ontop of this series is enough plus a pr_warn.
> Rocker and mlxsw are responsible to loop for a maximum amount of time.

Maybe, when the module requests a dump it can also provide a callback
that is invoked following each failed dump?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init
  2016-11-23 19:22             ` Ido Schimmel
@ 2016-11-23 19:45               ` Jiri Pirko
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2016-11-23 19:45 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Hannes Frederic Sowa, netdev, davem, idosch, eladr, yotamg,
	nogahf, arkadis, ogerlitz, roopa, dsa, nikolay, andy,
	vivien.didelot, andrew, f.fainelli, alexander.h.duyck, kaber

Wed, Nov 23, 2016 at 08:22:30PM CET, idosch@idosch.org wrote:
>On Wed, Nov 23, 2016 at 06:08:23PM +0100, Hannes Frederic Sowa wrote:
>> On Wed, Nov 23, 2016, at 18:04, Jiri Pirko wrote:
>> > >Sure, but an abort function can be provided to the kernel anyway and the
>> > >driver can care about that.
>> > 
>> > Ok, how?
>> 
>> I think just a sysctl ontop of this series is enough plus a pr_warn.
>> Rocker and mlxsw are responsible to loop for a maximum amount of time.
>
>Maybe, when the module requests a dump it can also provide a callback
>that is invoked following each failed dump?

That would make sense. Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump
  2016-11-23 17:47   ` Hannes Frederic Sowa
@ 2016-11-23 19:53     ` Ido Schimmel
  2016-11-23 23:04       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 24+ messages in thread
From: Ido Schimmel @ 2016-11-23 19:53 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber

On Wed, Nov 23, 2016 at 06:47:03PM +0100, Hannes Frederic Sowa wrote:
> Hmm, I think you need to read the sequence counter under rtnl_lock to
> have an ordering with the rest of the updates to the RCU trie. Otherwise
> you don't know if the fib trie has the correct view regarding to the
> incoming notifications as a whole. This is also necessary during restarts.

I spent quite a lot of time thinking about this specific issue, but I
couldn't convince myself that the read should be done under RTNL and I'm
not sure I understand your reasoning. Can you please elaborate?

If, before each notification sent, we call atomic_inc() and then call
atomic_read() at the end, then how can we be tricked?

Thanks for looking into this!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump
  2016-11-23 19:53     ` Ido Schimmel
@ 2016-11-23 23:04       ` Hannes Frederic Sowa
  2016-11-24  8:47         ` Ido Schimmel
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-23 23:04 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber

On 23.11.2016 20:53, Ido Schimmel wrote:
> On Wed, Nov 23, 2016 at 06:47:03PM +0100, Hannes Frederic Sowa wrote:
>> Hmm, I think you need to read the sequence counter under rtnl_lock to
>> have an ordering with the rest of the updates to the RCU trie. Otherwise
>> you don't know if the fib trie has the correct view regarding to the
>> incoming notifications as a whole. This is also necessary during restarts.
>
> I spent quite a lot of time thinking about this specific issue, but I
> couldn't convince myself that the read should be done under RTNL and I'm
> not sure I understand your reasoning. Can you please elaborate?
>
> If, before each notification sent, we call atomic_inc() and then call
> atomic_read() at the end, then how can we be tricked?

The race I am suspecting to happen is:

<CPU0> fib_register()

<CPU1> delete route by notifier
<CPU1> enqueue delete cmd into ordered queue

<CPU0> starts dump
<CPU0> sees deleted route by CPU1 because route not yet removed from RCU
<CPU0> enqueues route for addition

sometimes later in the ordered queue:

delete route -> route not in hw, nop
add route from dump -> route added to hardware

The result should actually have been that route isn't in hw.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump
  2016-11-23 23:04       ` Hannes Frederic Sowa
@ 2016-11-24  8:47         ` Ido Schimmel
  2016-11-24 12:34           ` Hannes Frederic Sowa
  0 siblings, 1 reply; 24+ messages in thread
From: Ido Schimmel @ 2016-11-24  8:47 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber

On Thu, Nov 24, 2016 at 12:04:57AM +0100, Hannes Frederic Sowa wrote:
> On 23.11.2016 20:53, Ido Schimmel wrote:
> > On Wed, Nov 23, 2016 at 06:47:03PM +0100, Hannes Frederic Sowa wrote:
> >> Hmm, I think you need to read the sequence counter under rtnl_lock to
> >> have an ordering with the rest of the updates to the RCU trie. Otherwise
> >> you don't know if the fib trie has the correct view regarding to the
> >> incoming notifications as a whole. This is also necessary during restarts.
> >
> > I spent quite a lot of time thinking about this specific issue, but I
> > couldn't convince myself that the read should be done under RTNL and I'm
> > not sure I understand your reasoning. Can you please elaborate?
> >
> > If, before each notification sent, we call atomic_inc() and then call
> > atomic_read() at the end, then how can we be tricked?
> 
> The race I am suspecting to happen is:
> 
> <CPU0> fib_register()
> 
> <CPU1> delete route by notifier
> <CPU1> enqueue delete cmd into ordered queue
> 
> <CPU0> starts dump
> <CPU0> sees deleted route by CPU1 because route not yet removed from RCU
> <CPU0> enqueues route for addition

Yea, I missed this trivial case... My mind was fixed on problems that
could happen after the dump already started. :(

Regarding your suggestion, I think the API will be more useful if we
don't bundle fib_register() and fib_dump() together. We can do the
following instead:

1) Sum 'fib_seq' (doesn't need to be atomic_t anymore) from all net
namespaces under RTNL
2) Dump FIB tables under RCU
3) Do 1) again
4) Compare results from 1) and 3) and retry (according to sysctl limit)
if results differ. Before each retry the module's callback (if passed)
will be invoked.

Sounds OK?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump
  2016-11-24  8:47         ` Ido Schimmel
@ 2016-11-24 12:34           ` Hannes Frederic Sowa
  0 siblings, 0 replies; 24+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-24 12:34 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber

On 24.11.2016 09:47, Ido Schimmel wrote:
> On Thu, Nov 24, 2016 at 12:04:57AM +0100, Hannes Frederic Sowa wrote:
>> On 23.11.2016 20:53, Ido Schimmel wrote:
>>> On Wed, Nov 23, 2016 at 06:47:03PM +0100, Hannes Frederic Sowa wrote:
>>>> Hmm, I think you need to read the sequence counter under rtnl_lock to
>>>> have an ordering with the rest of the updates to the RCU trie. Otherwise
>>>> you don't know if the fib trie has the correct view regarding to the
>>>> incoming notifications as a whole. This is also necessary during restarts.
>>>
>>> I spent quite a lot of time thinking about this specific issue, but I
>>> couldn't convince myself that the read should be done under RTNL and I'm
>>> not sure I understand your reasoning. Can you please elaborate?
>>>
>>> If, before each notification sent, we call atomic_inc() and then call
>>> atomic_read() at the end, then how can we be tricked?
>>
>> The race I am suspecting to happen is:
>>
>> <CPU0> fib_register()
>>
>> <CPU1> delete route by notifier
>> <CPU1> enqueue delete cmd into ordered queue
>>
>> <CPU0> starts dump
>> <CPU0> sees deleted route by CPU1 because route not yet removed from RCU
>> <CPU0> enqueues route for addition
> 
> Yea, I missed this trivial case... My mind was fixed on problems that
> could happen after the dump already started. :(
> 
> Regarding your suggestion, I think the API will be more useful if we
> don't bundle fib_register() and fib_dump() together. We can do the
> following instead:
> 
> 1) Sum 'fib_seq' (doesn't need to be atomic_t anymore) from all net
> namespaces under RTNL

You anyway only support init_net, no?

I didn't fully understood what you mean by sum? Using one for the whole
system?

We already have net->ipv4.rt_genid as a per-namespace routing change
counter, have you looked at that?

> 2) Dump FIB tables under RCU
> 3) Do 1) again
> 4) Compare results from 1) and 3) and retry (according to sysctl limit)
> if results differ. Before each retry the module's callback (if passed)
> will be invoked.
> 
> Sounds OK?

Ah, you want to sum up all the fib_seq from all namespaces. Now I got it.

Not sure if that is such a good idea actually. It might make problems
later on if offloading will maybe one day become a per-netns knob for
the respective admins.

But semantically it should work.

If it turns out to be much easier than doing it per-netns, I think this
approach should work.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-11-24 12:34 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-23 14:34 [patch net-next v2 00/11] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
2016-11-23 14:34 ` [patch net-next v2 01/11] ipv4: fib: Export free_fib_info() Jiri Pirko
2016-11-23 14:34 ` [patch net-next v2 02/11] ipv4: fib: Add fib_info_hold() helper Jiri Pirko
2016-11-23 14:34 ` [patch net-next v2 03/11] mlxsw: core: Create an ordered workqueue for FIB offload Jiri Pirko
2016-11-23 14:34 ` [patch net-next v2 04/11] mlxsw: spectrum_router: Implement FIB offload in deferred work Jiri Pirko
2016-11-23 14:34 ` [patch net-next v2 05/11] rocker: Create an ordered workqueue for FIB offload Jiri Pirko
2016-11-23 14:34 ` [patch net-next v2 06/11] rocker: Implement FIB offload in deferred work Jiri Pirko
2016-11-23 14:34 ` [patch net-next v2 07/11] ipv4: fib: Convert FIB notification chain to be atomic Jiri Pirko
2016-11-23 14:34 ` [patch net-next v2 08/11] ipv4: fib: Allow for consistent FIB dumping Jiri Pirko
2016-11-23 14:34 ` [patch net-next v2 09/11] ipv4: fib: Add an API to request a FIB dump Jiri Pirko
2016-11-23 17:47   ` Hannes Frederic Sowa
2016-11-23 19:53     ` Ido Schimmel
2016-11-23 23:04       ` Hannes Frederic Sowa
2016-11-24  8:47         ` Ido Schimmel
2016-11-24 12:34           ` Hannes Frederic Sowa
2016-11-23 14:48 ` [patch net-next v2 10/11] mlxsw: spectrum_router: Request a dump of FIB tables during init Jiri Pirko
2016-11-23 16:00   ` Hannes Frederic Sowa
2016-11-23 16:04     ` Jiri Pirko
2016-11-23 16:59       ` Hannes Frederic Sowa
2016-11-23 17:04         ` Jiri Pirko
2016-11-23 17:08           ` Hannes Frederic Sowa
2016-11-23 19:22             ` Ido Schimmel
2016-11-23 19:45               ` Jiri Pirko
2016-11-23 14:48 ` [patch net-next v2 11/11] rocker: Register FIB notifier before creating ports Jiri Pirko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.