All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
@ 2015-06-08 14:12 Matan Barak
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Previously, every vendor implemented its net device notifiers in its own
driver. This introduces a huge code duplication as figuring
whether an event is related to the vendor's net device in the
various cases (bonding, vlan or any other upper device) is
similar for all vendors. In the future, when multiple GID types will
be supported, this code duplication would have gotten even worse.

Therefore, we decided moving this into a common core core.
roce_gid_table and roce_gid_mgmt were created in order to store and
manage the new GID table, by filling it when getting the related events.
Vendors now only have to implement modify_gid and get_netdev IB
device calls, which are truly unique for each vendor.
roce_gid_table is implemented as IB client that manages the GID
table of the IB device. Each GID is associated with a GID type and a
network device (which is mandatory for management of the GID table).
The GID table is populated by using roce_gid_mgmt. roce_gid_mgmt
registers to net device/inet/inet events and calls roce_gid_table
in order to populate the GID table accordingly.

Patch 0001 creates a new infrastructure for storing GIDs and their attributes
in IB/core. This infrastructure support lock-less read of GIDs using a
seqcounter. The data structure is initialized only for RoCE ports.
Every gid has meta information describes its related net device.

Patch 0002 replaces the locking schema for IB devices. Previously, device_mutex
was used in order to lock the devices/clients list against every modification.
However, downstream patches add new functions which iterate over the device
list. Those functions could be executed for a workqueue contexts on behalf
of IB clients. Thus, when a client is removed, we need to wait for all works
to be finished. Since a client removal was done in device_mutex lock, we'll
be in fact waiting for a work which requires to lock the device_mutex itself
(=DEADLOCK). In order to mitigate this problem, we use rw semaphore to allow
multiple readers. We use a mutex in order to solve races between adding
(or removing) a client and a device simultaneously, which could have resulted
in calling client->add (or client->remove) twice for the same device and client.
This patch was sent as part of "Add network namespace support in the RDMA-CM"
series.

Patches 0003, 0005 and 0007 add population of this table for various cases
based on net device events. We always enable default gids for an active
device (an active device is defined here as a device that doesn't have
a bonding master or is the current active slave). This is done in order
to allow loopback traffic. Patch 0007 adds proper bonding support -
only the active slaves retain their master's IP based gids and default gids.
Patch 0006 adds the required information for the bonding case, while patch
0004 adds the required address for default GID.

The rest of the patches add support for ocrdma and mlx4 devices.

This series is rebased over Doug's k.o/for-4.2 branch.

Thanks,
Devesh, Somnath, Moni and Matan

Changes from V4:
(1) Remove any API changes.
(2) Fixed a bug regarding bonding upper devices.
(3) Rebased ontop of Doug's k.o/for-4.2.

Changes from V3:
(1) Remove RoCE V2 functionality (it will be sent at later patchset).
(2) Instead of removing qp_attr_mask flags, reserve them.
(3) Remove the kref from IB devices in favor of rwsem.
(4) Change the name of roce_gid_cache to roce_gid_table.
(5) Fix a race when roce_gid_table is free'd while getting events.
(6) Remove the roce_gid_cache active/inactive flag/API.

Changes from V2:
(1) When creating multiple vlans over an interface,
    only the last created vlan's GID was populated in the table
    (regression from V2).
(2) Inactive slave of bonding sometimes lost GIDs related to IPs
    that were directly applied to it.
(3) Memory leak in mlx4
(4) roce_gid_cache now calls modify_gid with zgid in order to cause
    the provider to delete all the information it allocated for those
    GIDs.
(4) A mlx4 patch didn't compile and a downstream patch fixed it.
(5) cma_configfs should depend on both address translation and configfs.
(6) ocrdma driver redefined zgid.
(7) Added event information for NETDEV_CHANGEUPPER event.

Changes from V1:
(1) Addressed Shachar and Haggai's comments
(2) Fixed multicast support
(3) Generalized bonding support
(4) Added default GID after the IB device's net device was removed from bonding
(5) Fixed bugs in mlx4 implementation regarding multicast
(6) Fixed bugs in mlx4 when using XRC QPs after this patchset was applied
(7) Fixed bug when the RoCE gid cache didn't exist
(8) Moved the bonding's DRV macros to a private header
(9) Support non-configfs configurations

Haggai Eran (1):
  IB/core: Add rwsem to allow reading device list or client list

Matan Barak (7):
  IB/core: Add RoCE GID table
  IB/core: Add RoCE GID population
  net/ipv6: Export addrconf_ifid_eui48
  IB/core: Add default GID for RoCE GID table
  net: Add info for NETDEV_CHANGEUPPER event
  IB/core: Add RoCE table bonding support
  IB/core: ib_cache routines should use roce_gid_table when needed

Moni Shoua (3):
  net/mlx4: Postpone the registration of net_device
  IB/mlx4: Implement ib_device callbacks
  IB/mlx4: Replace mechanism for RoCE GID management

Somnath Kotur (1):
  RDMA/ocrdma: Changes in driver to incorporate the moving of GID Table
    mgmt to IB/Core.

 drivers/infiniband/core/Makefile             |   3 +-
 drivers/infiniband/core/cache.c              | 210 +++++--
 drivers/infiniband/core/core_priv.h          |  61 +++
 drivers/infiniband/core/device.c             | 133 ++++-
 drivers/infiniband/core/roce_gid_mgmt.c      | 781 +++++++++++++++++++++++++++
 drivers/infiniband/core/roce_gid_table.c     | 656 ++++++++++++++++++++++
 drivers/infiniband/hw/mlx4/ah.c              |   2 +-
 drivers/infiniband/hw/mlx4/main.c            | 713 ++++++++----------------
 drivers/infiniband/hw/mlx4/mlx4_ib.h         |  21 +-
 drivers/infiniband/hw/mlx4/qp.c              |  10 +-
 drivers/infiniband/hw/ocrdma/ocrdma.h        |  10 +
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c     |   3 +
 drivers/infiniband/hw/ocrdma/ocrdma_main.c   | 233 +-------
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h    |  13 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |  31 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h  |   4 +
 drivers/net/bonding/bond_options.c           |  13 -
 drivers/net/ethernet/mellanox/mlx4/en_main.c |  36 +-
 drivers/net/ethernet/mellanox/mlx4/intf.c    |   3 +
 include/linux/mlx4/device.h                  |   3 +-
 include/linux/mlx4/driver.h                  |   1 +
 include/linux/netdevice.h                    |  14 +
 include/net/addrconf.h                       |  31 ++
 include/net/bonding.h                        |   7 +
 include/rdma/ib_addr.h                       |   2 +-
 include/rdma/ib_verbs.h                      |  76 ++-
 net/core/dev.c                               |  12 +-
 net/ipv6/addrconf.c                          |  31 --
 28 files changed, 2253 insertions(+), 860 deletions(-)
 create mode 100644 drivers/infiniband/core/roce_gid_mgmt.c
 create mode 100644 drivers/infiniband/core/roce_gid_table.c

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 01/12] IB/core: Add RoCE GID table
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-08 14:12   ` Matan Barak
  2015-06-08 14:12   ` [PATCH for-next V5 02/12] IB/core: Add rwsem to allow reading device list or client list Matan Barak
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Refactoring the GID management code requires us to have GIDs
alongside its meta information (the associated net_device).
This information is necessary in order to manage the GID
table successfully. For example, when a net_device is removed,
its associated GIDs need to be removed as well.

Adding a GID table that supports a lockless find, add and
delete gids. The lockless nature comes from using a unique
sequence number per table entry and detecting that while reading/
writing this sequence wasn't changed.

By using this RoCE GID table, providers must implement a
modify_gid callback. The table is managed exclusively by
this roce_gid_table and the provider just need to write
the data to the hardware.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/Makefile         |   3 +-
 drivers/infiniband/core/core_priv.h      |  23 ++
 drivers/infiniband/core/roce_gid_table.c | 470 +++++++++++++++++++++++++++++++
 drivers/infiniband/hw/mlx4/main.c        |   2 -
 include/rdma/ib_verbs.h                  |  46 ++-
 5 files changed, 540 insertions(+), 4 deletions(-)
 create mode 100644 drivers/infiniband/core/roce_gid_table.c

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index acf7367..fbeb72a 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -9,7 +9,8 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o \
 					$(user_access-y)
 
 ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
-				device.o fmr_pool.o cache.o netlink.o
+				device.o fmr_pool.o cache.o netlink.o \
+				roce_gid_table.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
 
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 87d1936..a9e58418 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -35,6 +35,7 @@
 
 #include <linux/list.h>
 #include <linux/spinlock.h>
+#include <net/net_namespace.h>
 
 #include <rdma/ib_verbs.h>
 
@@ -51,4 +52,26 @@ void ib_cache_cleanup(void);
 
 int ib_resolve_eth_l2_attrs(struct ib_qp *qp,
 			    struct ib_qp_attr *qp_attr, int *qp_attr_mask);
+
+int roce_gid_table_get_gid(struct ib_device *ib_dev, u8 port, int index,
+			   union ib_gid *gid, struct ib_gid_attr *attr);
+
+int roce_gid_table_find_gid(struct ib_device *ib_dev, const union ib_gid *gid,
+			    struct net_device *ndev, u8 *port,
+			    u16 *index);
+
+int roce_gid_table_find_gid_by_port(struct ib_device *ib_dev,
+				    const union ib_gid *gid,
+				    u8 port, struct net_device *ndev,
+				    u16 *index);
+
+int roce_add_gid(struct ib_device *ib_dev, u8 port,
+		 union ib_gid *gid, struct ib_gid_attr *attr);
+
+int roce_del_gid(struct ib_device *ib_dev, u8 port,
+		 union ib_gid *gid, struct ib_gid_attr *attr);
+
+int roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
+			     struct net_device *ndev);
+
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/roce_gid_table.c b/drivers/infiniband/core/roce_gid_table.c
new file mode 100644
index 0000000..f492cf1
--- /dev/null
+++ b/drivers/infiniband/core/roce_gid_table.c
@@ -0,0 +1,470 @@
+/*
+ * Copyright (c) 2015, Mellanox Technologies inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
+#include <rdma/ib_cache.h>
+
+#include "core_priv.h"
+
+union ib_gid zgid;
+EXPORT_SYMBOL_GPL(zgid);
+
+static const struct ib_gid_attr zattr;
+
+enum gid_attr_find_mask {
+	GID_ATTR_FIND_MASK_GID          = 1UL << 0,
+	GID_ATTR_FIND_MASK_NETDEV	= 1UL << 1,
+};
+
+struct dev_put_rcu {
+	struct rcu_head		rcu;
+	struct net_device	*ndev;
+};
+
+static void put_ndev(struct rcu_head *rcu)
+{
+	struct dev_put_rcu *put_rcu =
+		container_of(rcu, struct dev_put_rcu, rcu);
+
+	dev_put(put_rcu->ndev);
+	kfree(put_rcu);
+}
+
+static int write_gid(struct ib_device *ib_dev, u8 port,
+		     struct ib_roce_gid_table *table, int ix,
+		     const union ib_gid *gid,
+		     const struct ib_gid_attr *attr)
+{
+	int ret;
+	struct dev_put_rcu	*put_rcu;
+	struct net_device *old_net_dev;
+
+	write_seqcount_begin(&table->data_vec[ix].seq);
+
+	ret = ib_dev->modify_gid(ib_dev, port, ix, gid, attr,
+				 &table->data_vec[ix].context);
+
+	old_net_dev = table->data_vec[ix].attr.ndev;
+	if (old_net_dev && old_net_dev != attr->ndev) {
+		put_rcu = kmalloc(sizeof(*put_rcu), GFP_KERNEL);
+		if (put_rcu) {
+			put_rcu->ndev = old_net_dev;
+			call_rcu(&put_rcu->rcu, put_ndev);
+		} else {
+			pr_warn("roce_gid_table: can't allocate rcu context, using synchronize\n");
+			synchronize_rcu();
+			dev_put(old_net_dev);
+		}
+	}
+	/* if modify_gid failed, just delete the old gid */
+	if (ret || !memcmp(gid, &zgid, sizeof(*gid))) {
+		gid = &zgid;
+		attr = &zattr;
+		table->data_vec[ix].context = NULL;
+	}
+	memcpy(&table->data_vec[ix].gid, gid, sizeof(*gid));
+	memcpy(&table->data_vec[ix].attr, attr, sizeof(*attr));
+	if (table->data_vec[ix].attr.ndev &&
+	    table->data_vec[ix].attr.ndev != old_net_dev)
+		dev_hold(table->data_vec[ix].attr.ndev);
+
+	write_seqcount_end(&table->data_vec[ix].seq);
+
+	if (!ret) {
+		struct ib_event event;
+
+		event.device		= ib_dev;
+		event.element.port_num	= port;
+		event.event		= IB_EVENT_GID_CHANGE;
+
+		ib_dispatch_event(&event);
+	}
+	return ret;
+}
+
+static int find_gid(struct ib_roce_gid_table *table, const union ib_gid *gid,
+		    const struct ib_gid_attr *val, unsigned long mask)
+{
+	int i;
+
+	for (i = 0; i < table->sz; i++) {
+		struct ib_gid_attr *attr = &table->data_vec[i].attr;
+		unsigned int orig_seq = read_seqcount_begin(&table->data_vec[i].seq);
+
+		if (memcmp(gid, &table->data_vec[i].gid, sizeof(*gid)))
+			continue;
+
+		if (mask & GID_ATTR_FIND_MASK_NETDEV &&
+		    attr->ndev != val->ndev)
+			continue;
+
+		if (!read_seqcount_retry(&table->data_vec[i].seq, orig_seq))
+			return i;
+		/* The sequence number changed under our feet,
+		 * the GID entry is invalid. Continue to the
+		 * next entry.
+		 */
+	}
+
+	return -1;
+}
+
+int roce_add_gid(struct ib_device *ib_dev, u8 port,
+		 union ib_gid *gid, struct ib_gid_attr *attr)
+{
+	struct ib_roce_gid_table **ports_table =
+		READ_ONCE(ib_dev->cache.roce_gid_table);
+	struct ib_roce_gid_table *table;
+	int ix;
+	int ret = 0;
+
+	/* make sure we read the ports_table */
+	smp_rmb();
+
+	if (!ports_table)
+		return -EOPNOTSUPP;
+
+	table = ports_table[port - rdma_start_port(ib_dev)];
+
+	if (!table)
+		return -EPROTONOSUPPORT;
+
+	if (!memcmp(gid, &zgid, sizeof(*gid)))
+		return -EINVAL;
+
+	mutex_lock(&table->lock);
+
+	ix = find_gid(table, gid, attr, GID_ATTR_FIND_MASK_NETDEV);
+	if (ix >= 0)
+		goto out_unlock;
+
+	ix = find_gid(table, &zgid, NULL, 0);
+	if (ix < 0) {
+		ret = -ENOSPC;
+		goto out_unlock;
+	}
+
+	write_gid(ib_dev, port, table, ix, gid, attr);
+
+out_unlock:
+	mutex_unlock(&table->lock);
+	return ret;
+}
+
+int roce_del_gid(struct ib_device *ib_dev, u8 port,
+		 union ib_gid *gid, struct ib_gid_attr *attr)
+{
+	struct ib_roce_gid_table **ports_table =
+		READ_ONCE(ib_dev->cache.roce_gid_table);
+	struct ib_roce_gid_table *table;
+	int ix;
+
+	/* make sure we read the ports_table */
+	smp_rmb();
+
+	if (!ports_table)
+		return 0;
+
+	table  = ports_table[port - rdma_start_port(ib_dev)];
+
+	if (!table)
+		return -EPROTONOSUPPORT;
+
+	mutex_lock(&table->lock);
+
+	ix = find_gid(table, gid, attr,
+		      GID_ATTR_FIND_MASK_NETDEV);
+	if (ix < 0)
+		goto out_unlock;
+
+	write_gid(ib_dev, port, table, ix, &zgid, &zattr);
+
+out_unlock:
+	mutex_unlock(&table->lock);
+	return 0;
+}
+
+int roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
+			     struct net_device *ndev)
+{
+	struct ib_roce_gid_table **ports_table =
+		READ_ONCE(ib_dev->cache.roce_gid_table);
+	struct ib_roce_gid_table *table;
+	int ix;
+
+	/* make sure we read the ports_table */
+	smp_rmb();
+
+	if (!ports_table)
+		return 0;
+
+	table  = ports_table[port - rdma_start_port(ib_dev)];
+
+	if (!table)
+		return -EPROTONOSUPPORT;
+
+	mutex_lock(&table->lock);
+
+	for (ix = 0; ix < table->sz; ix++)
+		if (table->data_vec[ix].attr.ndev == ndev)
+			write_gid(ib_dev, port, table, ix, &zgid, &zattr);
+
+	mutex_unlock(&table->lock);
+	return 0;
+}
+
+int roce_gid_table_get_gid(struct ib_device *ib_dev, u8 port, int index,
+			   union ib_gid *gid, struct ib_gid_attr *attr)
+{
+	struct ib_roce_gid_table **ports_table =
+		READ_ONCE(ib_dev->cache.roce_gid_table);
+	struct ib_roce_gid_table *table;
+	union ib_gid local_gid;
+	struct ib_gid_attr local_attr;
+	unsigned int orig_seq;
+
+	/* make sure we read the ports_table */
+	smp_rmb();
+
+	if (!ports_table)
+		return -EOPNOTSUPP;
+
+	table = ports_table[port - rdma_start_port(ib_dev)];
+
+	if (!table)
+		return -EPROTONOSUPPORT;
+
+	if (index < 0 || index >= table->sz)
+		return -EINVAL;
+
+	orig_seq = read_seqcount_begin(&table->data_vec[index].seq);
+
+	memcpy(&local_gid, &table->data_vec[index].gid, sizeof(local_gid));
+	memcpy(&local_attr, &table->data_vec[index].attr, sizeof(local_attr));
+
+	if (read_seqcount_retry(&table->data_vec[index].seq, orig_seq))
+		return -EAGAIN;
+
+	memcpy(gid, &local_gid, sizeof(*gid));
+	if (attr)
+		memcpy(attr, &local_attr, sizeof(*attr));
+	return 0;
+}
+
+static int _roce_gid_table_find_gid(struct ib_device *ib_dev,
+				    const union ib_gid *gid,
+				    const struct ib_gid_attr *val,
+				    unsigned long mask,
+				    u8 *port, u16 *index)
+{
+	struct ib_roce_gid_table **ports_table =
+		READ_ONCE(ib_dev->cache.roce_gid_table);
+	struct ib_roce_gid_table *table;
+	u8 p;
+	int local_index;
+
+	/* make sure we read the ports_table */
+	smp_rmb();
+
+	if (!ports_table)
+		return -ENOENT;
+
+	for (p = 0; p < ib_dev->phys_port_cnt; p++) {
+		if (!rdma_protocol_roce(ib_dev, p + rdma_start_port(ib_dev)))
+			continue;
+		table = ports_table[p];
+		if (!table)
+			continue;
+		local_index = find_gid(table, gid, val, mask);
+		if (local_index >= 0) {
+			if (index)
+				*index = local_index;
+			if (port)
+				*port = p + rdma_start_port(ib_dev);
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+int roce_gid_table_find_gid(struct ib_device *ib_dev, const union ib_gid *gid,
+			    struct net_device *ndev, u8 *port, u16 *index)
+{
+	unsigned long mask = GID_ATTR_FIND_MASK_GID;
+	struct ib_gid_attr gid_attr_val = {.ndev = ndev};
+
+	if (ndev)
+		mask |= GID_ATTR_FIND_MASK_NETDEV;
+
+	return _roce_gid_table_find_gid(ib_dev, gid, &gid_attr_val,
+					mask, port, index);
+}
+
+int roce_gid_table_find_gid_by_port(struct ib_device *ib_dev,
+				    const union ib_gid *gid,
+				    u8 port, struct net_device *ndev,
+				    u16 *index)
+{
+	int local_index;
+	struct ib_roce_gid_table **ports_table =
+		READ_ONCE(ib_dev->cache.roce_gid_table);
+	struct ib_roce_gid_table *table;
+	unsigned long mask = 0;
+	struct ib_gid_attr val = {.ndev = ndev};
+
+	/* make sure we read the ports_table */
+	smp_rmb();
+
+	if (!ports_table || port < rdma_start_port(ib_dev) ||
+	    port > rdma_end_port(ib_dev))
+		return -ENOENT;
+
+	table = ports_table[port - rdma_start_port(ib_dev)];
+	if (!table)
+		return -ENOENT;
+
+	if (ndev)
+		mask |= GID_ATTR_FIND_MASK_NETDEV;
+
+	local_index = find_gid(table, gid, &val, mask);
+	if (local_index >= 0) {
+		if (index)
+			*index = local_index;
+		return 0;
+	}
+
+	return -ENOENT;
+}
+
+static struct ib_roce_gid_table *alloc_roce_gid_table(int sz)
+{
+	unsigned int i;
+	struct ib_roce_gid_table *table =
+		kzalloc(sizeof(struct ib_roce_gid_table), GFP_KERNEL);
+	if (!table)
+		return NULL;
+
+	table->data_vec = kcalloc(sz, sizeof(*table->data_vec), GFP_KERNEL);
+	if (!table->data_vec)
+		goto err_free_table;
+
+	mutex_init(&table->lock);
+
+	table->sz = sz;
+
+	for (i = 0; i < sz; i++)
+		seqcount_init(&table->data_vec[i].seq);
+
+	return table;
+
+err_free_table:
+	kfree(table);
+	return NULL;
+}
+
+static void free_roce_gid_table(struct ib_device *ib_dev, u8 port,
+				struct ib_roce_gid_table *table)
+{
+	int i;
+
+	if (!table)
+		return;
+
+	for (i = 0; i < table->sz; ++i) {
+		if (memcmp(&table->data_vec[i].gid, &zgid,
+			   sizeof(table->data_vec[i].gid)))
+			write_gid(ib_dev, port, table, i, &zgid, &zattr);
+	}
+	kfree(table->data_vec);
+	kfree(table);
+}
+
+static int roce_gid_table_setup_one(struct ib_device *ib_dev)
+{
+	u8 port;
+	struct ib_roce_gid_table **table;
+	int err = 0;
+
+	if (!ib_dev->modify_gid)
+		return -EOPNOTSUPP;
+
+	table = kcalloc(ib_dev->phys_port_cnt, sizeof(*table), GFP_KERNEL);
+
+	if (!table) {
+		pr_warn("failed to allocate roce addr table for %s\n",
+			ib_dev->name);
+		return -ENOMEM;
+	}
+
+	for (port = 0; port < ib_dev->phys_port_cnt; port++) {
+		uint8_t rdma_port = port + rdma_start_port(ib_dev);
+
+		if (!rdma_protocol_roce(ib_dev, rdma_port))
+			continue;
+		table[port] =
+			alloc_roce_gid_table(
+				ib_dev->port_immutable[rdma_port].gid_tbl_len);
+		if (!table[port]) {
+			err = -ENOMEM;
+			goto rollback_table_setup;
+		}
+	}
+
+	ib_dev->cache.roce_gid_table = table;
+	return 0;
+
+rollback_table_setup:
+	for (port = 1; port <= ib_dev->phys_port_cnt; port++)
+		free_roce_gid_table(ib_dev, port, table[port]);
+
+	kfree(table);
+	return err;
+}
+
+static void roce_gid_table_cleanup_one(struct ib_device *ib_dev,
+				       struct ib_roce_gid_table **table)
+{
+	u8 port;
+
+	if (!table)
+		return;
+
+	for (port = 0; port < ib_dev->phys_port_cnt; port++)
+		free_roce_gid_table(ib_dev, port + rdma_start_port(ib_dev),
+				    table[port]);
+
+	kfree(table);
+}
+
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 86c0c27..69ae464 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -93,8 +93,6 @@ static void init_query_mad(struct ib_smp *mad)
 	mad->method	   = IB_MGMT_METHOD_GET;
 }
 
-static union ib_gid zgid;
-
 static int check_flow_steering_support(struct mlx4_dev *dev)
 {
 	int eth_num_ports = 0;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7d78794..72b62cd 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -64,6 +64,27 @@ union ib_gid {
 	} global;
 };
 
+extern union ib_gid zgid;
+
+struct ib_gid_attr {
+	struct net_device	*ndev;
+};
+
+struct ib_roce_gid_table_entry {
+	seqcount_t	    seq;
+	union ib_gid        gid;
+	struct ib_gid_attr  attr;
+	void		   *context;
+};
+
+struct ib_roce_gid_table {
+	int		     active;
+	int                  sz;
+	/* locking against multiple writes in data_vec */
+	struct mutex         lock;
+	struct ib_roce_gid_table_entry *data_vec;
+};
+
 enum rdma_node_type {
 	/* IB values map to NodeInfo:NodeType. */
 	RDMA_NODE_IB_CA 	= 1,
@@ -272,7 +293,8 @@ enum ib_port_cap_flags {
 	IB_PORT_BOOT_MGMT_SUP			= 1 << 23,
 	IB_PORT_LINK_LATENCY_SUP		= 1 << 24,
 	IB_PORT_CLIENT_REG_SUP			= 1 << 25,
-	IB_PORT_IP_BASED_GIDS			= 1 << 26
+	IB_PORT_IP_BASED_GIDS			= 1 << 26,
+	IB_PORT_ROCE				= 1 << 27,
 };
 
 enum ib_port_width {
@@ -1476,6 +1498,7 @@ struct ib_cache {
 	struct ib_pkey_cache  **pkey_cache;
 	struct ib_gid_cache   **gid_cache;
 	u8                     *lmc_cache;
+	struct ib_roce_gid_table **roce_gid_table;
 };
 
 struct ib_dma_mapping_ops {
@@ -1559,6 +1582,27 @@ struct ib_device {
 	int		           (*query_gid)(struct ib_device *device,
 						u8 port_num, int index,
 						union ib_gid *gid);
+	/* When calling modify_gid, the HW vendor's driver should
+	 * modify the gid of device @device at gid index @index of
+	 * port @port to be @gid. Meta-info of that gid (for example,
+	 * the network device related to this gid is available
+	 * at @attr. @context allows the HW vendor driver to store extra
+	 * information together with a GID entry. The HW vendor may allocate
+	 * memory to contain this information and store it in @context when a
+	 * new GID entry is written to. Upon the deletion of a GID entry,
+	 * the HW vendor must free any allocated memory. The caller will clear
+	 * @context afterwards.GID deletion is done by passing the zero gid.
+	 * Params are consistent until the next call of modify_gid.
+	 * The function should return 0 on success or error otherwise.
+	 * The function could be called concurrently for different ports.
+	 * This function is only called when roce_gid_table is used.
+	 */
+	int		           (*modify_gid)(struct ib_device *device,
+						 u8 port_num,
+						 unsigned int index,
+						 const union ib_gid *gid,
+						 const struct ib_gid_attr *attr,
+						 void **context);
 	int		           (*query_pkey)(struct ib_device *device,
 						 u8 port_num, u16 index, u16 *pkey);
 	int		           (*modify_device)(struct ib_device *device,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 02/12] IB/core: Add rwsem to allow reading device list or client list
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 14:12   ` [PATCH for-next V5 01/12] IB/core: Add RoCE GID table Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
  2015-06-08 14:12   ` [PATCH for-next V5 03/12] IB/core: Add RoCE GID population Matan Barak
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Haggai Eran

From: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Currently the RDMA subsystem's device list and client list are protected by
a single mutex. This prevents adding user-facing APIs that iterate these
lists, since using them may cause a deadlock. The patch attempts to solve
this problem by adding a read-write semaphore to protect the lists. Readers
now don't need the mutex, and are safe just by read-locking the semaphore.

The ib_register_device, ib_register_client, ib_unregister_device, and
ib_unregister_client functions are modified to lock the semaphore for write
during their respective list modification. Also, in order to make sure
client callbacks are called only between add() and remove() calls, the code
is changed to only add items to the lists after the add() calls and remove
from the lists before the remove() calls.

This patch attempts to solve a similar need [1] that was seen in the RoCE
v2 patch series.

This patch is also a part of [2] "Add network namespace support in
RDMA-CM" patch series.

[1] http://www.spinics.net/lists/linux-rdma/msg24733.html
[2] http://permalink.gmane.org/gmane.linux.drivers.rdma/25588

Cc: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/device.c | 39 ++++++++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 8d07c12..7e83f0d 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -55,17 +55,24 @@ struct ib_client_data {
 struct workqueue_struct *ib_wq;
 EXPORT_SYMBOL_GPL(ib_wq);
 
+/* The device_list and client_list contain devices and clients after their
+ * registration has completed, and the devices and clients are removed
+ * during unregistration. */
 static LIST_HEAD(device_list);
 static LIST_HEAD(client_list);
 
 /*
- * device_mutex protects access to both device_list and client_list.
- * There's no real point to using multiple locks or something fancier
- * like an rwsem: we always access both lists, and we're always
- * modifying one list or the other list.  In any case this is not a
- * hot path so there's no point in trying to optimize.
+ * device_mutex and lists_rwsem protect access to both device_list and
+ * client_list.  device_mutex protects writer access by device and client
+ * registration / de-registration.  lists_rwsem protects reader access to
+ * these lists.  Iterators of these lists must lock it for read, while updates
+ * to the lists must be done with a write lock. A special case is when the
+ * device_mutex is locked. In this case locking the lists for read access is
+ * not necessary as the device_mutex implies it.
  */
 static DEFINE_MUTEX(device_mutex);
+static DECLARE_RWSEM(lists_rwsem);
+
 
 static int ib_device_check_mandatory(struct ib_device *device)
 {
@@ -294,8 +301,6 @@ int ib_register_device(struct ib_device *device,
 		goto out;
 	}
 
-	list_add_tail(&device->core_list, &device_list);
-
 	device->reg_state = IB_DEV_REGISTERED;
 
 	{
@@ -306,6 +311,10 @@ int ib_register_device(struct ib_device *device,
 				client->add(device);
 	}
 
+	down_write(&lists_rwsem);
+	list_add_tail(&device->core_list, &device_list);
+	up_write(&lists_rwsem);
+
  out:
 	mutex_unlock(&device_mutex);
 	return ret;
@@ -326,12 +335,14 @@ void ib_unregister_device(struct ib_device *device)
 
 	mutex_lock(&device_mutex);
 
+	down_write(&lists_rwsem);
+	list_del(&device->core_list);
+	up_write(&lists_rwsem);
+
 	list_for_each_entry_reverse(client, &client_list, list)
 		if (client->remove)
 			client->remove(device);
 
-	list_del(&device->core_list);
-
 	mutex_unlock(&device_mutex);
 
 	ib_device_unregister_sysfs(device);
@@ -364,11 +375,14 @@ int ib_register_client(struct ib_client *client)
 
 	mutex_lock(&device_mutex);
 
-	list_add_tail(&client->list, &client_list);
 	list_for_each_entry(device, &device_list, core_list)
 		if (client->add && !add_client_context(device, client))
 			client->add(device);
 
+	down_write(&lists_rwsem);
+	list_add_tail(&client->list, &client_list);
+	up_write(&lists_rwsem);
+
 	mutex_unlock(&device_mutex);
 
 	return 0;
@@ -391,6 +405,10 @@ void ib_unregister_client(struct ib_client *client)
 
 	mutex_lock(&device_mutex);
 
+	down_write(&lists_rwsem);
+	list_del(&client->list);
+	up_write(&lists_rwsem);
+
 	list_for_each_entry(device, &device_list, core_list) {
 		if (client->remove)
 			client->remove(device);
@@ -403,7 +421,6 @@ void ib_unregister_client(struct ib_client *client)
 			}
 		spin_unlock_irqrestore(&device->client_data_lock, flags);
 	}
-	list_del(&client->list);
 
 	mutex_unlock(&device_mutex);
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 03/12] IB/core: Add RoCE GID population
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 14:12   ` [PATCH for-next V5 01/12] IB/core: Add RoCE GID table Matan Barak
  2015-06-08 14:12   ` [PATCH for-next V5 02/12] IB/core: Add rwsem to allow reading device list or client list Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
       [not found]     ` <1433772735-22416-4-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 14:12   ` [PATCH for-next V5 04/12] net/ipv6: Export addrconf_ifid_eui48 Matan Barak
                     ` (9 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

In order to populate the GID table, we need to listen for
events:
(a) IB device has been added or removed - used in order
    to allocate/deallocate the table and populate
    the GID table internally.
(b) inet events - add new GIDs (according to the IP addresses)
    to the table.
(c) netdev up/down/change_addr - if a netdev is built onto our
    RoCE device, we need to add/delete its IPs.

When an event is received, multiple entries (each with
different GID type) are added.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/Makefile         |   2 +-
 drivers/infiniband/core/core_priv.h      |  26 ++
 drivers/infiniband/core/device.c         |  77 +++++
 drivers/infiniband/core/roce_gid_mgmt.c  | 471 +++++++++++++++++++++++++++++++
 drivers/infiniband/core/roce_gid_table.c |  52 ++++
 include/rdma/ib_addr.h                   |   2 +-
 include/rdma/ib_verbs.h                  |   8 +
 7 files changed, 636 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/core/roce_gid_mgmt.c

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index fbeb72a..3ceb3f8 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o \
 
 ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
 				device.o fmr_pool.o cache.o netlink.o \
-				roce_gid_table.o
+				roce_gid_table.o roce_gid_mgmt.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
 
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index a9e58418..eab4e6c 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -39,6 +39,8 @@
 
 #include <rdma/ib_verbs.h>
 
+extern struct workqueue_struct *roce_gid_mgmt_wq;
+
 int  ib_device_register_sysfs(struct ib_device *device,
 			      int (*port_callback)(struct ib_device *,
 						   u8, struct kobject *));
@@ -53,6 +55,22 @@ void ib_cache_cleanup(void);
 int ib_resolve_eth_l2_attrs(struct ib_qp *qp,
 			    struct ib_qp_attr *qp_attr, int *qp_attr_mask);
 
+typedef void (*roce_netdev_callback)(struct ib_device *device, u8 port,
+	      struct net_device *idev, void *cookie);
+
+typedef int (*roce_netdev_filter)(struct ib_device *device, u8 port,
+	     struct net_device *idev, void *cookie);
+
+void ib_dev_roce_ports_of_netdev(struct ib_device *ib_dev,
+				 roce_netdev_filter filter,
+				 void *filter_cookie,
+				 roce_netdev_callback cb,
+				 void *cookie);
+void ib_enum_roce_ports_of_netdev(roce_netdev_filter filter,
+				  void *filter_cookie,
+				  roce_netdev_callback cb,
+				  void *cookie);
+
 int roce_gid_table_get_gid(struct ib_device *ib_dev, u8 port, int index,
 			   union ib_gid *gid, struct ib_gid_attr *attr);
 
@@ -65,6 +83,9 @@ int roce_gid_table_find_gid_by_port(struct ib_device *ib_dev,
 				    u8 port, struct net_device *ndev,
 				    u16 *index);
 
+int roce_gid_table_setup(void);
+void roce_gid_table_cleanup(void);
+
 int roce_add_gid(struct ib_device *ib_dev, u8 port,
 		 union ib_gid *gid, struct ib_gid_attr *attr);
 
@@ -74,4 +95,9 @@ int roce_del_gid(struct ib_device *ib_dev, u8 port,
 int roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
 			     struct net_device *ndev);
 
+int roce_gid_mgmt_init(void);
+void roce_gid_mgmt_cleanup(void);
+
+int roce_rescan_device(struct ib_device *ib_dev);
+
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 7e83f0d..84edb9a 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -39,6 +39,7 @@
 #include <linux/init.h>
 #include <linux/mutex.h>
 #include <rdma/rdma_netlink.h>
+#include <rdma/ib_addr.h>
 
 #include "core_priv.h"
 
@@ -597,6 +598,79 @@ int ib_query_gid(struct ib_device *device,
 EXPORT_SYMBOL(ib_query_gid);
 
 /**
+ * ib_dev_roce_ports_of_netdev - enumerate RoCE ports of ibdev in
+ *				 respect of netdev
+ * @ib_dev : IB device we want to query
+ * @filter: Should we call the callback?
+ * @filter_cookie: Cookie passed to filter
+ * @cb: Callback to call for each found RoCE ports
+ * @cookie: Cookie passed back to the callback
+ *
+ * Enumerates all of the physical RoCE ports of ib_dev RoCE ports
+ * which are relaying Ethernet packets to a specific
+ * (possibly virtual) netdevice according to filter.
+ */
+void ib_dev_roce_ports_of_netdev(struct ib_device *ib_dev,
+				 roce_netdev_filter filter,
+				 void *filter_cookie,
+				 roce_netdev_callback cb,
+				 void *cookie)
+{
+	u8 port;
+
+	if (ib_dev->modify_gid)
+		for (port = rdma_start_port(ib_dev); port <= rdma_end_port(ib_dev);
+		     port++)
+			if (rdma_protocol_roce(ib_dev, port)) {
+				struct net_device *idev = NULL;
+
+				rcu_read_lock();
+				if (ib_dev->get_netdev)
+					idev = ib_dev->get_netdev(ib_dev, port);
+
+				if (idev &&
+				    idev->reg_state >= NETREG_UNREGISTERED)
+					idev = NULL;
+
+				if (idev)
+					dev_hold(idev);
+
+				rcu_read_unlock();
+
+				if (filter(ib_dev, port, idev, filter_cookie))
+					cb(ib_dev, port, idev, cookie);
+
+				if (idev)
+					dev_put(idev);
+			}
+}
+
+/**
+ * ib_enum_roce_ports_of_netdev - enumerate RoCE ports of a netdev
+ * @filter: Should we call the callback?
+ * @filter_cookie: Cookie passed to filter
+ * @cb: Callback to call for each found RoCE ports
+ * @cookie: Cookie passed back to the callback
+ *
+ * Enumerates all of the physical RoCE ports which are relaying
+ * Ethernet packets to a specific (possibly virtual) netdevice
+ * according to filter.
+ */
+void ib_enum_roce_ports_of_netdev(roce_netdev_filter filter,
+				  void *filter_cookie,
+				  roce_netdev_callback cb,
+				  void *cookie)
+{
+	struct ib_device *dev;
+
+	down_read(&lists_rwsem);
+	list_for_each_entry_rcu(dev, &device_list, core_list)
+		ib_dev_roce_ports_of_netdev(dev, filter, filter_cookie, cb,
+					    cookie);
+	up_read(&lists_rwsem);
+}
+
+/**
  * ib_query_pkey - Get P_Key table entry
  * @device:Device to query
  * @port_num:Port number to query
@@ -751,6 +825,8 @@ static int __init ib_core_init(void)
 		goto err_sysfs;
 	}
 
+	roce_gid_table_setup();
+
 	ret = ib_cache_setup();
 	if (ret) {
 		printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n");
@@ -772,6 +848,7 @@ err:
 
 static void __exit ib_core_cleanup(void)
 {
+	roce_gid_table_cleanup();
 	ib_cache_cleanup();
 	ibnl_cleanup();
 	ib_sysfs_cleanup();
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
new file mode 100644
index 0000000..70616fc
--- /dev/null
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -0,0 +1,471 @@
+/*
+ * Copyright (c) 2015, Mellanox Technologies inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "core_priv.h"
+
+#include <linux/in.h>
+#include <linux/in6.h>
+
+/* For in6_dev_get/in6_dev_put */
+#include <net/addrconf.h>
+
+#include <rdma/ib_cache.h>
+#include <rdma/ib_addr.h>
+
+struct workqueue_struct *roce_gid_mgmt_wq;
+
+enum gid_op_type {
+	GID_DEL = 0,
+	GID_ADD
+};
+
+struct  update_gid_event_work {
+	struct work_struct work;
+	union ib_gid       gid;
+	struct ib_gid_attr gid_attr;
+	enum gid_op_type gid_op;
+};
+
+#define ROCE_NETDEV_CALLBACK_SZ		2
+struct netdev_event_work_cmd {
+	roce_netdev_callback	cb;
+	roce_netdev_filter	filter;
+};
+
+struct netdev_event_work {
+	struct work_struct		work;
+	struct netdev_event_work_cmd	cmds[ROCE_NETDEV_CALLBACK_SZ];
+	struct net_device		*ndev;
+};
+
+static void update_gid(enum gid_op_type gid_op, struct ib_device *ib_dev,
+		       u8 port, union ib_gid *gid,
+		       struct ib_gid_attr *gid_attr)
+{
+	if (rdma_protocol_roce(ib_dev, port)) {
+		switch (gid_op) {
+		case GID_ADD:
+			roce_add_gid(ib_dev, port,
+				     gid, gid_attr);
+			break;
+		case GID_DEL:
+			roce_del_gid(ib_dev, port,
+				     gid, gid_attr);
+			break;
+		}
+	}
+}
+
+static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
+				 struct net_device *idev, void *cookie)
+{
+	struct net_device *rdev;
+	struct net_device *mdev;
+	struct net_device *ndev = (struct net_device *)cookie;
+
+	if (!idev)
+		return 0;
+
+	rcu_read_lock();
+	mdev = netdev_master_upper_dev_get_rcu(idev);
+	rdev = rdma_vlan_dev_real_dev(ndev);
+	rcu_read_unlock();
+
+	return (rdev ? rdev : ndev) == (mdev ? mdev : idev);
+}
+
+static int pass_all_filter(struct ib_device *ib_dev, u8 port,
+			   struct net_device *idev, void *cookie)
+{
+	return 1;
+}
+
+static void update_gid_ip(enum gid_op_type gid_op,
+			  struct ib_device *ib_dev,
+			  u8 port, struct net_device *ndev,
+			  const struct sockaddr *addr)
+{
+	union ib_gid gid;
+	struct ib_gid_attr gid_attr;
+
+	rdma_ip2gid(addr, &gid);
+	memset(&gid_attr, 0, sizeof(gid_attr));
+	gid_attr.ndev = ndev;
+
+	update_gid(gid_op, ib_dev, port, &gid, &gid_attr);
+}
+
+static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
+				 u8 port, struct net_device *ndev)
+{
+	struct in_device *in_dev;
+
+	if (ndev->reg_state >= NETREG_UNREGISTERING)
+		return;
+
+	in_dev = in_dev_get(ndev);
+	if (!in_dev)
+		return;
+
+	for_ifa(in_dev) {
+		struct sockaddr_in ip;
+
+		ip.sin_family = AF_INET;
+		ip.sin_addr.s_addr = ifa->ifa_address;
+		update_gid_ip(GID_ADD, ib_dev, port, ndev,
+			      (struct sockaddr *)&ip);
+	}
+	endfor_ifa(in_dev);
+
+	in_dev_put(in_dev);
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
+				 u8 port, struct net_device *ndev)
+{
+	struct inet6_ifaddr *ifp;
+	struct inet6_dev *in6_dev;
+	struct sin6_list {
+		struct list_head	list;
+		struct sockaddr_in6	sin6;
+	};
+	struct sin6_list *sin6_iter;
+	struct sin6_list *sin6_temp;
+	struct ib_gid_attr gid_attr = {.ndev = ndev};
+	LIST_HEAD(sin6_list);
+
+	if (ndev->reg_state >= NETREG_UNREGISTERING)
+		return;
+
+	in6_dev = in6_dev_get(ndev);
+	if (!in6_dev)
+		return;
+
+	read_lock_bh(&in6_dev->lock);
+	list_for_each_entry(ifp, &in6_dev->addr_list, if_list) {
+		struct sin6_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
+
+		if (!entry) {
+			pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv6 update\n");
+			continue;
+		}
+
+		entry->sin6.sin6_family = AF_INET6;
+		entry->sin6.sin6_addr = ifp->addr;
+		list_add_tail(&entry->list, &sin6_list);
+	}
+	read_unlock_bh(&in6_dev->lock);
+
+	in6_dev_put(in6_dev);
+
+	list_for_each_entry_safe(sin6_iter, sin6_temp, &sin6_list, list) {
+		union ib_gid	gid;
+
+		rdma_ip2gid((const struct sockaddr *)&sin6_iter->sin6, &gid);
+		update_gid(GID_ADD, ib_dev, port, &gid, &gid_attr);
+		list_del(&sin6_iter->list);
+		kfree(sin6_iter);
+	}
+}
+#endif
+
+static void add_netdev_ips(struct ib_device *ib_dev, u8 port,
+			   struct net_device *idev, void *cookie)
+{
+	struct net_device *ndev = (struct net_device *)cookie;
+
+	enum_netdev_ipv4_ips(ib_dev, port, ndev);
+#if IS_ENABLED(CONFIG_IPV6)
+	enum_netdev_ipv6_ips(ib_dev, port, ndev);
+#endif
+}
+
+static void del_netdev_ips(struct ib_device *ib_dev, u8 port,
+			   struct net_device *idev, void *cookie)
+{
+	struct net_device *ndev = (struct net_device *)cookie;
+
+	roce_del_all_netdev_gids(ib_dev, port, ndev);
+}
+
+static void enum_all_gids_of_dev_cb(struct ib_device *ib_dev,
+				    u8 port,
+				    struct net_device *idev,
+				    void *cookie)
+{
+	struct net *net;
+	struct net_device *ndev;
+
+	/* Lock the rtnl to make sure the netdevs does not move under
+	 * our feet
+	 */
+	rtnl_lock();
+	for_each_net(net)
+		for_each_netdev(net, ndev)
+			if (is_eth_port_of_netdev(ib_dev, port, idev, ndev))
+				add_netdev_ips(ib_dev, port, idev, ndev);
+	rtnl_unlock();
+}
+
+/* This function will rescan all of the network devices in the system
+ * and add their gids, as needed, to the relevant RoCE devices. Will
+ * take rtnl and the IB device list mutexes. Must not be called from
+ * ib_wq or deadlock will happen. */
+int roce_rescan_device(struct ib_device *ib_dev)
+{
+	ib_dev_roce_ports_of_netdev(ib_dev, pass_all_filter, NULL,
+				    enum_all_gids_of_dev_cb, NULL);
+
+	return 0;
+}
+
+static void callback_for_addr_gid_device_scan(struct ib_device *device,
+					      u8 port,
+					      struct net_device *idev,
+					      void *cookie)
+{
+	struct update_gid_event_work *parsed = cookie;
+
+	return update_gid(parsed->gid_op, device,
+			  port, &parsed->gid,
+			  &parsed->gid_attr);
+}
+
+/* The following functions operate on all IB devices. netdevice_event and
+ * addr_event execute ib_enum_roce_ports_of_netdev through a work.
+ * ib_enum_roce_ports_of_netdev iterates through all IB devices, thus proper
+ * usage of SRCU is required
+ */
+
+static void netdevice_event_work_handler(struct work_struct *_work)
+{
+	struct netdev_event_work *work =
+		container_of(_work, struct netdev_event_work, work);
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(work->cmds) && work->cmds[i].cb; i++)
+		ib_enum_roce_ports_of_netdev(work->cmds[i].filter, work->ndev,
+					     work->cmds[i].cb, work->ndev);
+
+	dev_put(work->ndev);
+	kfree(work);
+}
+
+static int netdevice_event(struct notifier_block *this, unsigned long event,
+			   void *ptr)
+{
+	static const struct netdev_event_work_cmd add_cmd = {
+		.cb = add_netdev_ips, .filter = is_eth_port_of_netdev};
+	static const struct netdev_event_work_cmd del_cmd = {
+		.cb = del_netdev_ips, .filter = pass_all_filter};
+	struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
+	struct netdev_event_work *ndev_work;
+	struct netdev_event_work_cmd cmds[ROCE_NETDEV_CALLBACK_SZ] = { {NULL} };
+
+	if (ndev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+	case NETDEV_UP:
+		cmds[0] = add_cmd;
+		break;
+
+	case NETDEV_UNREGISTER:
+		if (ndev->reg_state < NETREG_UNREGISTERED)
+			cmds[0] = del_cmd;
+		else
+			return NOTIFY_DONE;
+		break;
+
+	case NETDEV_CHANGEADDR:
+		cmds[0] = del_cmd;
+		cmds[1] = add_cmd;
+		break;
+	default:
+		return NOTIFY_DONE;
+	}
+
+	ndev_work = kmalloc(sizeof(*ndev_work), GFP_KERNEL);
+	if (!ndev_work) {
+		pr_warn("roce_gid_mgmt: can't allocate work for netdevice_event\n");
+		return NOTIFY_DONE;
+	}
+
+	memcpy(ndev_work->cmds, cmds, sizeof(ndev_work->cmds));
+	ndev_work->ndev = ndev;
+	dev_hold(ndev);
+	INIT_WORK(&ndev_work->work, netdevice_event_work_handler);
+
+	queue_work(roce_gid_mgmt_wq, &ndev_work->work);
+
+	return NOTIFY_DONE;
+}
+
+static void update_gid_event_work_handler(struct work_struct *_work)
+{
+	struct update_gid_event_work *work =
+		container_of(_work, struct update_gid_event_work, work);
+
+	ib_enum_roce_ports_of_netdev(is_eth_port_of_netdev, work->gid_attr.ndev,
+				     callback_for_addr_gid_device_scan, work);
+
+	dev_put(work->gid_attr.ndev);
+	kfree(work);
+}
+
+static int addr_event(struct notifier_block *this, unsigned long event,
+		      struct sockaddr *sa, struct net_device *ndev)
+{
+	struct update_gid_event_work *work;
+	enum gid_op_type gid_op;
+
+	if (ndev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UP:
+		gid_op = GID_ADD;
+		break;
+
+	case NETDEV_DOWN:
+		gid_op = GID_DEL;
+		break;
+
+	default:
+		return NOTIFY_DONE;
+	}
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (!work) {
+		pr_warn("roce_gid_mgmt: Couldn't allocate work for addr_event\n");
+		return NOTIFY_DONE;
+	}
+
+	INIT_WORK(&work->work, update_gid_event_work_handler);
+
+	rdma_ip2gid(sa, &work->gid);
+	work->gid_op = gid_op;
+
+	memset(&work->gid_attr, 0, sizeof(work->gid_attr));
+	dev_hold(ndev);
+	work->gid_attr.ndev   = ndev;
+
+	queue_work(roce_gid_mgmt_wq, &work->work);
+
+	return NOTIFY_DONE;
+}
+
+static int inetaddr_event(struct notifier_block *this, unsigned long event,
+			  void *ptr)
+{
+	struct sockaddr_in	in;
+	struct net_device	*ndev;
+	struct in_ifaddr	*ifa = ptr;
+
+	in.sin_family = AF_INET;
+	in.sin_addr.s_addr = ifa->ifa_address;
+	ndev = ifa->ifa_dev->dev;
+
+	return addr_event(this, event, (struct sockaddr *)&in, ndev);
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static int inet6addr_event(struct notifier_block *this, unsigned long event,
+			   void *ptr)
+{
+	struct sockaddr_in6	in6;
+	struct net_device	*ndev;
+	struct inet6_ifaddr	*ifa6 = ptr;
+
+	in6.sin6_family = AF_INET6;
+	in6.sin6_addr = ifa6->addr;
+	ndev = ifa6->idev->dev;
+
+	return addr_event(this, event, (struct sockaddr *)&in6, ndev);
+}
+#endif
+
+static struct notifier_block nb_netdevice = {
+	.notifier_call = netdevice_event
+};
+
+static struct notifier_block nb_inetaddr = {
+	.notifier_call = inetaddr_event
+};
+
+#if IS_ENABLED(CONFIG_IPV6)
+static struct notifier_block nb_inet6addr = {
+	.notifier_call = inet6addr_event
+};
+#endif
+
+int __init roce_gid_mgmt_init(void)
+{
+	roce_gid_mgmt_wq = alloc_ordered_workqueue("roce_gid_mgmt_wq", 0);
+
+	if (!roce_gid_mgmt_wq) {
+		pr_warn("roce_gid_mgmt: can't allocate work queue\n");
+		return -ENOMEM;
+	}
+
+	register_inetaddr_notifier(&nb_inetaddr);
+#if IS_ENABLED(CONFIG_IPV6)
+	register_inet6addr_notifier(&nb_inet6addr);
+#endif
+	/* We relay on the netdevice notifier to enumerate all
+	 * existing devices in the system. Register to this notifier
+	 * last to make sure we will not miss any IP add/del
+	 * callbacks.
+	 */
+	register_netdevice_notifier(&nb_netdevice);
+
+	return 0;
+}
+
+void __exit roce_gid_mgmt_cleanup(void)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	unregister_inet6addr_notifier(&nb_inet6addr);
+#endif
+	unregister_inetaddr_notifier(&nb_inetaddr);
+	unregister_netdevice_notifier(&nb_netdevice);
+	/* Ensure all gid deletion tasks complete before we go down,
+	 * to avoid any reference to free'd memory. By the time
+	 * ib-core is removed, all physical devices have been removed,
+	 * so no issue with remaining hardware contexts.
+	 */
+	synchronize_rcu();
+	drain_workqueue(roce_gid_mgmt_wq);
+	destroy_workqueue(roce_gid_mgmt_wq);
+}
diff --git a/drivers/infiniband/core/roce_gid_table.c b/drivers/infiniband/core/roce_gid_table.c
index f492cf1..5e9e4dc 100644
--- a/drivers/infiniband/core/roce_gid_table.c
+++ b/drivers/infiniband/core/roce_gid_table.c
@@ -468,3 +468,55 @@ static void roce_gid_table_cleanup_one(struct ib_device *ib_dev,
 	kfree(table);
 }
 
+static void roce_gid_table_client_cleanup_one(struct ib_device *ib_dev)
+{
+	struct ib_roce_gid_table **table = ib_dev->cache.roce_gid_table;
+
+	if (!table)
+		return;
+
+	ib_dev->cache.roce_gid_table = NULL;
+	/* smp_wmb is mandatory in order to make sure all executing works
+	 * realize we're freeing this roce_gid_table. Every function which
+	 * could be executed in a work, fetches ib_dev->cache.roce_gid_table
+	 * once (READ_ONCE + smp_rmb) into a local variable.
+	 * If it fetched a value != NULL, we wait for this work to finish by
+	 * calling flush_workqueue. If it fetches NULL, it'll return immediately.
+	 */
+	smp_wmb();
+	/* Make sure no gid update task is still referencing this device */
+	flush_workqueue(roce_gid_mgmt_wq);
+
+	roce_gid_table_cleanup_one(ib_dev, table);
+}
+
+static void roce_gid_table_client_setup_one(struct ib_device *ib_dev)
+{
+	if (!roce_gid_table_setup_one(ib_dev))
+		if (roce_rescan_device(ib_dev))
+			roce_gid_table_client_cleanup_one(ib_dev);
+}
+
+static struct ib_client table_client = {
+	.name   = "roce_gid_table",
+	.add    = roce_gid_table_client_setup_one,
+	.remove = roce_gid_table_client_cleanup_one
+};
+
+int __init roce_gid_table_setup(void)
+{
+	roce_gid_mgmt_init();
+
+	return ib_register_client(&table_client);
+}
+
+void __exit roce_gid_table_cleanup(void)
+{
+	ib_unregister_client(&table_client);
+
+	roce_gid_mgmt_cleanup();
+
+	flush_workqueue(system_wq);
+
+	rcu_barrier();
+}
diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h
index fde33ac..850eec6 100644
--- a/include/rdma/ib_addr.h
+++ b/include/rdma/ib_addr.h
@@ -142,7 +142,7 @@ static inline u16 rdma_vlan_dev_vlan_id(const struct net_device *dev)
 		vlan_dev_vlan_id(dev) : 0xffff;
 }
 
-static inline int rdma_ip2gid(struct sockaddr *addr, union ib_gid *gid)
+static inline int rdma_ip2gid(const struct sockaddr *addr, union ib_gid *gid)
 {
 	switch (addr->sa_family) {
 	case AF_INET:
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 72b62cd..05dcfad 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1579,6 +1579,14 @@ struct ib_device {
 						 struct ib_port_attr *port_attr);
 	enum rdma_link_layer	   (*get_link_layer)(struct ib_device *device,
 						     u8 port_num);
+	/* When calling get_netdev, the HW vendor's driver should return the
+	 * net device of device @device at port @port_num. The function
+	 * is called in rtnl_lock. The HW vendor's device driver must guarantee
+	 * to return NULL before the net device has reached
+	 * NETDEV_UNREGISTER_FINAL state.
+	 */
+	struct net_device	  *(*get_netdev)(struct ib_device *device,
+						 u8 port_num);
 	int		           (*query_gid)(struct ib_device *device,
 						u8 port_num, int index,
 						union ib_gid *gid);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 04/12] net/ipv6: Export addrconf_ifid_eui48
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 03/12] IB/core: Add RoCE GID population Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
  2015-06-08 14:12   ` [PATCH for-next V5 05/12] IB/core: Add default GID for RoCE GID table Matan Barak
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

RoCE devices would like to have a default GID even
when the interface is down. In order to do so,
we use the IPv6 link local address as a default
GID. addrconf_ifid_eui48 is used to gernerate
this address.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/net/addrconf.h | 31 +++++++++++++++++++++++++++++++
 net/ipv6/addrconf.c    | 31 -------------------------------
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 80456f7..89890e7 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -91,6 +91,37 @@ int ipv6_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2);
 void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr);
 void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr *addr);
 
+static inline int addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
+{
+	if (dev->addr_len != ETH_ALEN)
+		return -1;
+	memcpy(eui, dev->dev_addr, 3);
+	memcpy(eui + 5, dev->dev_addr + 3, 3);
+
+	/*
+	 * The zSeries OSA network cards can be shared among various
+	 * OS instances, but the OSA cards have only one MAC address.
+	 * This leads to duplicate address conflicts in conjunction
+	 * with IPv6 if more than one instance uses the same card.
+	 *
+	 * The driver for these cards can deliver a unique 16-bit
+	 * identifier for each instance sharing the same card.  It is
+	 * placed instead of 0xFFFE in the interface identifier.  The
+	 * "u" bit of the interface identifier is not inverted in this
+	 * case.  Hence the resulting interface identifier has local
+	 * scope according to RFC2373.
+	 */
+	if (dev->dev_id) {
+		eui[3] = (dev->dev_id >> 8) & 0xFF;
+		eui[4] = dev->dev_id & 0xFF;
+	} else {
+		eui[3] = 0xFF;
+		eui[4] = 0xFE;
+		eui[0] ^= 2;
+	}
+	return 0;
+}
+
 static inline unsigned long addrconf_timeout_fixup(u32 timeout,
 						   unsigned int unit)
 {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 37b70e8..7170c7b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1845,37 +1845,6 @@ static void addrconf_leave_anycast(struct inet6_ifaddr *ifp)
 	__ipv6_dev_ac_dec(ifp->idev, &addr);
 }
 
-static int addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
-{
-	if (dev->addr_len != ETH_ALEN)
-		return -1;
-	memcpy(eui, dev->dev_addr, 3);
-	memcpy(eui + 5, dev->dev_addr + 3, 3);
-
-	/*
-	 * The zSeries OSA network cards can be shared among various
-	 * OS instances, but the OSA cards have only one MAC address.
-	 * This leads to duplicate address conflicts in conjunction
-	 * with IPv6 if more than one instance uses the same card.
-	 *
-	 * The driver for these cards can deliver a unique 16-bit
-	 * identifier for each instance sharing the same card.  It is
-	 * placed instead of 0xFFFE in the interface identifier.  The
-	 * "u" bit of the interface identifier is not inverted in this
-	 * case.  Hence the resulting interface identifier has local
-	 * scope according to RFC2373.
-	 */
-	if (dev->dev_id) {
-		eui[3] = (dev->dev_id >> 8) & 0xFF;
-		eui[4] = dev->dev_id & 0xFF;
-	} else {
-		eui[3] = 0xFF;
-		eui[4] = 0xFE;
-		eui[0] ^= 2;
-	}
-	return 0;
-}
-
 static int addrconf_ifid_eui64(u8 *eui, struct net_device *dev)
 {
 	if (dev->addr_len != IEEE802154_ADDR_LEN)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 05/12] IB/core: Add default GID for RoCE GID table
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 04/12] net/ipv6: Export addrconf_ifid_eui48 Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
       [not found]     ` <1433772735-22416-6-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 14:12   ` [PATCH for-next V5 06/12] net: Add info for NETDEV_CHANGEUPPER event Matan Barak
                     ` (7 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

When RoCE is used, a default GID address should be generated
for every supported RoCE type. These default GID addresses are
generated based on the IPv6 link-local address, but in contrast
to the GID based on the regular IPv6 link-local (as we generate
GID per IP address), these GIDs are also available if the net
device is down (in order to support loopback).
Moreover, these default GID addresses can't be deleted.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/core_priv.h      |  12 +++
 drivers/infiniband/core/roce_gid_mgmt.c  |  25 ++++-
 drivers/infiniband/core/roce_gid_table.c | 162 ++++++++++++++++++++++++++++---
 include/rdma/ib_verbs.h                  |   1 +
 4 files changed, 185 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index eab4e6c..8da7a86 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -83,6 +83,16 @@ int roce_gid_table_find_gid_by_port(struct ib_device *ib_dev,
 				    u8 port, struct net_device *ndev,
 				    u16 *index);
 
+enum roce_gid_table_default_mode {
+	ROCE_GID_TABLE_DEFAULT_MODE_SET,
+	ROCE_GID_TABLE_DEFAULT_MODE_DELETE
+};
+
+void roce_gid_table_set_default_gid(struct ib_device *ib_dev, u8 port,
+				    struct net_device *ndev,
+				    unsigned long gid_type_mask,
+				    enum roce_gid_table_default_mode mode);
+
 int roce_gid_table_setup(void);
 void roce_gid_table_cleanup(void);
 
@@ -99,5 +109,7 @@ int roce_gid_mgmt_init(void);
 void roce_gid_mgmt_cleanup(void);
 
 int roce_rescan_device(struct ib_device *ib_dev);
+unsigned long roce_gid_type_mask_support(struct ib_device *ib_dev, u8 port);
+
 
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
index 70616fc..6dcd1c7 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -67,11 +67,18 @@ struct netdev_event_work {
 	struct net_device		*ndev;
 };
 
+unsigned long roce_gid_type_mask_support(struct ib_device *ib_dev, u8 port)
+{
+	return !!rdma_protocol_roce(ib_dev, port);
+}
+
 static void update_gid(enum gid_op_type gid_op, struct ib_device *ib_dev,
 		       u8 port, union ib_gid *gid,
 		       struct ib_gid_attr *gid_attr)
 {
-	if (rdma_protocol_roce(ib_dev, port)) {
+	unsigned long gid_type_mask = roce_gid_type_mask_support(ib_dev, port);
+
+	if (gid_type_mask) {
 		switch (gid_op) {
 		case GID_ADD:
 			roce_add_gid(ib_dev, port,
@@ -124,6 +131,21 @@ static void update_gid_ip(enum gid_op_type gid_op,
 	update_gid(gid_op, ib_dev, port, &gid, &gid_attr);
 }
 
+static void enum_netdev_default_gids(struct ib_device *ib_dev,
+				     u8 port, struct net_device *ndev,
+				     struct net_device *idev)
+{
+	unsigned long gid_type_mask;
+
+	if (idev != ndev)
+		return;
+
+	gid_type_mask = roce_gid_type_mask_support(ib_dev, port);
+
+	roce_gid_table_set_default_gid(ib_dev, port, idev, gid_type_mask,
+				       ROCE_GID_TABLE_DEFAULT_MODE_SET);
+}
+
 static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
 				 u8 port, struct net_device *ndev)
 {
@@ -204,6 +226,7 @@ static void add_netdev_ips(struct ib_device *ib_dev, u8 port,
 {
 	struct net_device *ndev = (struct net_device *)cookie;
 
+	enum_netdev_default_gids(ib_dev, port, ndev, idev);
 	enum_netdev_ipv4_ips(ib_dev, port, ndev);
 #if IS_ENABLED(CONFIG_IPV6)
 	enum_netdev_ipv6_ips(ib_dev, port, ndev);
diff --git a/drivers/infiniband/core/roce_gid_table.c b/drivers/infiniband/core/roce_gid_table.c
index 5e9e4dc..f0e68dc 100644
--- a/drivers/infiniband/core/roce_gid_table.c
+++ b/drivers/infiniband/core/roce_gid_table.c
@@ -34,6 +34,7 @@
 #include <linux/netdevice.h>
 #include <linux/rtnetlink.h>
 #include <rdma/ib_cache.h>
+#include <net/addrconf.h>
 
 #include "core_priv.h"
 
@@ -45,6 +46,7 @@ static const struct ib_gid_attr zattr;
 enum gid_attr_find_mask {
 	GID_ATTR_FIND_MASK_GID          = 1UL << 0,
 	GID_ATTR_FIND_MASK_NETDEV	= 1UL << 1,
+	GID_ATTR_FIND_MASK_DEFAULT	= 1UL << 2,
 };
 
 struct dev_put_rcu {
@@ -64,7 +66,8 @@ static void put_ndev(struct rcu_head *rcu)
 static int write_gid(struct ib_device *ib_dev, u8 port,
 		     struct ib_roce_gid_table *table, int ix,
 		     const union ib_gid *gid,
-		     const struct ib_gid_attr *attr)
+		     const struct ib_gid_attr *attr,
+		     bool  default_gid)
 {
 	int ret;
 	struct dev_put_rcu	*put_rcu;
@@ -72,6 +75,7 @@ static int write_gid(struct ib_device *ib_dev, u8 port,
 
 	write_seqcount_begin(&table->data_vec[ix].seq);
 
+	table->data_vec[ix].default_gid = default_gid;
 	ret = ib_dev->modify_gid(ib_dev, port, ix, gid, attr,
 				 &table->data_vec[ix].context);
 
@@ -114,7 +118,8 @@ static int write_gid(struct ib_device *ib_dev, u8 port,
 }
 
 static int find_gid(struct ib_roce_gid_table *table, const union ib_gid *gid,
-		    const struct ib_gid_attr *val, unsigned long mask)
+		    const struct ib_gid_attr *val, bool default_gid,
+		    unsigned long mask)
 {
 	int i;
 
@@ -122,13 +127,18 @@ static int find_gid(struct ib_roce_gid_table *table, const union ib_gid *gid,
 		struct ib_gid_attr *attr = &table->data_vec[i].attr;
 		unsigned int orig_seq = read_seqcount_begin(&table->data_vec[i].seq);
 
-		if (memcmp(gid, &table->data_vec[i].gid, sizeof(*gid)))
+		if (mask & GID_ATTR_FIND_MASK_GID &&
+		    memcmp(gid, &table->data_vec[i].gid, sizeof(*gid)))
 			continue;
 
 		if (mask & GID_ATTR_FIND_MASK_NETDEV &&
 		    attr->ndev != val->ndev)
 			continue;
 
+		if (mask & GID_ATTR_FIND_MASK_DEFAULT &&
+		    table->data_vec[i].default_gid != default_gid)
+			continue;
+
 		if (!read_seqcount_retry(&table->data_vec[i].seq, orig_seq))
 			return i;
 		/* The sequence number changed under our feet,
@@ -140,6 +150,12 @@ static int find_gid(struct ib_roce_gid_table *table, const union ib_gid *gid,
 	return -1;
 }
 
+static void make_default_gid(struct  net_device *dev, union ib_gid *gid)
+{
+	gid->global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
+	addrconf_ifid_eui48(&gid->raw[8], dev);
+}
+
 int roce_add_gid(struct ib_device *ib_dev, u8 port,
 		 union ib_gid *gid, struct ib_gid_attr *attr)
 {
@@ -148,6 +164,7 @@ int roce_add_gid(struct ib_device *ib_dev, u8 port,
 	struct ib_roce_gid_table *table;
 	int ix;
 	int ret = 0;
+	struct net_device *idev;
 
 	/* make sure we read the ports_table */
 	smp_rmb();
@@ -163,19 +180,37 @@ int roce_add_gid(struct ib_device *ib_dev, u8 port,
 	if (!memcmp(gid, &zgid, sizeof(*gid)))
 		return -EINVAL;
 
+	if (ib_dev->get_netdev) {
+		rcu_read_lock();
+		idev = ib_dev->get_netdev(ib_dev, port);
+		if (idev && attr->ndev != idev) {
+			union ib_gid default_gid;
+
+			/* Adding default GIDs in not permitted */
+			make_default_gid(idev, &default_gid);
+			if (!memcmp(gid, &default_gid, sizeof(*gid))) {
+				rcu_read_unlock();
+				return -EPERM;
+			}
+		}
+		rcu_read_unlock();
+	}
+
 	mutex_lock(&table->lock);
 
-	ix = find_gid(table, gid, attr, GID_ATTR_FIND_MASK_NETDEV);
+	ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID |
+		      GID_ATTR_FIND_MASK_NETDEV);
 	if (ix >= 0)
 		goto out_unlock;
 
-	ix = find_gid(table, &zgid, NULL, 0);
+	ix = find_gid(table, &zgid, NULL, false, GID_ATTR_FIND_MASK_GID |
+		      GID_ATTR_FIND_MASK_DEFAULT);
 	if (ix < 0) {
 		ret = -ENOSPC;
 		goto out_unlock;
 	}
 
-	write_gid(ib_dev, port, table, ix, gid, attr);
+	write_gid(ib_dev, port, table, ix, gid, attr, false);
 
 out_unlock:
 	mutex_unlock(&table->lock);
@@ -188,6 +223,7 @@ int roce_del_gid(struct ib_device *ib_dev, u8 port,
 	struct ib_roce_gid_table **ports_table =
 		READ_ONCE(ib_dev->cache.roce_gid_table);
 	struct ib_roce_gid_table *table;
+	union ib_gid default_gid;
 	int ix;
 
 	/* make sure we read the ports_table */
@@ -201,14 +237,23 @@ int roce_del_gid(struct ib_device *ib_dev, u8 port,
 	if (!table)
 		return -EPROTONOSUPPORT;
 
+	if (attr->ndev) {
+		/* Deleting default GIDs in not permitted */
+		make_default_gid(attr->ndev, &default_gid);
+		if (!memcmp(gid, &default_gid, sizeof(*gid)))
+			return -EPERM;
+	}
+
 	mutex_lock(&table->lock);
 
-	ix = find_gid(table, gid, attr,
-		      GID_ATTR_FIND_MASK_NETDEV);
+	ix = find_gid(table, gid, attr, false,
+		      GID_ATTR_FIND_MASK_GID	  |
+		      GID_ATTR_FIND_MASK_NETDEV	  |
+		      GID_ATTR_FIND_MASK_DEFAULT);
 	if (ix < 0)
 		goto out_unlock;
 
-	write_gid(ib_dev, port, table, ix, &zgid, &zattr);
+	write_gid(ib_dev, port, table, ix, &zgid, &zattr, false);
 
 out_unlock:
 	mutex_unlock(&table->lock);
@@ -238,7 +283,7 @@ int roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
 
 	for (ix = 0; ix < table->sz; ix++)
 		if (table->data_vec[ix].attr.ndev == ndev)
-			write_gid(ib_dev, port, table, ix, &zgid, &zattr);
+			write_gid(ib_dev, port, table, ix, &zgid, &zattr, false);
 
 	mutex_unlock(&table->lock);
 	return 0;
@@ -306,7 +351,7 @@ static int _roce_gid_table_find_gid(struct ib_device *ib_dev,
 		table = ports_table[p];
 		if (!table)
 			continue;
-		local_index = find_gid(table, gid, val, mask);
+		local_index = find_gid(table, gid, val, false, mask);
 		if (local_index >= 0) {
 			if (index)
 				*index = local_index;
@@ -341,7 +386,7 @@ int roce_gid_table_find_gid_by_port(struct ib_device *ib_dev,
 	struct ib_roce_gid_table **ports_table =
 		READ_ONCE(ib_dev->cache.roce_gid_table);
 	struct ib_roce_gid_table *table;
-	unsigned long mask = 0;
+	unsigned long mask = GID_ATTR_FIND_MASK_GID;
 	struct ib_gid_attr val = {.ndev = ndev};
 
 	/* make sure we read the ports_table */
@@ -358,7 +403,7 @@ int roce_gid_table_find_gid_by_port(struct ib_device *ib_dev,
 	if (ndev)
 		mask |= GID_ATTR_FIND_MASK_NETDEV;
 
-	local_index = find_gid(table, gid, &val, mask);
+	local_index = find_gid(table, gid, &val, false, mask);
 	if (local_index >= 0) {
 		if (index)
 			*index = local_index;
@@ -405,12 +450,95 @@ static void free_roce_gid_table(struct ib_device *ib_dev, u8 port,
 	for (i = 0; i < table->sz; ++i) {
 		if (memcmp(&table->data_vec[i].gid, &zgid,
 			   sizeof(table->data_vec[i].gid)))
-			write_gid(ib_dev, port, table, i, &zgid, &zattr);
+			write_gid(ib_dev, port, table, i, &zgid, &zattr,
+				  table->data_vec[i].default_gid);
 	}
 	kfree(table->data_vec);
 	kfree(table);
 }
 
+void roce_gid_table_set_default_gid(struct ib_device *ib_dev, u8 port,
+				    struct net_device *ndev,
+				    unsigned long gid_type_mask,
+				    enum roce_gid_table_default_mode mode)
+{
+	struct ib_roce_gid_table **ports_table =
+		READ_ONCE(ib_dev->cache.roce_gid_table);
+	union ib_gid gid;
+	struct ib_gid_attr gid_attr;
+	struct ib_roce_gid_table *table;
+
+	/* make sure we read the ports_table */
+	smp_rmb();
+
+	if (!ports_table)
+		return;
+
+	table  = ports_table[port - rdma_start_port(ib_dev)];
+
+	if (!table)
+		return;
+
+	make_default_gid(ndev, &gid);
+	memset(&gid_attr, 0, sizeof(gid_attr));
+	gid_attr.ndev = ndev;
+	if (gid_type_mask) {
+		int ix;
+		union ib_gid current_gid;
+		struct ib_gid_attr current_gid_attr;
+
+		ix = find_gid(table, &gid, &gid_attr, true,
+			      GID_ATTR_FIND_MASK_DEFAULT);
+
+		if (ix < 0) {
+			pr_warn("roce_gid_table: couldn't find index for default gid\n");
+			return;
+		}
+
+		mutex_lock(&table->lock);
+		if (!roce_gid_table_get_gid(ib_dev, port, ix,
+					    &current_gid, &current_gid_attr) &&
+		    mode == ROCE_GID_TABLE_DEFAULT_MODE_SET &&
+		    !memcmp(&gid, &current_gid, sizeof(gid)) &&
+		    !memcmp(&gid_attr, &current_gid_attr, sizeof(gid_attr)))
+			goto unlock_mutex;
+
+		if ((memcmp(&current_gid, &zgid, sizeof(current_gid)) ||
+		     memcmp(&current_gid_attr, &zattr,
+			    sizeof(current_gid_attr))) &&
+		    write_gid(ib_dev, port, table, ix, &zgid, &zattr, true)) {
+			pr_warn("roce_gid_table: can't delete index %d for default gid %pI6\n",
+				ix, gid.raw);
+			goto unlock_mutex;
+		}
+
+		if (mode == ROCE_GID_TABLE_DEFAULT_MODE_SET)
+			if (write_gid(ib_dev, port, table, ix, &gid, &gid_attr,
+				      true))
+				pr_warn("roce_gid_table: unable to add default gid %pI6\n",
+					gid.raw);
+	}
+
+unlock_mutex:
+	mutex_unlock(&table->lock);
+}
+
+static int roce_gid_table_reserve_default(struct ib_device *ib_dev, u8 port,
+					  struct ib_roce_gid_table *table)
+{
+	unsigned long roce_gid_type_mask;
+
+	roce_gid_type_mask = roce_gid_type_mask_support(ib_dev, port);
+	if (roce_gid_type_mask) {
+		struct ib_roce_gid_table_entry *entry =
+			&table->data_vec[0];
+
+		entry->default_gid = true;
+	}
+
+	return 0;
+}
+
 static int roce_gid_table_setup_one(struct ib_device *ib_dev)
 {
 	u8 port;
@@ -440,6 +568,12 @@ static int roce_gid_table_setup_one(struct ib_device *ib_dev)
 			err = -ENOMEM;
 			goto rollback_table_setup;
 		}
+
+		err = roce_gid_table_reserve_default(ib_dev,
+						     port + rdma_start_port(ib_dev),
+						     table[port]);
+		if (err)
+			goto rollback_table_setup;
 	}
 
 	ib_dev->cache.roce_gid_table = table;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 05dcfad..1f918b0 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -75,6 +75,7 @@ struct ib_roce_gid_table_entry {
 	union ib_gid        gid;
 	struct ib_gid_attr  attr;
 	void		   *context;
+	bool		    default_gid;
 };
 
 struct ib_roce_gid_table {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 06/12] net: Add info for NETDEV_CHANGEUPPER event
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 05/12] IB/core: Add default GID for RoCE GID table Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
  2015-06-08 14:12   ` [PATCH for-next V5 07/12] IB/core: Add RoCE table bonding support Matan Barak
                     ` (6 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Consumers of NETDEV_CHANGEUPPER event sometimes want
to know which upper device was linked/unlinked and which
operation was carried. Adding extra information in the
notifier info block.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/linux/netdevice.h | 14 ++++++++++++++
 net/core/dev.c            | 12 ++++++++++--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 05b9a69..6cd142a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3553,6 +3553,20 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
 struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
 				    netdev_features_t features);
 
+enum netdev_changeupper_event {
+	NETDEV_CHANGEUPPER_LINK,
+	NETDEV_CHANGEUPPER_UNLINK,
+};
+
+struct netdev_changeupper_info {
+	struct netdev_notifier_info	info; /* must be first */
+	enum netdev_changeupper_event	event;
+	struct net_device		*upper;
+};
+
+void netdev_changeupper_info_change(struct net_device *dev,
+				    struct netdev_changeupper_info *info);
+
 struct netdev_bonding_info {
 	ifslave	slave;
 	ifbond	master;
diff --git a/net/core/dev.c b/net/core/dev.c
index 2c1c67f..ba73be4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5198,6 +5198,7 @@ static int __netdev_upper_dev_link(struct net_device *dev,
 				   void *private)
 {
 	struct netdev_adjacent *i, *j, *to_i, *to_j;
+	struct netdev_changeupper_info changeupper_info;
 	int ret = 0;
 
 	ASSERT_RTNL();
@@ -5253,7 +5254,10 @@ static int __netdev_upper_dev_link(struct net_device *dev,
 			goto rollback_lower_mesh;
 	}
 
-	call_netdevice_notifiers(NETDEV_CHANGEUPPER, dev);
+	changeupper_info.event = NETDEV_CHANGEUPPER_LINK;
+	changeupper_info.upper = upper_dev;
+	call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, dev,
+				      &changeupper_info.info);
 	return 0;
 
 rollback_lower_mesh:
@@ -5349,6 +5353,7 @@ void netdev_upper_dev_unlink(struct net_device *dev,
 			     struct net_device *upper_dev)
 {
 	struct netdev_adjacent *i, *j;
+	struct netdev_changeupper_info changeupper_info;
 	ASSERT_RTNL();
 
 	__netdev_adjacent_dev_unlink_neighbour(dev, upper_dev);
@@ -5370,7 +5375,10 @@ void netdev_upper_dev_unlink(struct net_device *dev,
 	list_for_each_entry(i, &upper_dev->all_adj_list.upper, list)
 		__netdev_adjacent_dev_unlink(dev, i->dev);
 
-	call_netdevice_notifiers(NETDEV_CHANGEUPPER, dev);
+	changeupper_info.event = NETDEV_CHANGEUPPER_UNLINK;
+	changeupper_info.upper = upper_dev;
+	call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, dev,
+				      &changeupper_info.info);
 }
 EXPORT_SYMBOL(netdev_upper_dev_unlink);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 07/12] IB/core: Add RoCE table bonding support
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 06/12] net: Add info for NETDEV_CHANGEUPPER event Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
       [not found]     ` <1433772735-22416-8-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 14:12   ` [PATCH for-next V5 08/12] IB/core: ib_cache routines should use roce_gid_table when needed Matan Barak
                     ` (5 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Bonding is a unique behavior since when working in
active-backup mode, only the current selected slave
should occupy the default GIDs and the master's GID.
Listening to bonding events and only adding the
required GIDs to the active slave in the RoCE table
GID table.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/roce_gid_mgmt.c | 327 ++++++++++++++++++++++++++++++--
 drivers/net/bonding/bond_options.c      |  13 --
 include/net/bonding.h                   |   7 +
 3 files changed, 314 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
index 6dcd1c7..b019d4e 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -37,6 +37,7 @@
 
 /* For in6_dev_get/in6_dev_put */
 #include <net/addrconf.h>
+#include <net/bonding.h>
 
 #include <rdma/ib_cache.h>
 #include <rdma/ib_addr.h>
@@ -55,16 +56,17 @@ struct  update_gid_event_work {
 	enum gid_op_type gid_op;
 };
 
-#define ROCE_NETDEV_CALLBACK_SZ		2
+#define ROCE_NETDEV_CALLBACK_SZ		3
 struct netdev_event_work_cmd {
 	roce_netdev_callback	cb;
 	roce_netdev_filter	filter;
+	struct net_device	*ndev;
+	struct net_device	*f_ndev;
 };
 
 struct netdev_event_work {
 	struct work_struct		work;
 	struct netdev_event_work_cmd	cmds[ROCE_NETDEV_CALLBACK_SZ];
-	struct net_device		*ndev;
 };
 
 unsigned long roce_gid_type_mask_support(struct ib_device *ib_dev, u8 port)
@@ -92,22 +94,96 @@ static void update_gid(enum gid_op_type gid_op, struct ib_device *ib_dev,
 	}
 }
 
+#define IS_NETDEV_BONDING_MASTER(ndev)	\
+	(((ndev)->priv_flags & IFF_BONDING) && \
+	 ((ndev)->flags & IFF_MASTER))
+
+enum bonding_slave_state {
+	BONDING_SLAVE_STATE_ACTIVE	= 1UL << 0,
+	BONDING_SLAVE_STATE_INACTIVE	= 1UL << 1,
+	BONDING_SLAVE_STATE_NA		= 1UL << 2,
+};
+
+static enum bonding_slave_state is_eth_active_slave_of_bonding(struct net_device *idev,
+							       struct net_device *upper)
+{
+	if (upper && IS_NETDEV_BONDING_MASTER(upper)) {
+		struct net_device *pdev;
+
+		rcu_read_lock();
+		pdev = bond_option_active_slave_get_rcu(netdev_priv(upper));
+		rcu_read_unlock();
+		if (pdev)
+			return idev == pdev ? BONDING_SLAVE_STATE_ACTIVE :
+				BONDING_SLAVE_STATE_INACTIVE;
+	}
+
+	return BONDING_SLAVE_STATE_NA;
+}
+
+static bool is_upper_dev_rcu(struct net_device *dev, struct net_device *upper)
+{
+	struct net_device *_upper = NULL;
+	struct list_head *iter;
+
+	rcu_read_lock();
+	netdev_for_each_all_upper_dev_rcu(dev, _upper, iter) {
+		if (_upper == upper)
+			break;
+	}
+
+	rcu_read_unlock();
+	return _upper == upper;
+}
+
+static int _is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
+				  struct net_device *idev, void *cookie,
+				  unsigned long bond_state)
+{
+	struct net_device *ndev = (struct net_device *)cookie;
+	struct net_device *rdev;
+	int res;
+
+	if (!idev)
+		return 0;
+
+	rcu_read_lock();
+	rdev = rdma_vlan_dev_real_dev(ndev);
+	if (!rdev)
+		rdev = ndev;
+
+	res = ((is_upper_dev_rcu(idev, ndev) &&
+	       (is_eth_active_slave_of_bonding(idev, rdev) &
+		bond_state)) ||
+	       rdev == idev);
+
+	rcu_read_unlock();
+	return res;
+}
+
 static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
 				 struct net_device *idev, void *cookie)
 {
-	struct net_device *rdev;
-	struct net_device *mdev;
-	struct net_device *ndev = (struct net_device *)cookie;
+	return _is_eth_port_of_netdev(ib_dev, port, idev, cookie,
+				      BONDING_SLAVE_STATE_ACTIVE |
+				      BONDING_SLAVE_STATE_NA);
+}
 
+static int is_eth_port_inactive_slave(struct ib_device *ib_dev, u8 port,
+				      struct net_device *idev, void *cookie)
+{
+	struct net_device *mdev;
+	int res;
 	if (!idev)
 		return 0;
 
 	rcu_read_lock();
 	mdev = netdev_master_upper_dev_get_rcu(idev);
-	rdev = rdma_vlan_dev_real_dev(ndev);
+	res = is_eth_active_slave_of_bonding(idev, mdev) ==
+		BONDING_SLAVE_STATE_INACTIVE;
 	rcu_read_unlock();
 
-	return (rdev ? rdev : ndev) == (mdev ? mdev : idev);
+	return res;
 }
 
 static int pass_all_filter(struct ib_device *ib_dev, u8 port,
@@ -116,6 +192,34 @@ static int pass_all_filter(struct ib_device *ib_dev, u8 port,
 	return 1;
 }
 
+static int upper_device_filter(struct ib_device *ib_dev, u8 port,
+			       struct net_device *idev, void *cookie)
+{
+	struct net_device *ndev = (struct net_device *)cookie;
+
+	return idev == ndev || is_upper_dev_rcu(idev, ndev);
+}
+
+static int bonding_slaves_filter(struct ib_device *ib_dev, u8 port,
+				 struct net_device *idev, void *cookie)
+{
+	struct net_device *rdev;
+	struct net_device *ndev = (struct net_device *)cookie;
+	int res;
+
+	rdev = rdma_vlan_dev_real_dev(ndev);
+
+	ndev = rdev ? rdev : ndev;
+	if (!idev || !IS_NETDEV_BONDING_MASTER(ndev))
+		return 0;
+
+	rcu_read_lock();
+	res = is_upper_dev_rcu(idev, ndev);
+	rcu_read_unlock();
+
+	return res;
+}
+
 static void update_gid_ip(enum gid_op_type gid_op,
 			  struct ib_device *ib_dev,
 			  u8 port, struct net_device *ndev,
@@ -137,8 +241,16 @@ static void enum_netdev_default_gids(struct ib_device *ib_dev,
 {
 	unsigned long gid_type_mask;
 
-	if (idev != ndev)
+	rcu_read_lock();
+	if (!idev ||
+	    ((idev != ndev && !is_upper_dev_rcu(idev, ndev)) ||
+	     is_eth_active_slave_of_bonding(idev,
+					    netdev_master_upper_dev_get_rcu(idev)) ==
+	     BONDING_SLAVE_STATE_INACTIVE)) {
+		rcu_read_unlock();
 		return;
+	}
+	rcu_read_unlock();
 
 	gid_type_mask = roce_gid_type_mask_support(ib_dev, port);
 
@@ -146,6 +258,37 @@ static void enum_netdev_default_gids(struct ib_device *ib_dev,
 				       ROCE_GID_TABLE_DEFAULT_MODE_SET);
 }
 
+static void bond_delete_netdev_default_gids(struct ib_device *ib_dev,
+					    u8 port, struct net_device *ndev,
+					    struct net_device *idev)
+{
+	struct net_device *rdev = rdma_vlan_dev_real_dev(ndev);
+
+	if (!idev)
+		return;
+
+	if (!rdev)
+		rdev = ndev;
+
+	rcu_read_lock();
+
+	if (is_upper_dev_rcu(idev, ndev) &&
+	    is_eth_active_slave_of_bonding(idev, rdev) ==
+	    BONDING_SLAVE_STATE_INACTIVE) {
+		unsigned long gid_type_mask;
+
+		rcu_read_unlock();
+
+		gid_type_mask = roce_gid_type_mask_support(ib_dev, port);
+
+		roce_gid_table_set_default_gid(ib_dev, port, idev,
+					       gid_type_mask,
+					       ROCE_GID_TABLE_DEFAULT_MODE_DELETE);
+	} else {
+		rcu_read_unlock();
+	}
+}
+
 static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
 				 u8 port, struct net_device *ndev)
 {
@@ -221,16 +364,22 @@ static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
 }
 #endif
 
+static void _add_netdev_ips(struct ib_device *ib_dev, u8 port,
+			    struct net_device *ndev)
+{
+	enum_netdev_ipv4_ips(ib_dev, port, ndev);
+#if IS_ENABLED(CONFIG_IPV6)
+	enum_netdev_ipv6_ips(ib_dev, port, ndev);
+#endif
+}
+
 static void add_netdev_ips(struct ib_device *ib_dev, u8 port,
 			   struct net_device *idev, void *cookie)
 {
 	struct net_device *ndev = (struct net_device *)cookie;
 
 	enum_netdev_default_gids(ib_dev, port, ndev, idev);
-	enum_netdev_ipv4_ips(ib_dev, port, ndev);
-#if IS_ENABLED(CONFIG_IPV6)
-	enum_netdev_ipv6_ips(ib_dev, port, ndev);
-#endif
+	_add_netdev_ips(ib_dev, port, ndev);
 }
 
 static void del_netdev_ips(struct ib_device *ib_dev, u8 port,
@@ -284,6 +433,92 @@ static void callback_for_addr_gid_device_scan(struct ib_device *device,
 			  &parsed->gid_attr);
 }
 
+static void handle_netdev_upper(struct ib_device *ib_dev, u8 port,
+				void *cookie,
+				void (*handle_netdev)(struct ib_device *ib_dev,
+						      u8 port,
+						      struct net_device *ndev))
+{
+	struct net_device *ndev = (struct net_device *)cookie;
+	struct upper_list {
+		struct list_head list;
+		struct net_device *upper;
+	};
+	struct net_device *upper;
+	struct list_head *iter;
+	struct upper_list *upper_iter;
+	struct upper_list *upper_temp;
+	LIST_HEAD(upper_list);
+
+	rcu_read_lock();
+	netdev_for_each_all_upper_dev_rcu(ndev, upper, iter) {
+		struct upper_list *entry = kmalloc(sizeof(*entry),
+						   GFP_ATOMIC);
+
+		if (!entry) {
+			pr_info("roce_gid_mgmt: couldn't allocate entry to delete ndev\n");
+			continue;
+		}
+
+		list_add_tail(&entry->list, &upper_list);
+		dev_hold(upper);
+		entry->upper = upper;
+	}
+	rcu_read_unlock();
+
+	handle_netdev(ib_dev, port, ndev);
+	list_for_each_entry_safe(upper_iter, upper_temp, &upper_list,
+				 list) {
+		handle_netdev(ib_dev, port, upper_iter->upper);
+		dev_put(upper_iter->upper);
+		list_del(&upper_iter->list);
+		kfree(upper_iter);
+	}
+}
+
+static void _roce_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
+				      struct net_device *ndev)
+{
+	roce_del_all_netdev_gids(ib_dev, port, ndev);
+}
+
+static void del_netdev_upper_ips(struct ib_device *ib_dev, u8 port,
+				 struct net_device *idev, void *cookie)
+{
+	handle_netdev_upper(ib_dev, port, cookie, _roce_del_all_netdev_gids);
+}
+
+static void add_netdev_upper_ips(struct ib_device *ib_dev, u8 port,
+				 struct net_device *idev, void *cookie)
+{
+	handle_netdev_upper(ib_dev, port, cookie, _add_netdev_ips);
+}
+
+static void del_netdev_default_ips_join(struct ib_device *ib_dev, u8 port,
+					struct net_device *idev, void *cookie)
+{
+	struct net_device *mdev;
+
+	rcu_read_lock();
+	mdev = netdev_master_upper_dev_get_rcu(idev);
+	if (mdev)
+		dev_hold(mdev);
+	rcu_read_unlock();
+
+	if (mdev) {
+		bond_delete_netdev_default_gids(ib_dev, port, mdev, idev);
+		dev_put(mdev);
+	}
+}
+
+static void del_netdev_default_ips(struct ib_device *ib_dev, u8 port,
+				   struct net_device *idev, void *cookie)
+{
+	struct net_device *ndev = (struct net_device *)cookie;
+
+	bond_delete_netdev_default_gids(ib_dev, port, ndev, idev);
+}
+
 /* The following functions operate on all IB devices. netdevice_event and
  * addr_event execute ib_enum_roce_ports_of_netdev through a work.
  * ib_enum_roce_ports_of_netdev iterates through all IB devices, thus proper
@@ -296,11 +531,15 @@ static void netdevice_event_work_handler(struct work_struct *_work)
 		container_of(_work, struct netdev_event_work, work);
 	unsigned int i;
 
-	for (i = 0; i < ARRAY_SIZE(work->cmds) && work->cmds[i].cb; i++)
-		ib_enum_roce_ports_of_netdev(work->cmds[i].filter, work->ndev,
-					     work->cmds[i].cb, work->ndev);
+	for (i = 0; i < ARRAY_SIZE(work->cmds) && work->cmds[i].cb; i++) {
+		ib_enum_roce_ports_of_netdev(work->cmds[i].filter,
+					     work->cmds[i].f_ndev,
+					     work->cmds[i].cb,
+					     work->cmds[i].ndev);
+		dev_put(work->cmds[i].ndev);
+		dev_put(work->cmds[i].f_ndev);
+	}
 
-	dev_put(work->ndev);
 	kfree(work);
 }
 
@@ -309,11 +548,24 @@ static int netdevice_event(struct notifier_block *this, unsigned long event,
 {
 	static const struct netdev_event_work_cmd add_cmd = {
 		.cb = add_netdev_ips, .filter = is_eth_port_of_netdev};
+	static const struct netdev_event_work_cmd add_cmd_upper_ips = {
+		.cb = add_netdev_upper_ips, .filter = is_eth_port_of_netdev};
 	static const struct netdev_event_work_cmd del_cmd = {
 		.cb = del_netdev_ips, .filter = pass_all_filter};
+	static const struct netdev_event_work_cmd bonding_default_del_cmd_join = {
+		.cb = del_netdev_default_ips_join, .filter = is_eth_port_inactive_slave};
+	static const struct netdev_event_work_cmd bonding_default_del_cmd = {
+		.cb = del_netdev_default_ips, .filter = is_eth_port_inactive_slave};
+	static const struct netdev_event_work_cmd default_del_cmd = {
+		.cb = del_netdev_default_ips, .filter = pass_all_filter};
+	static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
+		.cb = del_netdev_upper_ips, .filter = bonding_slaves_filter};
+	static const struct netdev_event_work_cmd upper_ips_del_cmd = {
+		.cb = del_netdev_upper_ips, .filter = upper_device_filter};
 	struct net_device *ndev = netdev_notifier_info_to_dev(ptr);
 	struct netdev_event_work *ndev_work;
 	struct netdev_event_work_cmd cmds[ROCE_NETDEV_CALLBACK_SZ] = { {NULL} };
+	unsigned int i;
 
 	if (ndev->type != ARPHRD_ETHER)
 		return NOTIFY_DONE;
@@ -321,7 +573,8 @@ static int netdevice_event(struct notifier_block *this, unsigned long event,
 	switch (event) {
 	case NETDEV_REGISTER:
 	case NETDEV_UP:
-		cmds[0] = add_cmd;
+		cmds[0] = bonding_default_del_cmd_join;
+		cmds[1] = add_cmd;
 		break;
 
 	case NETDEV_UNREGISTER:
@@ -332,9 +585,37 @@ static int netdevice_event(struct notifier_block *this, unsigned long event,
 		break;
 
 	case NETDEV_CHANGEADDR:
-		cmds[0] = del_cmd;
+		cmds[0] = default_del_cmd;
 		cmds[1] = add_cmd;
 		break;
+
+	case NETDEV_CHANGEUPPER:
+		{
+			struct netdev_changeupper_info *changeupper_info =
+				container_of(ptr, struct netdev_changeupper_info, info);
+
+			if (changeupper_info->event ==
+			    NETDEV_CHANGEUPPER_UNLINK) {
+				cmds[0] = upper_ips_del_cmd;
+				cmds[0].ndev = changeupper_info->upper;
+				cmds[1] = add_cmd;
+			} else if (changeupper_info->event ==
+				   NETDEV_CHANGEUPPER_LINK) {
+				cmds[0] = bonding_default_del_cmd;
+				cmds[0].ndev = changeupper_info->upper;
+				cmds[1] = add_cmd_upper_ips;
+				cmds[1].ndev = changeupper_info->upper;
+				cmds[1].f_ndev = changeupper_info->upper;
+			}
+		}
+	break;
+
+	case NETDEV_BONDING_FAILOVER:
+		cmds[0] = bonding_event_ips_del_cmd;
+		cmds[1] = bonding_default_del_cmd_join;
+		cmds[2] = add_cmd_upper_ips;
+		break;
+
 	default:
 		return NOTIFY_DONE;
 	}
@@ -346,8 +627,14 @@ static int netdevice_event(struct notifier_block *this, unsigned long event,
 	}
 
 	memcpy(ndev_work->cmds, cmds, sizeof(ndev_work->cmds));
-	ndev_work->ndev = ndev;
-	dev_hold(ndev);
+	for (i = 0; i < ARRAY_SIZE(ndev_work->cmds) && ndev_work->cmds[i].cb; i++) {
+		if (!ndev_work->cmds[i].ndev)
+			ndev_work->cmds[i].ndev = ndev;
+		if (!ndev_work->cmds[i].f_ndev)
+			ndev_work->cmds[i].f_ndev = ndev;
+		dev_hold(ndev_work->cmds[i].ndev);
+		dev_hold(ndev_work->cmds[i].f_ndev);
+	}
 	INIT_WORK(&ndev_work->work, netdevice_event_work_handler);
 
 	queue_work(roce_gid_mgmt_wq, &ndev_work->work);
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 4df2894..c4fe29a8 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -689,19 +689,6 @@ static int bond_option_mode_set(struct bonding *bond,
 	return 0;
 }
 
-static struct net_device *__bond_option_active_slave_get(struct bonding *bond,
-							 struct slave *slave)
-{
-	return bond_uses_primary(bond) && slave ? slave->dev : NULL;
-}
-
-struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond)
-{
-	struct slave *slave = rcu_dereference(bond->curr_active_slave);
-
-	return __bond_option_active_slave_get(bond, slave);
-}
-
 static int bond_option_active_slave_set(struct bonding *bond,
 					const struct bond_opt_value *newval)
 {
diff --git a/include/net/bonding.h b/include/net/bonding.h
index 78ed135..81a94ed 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -307,6 +307,13 @@ static inline bool bond_uses_primary(struct bonding *bond)
 	return bond_mode_uses_primary(BOND_MODE(bond));
 }
 
+static inline struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond)
+{
+	struct slave *slave = rcu_dereference(bond->curr_active_slave);
+
+	return bond_uses_primary(bond) && slave ? slave->dev : NULL;
+}
+
 static inline bool bond_slave_is_up(struct slave *slave)
 {
 	return netif_running(slave->dev) && netif_carrier_ok(slave->dev);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 08/12] IB/core: ib_cache routines should use roce_gid_table when needed
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 07/12] IB/core: Add RoCE table bonding support Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
  2015-06-08 14:12   ` [PATCH for-next V5 09/12] net/mlx4: Postpone the registration of net_device Matan Barak
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

When a port uses roce_gid_table, the following function
(a) ib_find_cached_gid
(b) ib_get_cached_gid

should query the gid table accordingly.

In order to query it, roce_gid_table is initialized
when needed.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/cache.c  | 210 +++++++++++++++++++++++++++++----------
 drivers/infiniband/core/device.c |  17 +++-
 include/rdma/ib_verbs.h          |  21 +++-
 3 files changed, 193 insertions(+), 55 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 871da83..217e639 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -58,64 +58,133 @@ struct ib_update_work {
 	u8                 port_num;
 };
 
-int ib_get_cached_gid(struct ib_device *device,
-		      u8                port_num,
-		      int               index,
-		      union ib_gid     *gid)
+static int __ib_get_cached_gid(struct ib_device *device,
+			       u8                port_num,
+			       int               index,
+			       union ib_gid     *gid)
 {
 	struct ib_gid_cache *cache;
 	unsigned long flags;
-	int ret = 0;
+	int ret = -ENOENT;
 
 	if (port_num < rdma_start_port(device) || port_num > rdma_end_port(device))
 		return -EINVAL;
+	if (!device->cache.gid_cache)
+		return -ENOENT;
 
 	read_lock_irqsave(&device->cache.lock, flags);
 
 	cache = device->cache.gid_cache[port_num - rdma_start_port(device)];
-
-	if (index < 0 || index >= cache->table_len)
-		ret = -EINVAL;
-	else
+	if (cache && index >= 0 && index < cache->table_len) {
 		*gid = cache->table[index];
+		ret = 0;
+	}
 
 	read_unlock_irqrestore(&device->cache.lock, flags);
-
 	return ret;
 }
+
+int ib_get_cached_gid(struct ib_device *device,
+		      u8                port_num,
+		      int               index,
+		      union ib_gid     *gid)
+{
+	if (port_num < rdma_start_port(device) || port_num > rdma_end_port(device))
+		return -EINVAL;
+
+	if (rdma_cap_roce_gid_table(device, port_num))
+		return roce_gid_table_get_gid(device, port_num, index, gid,
+					      NULL);
+
+	if (rdma_protocol_roce(device, port_num))
+		return -EAGAIN;
+
+	return __ib_get_cached_gid(device, port_num, index, gid);
+}
 EXPORT_SYMBOL(ib_get_cached_gid);
 
-int ib_find_cached_gid(struct ib_device   *device,
-		       const union ib_gid *gid,
-		       u8                 *port_num,
-		       u16                *index)
+static int ___ib_find_cached_gid_by_port(struct ib_device *device,
+					 u8               port_num,
+					 const union ib_gid *gid,
+					 u16              *index)
 {
 	struct ib_gid_cache *cache;
+	u8 p = port_num - rdma_start_port(device);
+	int i;
+
+	if (port_num < rdma_start_port(device) || port_num > rdma_end_port(device))
+		return -EINVAL;
+	if (rdma_cap_roce_gid_table(device, port_num))
+		return -EPROTONOSUPPORT;
+	if (!device->cache.gid_cache)
+		return -ENOENT;
+
+	cache = device->cache.gid_cache[p];
+	if (!cache)
+		return -ENOENT;
+
+	for (i = 0; i < cache->table_len; ++i) {
+		if (!memcmp(gid, &cache->table[i], sizeof(*gid))) {
+			if (index)
+				*index = i;
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+static int __ib_find_cached_gid(struct ib_device *device,
+				const union ib_gid *gid,
+				u8               *port_num,
+				u16              *index)
+{
 	unsigned long flags;
-	int p, i;
+	u16 found_index;
+	int p;
 	int ret = -ENOENT;
 
-	*port_num = -1;
+	if (port_num)
+		*port_num = -1;
 	if (index)
 		*index = -1;
 
 	read_lock_irqsave(&device->cache.lock, flags);
 
-	for (p = 0; p <= rdma_end_port(device) - rdma_start_port(device); ++p) {
-		cache = device->cache.gid_cache[p];
-		for (i = 0; i < cache->table_len; ++i) {
-			if (!memcmp(gid, &cache->table[i], sizeof *gid)) {
-				*port_num = p + rdma_start_port(device);
-				if (index)
-					*index = i;
-				ret = 0;
-				goto found;
-			}
+	for (p = rdma_start_port(device); p <= rdma_end_port(device); ++p) {
+		if (!___ib_find_cached_gid_by_port(device, p, gid,
+						   &found_index)) {
+			if (port_num)
+				*port_num = p;
+			ret = 0;
+			break;
 		}
 	}
-found:
+
 	read_unlock_irqrestore(&device->cache.lock, flags);
 
+	if (!ret && index)
+		*index = found_index;
+
+	return ret;
+}
+
+int ib_find_cached_gid(struct ib_device *device,
+		       const union ib_gid *gid,
+		       u8               *port_num,
+		       u16              *index)
+{
+	int ret = -ENOENT;
+
+	/* Look for a RoCE device with the specified GID. */
+	if (device->cache.roce_gid_table)
+		ret = roce_gid_table_find_gid(device, gid, NULL,
+					      port_num, index);
+
+	/* If no RoCE devices with the specified GID, look for IB device. */
+	if (ret)
+		ret =  __ib_find_cached_gid(device, gid, port_num, index);
+
 	return ret;
 }
 EXPORT_SYMBOL(ib_find_cached_gid);
@@ -127,22 +196,23 @@ int ib_get_cached_pkey(struct ib_device *device,
 {
 	struct ib_pkey_cache *cache;
 	unsigned long flags;
-	int ret = 0;
+	int ret = -ENOENT;
 
 	if (port_num < rdma_start_port(device) || port_num > rdma_end_port(device))
 		return -EINVAL;
 
+	if (!device->cache.pkey_cache)
+		return -ENOENT;
+
 	read_lock_irqsave(&device->cache.lock, flags);
 
 	cache = device->cache.pkey_cache[port_num - rdma_start_port(device)];
-
-	if (index < 0 || index >= cache->table_len)
-		ret = -EINVAL;
-	else
+	if (cache && index >= 0 && index < cache->table_len) {
 		*pkey = cache->table[index];
+		ret = 0;
+	}
 
 	read_unlock_irqrestore(&device->cache.lock, flags);
-
 	return ret;
 }
 EXPORT_SYMBOL(ib_get_cached_pkey);
@@ -161,9 +231,14 @@ int ib_find_cached_pkey(struct ib_device *device,
 	if (port_num < rdma_start_port(device) || port_num > rdma_end_port(device))
 		return -EINVAL;
 
+	if (!device->cache.pkey_cache)
+		return -ENOENT;
+
 	read_lock_irqsave(&device->cache.lock, flags);
 
 	cache = device->cache.pkey_cache[port_num - rdma_start_port(device)];
+	if (!cache)
+		goto out;
 
 	*index = -1;
 
@@ -182,8 +257,8 @@ int ib_find_cached_pkey(struct ib_device *device,
 		ret = 0;
 	}
 
+out:
 	read_unlock_irqrestore(&device->cache.lock, flags);
-
 	return ret;
 }
 EXPORT_SYMBOL(ib_find_cached_pkey);
@@ -201,9 +276,14 @@ int ib_find_exact_cached_pkey(struct ib_device *device,
 	if (port_num < rdma_start_port(device) || port_num > rdma_end_port(device))
 		return -EINVAL;
 
+	if (!device->cache.pkey_cache)
+		return -ENOENT;
+
 	read_lock_irqsave(&device->cache.lock, flags);
 
 	cache = device->cache.pkey_cache[port_num - rdma_start_port(device)];
+	if (!cache)
+		goto out;
 
 	*index = -1;
 
@@ -213,9 +293,8 @@ int ib_find_exact_cached_pkey(struct ib_device *device,
 			ret = 0;
 			break;
 		}
-
+out:
 	read_unlock_irqrestore(&device->cache.lock, flags);
-
 	return ret;
 }
 EXPORT_SYMBOL(ib_find_exact_cached_pkey);
@@ -225,13 +304,16 @@ int ib_get_cached_lmc(struct ib_device *device,
 		      u8                *lmc)
 {
 	unsigned long flags;
-	int ret = 0;
+	int ret = -ENOENT;
 
 	if (port_num < rdma_start_port(device) || port_num > rdma_end_port(device))
 		return -EINVAL;
 
 	read_lock_irqsave(&device->cache.lock, flags);
-	*lmc = device->cache.lmc_cache[port_num - rdma_start_port(device)];
+	if (device->cache.lmc_cache) {
+		*lmc = device->cache.lmc_cache[port_num - rdma_start_port(device)];
+		ret = 0;
+	}
 	read_unlock_irqrestore(&device->cache.lock, flags);
 
 	return ret;
@@ -243,9 +325,18 @@ static void ib_cache_update(struct ib_device *device,
 {
 	struct ib_port_attr       *tprops = NULL;
 	struct ib_pkey_cache      *pkey_cache = NULL, *old_pkey_cache;
-	struct ib_gid_cache       *gid_cache = NULL, *old_gid_cache;
+	struct ib_gid_cache       *gid_cache = NULL, *old_gid_cache = NULL;
 	int                        i;
 	int                        ret;
+	bool			   use_roce_gid_table =
+					rdma_cap_roce_gid_table(device, port);
+
+	if (port < rdma_start_port(device) || port > rdma_end_port(device))
+		return;
+
+	if (!(device->cache.pkey_cache && device->cache.gid_cache &&
+	      device->cache.lmc_cache))
+		return;
 
 	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
 	if (!tprops)
@@ -265,12 +356,14 @@ static void ib_cache_update(struct ib_device *device,
 
 	pkey_cache->table_len = tprops->pkey_tbl_len;
 
-	gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len *
-			    sizeof *gid_cache->table, GFP_KERNEL);
-	if (!gid_cache)
-		goto err;
+	if (!use_roce_gid_table) {
+		gid_cache = kmalloc(sizeof(*gid_cache) + tprops->gid_tbl_len *
+			    sizeof(*gid_cache->table), GFP_KERNEL);
+		if (!gid_cache)
+			goto err;
 
-	gid_cache->table_len = tprops->gid_tbl_len;
+		gid_cache->table_len = tprops->gid_tbl_len;
+	}
 
 	for (i = 0; i < pkey_cache->table_len; ++i) {
 		ret = ib_query_pkey(device, port, i, pkey_cache->table + i);
@@ -281,22 +374,28 @@ static void ib_cache_update(struct ib_device *device,
 		}
 	}
 
-	for (i = 0; i < gid_cache->table_len; ++i) {
-		ret = ib_query_gid(device, port, i, gid_cache->table + i);
-		if (ret) {
-			printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
-			       ret, device->name, i);
-			goto err;
+	if (!use_roce_gid_table) {
+		for (i = 0;  i < gid_cache->table_len; ++i) {
+			ret = ib_query_gid(device, port, i,
+					   gid_cache->table + i);
+			if (ret) {
+				printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
+				       ret, device->name, i);
+				goto err;
+			}
 		}
 	}
 
 	write_lock_irq(&device->cache.lock);
 
 	old_pkey_cache = device->cache.pkey_cache[port - rdma_start_port(device)];
-	old_gid_cache  = device->cache.gid_cache [port - rdma_start_port(device)];
+	if (!use_roce_gid_table)
+		old_gid_cache  =
+			device->cache.gid_cache[port - rdma_start_port(device)];
 
 	device->cache.pkey_cache[port - rdma_start_port(device)] = pkey_cache;
-	device->cache.gid_cache [port - rdma_start_port(device)] = gid_cache;
+	if (!use_roce_gid_table)
+		device->cache.gid_cache[port - rdma_start_port(device)] = gid_cache;
 
 	device->cache.lmc_cache[port - rdma_start_port(device)] = tprops->lmc;
 
@@ -392,12 +491,19 @@ err:
 	kfree(device->cache.pkey_cache);
 	kfree(device->cache.gid_cache);
 	kfree(device->cache.lmc_cache);
+	device->cache.pkey_cache = NULL;
+	device->cache.gid_cache = NULL;
+	device->cache.lmc_cache = NULL;
 }
 
 static void ib_cache_cleanup_one(struct ib_device *device)
 {
 	int p;
 
+	if (!(device->cache.pkey_cache && device->cache.gid_cache &&
+	      device->cache.lmc_cache))
+		return;
+
 	ib_unregister_event_handler(&device->cache.event_handler);
 	flush_workqueue(ib_wq);
 
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 84edb9a..f9c6935 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -40,6 +40,7 @@
 #include <linux/mutex.h>
 #include <rdma/rdma_netlink.h>
 #include <rdma/ib_addr.h>
+#include <rdma/ib_cache.h>
 
 #include "core_priv.h"
 
@@ -593,6 +594,10 @@ EXPORT_SYMBOL(ib_query_port);
 int ib_query_gid(struct ib_device *device,
 		 u8 port_num, int index, union ib_gid *gid)
 {
+	if (rdma_cap_roce_gid_table(device, port_num))
+		return roce_gid_table_get_gid(device, port_num, index, gid,
+					      NULL);
+
 	return device->query_gid(device, port_num, index, gid);
 }
 EXPORT_SYMBOL(ib_query_gid);
@@ -738,18 +743,26 @@ EXPORT_SYMBOL(ib_modify_port);
  *   a specified GID value occurs.
  * @device: The device to query.
  * @gid: The GID value to search for.
+ * @ndev: In RoCE, the net device of the device. Null means ignore.
  * @port_num: The port number of the device where the GID value was found.
  * @index: The index into the GID table where the GID was found.  This
  *   parameter may be NULL.
  */
 int ib_find_gid(struct ib_device *device, union ib_gid *gid,
-		u8 *port_num, u16 *index)
+		struct net_device *ndev, u8 *port_num, u16 *index)
 {
 	union ib_gid tmp_gid;
 	int ret, port, i;
 
+	if (device->cache.roce_gid_table &&
+	    !roce_gid_table_find_gid(device, gid, ndev, port_num, index))
+		return 0;
+
 	for (port = rdma_start_port(device); port <= rdma_end_port(device); ++port) {
-		for (i = 0; i < device->port_immutable[port].gid_tbl_len; ++i) {
+		if (rdma_cap_roce_gid_table(device, port))
+			continue;
+
+		for (i = 0; i < device->port_immutable[port].pkey_tbl_len; ++i) {
 			ret = ib_query_gid(device, port, i, &tmp_gid);
 			if (ret)
 				return ret;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 1f918b0..4806d8b 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2093,6 +2093,25 @@ static inline bool rdma_cap_read_multi_sge(struct ib_device *device,
 	return !(device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_IWARP);
 }
 
+/**
+ * rdma_cap_roce_gid_table - Check if the port of device uses roce_gid_table
+ * @device: Device to check
+ * @port_num: Port number to check
+ *
+ * RoCE GID table mechanism manages the various GIDs for a device.
+ *
+ * NOTE: if allocating the port's GID table has failed, this call will still
+ * return true, but any RoCE GID table API will fail.
+ *
+ * Return: true if the port uses RoCE GID table mechanism in order to manage
+ * its GIDs.
+ */
+static inline bool rdma_cap_roce_gid_table(const struct ib_device *device,
+					   u8 port_num)
+{
+	return rdma_protocol_roce(device, port_num) && device->cache.roce_gid_table;
+}
+
 int ib_query_gid(struct ib_device *device,
 		 u8 port_num, int index, union ib_gid *gid);
 
@@ -2108,7 +2127,7 @@ int ib_modify_port(struct ib_device *device,
 		   struct ib_port_modify *port_modify);
 
 int ib_find_gid(struct ib_device *device, union ib_gid *gid,
-		u8 *port_num, u16 *index);
+		struct net_device *ndev, u8 *port_num, u16 *index);
 
 int ib_find_pkey(struct ib_device *device,
 		 u8 port_num, u16 pkey, u16 *index);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 09/12] net/mlx4: Postpone the registration of net_device
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (7 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 08/12] IB/core: ib_cache routines should use roce_gid_table when needed Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
  2015-06-08 14:12   ` [PATCH for-next V5 10/12] IB/mlx4: Implement ib_device callbacks Matan Barak
                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The mlx4 network driver was registered in the context of the 'add'
function of the core driver (called when HW should be registered).
This makes the netdev event NETDEV_REGISTER to be sent in a context
where the answer to get_protocol_dev() callback returns NULL. This may
be confusing to listeners of netdev events.
This patch is a preparation to the patch that implements the
get_netdev() callback in the IB/mlx4 driver.

Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c | 36 ++++++++++++++++------------
 drivers/net/ethernet/mellanox/mlx4/intf.c    |  3 +++
 include/linux/mlx4/driver.h                  |  1 +
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index 913b716..a946e4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -224,6 +224,26 @@ static void mlx4_en_remove(struct mlx4_dev *dev, void *endev_ptr)
 	kfree(mdev);
 }
 
+static void mlx4_en_activate(struct mlx4_dev *dev, void *ctx)
+{
+	int i;
+	struct mlx4_en_dev *mdev = ctx;
+
+	/* Create a netdev for each port */
+	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_ETH) {
+		mlx4_info(mdev, "Activating port:%d\n", i);
+		if (mlx4_en_init_netdev(mdev, i, &mdev->profile.prof[i]))
+			mdev->pndev[i] = NULL;
+	}
+
+	/* register notifier */
+	mdev->nb.notifier_call = mlx4_en_netdev_event;
+	if (register_netdevice_notifier(&mdev->nb)) {
+		mdev->nb.notifier_call = NULL;
+		mlx4_err(mdev, "Failed to create notifier\n");
+	}
+}
+
 static void *mlx4_en_add(struct mlx4_dev *dev)
 {
 	struct mlx4_en_dev *mdev;
@@ -297,21 +317,6 @@ static void *mlx4_en_add(struct mlx4_dev *dev)
 	mutex_init(&mdev->state_lock);
 	mdev->device_up = true;
 
-	/* Setup ports */
-
-	/* Create a netdev for each port */
-	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_ETH) {
-		mlx4_info(mdev, "Activating port:%d\n", i);
-		if (mlx4_en_init_netdev(mdev, i, &mdev->profile.prof[i]))
-			mdev->pndev[i] = NULL;
-	}
-	/* register notifier */
-	mdev->nb.notifier_call = mlx4_en_netdev_event;
-	if (register_netdevice_notifier(&mdev->nb)) {
-		mdev->nb.notifier_call = NULL;
-		mlx4_err(mdev, "Failed to create notifier\n");
-	}
-
 	return mdev;
 
 err_mr:
@@ -335,6 +340,7 @@ static struct mlx4_interface mlx4_en_interface = {
 	.event		= mlx4_en_event,
 	.get_dev	= mlx4_en_get_netdev,
 	.protocol	= MLX4_PROT_ETH,
+	.activate	= mlx4_en_activate,
 };
 
 static void mlx4_en_verify_params(void)
diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 6fce587..09e94c6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -63,8 +63,11 @@ static void mlx4_add_device(struct mlx4_interface *intf, struct mlx4_priv *priv)
 		spin_lock_irq(&priv->ctx_lock);
 		list_add_tail(&dev_ctx->list, &priv->ctx_list);
 		spin_unlock_irq(&priv->ctx_lock);
+		if (intf->activate)
+			intf->activate(&priv->dev, dev_ctx->context);
 	} else
 		kfree(dev_ctx);
+
 }
 
 static void mlx4_remove_device(struct mlx4_interface *intf, struct mlx4_priv *priv)
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 9553a73..5a06d96 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -59,6 +59,7 @@ struct mlx4_interface {
 	void			(*event) (struct mlx4_dev *dev, void *context,
 					  enum mlx4_dev_event event, unsigned long param);
 	void *			(*get_dev)(struct mlx4_dev *dev, void *context, u8 port);
+	void			(*activate)(struct mlx4_dev *dev, void *context);
 	struct list_head	list;
 	enum mlx4_protocol	protocol;
 	int			flags;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 10/12] IB/mlx4: Implement ib_device callbacks
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (8 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 09/12] net/mlx4: Postpone the registration of net_device Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
       [not found]     ` <1433772735-22416-11-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 14:12   ` [PATCH for-next V5 11/12] IB/mlx4: Replace mechanism for RoCE GID management Matan Barak
                     ` (2 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

get_netdev: get the net_device on the physical port of the IB transport port. In
port aggregation mode it is required to return the netdev of the active port.

modify_gid: note for a change in the RoCE gid cache. Handle this by writing to
the harsware GID table. It is possible that indexes in cahce and hardware tables
won't match so a translation is required when modifying a QP or creating an
address handle.

Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c    | 213 ++++++++++++++++++++++++++++++++++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |  17 +++
 include/linux/mlx4/device.h          |   3 +-
 3 files changed, 229 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 69ae464..bf38e32 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -45,6 +45,9 @@
 #include <rdma/ib_smi.h>
 #include <rdma/ib_user_verbs.h>
 #include <rdma/ib_addr.h>
+#include <rdma/ib_cache.h>
+
+#include <net/bonding.h>
 
 #include <linux/mlx4/driver.h>
 #include <linux/mlx4/cmd.h>
@@ -129,6 +132,199 @@ static int num_ib_ports(struct mlx4_dev *dev)
 	return ib_ports;
 }
 
+static struct net_device *mlx4_ib_get_netdev(struct ib_device *device, u8 port_num)
+{
+	struct mlx4_ib_dev *ibdev = to_mdev(device);
+
+	if (mlx4_is_bonded(ibdev->dev)) {
+		struct net_device *dev;
+		struct net_device *upper = NULL;
+
+		rcu_read_lock();
+
+		dev = mlx4_get_protocol_dev(ibdev->dev, MLX4_PROT_ETH, port_num);
+		if (dev)
+			upper = netdev_master_upper_dev_get_rcu(dev);
+		else
+			goto unlock;
+		if (upper)
+			dev = bond_option_active_slave_get_rcu(netdev_priv(upper));
+unlock:
+		rcu_read_unlock();
+
+		return dev;
+	}
+
+	return mlx4_get_protocol_dev(ibdev->dev, MLX4_PROT_ETH, port_num);
+}
+
+static int mlx4_ib_update_gids(struct gid_entry *gids,
+			       struct mlx4_ib_dev *ibdev,
+			       u8 port_num)
+{
+	struct mlx4_cmd_mailbox *mailbox;
+	int err;
+	struct mlx4_dev *dev = ibdev->dev;
+	int i;
+	union ib_gid *gid_tbl;
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox))
+		return -ENOMEM;
+
+	gid_tbl = mailbox->buf;
+
+	for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i)
+		memcpy(&gid_tbl[i], &gids[i].gid, sizeof(union ib_gid));
+
+	err = mlx4_cmd(dev, mailbox->dma,
+		       MLX4_SET_PORT_GID_TABLE << 8 | port_num,
+		       1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
+		       MLX4_CMD_WRAPPED);
+	if (mlx4_is_bonded(dev))
+		err += mlx4_cmd(dev, mailbox->dma,
+				MLX4_SET_PORT_GID_TABLE << 8 | 2,
+				1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
+				MLX4_CMD_WRAPPED);
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	return err;
+}
+
+static int mlx4_ib_modify_gid(struct ib_device *device,
+			      u8 port_num, unsigned int index,
+			      const union ib_gid *gid,
+			      const struct ib_gid_attr *attr,
+			      void **context)
+{
+	struct mlx4_ib_dev *ibdev = to_mdev(device);
+	struct mlx4_ib_iboe *iboe = &ibdev->iboe;
+	struct mlx4_port_gid_table   *port_gid_table;
+	int free = -1, found = -1;
+	int ret = 0;
+	int clear = !memcmp(&zgid, gid, sizeof(*gid));
+	int hw_update = 0;
+	int i;
+	struct gid_entry *gids = NULL;
+
+	if (!rdma_cap_roce_gid_table(device, port_num))
+		return -EINVAL;
+
+	if (port_num > MLX4_MAX_PORTS)
+		return -EINVAL;
+
+	if (!context)
+		return -EINVAL;
+
+	spin_lock_bh(&iboe->lock);
+	port_gid_table = &iboe->gids[port_num - 1];
+
+	if (clear) {
+		struct gid_cache_context *ctx = *context;
+
+		if (ctx) {
+			ctx->refcount--;
+			if (!ctx->refcount) {
+				unsigned int real_index = ctx->real_index;
+
+				memcpy(&port_gid_table->gids[real_index].gid, &zgid, sizeof(*gid));
+				kfree(port_gid_table->gids[real_index].ctx);
+				port_gid_table->gids[real_index].ctx = NULL;
+				hw_update = 1;
+			}
+		}
+	} else {
+		for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i) {
+			if (!memcmp(&port_gid_table->gids[i].gid, gid, sizeof(*gid))) {
+				found = i;
+				break;
+			}
+			if (free < 0 && !memcmp(&port_gid_table->gids[i].gid, &zgid, sizeof(*gid)))
+				free = i; /* HW has space */
+		}
+
+		if (found < 0) {
+			if (free < 0) {
+				ret = -ENOSPC;
+			} else {
+				port_gid_table->gids[free].ctx = kmalloc(sizeof(*port_gid_table->gids[free].ctx), GFP_ATOMIC);
+				if (!port_gid_table->gids[free].ctx) {
+					ret = -ENOMEM;
+				} else {
+					*context = port_gid_table->gids[free].ctx;
+					memcpy(&port_gid_table->gids[free].gid, gid, sizeof(*gid));
+					port_gid_table->gids[free].ctx->real_index = free;
+					port_gid_table->gids[free].ctx->refcount = 1;
+					hw_update = 1;
+				}
+			}
+		} else {
+			struct gid_cache_context *ctx = port_gid_table->gids[found].ctx;
+			*context = ctx;
+			ctx->refcount++;
+		}
+	}
+	if (!ret && hw_update) {
+		gids = kmalloc(sizeof(*gids) * MLX4_MAX_PORT_GIDS, GFP_ATOMIC);
+		if (!gids) {
+			ret = -ENOMEM;
+		} else {
+			for (i = 0; i < MLX4_MAX_PORT_GIDS; i++)
+				memcpy(&gids[i].gid, &port_gid_table->gids[i].gid, sizeof(union ib_gid));
+		}
+	}
+	spin_unlock_bh(&iboe->lock);
+
+	if (!ret && hw_update) {
+		ret = mlx4_ib_update_gids(gids, ibdev, port_num);
+		kfree(gids);
+	}
+
+	return ret;
+}
+
+int mlx4_ib_gid_index_to_real_index(struct mlx4_ib_dev *ibdev,
+				    u8 port_num, int index)
+{
+	struct mlx4_ib_iboe *iboe = &ibdev->iboe;
+	struct gid_cache_context *ctx = NULL;
+	union ib_gid gid;
+	struct mlx4_port_gid_table   *port_gid_table;
+	int real_index = -EINVAL;
+	int i;
+	int ret;
+	unsigned long flags;
+
+	if (port_num > MLX4_MAX_PORTS)
+		return -EINVAL;
+
+	if (mlx4_is_bonded(ibdev->dev))
+		port_num = 1;
+
+	if (!rdma_cap_roce_gid_table(&ibdev->ib_dev, port_num))
+		return index;
+
+	ret = ib_get_cached_gid(&ibdev->ib_dev, port_num, index, &gid);
+	if (ret)
+		return ret;
+
+	if (!memcmp(&gid, &zgid, sizeof(gid)))
+		return -EINVAL;
+
+	spin_lock_irqsave(&iboe->lock, flags);
+	port_gid_table = &iboe->gids[port_num - 1];
+
+	for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i)
+		if (!memcmp(&port_gid_table->gids[i].gid, &gid, sizeof(gid))) {
+			ctx = port_gid_table->gids[i].ctx;
+			break;
+		}
+	if (ctx)
+		real_index = ctx->real_index;
+	spin_unlock_irqrestore(&iboe->lock, flags);
+	return real_index;
+}
+
 static int mlx4_ib_query_device(struct ib_device *ibdev,
 				struct ib_device_attr *props)
 {
@@ -477,11 +673,22 @@ out:
 static int iboe_query_gid(struct ib_device *ibdev, u8 port, int index,
 			  union ib_gid *gid)
 {
-	struct mlx4_ib_dev *dev = to_mdev(ibdev);
+	int ret;
 
-	*gid = dev->iboe.gid_table[port - 1][index];
+	if (!rdma_cap_roce_gid_table(ibdev, port)) {
+		struct mlx4_ib_dev *dev = to_mdev(ibdev);
 
-	return 0;
+		*gid = dev->iboe.gid_table[port - 1][index];
+		return 0;
+	}
+
+	ret = ib_get_cached_gid(ibdev, port, index, gid);
+	if (ret == -EAGAIN) {
+		memcpy(gid, &zgid, sizeof(*gid));
+		return 0;
+	}
+
+	return ret;
 }
 
 static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 645d55e..c870ddb 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -456,6 +456,20 @@ struct mlx4_ib_sriov {
 	struct idr pv_id_table;
 };
 
+struct gid_cache_context {
+	int real_index;
+	int refcount;
+};
+
+struct gid_entry {
+	union ib_gid	gid;
+	struct gid_cache_context *ctx;
+};
+
+struct mlx4_port_gid_table {
+	struct gid_entry gids[MLX4_MAX_PORT_GIDS];
+};
+
 struct mlx4_ib_iboe {
 	spinlock_t		lock;
 	struct net_device      *netdevs[MLX4_MAX_PORTS];
@@ -465,6 +479,7 @@ struct mlx4_ib_iboe {
 	struct notifier_block	nb_inet;
 	struct notifier_block	nb_inet6;
 	union ib_gid		gid_table[MLX4_MAX_PORTS][128];
+	struct mlx4_port_gid_table gids[MLX4_MAX_PORTS];
 };
 
 struct pkey_mgt {
@@ -815,5 +830,7 @@ int mlx4_ib_rereg_user_mr(struct ib_mr *mr, int flags,
 			  u64 start, u64 length, u64 virt_addr,
 			  int mr_access_flags, struct ib_pd *pd,
 			  struct ib_udata *udata);
+int mlx4_ib_gid_index_to_real_index(struct mlx4_ib_dev *ibdev,
+				    u8 port_num, int index);
 
 #endif /* MLX4_IB_H */
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 83e80ab..d439949 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -78,7 +78,8 @@ enum {
 
 enum {
 	MLX4_MAX_PORTS		= 2,
-	MLX4_MAX_PORT_PKEYS	= 128
+	MLX4_MAX_PORT_PKEYS	= 128,
+	MLX4_MAX_PORT_GIDS	= 128
 };
 
 /* base qkey for use in sriov tunnel-qp/proxy-qp communication.
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 11/12] IB/mlx4: Replace mechanism for RoCE GID management
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (9 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 10/12] IB/mlx4: Implement ib_device callbacks Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
  2015-06-08 14:12   ` [PATCH for-next V5 12/12] RDMA/ocrdma: Changes in driver to incorporate the moving of GID Table mgmt to IB/Core Matan Barak
  2015-06-08 21:37   ` [PATCH for-next V5 00/12] Move RoCE GID management " Hefty, Sean
  12 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

From: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Manage RoCE gid table with logic in IB/core, which is common to all
vendors, and remove the mechanism from the mlx4 IB driver.
Since management of the GID cache may lead to index mismatch with the
hardware GID table, a translation between indexes is required when
modifying a QP or creating an address handle.

Signed-off-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/ah.c      |   2 +-
 drivers/infiniband/hw/mlx4/main.c    | 510 ++---------------------------------
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   4 -
 drivers/infiniband/hw/mlx4/qp.c      |  10 +-
 4 files changed, 28 insertions(+), 498 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index f50a546..7ad6f96 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -89,7 +89,7 @@ static struct ib_ah *create_iboe_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr
 	if (vlan_tag < 0x1000)
 		vlan_tag |= (ah_attr->sl & 7) << 13;
 	ah->av.eth.port_pd = cpu_to_be32(to_mpd(pd)->pdn | (ah_attr->port_num << 24));
-	ah->av.eth.gid_index = ah_attr->grh.sgid_index;
+	ah->av.eth.gid_index = mlx4_ib_gid_index_to_real_index(ibdev, ah_attr->port_num, ah_attr->grh.sgid_index);
 	ah->av.eth.vlan = cpu_to_be16(vlan_tag);
 	if (ah_attr->static_rate) {
 		ah->av.eth.stat_rate = ah_attr->static_rate + MLX4_STAT_RATE_OFFSET;
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index bf38e32..18708a7 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -77,13 +77,6 @@ static const char mlx4_ib_version[] =
 	DRV_NAME ": Mellanox ConnectX InfiniBand driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
-struct update_gid_work {
-	struct work_struct	work;
-	union ib_gid		gids[128];
-	struct mlx4_ib_dev     *dev;
-	int			port;
-};
-
 static void do_slave_init(struct mlx4_ib_dev *ibdev, int slave, int do_init);
 
 static struct workqueue_struct *wq;
@@ -560,7 +553,8 @@ static int eth_link_query_port(struct ib_device *ibdev, u8 port,
 	props->active_width	=  (((u8 *)mailbox->buf)[5] == 0x40) ?
 						IB_WIDTH_4X : IB_WIDTH_1X;
 	props->active_speed	= IB_SPEED_QDR;
-	props->port_cap_flags	= IB_PORT_CM_SUP | IB_PORT_IP_BASED_GIDS;
+	props->port_cap_flags	= IB_PORT_CM_SUP | IB_PORT_IP_BASED_GIDS |
+				  IB_PORT_ROCE;
 	props->gid_tbl_len	= mdev->dev->caps.gid_table_len[port];
 	props->max_msg_sz	= mdev->dev->caps.max_msg_sz;
 	props->pkey_tbl_len	= 1;
@@ -569,12 +563,13 @@ static int eth_link_query_port(struct ib_device *ibdev, u8 port,
 	props->state		= IB_PORT_DOWN;
 	props->phys_state	= state_to_phys_state(props->state);
 	props->active_mtu	= IB_MTU_256;
-	if (is_bonded)
-		rtnl_lock(); /* required to get upper dev */
 	spin_lock_bh(&iboe->lock);
 	ndev = iboe->netdevs[port - 1];
-	if (ndev && is_bonded)
-		ndev = netdev_master_upper_dev_get(ndev);
+	if (ndev && is_bonded) {
+		rcu_read_lock(); /* required to get upper dev */
+		ndev = netdev_master_upper_dev_get_rcu(ndev);
+		rcu_read_unlock();
+	}
 	if (!ndev)
 		goto out_unlock;
 
@@ -586,8 +581,6 @@ static int eth_link_query_port(struct ib_device *ibdev, u8 port,
 	props->phys_state	= state_to_phys_state(props->state);
 out_unlock:
 	spin_unlock_bh(&iboe->lock);
-	if (is_bonded)
-		rtnl_unlock();
 out:
 	mlx4_free_cmd_mailbox(mdev->dev, mailbox);
 	return err;
@@ -670,17 +663,19 @@ out:
 	return err;
 }
 
-static int iboe_query_gid(struct ib_device *ibdev, u8 port, int index,
-			  union ib_gid *gid)
+static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
+			     union ib_gid *gid)
 {
 	int ret;
 
-	if (!rdma_cap_roce_gid_table(ibdev, port)) {
-		struct mlx4_ib_dev *dev = to_mdev(ibdev);
+	if (rdma_protocol_ib(ibdev, port))
+		return __mlx4_ib_query_gid(ibdev, port, index, gid, 0);
 
-		*gid = dev->iboe.gid_table[port - 1][index];
-		return 0;
-	}
+	if (!rdma_protocol_roce(ibdev, port))
+		return -ENODEV;
+
+	if (!rdma_cap_roce_gid_table(ibdev, port))
+		return -ENODEV;
 
 	ret = ib_get_cached_gid(ibdev, port, index, gid);
 	if (ret == -EAGAIN) {
@@ -691,15 +686,6 @@ static int iboe_query_gid(struct ib_device *ibdev, u8 port, int index,
 	return ret;
 }
 
-static int mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
-			     union ib_gid *gid)
-{
-	if (rdma_port_get_link_layer(ibdev, port) == IB_LINK_LAYER_INFINIBAND)
-		return __mlx4_ib_query_gid(ibdev, port, index, gid, 0);
-	else
-		return iboe_query_gid(ibdev, port, index, gid);
-}
-
 int __mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, u16 index,
 			 u16 *pkey, int netw_view)
 {
@@ -1695,272 +1681,6 @@ static struct device_attribute *mlx4_class_attributes[] = {
 	&dev_attr_board_id
 };
 
-static void mlx4_addrconf_ifid_eui48(u8 *eui, u16 vlan_id,
-				     struct net_device *dev)
-{
-	memcpy(eui, dev->dev_addr, 3);
-	memcpy(eui + 5, dev->dev_addr + 3, 3);
-	if (vlan_id < 0x1000) {
-		eui[3] = vlan_id >> 8;
-		eui[4] = vlan_id & 0xff;
-	} else {
-		eui[3] = 0xff;
-		eui[4] = 0xfe;
-	}
-	eui[0] ^= 2;
-}
-
-static void update_gids_task(struct work_struct *work)
-{
-	struct update_gid_work *gw = container_of(work, struct update_gid_work, work);
-	struct mlx4_cmd_mailbox *mailbox;
-	union ib_gid *gids;
-	int err;
-	struct mlx4_dev	*dev = gw->dev->dev;
-	int is_bonded = mlx4_is_bonded(dev);
-
-	if (!gw->dev->ib_active)
-		return;
-
-	mailbox = mlx4_alloc_cmd_mailbox(dev);
-	if (IS_ERR(mailbox)) {
-		pr_warn("update gid table failed %ld\n", PTR_ERR(mailbox));
-		return;
-	}
-
-	gids = mailbox->buf;
-	memcpy(gids, gw->gids, sizeof gw->gids);
-
-	err = mlx4_cmd(dev, mailbox->dma, MLX4_SET_PORT_GID_TABLE << 8 | gw->port,
-		       MLX4_SET_PORT_ETH_OPCODE, MLX4_CMD_SET_PORT,
-		       MLX4_CMD_TIME_CLASS_B, MLX4_CMD_WRAPPED);
-	if (err)
-		pr_warn("set port command failed\n");
-	else
-		if ((gw->port == 1) || !is_bonded)
-			mlx4_ib_dispatch_event(gw->dev,
-					       is_bonded ? 1 : gw->port,
-					       IB_EVENT_GID_CHANGE);
-
-	mlx4_free_cmd_mailbox(dev, mailbox);
-	kfree(gw);
-}
-
-static void reset_gids_task(struct work_struct *work)
-{
-	struct update_gid_work *gw =
-			container_of(work, struct update_gid_work, work);
-	struct mlx4_cmd_mailbox *mailbox;
-	union ib_gid *gids;
-	int err;
-	struct mlx4_dev	*dev = gw->dev->dev;
-
-	if (!gw->dev->ib_active)
-		return;
-
-	mailbox = mlx4_alloc_cmd_mailbox(dev);
-	if (IS_ERR(mailbox)) {
-		pr_warn("reset gid table failed\n");
-		goto free;
-	}
-
-	gids = mailbox->buf;
-	memcpy(gids, gw->gids, sizeof(gw->gids));
-
-	if (mlx4_ib_port_link_layer(&gw->dev->ib_dev, gw->port) ==
-				    IB_LINK_LAYER_ETHERNET) {
-		err = mlx4_cmd(dev, mailbox->dma,
-			       MLX4_SET_PORT_GID_TABLE << 8 | gw->port,
-			       MLX4_SET_PORT_ETH_OPCODE, MLX4_CMD_SET_PORT,
-			       MLX4_CMD_TIME_CLASS_B,
-			       MLX4_CMD_WRAPPED);
-		if (err)
-			pr_warn("set port %d command failed\n", gw->port);
-	}
-
-	mlx4_free_cmd_mailbox(dev, mailbox);
-free:
-	kfree(gw);
-}
-
-static int update_gid_table(struct mlx4_ib_dev *dev, int port,
-			    union ib_gid *gid, int clear,
-			    int default_gid)
-{
-	struct update_gid_work *work;
-	int i;
-	int need_update = 0;
-	int free = -1;
-	int found = -1;
-	int max_gids;
-
-	if (default_gid) {
-		free = 0;
-	} else {
-		max_gids = dev->dev->caps.gid_table_len[port];
-		for (i = 1; i < max_gids; ++i) {
-			if (!memcmp(&dev->iboe.gid_table[port - 1][i], gid,
-				    sizeof(*gid)))
-				found = i;
-
-			if (clear) {
-				if (found >= 0) {
-					need_update = 1;
-					dev->iboe.gid_table[port - 1][found] =
-						zgid;
-					break;
-				}
-			} else {
-				if (found >= 0)
-					break;
-
-				if (free < 0 &&
-				    !memcmp(&dev->iboe.gid_table[port - 1][i],
-					    &zgid, sizeof(*gid)))
-					free = i;
-			}
-		}
-	}
-
-	if (found == -1 && !clear && free >= 0) {
-		dev->iboe.gid_table[port - 1][free] = *gid;
-		need_update = 1;
-	}
-
-	if (!need_update)
-		return 0;
-
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
-	if (!work)
-		return -ENOMEM;
-
-	memcpy(work->gids, dev->iboe.gid_table[port - 1], sizeof(work->gids));
-	INIT_WORK(&work->work, update_gids_task);
-	work->port = port;
-	work->dev = dev;
-	queue_work(wq, &work->work);
-
-	return 0;
-}
-
-static void mlx4_make_default_gid(struct  net_device *dev, union ib_gid *gid)
-{
-	gid->global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
-	mlx4_addrconf_ifid_eui48(&gid->raw[8], 0xffff, dev);
-}
-
-
-static int reset_gid_table(struct mlx4_ib_dev *dev, u8 port)
-{
-	struct update_gid_work *work;
-
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
-	if (!work)
-		return -ENOMEM;
-
-	memset(dev->iboe.gid_table[port - 1], 0, sizeof(work->gids));
-	memset(work->gids, 0, sizeof(work->gids));
-	INIT_WORK(&work->work, reset_gids_task);
-	work->dev = dev;
-	work->port = port;
-	queue_work(wq, &work->work);
-	return 0;
-}
-
-static int mlx4_ib_addr_event(int event, struct net_device *event_netdev,
-			      struct mlx4_ib_dev *ibdev, union ib_gid *gid)
-{
-	struct mlx4_ib_iboe *iboe;
-	int port = 0;
-	struct net_device *real_dev = rdma_vlan_dev_real_dev(event_netdev) ?
-				rdma_vlan_dev_real_dev(event_netdev) :
-				event_netdev;
-	union ib_gid default_gid;
-
-	mlx4_make_default_gid(real_dev, &default_gid);
-
-	if (!memcmp(gid, &default_gid, sizeof(*gid)))
-		return 0;
-
-	if (event != NETDEV_DOWN && event != NETDEV_UP)
-		return 0;
-
-	if ((real_dev != event_netdev) &&
-	    (event == NETDEV_DOWN) &&
-	    rdma_link_local_addr((struct in6_addr *)gid))
-		return 0;
-
-	iboe = &ibdev->iboe;
-	spin_lock_bh(&iboe->lock);
-
-	for (port = 1; port <= ibdev->dev->caps.num_ports; ++port)
-		if ((netif_is_bond_master(real_dev) &&
-		     (real_dev == iboe->masters[port - 1])) ||
-		     (!netif_is_bond_master(real_dev) &&
-		     (real_dev == iboe->netdevs[port - 1])))
-			update_gid_table(ibdev, port, gid,
-					 event == NETDEV_DOWN, 0);
-
-	spin_unlock_bh(&iboe->lock);
-	return 0;
-
-}
-
-static u8 mlx4_ib_get_dev_port(struct net_device *dev,
-			       struct mlx4_ib_dev *ibdev)
-{
-	u8 port = 0;
-	struct mlx4_ib_iboe *iboe;
-	struct net_device *real_dev = rdma_vlan_dev_real_dev(dev) ?
-				rdma_vlan_dev_real_dev(dev) : dev;
-
-	iboe = &ibdev->iboe;
-
-	for (port = 1; port <= ibdev->dev->caps.num_ports; ++port)
-		if ((netif_is_bond_master(real_dev) &&
-		     (real_dev == iboe->masters[port - 1])) ||
-		     (!netif_is_bond_master(real_dev) &&
-		     (real_dev == iboe->netdevs[port - 1])))
-			break;
-
-	if ((port == 0) || (port > ibdev->dev->caps.num_ports))
-		return 0;
-	else
-		return port;
-}
-
-static int mlx4_ib_inet_event(struct notifier_block *this, unsigned long event,
-				void *ptr)
-{
-	struct mlx4_ib_dev *ibdev;
-	struct in_ifaddr *ifa = ptr;
-	union ib_gid gid;
-	struct net_device *event_netdev = ifa->ifa_dev->dev;
-
-	ipv6_addr_set_v4mapped(ifa->ifa_address, (struct in6_addr *)&gid);
-
-	ibdev = container_of(this, struct mlx4_ib_dev, iboe.nb_inet);
-
-	mlx4_ib_addr_event(event, event_netdev, ibdev, &gid);
-	return NOTIFY_DONE;
-}
-
-#if IS_ENABLED(CONFIG_IPV6)
-static int mlx4_ib_inet6_event(struct notifier_block *this, unsigned long event,
-				void *ptr)
-{
-	struct mlx4_ib_dev *ibdev;
-	struct inet6_ifaddr *ifa = ptr;
-	union  ib_gid *gid = (union ib_gid *)&ifa->addr;
-	struct net_device *event_netdev = ifa->idev->dev;
-
-	ibdev = container_of(this, struct mlx4_ib_dev, iboe.nb_inet6);
-
-	mlx4_ib_addr_event(event, event_netdev, ibdev, gid);
-	return NOTIFY_DONE;
-}
-#endif
-
 #define MLX4_IB_INVALID_MAC	((u64)-1)
 static void mlx4_ib_update_qps(struct mlx4_ib_dev *ibdev,
 			       struct net_device *dev,
@@ -2019,94 +1739,6 @@ unlock:
 	mutex_unlock(&ibdev->qp1_proxy_lock[port - 1]);
 }
 
-static void mlx4_ib_get_dev_addr(struct net_device *dev,
-				 struct mlx4_ib_dev *ibdev, u8 port)
-{
-	struct in_device *in_dev;
-#if IS_ENABLED(CONFIG_IPV6)
-	struct inet6_dev *in6_dev;
-	union ib_gid  *pgid;
-	struct inet6_ifaddr *ifp;
-	union ib_gid default_gid;
-#endif
-	union ib_gid gid;
-
-
-	if ((port == 0) || (port > ibdev->dev->caps.num_ports))
-		return;
-
-	/* IPv4 gids */
-	in_dev = in_dev_get(dev);
-	if (in_dev) {
-		for_ifa(in_dev) {
-			/*ifa->ifa_address;*/
-			ipv6_addr_set_v4mapped(ifa->ifa_address,
-					       (struct in6_addr *)&gid);
-			update_gid_table(ibdev, port, &gid, 0, 0);
-		}
-		endfor_ifa(in_dev);
-		in_dev_put(in_dev);
-	}
-#if IS_ENABLED(CONFIG_IPV6)
-	mlx4_make_default_gid(dev, &default_gid);
-	/* IPv6 gids */
-	in6_dev = in6_dev_get(dev);
-	if (in6_dev) {
-		read_lock_bh(&in6_dev->lock);
-		list_for_each_entry(ifp, &in6_dev->addr_list, if_list) {
-			pgid = (union ib_gid *)&ifp->addr;
-			if (!memcmp(pgid, &default_gid, sizeof(*pgid)))
-				continue;
-			update_gid_table(ibdev, port, pgid, 0, 0);
-		}
-		read_unlock_bh(&in6_dev->lock);
-		in6_dev_put(in6_dev);
-	}
-#endif
-}
-
-static void mlx4_ib_set_default_gid(struct mlx4_ib_dev *ibdev,
-				 struct  net_device *dev, u8 port)
-{
-	union ib_gid gid;
-	mlx4_make_default_gid(dev, &gid);
-	update_gid_table(ibdev, port, &gid, 0, 1);
-}
-
-static int mlx4_ib_init_gid_table(struct mlx4_ib_dev *ibdev)
-{
-	struct	net_device *dev;
-	struct mlx4_ib_iboe *iboe = &ibdev->iboe;
-	int i;
-	int err = 0;
-
-	for (i = 1; i <= ibdev->num_ports; ++i) {
-		if (rdma_port_get_link_layer(&ibdev->ib_dev, i) ==
-		    IB_LINK_LAYER_ETHERNET) {
-			err = reset_gid_table(ibdev, i);
-			if (err)
-				goto out;
-		}
-	}
-
-	read_lock(&dev_base_lock);
-	spin_lock_bh(&iboe->lock);
-
-	for_each_netdev(&init_net, dev) {
-		u8 port = mlx4_ib_get_dev_port(dev, ibdev);
-		/* port will be non-zero only for ETH ports */
-		if (port) {
-			mlx4_ib_set_default_gid(ibdev, dev, port);
-			mlx4_ib_get_dev_addr(dev, ibdev, port);
-		}
-	}
-
-	spin_unlock_bh(&iboe->lock);
-	read_unlock(&dev_base_lock);
-out:
-	return err;
-}
-
 static void mlx4_ib_scan_netdevs(struct mlx4_ib_dev *ibdev,
 				 struct net_device *dev,
 				 unsigned long event)
@@ -2116,81 +1748,22 @@ static void mlx4_ib_scan_netdevs(struct mlx4_ib_dev *ibdev,
 	int update_qps_port = -1;
 	int port;
 
+	ASSERT_RTNL();
+
 	iboe = &ibdev->iboe;
 
 	spin_lock_bh(&iboe->lock);
 	mlx4_foreach_ib_transport_port(port, ibdev->dev) {
-		enum ib_port_state	port_state = IB_PORT_NOP;
-		struct net_device *old_master = iboe->masters[port - 1];
-		struct net_device *curr_netdev;
-		struct net_device *curr_master;
 
 		iboe->netdevs[port - 1] =
 			mlx4_get_protocol_dev(ibdev->dev, MLX4_PROT_ETH, port);
-		if (iboe->netdevs[port - 1])
-			mlx4_ib_set_default_gid(ibdev,
-						iboe->netdevs[port - 1], port);
-		curr_netdev = iboe->netdevs[port - 1];
-
-		if (iboe->netdevs[port - 1] &&
-		    netif_is_bond_slave(iboe->netdevs[port - 1])) {
-			iboe->masters[port - 1] = netdev_master_upper_dev_get(
-				iboe->netdevs[port - 1]);
-		} else {
-			iboe->masters[port - 1] = NULL;
-		}
-		curr_master = iboe->masters[port - 1];
 
 		if (dev == iboe->netdevs[port - 1] &&
 		    (event == NETDEV_CHANGEADDR || event == NETDEV_REGISTER ||
 		     event == NETDEV_UP || event == NETDEV_CHANGE))
 			update_qps_port = port;
 
-		if (curr_netdev) {
-			port_state = (netif_running(curr_netdev) && netif_carrier_ok(curr_netdev)) ?
-						IB_PORT_ACTIVE : IB_PORT_DOWN;
-			mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
-			if (curr_master) {
-				/* if using bonding/team and a slave port is down, we
-				 * don't want the bond IP based gids in the table since
-				 * flows that select port by gid may get the down port.
-				*/
-				if (port_state == IB_PORT_DOWN &&
-				    !mlx4_is_bonded(ibdev->dev)) {
-					reset_gid_table(ibdev, port);
-					mlx4_ib_set_default_gid(ibdev,
-								curr_netdev,
-								port);
-				} else {
-					/* gids from the upper dev (bond/team)
-					 * should appear in port's gid table
-					*/
-					mlx4_ib_get_dev_addr(curr_master,
-							     ibdev, port);
-				}
-			}
-			/* if bonding is used it is possible that we add it to
-			 * masters only after IP address is assigned to the
-			 * net bonding interface.
-			*/
-			if (curr_master && (old_master != curr_master)) {
-				reset_gid_table(ibdev, port);
-				mlx4_ib_set_default_gid(ibdev,
-							curr_netdev, port);
-				mlx4_ib_get_dev_addr(curr_master, ibdev, port);
-			}
-
-			if (!curr_master && (old_master != curr_master)) {
-				reset_gid_table(ibdev, port);
-				mlx4_ib_set_default_gid(ibdev,
-							curr_netdev, port);
-				mlx4_ib_get_dev_addr(curr_netdev, ibdev, port);
-			}
-		} else {
-			reset_gid_table(ibdev, port);
-		}
 	}
-
 	spin_unlock_bh(&iboe->lock);
 
 	if (update_qps_port > 0)
@@ -2394,6 +1967,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 						1 : ibdev->num_ports;
 	ibdev->ib_dev.num_comp_vectors	= dev->caps.num_comp_vectors;
 	ibdev->ib_dev.dma_device	= &dev->persist->pdev->dev;
+	ibdev->ib_dev.get_netdev	= mlx4_ib_get_netdev;
+	ibdev->ib_dev.modify_gid	= mlx4_ib_modify_gid;
 
 	if (dev->caps.userspace_caps)
 		ibdev->ib_dev.uverbs_abi_ver = MLX4_IB_UVERBS_ABI_VERSION;
@@ -2588,26 +2163,6 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 				goto err_notif;
 			}
 		}
-		if (!iboe->nb_inet.notifier_call) {
-			iboe->nb_inet.notifier_call = mlx4_ib_inet_event;
-			err = register_inetaddr_notifier(&iboe->nb_inet);
-			if (err) {
-				iboe->nb_inet.notifier_call = NULL;
-				goto err_notif;
-			}
-		}
-#if IS_ENABLED(CONFIG_IPV6)
-		if (!iboe->nb_inet6.notifier_call) {
-			iboe->nb_inet6.notifier_call = mlx4_ib_inet6_event;
-			err = register_inet6addr_notifier(&iboe->nb_inet6);
-			if (err) {
-				iboe->nb_inet6.notifier_call = NULL;
-				goto err_notif;
-			}
-		}
-#endif
-		if (mlx4_ib_init_gid_table(ibdev))
-			goto err_notif;
 	}
 
 	for (j = 0; j < ARRAY_SIZE(mlx4_class_attributes); ++j) {
@@ -2638,18 +2193,6 @@ err_notif:
 			pr_warn("failure unregistering notifier\n");
 		ibdev->iboe.nb.notifier_call = NULL;
 	}
-	if (ibdev->iboe.nb_inet.notifier_call) {
-		if (unregister_inetaddr_notifier(&ibdev->iboe.nb_inet))
-			pr_warn("failure unregistering notifier\n");
-		ibdev->iboe.nb_inet.notifier_call = NULL;
-	}
-#if IS_ENABLED(CONFIG_IPV6)
-	if (ibdev->iboe.nb_inet6.notifier_call) {
-		if (unregister_inet6addr_notifier(&ibdev->iboe.nb_inet6))
-			pr_warn("failure unregistering notifier\n");
-		ibdev->iboe.nb_inet6.notifier_call = NULL;
-	}
-#endif
 	flush_workqueue(wq);
 
 	mlx4_ib_close_sriov(ibdev);
@@ -2773,19 +2316,6 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr)
 		kfree(ibdev->ib_uc_qpns_bitmap);
 	}
 
-	if (ibdev->iboe.nb_inet.notifier_call) {
-		if (unregister_inetaddr_notifier(&ibdev->iboe.nb_inet))
-			pr_warn("failure unregistering notifier\n");
-		ibdev->iboe.nb_inet.notifier_call = NULL;
-	}
-#if IS_ENABLED(CONFIG_IPV6)
-	if (ibdev->iboe.nb_inet6.notifier_call) {
-		if (unregister_inet6addr_notifier(&ibdev->iboe.nb_inet6))
-			pr_warn("failure unregistering notifier\n");
-		ibdev->iboe.nb_inet6.notifier_call = NULL;
-	}
-#endif
-
 	iounmap(ibdev->uar_map);
 	for (p = 0; p < ibdev->num_ports; ++p)
 		if (ibdev->counters[p] != -1)
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index c870ddb..19ffdab 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -473,12 +473,8 @@ struct mlx4_port_gid_table {
 struct mlx4_ib_iboe {
 	spinlock_t		lock;
 	struct net_device      *netdevs[MLX4_MAX_PORTS];
-	struct net_device      *masters[MLX4_MAX_PORTS];
 	atomic64_t		mac[MLX4_MAX_PORTS];
 	struct notifier_block 	nb;
-	struct notifier_block	nb_inet;
-	struct notifier_block	nb_inet6;
-	union ib_gid		gid_table[MLX4_MAX_PORTS][128];
 	struct mlx4_port_gid_table gids[MLX4_MAX_PORTS];
 };
 
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 02fc91c6..d4393a1 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1292,14 +1292,18 @@ static int _mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 		path->static_rate = 0;
 
 	if (ah->ah_flags & IB_AH_GRH) {
-		if (ah->grh.sgid_index >= dev->dev->caps.gid_table_len[port]) {
+		int real_sgid_index = mlx4_ib_gid_index_to_real_index(dev,
+								      port,
+								      ah->grh.sgid_index);
+
+		if (real_sgid_index >= dev->dev->caps.gid_table_len[port]) {
 			pr_err("sgid_index (%u) too large. max is %d\n",
-			       ah->grh.sgid_index, dev->dev->caps.gid_table_len[port] - 1);
+			       real_sgid_index, dev->dev->caps.gid_table_len[port] - 1);
 			return -1;
 		}
 
 		path->grh_mylmc |= 1 << 7;
-		path->mgid_index = ah->grh.sgid_index;
+		path->mgid_index = real_sgid_index;
 		path->hop_limit  = ah->grh.hop_limit;
 		path->tclass_flowlabel =
 			cpu_to_be32((ah->grh.traffic_class << 20) |
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH for-next V5 12/12] RDMA/ocrdma: Changes in driver to incorporate the moving of GID Table mgmt to IB/Core.
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (10 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 11/12] IB/mlx4: Replace mechanism for RoCE GID management Matan Barak
@ 2015-06-08 14:12   ` Matan Barak
       [not found]     ` <1433772735-22416-13-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-08 21:37   ` [PATCH for-next V5 00/12] Move RoCE GID management " Hefty, Sean
  12 siblings, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-08 14:12 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Moni Shoua, Jason Gunthorpe, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Somnath Kotur,
	Devesh Sharma

From: Somnath Kotur <somnath.kotur-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>

1.Check and set port capability flags to indicate RoCEV2 support.
2.Change query_gid hook to return value from IB/Core GID Mgmt APIs.
3.Get rid of all the netdev notifier chain subscription code as well as
maintenance of SGID Table in memory.
4.Implement get_netdev hook in driver.

Signed-off-by: Somnath Kotur <somnath.kotur-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Devesh Sharma <devesh.sharma-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/ocrdma/ocrdma.h       |  10 ++
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c    |   3 +
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  | 233 +---------------------------
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h   |  13 ++
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |  31 +++-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |   4 +
 6 files changed, 62 insertions(+), 232 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h b/drivers/infiniband/hw/ocrdma/ocrdma.h
index c9780d9..ea6484c 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -99,6 +99,7 @@ struct ocrdma_dev_attr {
 	u8 local_ca_ack_delay;
 	u8 ird;
 	u8 num_ird_pages;
+	u8 roce_flags;
 };
 
 struct ocrdma_dma_mem {
@@ -574,4 +575,13 @@ static inline u8 ocrdma_is_enabled_and_synced(u32 state)
 		(state & OCRDMA_STATE_FLAG_SYNC);
 }
 
+static inline bool ocrdma_is_rocev2_supported(struct ocrdma_dev *dev)
+{
+	return (dev->attr.roce_flags & (OCRDMA_L3_TYPE_IPV4 <<
+					OCRDMA_ROUDP_FLAGS_SHIFT) ||
+		dev->attr.roce_flags & (OCRDMA_L3_TYPE_IPV6 <<
+					OCRDMA_ROUDP_FLAGS_SHIFT)) ?
+								true : false;
+}
+
 #endif
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index 0c9e959..42116a5 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -1112,6 +1112,9 @@ static void ocrdma_get_attr(struct ocrdma_dev *dev,
 	attr->local_ca_ack_delay = (rsp->max_pd_ca_ack_delay &
 				    OCRDMA_MBX_QUERY_CFG_CA_ACK_DELAY_MASK) >>
 	    OCRDMA_MBX_QUERY_CFG_CA_ACK_DELAY_SHIFT;
+	attr->roce_flags = (rsp->max_pd_ca_ack_delay &
+				OCRDMA_MBX_QUERY_CFG_L3_TYPE_MASK) >>
+				OCRDMA_MBX_QUERY_CFG_L3_TYPE_SHIFT;
 	attr->max_mw = rsp->max_mw;
 	attr->max_mr = rsp->max_mr;
 	attr->max_mr_size = ((u64)rsp->max_mr_size_hi << 32) |
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index f552898..0d3e915 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -51,8 +51,6 @@ static LIST_HEAD(ocrdma_dev_list);
 static DEFINE_SPINLOCK(ocrdma_devlist_lock);
 static DEFINE_IDR(ocrdma_dev_id);
 
-static union ib_gid ocrdma_zero_sgid;
-
 void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid)
 {
 	u8 mac_addr[6];
@@ -67,135 +65,6 @@ void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid)
 	guid[6] = mac_addr[4];
 	guid[7] = mac_addr[5];
 }
-
-static bool ocrdma_add_sgid(struct ocrdma_dev *dev, union ib_gid *new_sgid)
-{
-	int i;
-	unsigned long flags;
-
-	memset(&ocrdma_zero_sgid, 0, sizeof(union ib_gid));
-
-
-	spin_lock_irqsave(&dev->sgid_lock, flags);
-	for (i = 0; i < OCRDMA_MAX_SGID; i++) {
-		if (!memcmp(&dev->sgid_tbl[i], &ocrdma_zero_sgid,
-			    sizeof(union ib_gid))) {
-			/* found free entry */
-			memcpy(&dev->sgid_tbl[i], new_sgid,
-			       sizeof(union ib_gid));
-			spin_unlock_irqrestore(&dev->sgid_lock, flags);
-			return true;
-		} else if (!memcmp(&dev->sgid_tbl[i], new_sgid,
-				   sizeof(union ib_gid))) {
-			/* entry already present, no addition is required. */
-			spin_unlock_irqrestore(&dev->sgid_lock, flags);
-			return false;
-		}
-	}
-	spin_unlock_irqrestore(&dev->sgid_lock, flags);
-	return false;
-}
-
-static bool ocrdma_del_sgid(struct ocrdma_dev *dev, union ib_gid *sgid)
-{
-	int found = false;
-	int i;
-	unsigned long flags;
-
-
-	spin_lock_irqsave(&dev->sgid_lock, flags);
-	/* first is default sgid, which cannot be deleted. */
-	for (i = 1; i < OCRDMA_MAX_SGID; i++) {
-		if (!memcmp(&dev->sgid_tbl[i], sgid, sizeof(union ib_gid))) {
-			/* found matching entry */
-			memset(&dev->sgid_tbl[i], 0, sizeof(union ib_gid));
-			found = true;
-			break;
-		}
-	}
-	spin_unlock_irqrestore(&dev->sgid_lock, flags);
-	return found;
-}
-
-static int ocrdma_addr_event(unsigned long event, struct net_device *netdev,
-			     union ib_gid *gid)
-{
-	struct ib_event gid_event;
-	struct ocrdma_dev *dev;
-	bool found = false;
-	bool updated = false;
-	bool is_vlan = false;
-
-	is_vlan = netdev->priv_flags & IFF_802_1Q_VLAN;
-	if (is_vlan)
-		netdev = rdma_vlan_dev_real_dev(netdev);
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(dev, &ocrdma_dev_list, entry) {
-		if (dev->nic_info.netdev == netdev) {
-			found = true;
-			break;
-		}
-	}
-	rcu_read_unlock();
-
-	if (!found)
-		return NOTIFY_DONE;
-
-	mutex_lock(&dev->dev_lock);
-	switch (event) {
-	case NETDEV_UP:
-		updated = ocrdma_add_sgid(dev, gid);
-		break;
-	case NETDEV_DOWN:
-		updated = ocrdma_del_sgid(dev, gid);
-		break;
-	default:
-		break;
-	}
-	if (updated) {
-		/* GID table updated, notify the consumers about it */
-		gid_event.device = &dev->ibdev;
-		gid_event.element.port_num = 1;
-		gid_event.event = IB_EVENT_GID_CHANGE;
-		ib_dispatch_event(&gid_event);
-	}
-	mutex_unlock(&dev->dev_lock);
-	return NOTIFY_OK;
-}
-
-static int ocrdma_inetaddr_event(struct notifier_block *notifier,
-				  unsigned long event, void *ptr)
-{
-	struct in_ifaddr *ifa = ptr;
-	union ib_gid gid;
-	struct net_device *netdev = ifa->ifa_dev->dev;
-
-	ipv6_addr_set_v4mapped(ifa->ifa_address, (struct in6_addr *)&gid);
-	return ocrdma_addr_event(event, netdev, &gid);
-}
-
-static struct notifier_block ocrdma_inetaddr_notifier = {
-	.notifier_call = ocrdma_inetaddr_event
-};
-
-#if IS_ENABLED(CONFIG_IPV6)
-
-static int ocrdma_inet6addr_event(struct notifier_block *notifier,
-				  unsigned long event, void *ptr)
-{
-	struct inet6_ifaddr *ifa = (struct inet6_ifaddr *)ptr;
-	union  ib_gid *gid = (union ib_gid *)&ifa->addr;
-	struct net_device *netdev = ifa->idev->dev;
-	return ocrdma_addr_event(event, netdev, gid);
-}
-
-static struct notifier_block ocrdma_inet6addr_notifier = {
-	.notifier_call = ocrdma_inet6addr_event
-};
-
-#endif /* IPV6 and VLAN */
-
 static enum rdma_link_layer ocrdma_link_layer(struct ib_device *device,
 					      u8 port_num)
 {
@@ -263,6 +132,8 @@ static int ocrdma_register_device(struct ocrdma_dev *dev)
 	dev->ibdev.query_port = ocrdma_query_port;
 	dev->ibdev.modify_port = ocrdma_modify_port;
 	dev->ibdev.query_gid = ocrdma_query_gid;
+	dev->ibdev.get_netdev = ocrdma_get_netdev;
+	dev->ibdev.modify_gid = ocrdma_modify_gid;
 	dev->ibdev.get_link_layer = ocrdma_link_layer;
 	dev->ibdev.alloc_pd = ocrdma_alloc_pd;
 	dev->ibdev.dealloc_pd = ocrdma_dealloc_pd;
@@ -325,12 +196,6 @@ static int ocrdma_register_device(struct ocrdma_dev *dev)
 static int ocrdma_alloc_resources(struct ocrdma_dev *dev)
 {
 	mutex_init(&dev->dev_lock);
-	dev->sgid_tbl = kzalloc(sizeof(union ib_gid) *
-				OCRDMA_MAX_SGID, GFP_KERNEL);
-	if (!dev->sgid_tbl)
-		goto alloc_err;
-	spin_lock_init(&dev->sgid_lock);
-
 	dev->cq_tbl = kzalloc(sizeof(struct ocrdma_cq *) *
 			      OCRDMA_MAX_CQ, GFP_KERNEL);
 	if (!dev->cq_tbl)
@@ -362,7 +227,6 @@ static void ocrdma_free_resources(struct ocrdma_dev *dev)
 	kfree(dev->stag_arr);
 	kfree(dev->qp_tbl);
 	kfree(dev->cq_tbl);
-	kfree(dev->sgid_tbl);
 }
 
 /* OCRDMA sysfs interface */
@@ -408,68 +272,6 @@ static void ocrdma_remove_sysfiles(struct ocrdma_dev *dev)
 		device_remove_file(&dev->ibdev.dev, ocrdma_attributes[i]);
 }
 
-static void ocrdma_add_default_sgid(struct ocrdma_dev *dev)
-{
-	/* GID Index 0 - Invariant manufacturer-assigned EUI-64 */
-	union ib_gid *sgid = &dev->sgid_tbl[0];
-
-	sgid->global.subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
-	ocrdma_get_guid(dev, &sgid->raw[8]);
-}
-
-static void ocrdma_init_ipv4_gids(struct ocrdma_dev *dev,
-				  struct net_device *net)
-{
-	struct in_device *in_dev;
-	union ib_gid gid;
-	in_dev = in_dev_get(net);
-	if (in_dev) {
-		for_ifa(in_dev) {
-			ipv6_addr_set_v4mapped(ifa->ifa_address,
-					       (struct in6_addr *)&gid);
-			ocrdma_add_sgid(dev, &gid);
-		}
-		endfor_ifa(in_dev);
-		in_dev_put(in_dev);
-	}
-}
-
-static void ocrdma_init_ipv6_gids(struct ocrdma_dev *dev,
-				  struct net_device *net)
-{
-#if IS_ENABLED(CONFIG_IPV6)
-	struct inet6_dev *in6_dev;
-	union ib_gid  *pgid;
-	struct inet6_ifaddr *ifp;
-	in6_dev = in6_dev_get(net);
-	if (in6_dev) {
-		read_lock_bh(&in6_dev->lock);
-		list_for_each_entry(ifp, &in6_dev->addr_list, if_list) {
-			pgid = (union ib_gid *)&ifp->addr;
-			ocrdma_add_sgid(dev, pgid);
-		}
-		read_unlock_bh(&in6_dev->lock);
-		in6_dev_put(in6_dev);
-	}
-#endif
-}
-
-static void ocrdma_init_gid_table(struct ocrdma_dev *dev)
-{
-	struct  net_device *net_dev;
-
-	for_each_netdev(&init_net, net_dev) {
-		struct net_device *real_dev = rdma_vlan_dev_real_dev(net_dev) ?
-				rdma_vlan_dev_real_dev(net_dev) : net_dev;
-
-		if (real_dev == dev->nic_info.netdev) {
-			ocrdma_add_default_sgid(dev);
-			ocrdma_init_ipv4_gids(dev, net_dev);
-			ocrdma_init_ipv6_gids(dev, net_dev);
-		}
-	}
-}
-
 static struct ocrdma_dev *ocrdma_add(struct be_dev_info *dev_info)
 {
 	int status = 0, i;
@@ -498,7 +300,6 @@ static struct ocrdma_dev *ocrdma_add(struct be_dev_info *dev_info)
 		goto alloc_err;
 
 	ocrdma_init_service_level(dev);
-	ocrdma_init_gid_table(dev);
 	status = ocrdma_register_device(dev);
 	if (status)
 		goto alloc_err;
@@ -645,34 +446,12 @@ static struct ocrdma_driver ocrdma_drv = {
 	.be_abi_version		= OCRDMA_BE_ROCE_ABI_VERSION,
 };
 
-static void ocrdma_unregister_inet6addr_notifier(void)
-{
-#if IS_ENABLED(CONFIG_IPV6)
-	unregister_inet6addr_notifier(&ocrdma_inet6addr_notifier);
-#endif
-}
-
-static void ocrdma_unregister_inetaddr_notifier(void)
-{
-	unregister_inetaddr_notifier(&ocrdma_inetaddr_notifier);
-}
-
 static int __init ocrdma_init_module(void)
 {
 	int status;
 
 	ocrdma_init_debugfs();
 
-	status = register_inetaddr_notifier(&ocrdma_inetaddr_notifier);
-	if (status)
-		return status;
-
-#if IS_ENABLED(CONFIG_IPV6)
-	status = register_inet6addr_notifier(&ocrdma_inet6addr_notifier);
-	if (status)
-		goto err_notifier6;
-#endif
-
 	status = be_roce_register_driver(&ocrdma_drv);
 	if (status)
 		goto err_be_reg;
@@ -680,19 +459,13 @@ static int __init ocrdma_init_module(void)
 	return 0;
 
 err_be_reg:
-#if IS_ENABLED(CONFIG_IPV6)
-	ocrdma_unregister_inet6addr_notifier();
-err_notifier6:
-#endif
-	ocrdma_unregister_inetaddr_notifier();
+
 	return status;
 }
 
 static void __exit ocrdma_exit_module(void)
 {
 	be_roce_unregister_driver(&ocrdma_drv);
-	ocrdma_unregister_inet6addr_notifier();
-	ocrdma_unregister_inetaddr_notifier();
 	ocrdma_rem_debugfs();
 }
 
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
index 243c87c..6b74eb9 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
@@ -125,6 +125,14 @@ enum {
 	OCRDMA_DB_RQ_SHIFT		= 24
 };
 
+enum {
+	OCRDMA_L3_TYPE_IB_GRH   = 0x00,
+	OCRDMA_L3_TYPE_IPV4     = 0x01,
+	OCRDMA_L3_TYPE_IPV6     = 0x02
+};
+
+#define OCRDMA_ROUDP_FLAGS_SHIFT	0x03
+
 #define OCRDMA_DB_CQ_RING_ID_MASK       0x3FF	/* bits 0 - 9 */
 #define OCRDMA_DB_CQ_RING_ID_EXT_MASK  0x0C00	/* bits 10-11 of qid at 12-11 */
 /* qid #2 msbits at 12-11 */
@@ -488,6 +496,9 @@ enum {
 	OCRDMA_MBX_QUERY_CFG_CA_ACK_DELAY_SHIFT		= 8,
 	OCRDMA_MBX_QUERY_CFG_CA_ACK_DELAY_MASK		= 0xFF <<
 				OCRDMA_MBX_QUERY_CFG_CA_ACK_DELAY_SHIFT,
+	OCRDMA_MBX_QUERY_CFG_L3_TYPE_SHIFT		 = 0,
+	OCRDMA_MBX_QUERY_CFG_L3_TYPE_MASK		= 0xFF <<
+				OCRDMA_MBX_QUERY_CFG_L3_TYPE_SHIFT,
 
 	OCRDMA_MBX_QUERY_CFG_MAX_SEND_SGE_SHIFT		= 0,
 	OCRDMA_MBX_QUERY_CFG_MAX_SEND_SGE_MASK		= 0xFFFF,
@@ -1049,6 +1060,8 @@ enum {
 	OCRDMA_QP_PARAMS_STATE_MASK		= BIT(5) | BIT(6) | BIT(7),
 	OCRDMA_QP_PARAMS_FLAGS_SQD_ASYNC	= BIT(8),
 	OCRDMA_QP_PARAMS_FLAGS_INB_ATEN		= BIT(9),
+	OCRDMA_QP_PARAMS_FLAGS_L3_TYPE_SHIFT	= 11,
+	OCRDMA_QP_PARAMS_FLAGS_L3_TYPE_MASK	= BIT(11) | BIT(12) | BIT(13),
 	OCRDMA_QP_PARAMS_MAX_SGE_RECV_SHIFT	= 16,
 	OCRDMA_QP_PARAMS_MAX_SGE_RECV_MASK	= 0xFFFF <<
 					OCRDMA_QP_PARAMS_MAX_SGE_RECV_SHIFT,
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index cf1f515..f1c4290 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -31,6 +31,7 @@
 #include <rdma/iw_cm.h>
 #include <rdma/ib_umem.h>
 #include <rdma/ib_addr.h>
+#include <rdma/ib_cache.h>
 
 #include "ocrdma.h"
 #include "ocrdma_hw.h"
@@ -49,6 +50,7 @@ int ocrdma_query_pkey(struct ib_device *ibdev, u8 port, u16 index, u16 *pkey)
 int ocrdma_query_gid(struct ib_device *ibdev, u8 port,
 		     int index, union ib_gid *sgid)
 {
+	int ret;
 	struct ocrdma_dev *dev;
 
 	dev = get_ocrdma_dev(ibdev);
@@ -56,7 +58,22 @@ int ocrdma_query_gid(struct ib_device *ibdev, u8 port,
 	if (index >= OCRDMA_MAX_SGID)
 		return -EINVAL;
 
-	memcpy(sgid, &dev->sgid_tbl[index], sizeof(*sgid));
+	ret = ib_get_cached_gid(ibdev, port, index, sgid);
+	if (ret == -EAGAIN) {
+		memcpy(sgid, &zgid, sizeof(*sgid));
+		return 0;
+	}
+
+	return ret;
+}
+
+int ocrdma_modify_gid(struct ib_device *ibdev, u8 port_num, unsigned int index,
+		      const union ib_gid *gid, const struct ib_gid_attr *attr,
+		      void **context)
+{
+	struct ocrdma_dev *dev;
+
+	dev = get_ocrdma_dev(ibdev);
 
 	return 0;
 }
@@ -106,6 +123,15 @@ int ocrdma_query_device(struct ib_device *ibdev, struct ib_device_attr *attr)
 	return 0;
 }
 
+struct net_device *ocrdma_get_netdev(struct ib_device *ibdev, u8 port_num)
+{
+	struct ocrdma_dev *dev = get_ocrdma_dev(ibdev);
+
+	if (dev)
+		return dev->nic_info.netdev;
+
+	return NULL;
+}
 static inline void get_link_speed_and_width(struct ocrdma_dev *dev,
 					    u8 *ib_speed, u8 *ib_width)
 {
@@ -175,7 +201,8 @@ int ocrdma_query_port(struct ib_device *ibdev,
 	props->port_cap_flags =
 	    IB_PORT_CM_SUP |
 	    IB_PORT_REINIT_SUP |
-	    IB_PORT_DEVICE_MGMT_SUP | IB_PORT_VENDOR_CLASS_SUP | IB_PORT_IP_BASED_GIDS;
+	    IB_PORT_DEVICE_MGMT_SUP | IB_PORT_VENDOR_CLASS_SUP |
+	    IB_PORT_IP_BASED_GIDS | IB_PORT_ROCE;
 	props->gid_tbl_len = OCRDMA_MAX_SGID;
 	props->pkey_tbl_len = 1;
 	props->bad_pkey_cntr = 0;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
index 3cdc81e..b24795c 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
@@ -47,6 +47,10 @@ ocrdma_query_protocol(struct ib_device *device, u8 port_num);
 void ocrdma_get_guid(struct ocrdma_dev *, u8 *guid);
 int ocrdma_query_gid(struct ib_device *, u8 port,
 		     int index, union ib_gid *gid);
+struct net_device *ocrdma_get_netdev(struct ib_device *device, u8 port_num);
+int ocrdma_modify_gid(struct ib_device *ibdev, u8 port_num, unsigned int index,
+		      const union ib_gid *gid, const struct ib_gid_attr *attr,
+		      void **context);
 int ocrdma_query_pkey(struct ib_device *, u8 port, u16 index, u16 *pkey);
 
 struct ib_ucontext *ocrdma_alloc_ucontext(struct ib_device *,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* RE: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (11 preceding siblings ...)
  2015-06-08 14:12   ` [PATCH for-next V5 12/12] RDMA/ocrdma: Changes in driver to incorporate the moving of GID Table mgmt to IB/Core Matan Barak
@ 2015-06-08 21:37   ` Hefty, Sean
       [not found]     ` <1828884A29C6694DAF28B7E6B8A82373A8FE5D17-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  12 siblings, 1 reply; 45+ messages in thread
From: Hefty, Sean @ 2015-06-08 21:37 UTC (permalink / raw)
  To: Matan Barak, Doug Ledford
  Cc: Or Gerlitz, Moni Shoua, Jason Gunthorpe, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

> Previously, every vendor implemented its net device notifiers in its own
> driver. This introduces a huge code duplication as figuring


>  28 files changed, 2253 insertions(+), 860 deletions(-)

How does adding 1400 lines of code help reduce code duplication?

Can you please explain and justify why this change is actually needed?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]     ` <1828884A29C6694DAF28B7E6B8A82373A8FE5D17-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-09  7:27       ` Matan Barak
       [not found]         ` <55769561.8000300-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-09  7:27 UTC (permalink / raw)
  To: Hefty, Sean, Doug Ledford
  Cc: Or Gerlitz, Moni Shoua, Jason Gunthorpe, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



On 6/9/2015 12:37 AM, Hefty, Sean wrote:
>> Previously, every vendor implemented its net device notifiers in its own
>> driver. This introduces a huge code duplication as figuring
>
>
>>   28 files changed, 2253 insertions(+), 860 deletions(-)
>
> How does adding 1400 lines of code help reduce code duplication?
>
> Can you please explain and justify why this change is actually needed?
>

Let's look at this change from several perspectives:

(1) Each vedor lost ~250 lines of GID management code just by this 
change. In the future it's very probable that more vendor drivers will 
implement RoCE. This removes the burden and code duplication required by 
them to implement a full RoCE support and is a lot more scalable than 
the current approach.

(2) All vendors are now aligned. For example, mlx4 driver had bonding 
support but ocrdma didn't have such support. The user expects the same 
behavior regardless the vendor's driver.

(3) When making something more general it usually requires more lines of 
code as it introduces API and doesn't cut corners assuming anything on 
the vendor's driver.

(4) This is a per-requisite to the RoCE V2 series. I'm sure you remember 
we first submitted this patch-set as a part of the RoCE V2 series. 
Adding more features to the RoCE GID management will make the code 
duplication a lot worse than just ~250 lines. I don't think it's fair 
playing "lets divide the RoCE V2 patch-set to several patch-sets" and 
then say "why do we need this <first part> at all". Let alone, the other 
there reasons are more than enough IMHO.

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]         ` <55769561.8000300-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-10  8:53           ` Or Gerlitz
       [not found]             ` <5577FAFB.8020205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Or Gerlitz @ 2015-06-10  8:53 UTC (permalink / raw)
  To: Hefty, Sean, Doug Ledford, Jason Gunthorpe
  Cc: Matan Barak, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 6/9/2015 10:27 AM, Matan Barak wrote:
>
>
> On 6/9/2015 12:37 AM, Hefty, Sean wrote:
>>> Previously, every vendor implemented its net device notifiers in its 
>>> own
>>> driver. This introduces a huge code duplication as figuring
>>
>>
>>>   28 files changed, 2253 insertions(+), 860 deletions(-)
>>
>> How does adding 1400 lines of code help reduce code duplication?
>>
>> Can you please explain and justify why this change is actually needed?
>>
>
> Let's look at this change from several perspectives:
>
> (1) Each vedor lost ~250 lines of GID management code just by this 
> change. In the future it's very probable that more vendor drivers will 
> implement RoCE. This removes the burden and code duplication required 
> by them to implement a full RoCE support and is a lot more scalable 
> than the current approach.
>
> (2) All vendors are now aligned. For example, mlx4 driver had bonding 
> support but ocrdma didn't have such support. The user expects the same 
> behavior regardless the vendor's driver.
>
> (3) When making something more general it usually requires more lines 
> of code as it introduces API and doesn't cut corners assuming anything 
> on the vendor's driver.
>
> (4) This is a per-requisite to the RoCE V2 series. I'm sure you 
> remember we first submitted this patch-set as a part of the RoCE V2 
> series. Adding more features to the RoCE GID management will make the 
> code duplication a lot worse than just ~250 lines. I don't think it's 
> fair playing "lets divide the RoCE V2 patch-set to several patch-sets" 
> and then say "why do we need this <first part> at all". Let alone, the 
> other there reasons are more than enough IMHO.
>

Sean, this change is needed b/c two drivers have (mlx4 and ocrda) and 
more two to come soon (mlx5 and soft-Roce) would have the very same 
logic of constructing the port GID table according to netdev events and 
such, no point in repeating this logic/code over and over.

Matan explained why we don't have 2 x Y deletions and 1 x Y insertions.

Jason, can you ack that this post addressed your comments?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]             ` <5577FAFB.8020205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-10 15:00               ` Jason Gunthorpe
       [not found]                 ` <20150610150010.GA11243-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-06-10 15:09               ` Hefty, Sean
  1 sibling, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-10 15:00 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Hefty, Sean, Doug Ledford, Matan Barak, Moni Shoua,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 10, 2015 at 11:53:15AM +0300, Or Gerlitz wrote:

> Jason, can you ack that this post addressed your comments?

Well, I asked for a cleanup series, multiple times, and this is the
closest things have got.

It isn't really a cleanup because the whole gid table is new code and
has latent elements for rocev2 - this is why it is so much bigger than
it should be.

The other core parts have been mostly trimmed, so that is the specific
things discussed last round.

Is it Ok to go ahead with the gid table as is? I don't know, I haven't
studied the patch in any detail. Technically, that is not best
practice for kernel development process.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                 ` <20150610150010.GA11243-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-10 15:08                   ` Matan Barak
       [not found]                     ` <557852EE.5030107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-11  0:15                   ` Doug Ledford
  1 sibling, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-10 15:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Or Gerlitz
  Cc: Hefty, Sean, Doug Ledford, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



On 6/10/2015 6:00 PM, Jason Gunthorpe wrote:
> On Wed, Jun 10, 2015 at 11:53:15AM +0300, Or Gerlitz wrote:
>
>> Jason, can you ack that this post addressed your comments?
>
> Well, I asked for a cleanup series, multiple times, and this is the
> closest things have got.
>
> It isn't really a cleanup because the whole gid table is new code and
> has latent elements for rocev2 - this is why it is so much bigger than
> it should be.
>

I disagree. Could you please point on anything that is RoCE V2 specific?
The essence of RoCE V2 in the previous series was the gid_type part.
This is now completely removed.

> The other core parts have been mostly trimmed, so that is the specific
> things discussed last round.
>
> Is it Ok to go ahead with the gid table as is? I don't know, I haven't
> studied the patch in any detail. Technically, that is not best
> practice for kernel development process.
>
> Jason
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]             ` <5577FAFB.8020205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-06-10 15:00               ` Jason Gunthorpe
@ 2015-06-10 15:09               ` Hefty, Sean
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A82373A8FE6616-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 45+ messages in thread
From: Hefty, Sean @ 2015-06-10 15:09 UTC (permalink / raw)
  To: Or Gerlitz, Doug Ledford, Jason Gunthorpe
  Cc: Matan Barak, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

> Sean, this change is needed b/c two drivers have (mlx4 and ocrda) and
> more two to come soon (mlx5 and soft-Roce) would have the very same
> logic of constructing the port GID table according to netdev events and
> such, no point in repeating this logic/code over and over.
> 
> Matan explained why we don't have 2 x Y deletions and 1 x Y insertions.

It more than doubles the amount of code.  That's not a cleanup.  It introduces a bunch of new functionality.  Jason has asked repeatedly to remove the RoCEv2 code, and that has been ignored repeatedly.  As far as I'm concerned, this patch is not worth my time, and I will no longer even bother following this series.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A82373A8FE6616-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-06-10 15:19                   ` Matan Barak
  0 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-10 15:19 UTC (permalink / raw)
  To: Hefty, Sean, Or Gerlitz, Doug Ledford, Jason Gunthorpe
  Cc: Moni Shoua, Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA



On 6/10/2015 6:09 PM, Hefty, Sean wrote:
>> Sean, this change is needed b/c two drivers have (mlx4 and ocrda) and
>> more two to come soon (mlx5 and soft-Roce) would have the very same
>> logic of constructing the port GID table according to netdev events and
>> such, no point in repeating this logic/code over and over.
>>
>> Matan explained why we don't have 2 x Y deletions and 1 x Y insertions.
>
> It more than doubles the amount of code.  That's not a cleanup.  It introduces a bunch of new functionality.  Jason has asked repeatedly to remove the RoCEv2 code, and that has been ignored repeatedly.  As far as I'm concerned, this patch is not worth my time, and I will no longer even bother following this series.

Well, saying that without giving one evidence of RoCE V2 code in this 
series is simply non-sense. Regarding Jason's comments, all of them were 
either fixed or answered.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                     ` <557852EE.5030107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-10 18:49                       ` Jason Gunthorpe
       [not found]                         ` <20150610184954.GA26404-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-10 18:49 UTC (permalink / raw)
  To: Matan Barak
  Cc: Or Gerlitz, Hefty, Sean, Doug Ledford, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 10, 2015 at 06:08:30PM +0300, Matan Barak wrote:
> >It isn't really a cleanup because the whole gid table is new code and
> >has latent elements for rocev2 - this is why it is so much bigger than
> >it should be.
> 
> I disagree. Could you please point on anything that is RoCE V2 specific?
> The essence of RoCE V2 in the previous series was the gid_type part.
> This is now completely removed.

Sure gid_type is gone, but I didn't say roceve2 specific, I said
latent elements. ie I'm assuming reasons for the scary locking are
because the ripped out rocev2 code needed it?  And some of the
complexity that looks pointless now was supporting ripped out rocev2
elements? That is not necessarily bad, but the code had better be good
quailty and working..

But then I look at the patches, and the very first locking I test out
looks wrong. I see call_rcu/synchronize_rcu being used without a
single call to rcu_read_lock. So this fails #2 of the RCU review
checklist (Seriously? Why am I catching this?)

I stopped reading at that point.

I think you've got the right basic idea for a cleanup series here. It
is time to buckle down and execute it well. Do an internal mellanox
kernel team review of this series. Audit and fix all the locking,
evaluate the code growth and design. Audit to confirm there is no
functional change that is not documented in a commit message. Tell me
v6 is the best effort *team Mellanox* can put forward.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                         ` <20150610184954.GA26404-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-10 20:19                           ` Matan Barak
       [not found]                             ` <CAAKD3BB90iZ98B2ADG+=ZYuEVtLq26a99BEjQCR8U1vzvcG+Gw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-06-11  1:06                           ` Doug Ledford
  1 sibling, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-10 20:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, Or Gerlitz, Hefty, Sean, Doug Ledford, Moni Shoua,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 10, 2015 at 9:49 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Wed, Jun 10, 2015 at 06:08:30PM +0300, Matan Barak wrote:
>> >It isn't really a cleanup because the whole gid table is new code and
>> >has latent elements for rocev2 - this is why it is so much bigger than
>> >it should be.
>>
>> I disagree. Could you please point on anything that is RoCE V2 specific?
>> The essence of RoCE V2 in the previous series was the gid_type part.
>> This is now completely removed.
>
> Sure gid_type is gone, but I didn't say roceve2 specific, I said
> latent elements. ie I'm assuming reasons for the scary locking are
> because the ripped out rocev2 code needed it?  And some of the
> complexity that looks pointless now was supporting ripped out rocev2
> elements? That is not necessarily bad, but the code had better be good
> quailty and working..
>

Why do you think the locks have anything to do with roce v2?

> But then I look at the patches, and the very first locking I test out
> looks wrong. I see call_rcu/synchronize_rcu being used without a
> single call to rcu_read_lock. So this fails #2 of the RCU review
> checklist (Seriously? Why am I catching this?)
>
> I stopped reading at that point.
>

Well, that's easy to explain - write_gid could be called with one of
roce_gid_table's find API.
The find API also returns a ndev. The RCU solves the following scenario:


                 find is called and returns a ndev
write_gid is called and calls dev_put(ndev)
ndev is freed

                 find uses the ndev


By calling the find API in RCU, your ndev is protected.

> I think you've got the right basic idea for a cleanup series here. It
> is time to buckle down and execute it well. Do an internal mellanox
> kernel team review of this series. Audit and fix all the locking,
> evaluate the code growth and design. Audit to confirm there is no
> functional change that is not documented in a commit message. Tell me
> v6 is the best effort *team Mellanox* can put forward.
>

Jason, I really appreciate your review. If you have any comments, I
would like to
either fix or write you back. This series wasn't sent without being
looked at by the
internal team here.

> Jason

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                             ` <CAAKD3BB90iZ98B2ADG+=ZYuEVtLq26a99BEjQCR8U1vzvcG+Gw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-10 22:01                               ` Jason Gunthorpe
       [not found]                                 ` <20150610220154.GA4391-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-10 22:01 UTC (permalink / raw)
  To: Matan Barak
  Cc: Matan Barak, Or Gerlitz, Hefty, Sean, Doug Ledford, Moni Shoua,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 10, 2015 at 11:19:03PM +0300, Matan Barak wrote:

> > Sure gid_type is gone, but I didn't say roceve2 specific, I said
> > latent elements. ie I'm assuming reasons for the scary locking are
> > because the ripped out rocev2 code needed it?  And some of the
> > complexity that looks pointless now was supporting ripped out rocev2
> > elements? That is not necessarily bad, but the code had better be good
> > quailty and working..
> 
> Why do you think the locks have anything to do with roce v2?

What else could they be for? The current mlx4 driver doesn't use use
agressive performance locking.

After writing this email, I am of the opinion that the locking should
be simplified to rwsem and mutex, and every use of rcu, READ_ONCE and
seqlock should be ditched.

> > But then I look at the patches, and the very first locking I test out
> > looks wrong. I see call_rcu/synchronize_rcu being used without a
> > single call to rcu_read_lock. So this fails #2 of the RCU review
> > checklist (Seriously? Why am I catching this?)
> >
> > I stopped reading at that point.
> >
> 
> Well, that's easy to explain - write_gid could be called with one of
> roce_gid_table's find API.

That doesn't explain anything.

You can't use call_rcu without also using rcu_dereference and
rcu_read_lock. It doesn't make any sense otherwise.

Your explanation seems confused too, did you reasearch this? Did you
read the RCU checklist? Is this a knee-jerk reply? Please be thoughtfull.

>  find is called and returns a ndev
>  write_gid is called and calls dev_put(ndev)
>  ndev is freed
>  find uses the ndev

Are you trying to say that this rcu is protecting this:

+static int find_gid(struct ib_roce_gid_table *table, const union ib_gid *gid,
+		    const struct ib_gid_attr *val, unsigned long mask)
+{
[..]
+		if (mask & GID_ATTR_FIND_MASK_NETDEV &&
+		    attr->ndev != val->ndev)
+			continue;

That is an unlocked access to a RCU protected value, without
rcu_dereference. Fails two points on the RCU checklist.

Where does it return ndev?

Honestly, since RCU is done wrong, and I'm very suspicious seqlock is
done wrong too, I would *strongly* encourage v6 to have simple
read/write sem and mutex locking and nothing fancy for performance. I
don't want to go round and round on subtle performance locking for a
*cleanup patch*.

There is also this RCU confusion:

+				rcu_read_lock();
+				if (ib_dev->get_netdev)
+					idev = ib_dev->get_netdev(ib_dev, port);

When holding the rcu_read_lock it should be obvious what the RCU
protected data is. There is no way holding it around a driver call
back makes any sense.

The driver should return a held netdev or null.

.. and maybe more, I stopped looking

> By calling the find API in RCU, your ndev is protected.

When implementing locking, identify the data being locked, and
confirm that every possible access to that data follows the required
locking rules. In this case the data being locked is the
table->data_vec[ix].attr.ndev pointer.

It was the very first thing I checked, in the very first patch.

> > I think you've got the right basic idea for a cleanup series here. It
> > is time to buckle down and execute it well. Do an internal mellanox
> > kernel team review of this series. Audit and fix all the locking,
> > evaluate the code growth and design. Audit to confirm there is no
> > functional change that is not documented in a commit message. Tell me
> > v6 is the best effort *team Mellanox* can put forward.
> 
> Jason, I really appreciate your review. If you have any comments, I
> would like to either fix or write you back. This series wasn't sent
> without being looked at by the internal team here.

Well, I am looking at this thinking I don't want to invest time in
searching for things I think your team can find on it's own.

Take a breather, produce v6 very carefully.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                 ` <20150610150010.GA11243-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-06-10 15:08                   ` Matan Barak
@ 2015-06-11  0:15                   ` Doug Ledford
       [not found]                     ` <1433981756.71666.60.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 45+ messages in thread
From: Doug Ledford @ 2015-06-11  0:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, Hefty, Sean, Matan Barak, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1319 bytes --]

On Wed, 2015-06-10 at 09:00 -0600, Jason Gunthorpe wrote:
> On Wed, Jun 10, 2015 at 11:53:15AM +0300, Or Gerlitz wrote:
> 
> > Jason, can you ack that this post addressed your comments?
> 
> Well, I asked for a cleanup series, multiple times, and this is the
> closest things have got.
> 
> It isn't really a cleanup because the whole gid table is new code and
> has latent elements for rocev2 - this is why it is so much bigger than
> it should be.

I'm not sure the complexity here is "latent RoCEv2" stuff versus simple
over-design.  I didn't see anything in the RoCEv2 that warranted this
level of complexity either.

Just to be clear, I'm currently reviewing the RCU usage here.  Jason has
brought up specific issue, if I can't convince myself that his
objections to the RCU usage are wrong, then I'm going to second his
request that we go back to a more simplistic rwlock.

> The other core parts have been mostly trimmed, so that is the specific
> things discussed last round.
> 
> Is it Ok to go ahead with the gid table as is? I don't know, I haven't
> studied the patch in any detail. Technically, that is not best
> practice for kernel development process.
> 
> Jason


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                         ` <20150610184954.GA26404-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-06-10 20:19                           ` Matan Barak
@ 2015-06-11  1:06                           ` Doug Ledford
       [not found]                             ` <1433984788.71666.78.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 45+ messages in thread
From: Doug Ledford @ 2015-06-11  1:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, Or Gerlitz, Hefty, Sean, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 4201 bytes --]

On Wed, 2015-06-10 at 12:49 -0600, Jason Gunthorpe wrote:
> On Wed, Jun 10, 2015 at 06:08:30PM +0300, Matan Barak wrote:
> > >It isn't really a cleanup because the whole gid table is new code and
> > >has latent elements for rocev2 - this is why it is so much bigger than
> > >it should be.
> > 
> > I disagree. Could you please point on anything that is RoCE V2 specific?
> > The essence of RoCE V2 in the previous series was the gid_type part.
> > This is now completely removed.
> 
> Sure gid_type is gone, but I didn't say roceve2 specific, I said
> latent elements. ie I'm assuming reasons for the scary locking are
> because the ripped out rocev2 code needed it?  And some of the
> complexity that looks pointless now was supporting ripped out rocev2
> elements? That is not necessarily bad, but the code had better be good
> quailty and working..
> 
> But then I look at the patches, and the very first locking I test out
> looks wrong. I see call_rcu/synchronize_rcu being used without a
> single call to rcu_read_lock. So this fails #2 of the RCU review
> checklist (Seriously? Why am I catching this?)
> 
> I stopped reading at that point.

The way they chose to split up patches is part of the problem here.

People tend to push the "patches should be small, self contained,
incremental" ideal.  In some cases, that gets carried to an extreme.  In
this case, patch 1 introduces one side of the locking and patch 3 and 5
introduce the other halves.

In all, this needs to be re-ordered first off:

Patch 4 should be 1 and netdev@ should be Cc:ed
Patch 6 should be 2 and netdev@ should be Cc:ed
Patch 2 should be 3 (or just all by itself in a separate submission)
Patch 1, 3, and 5 should be squashed down to a single patch so that the
locking can actually be analyzed for correctness.

> I think you've got the right basic idea for a cleanup series here. It
> is time to buckle down and execute it well.

Except that this isn't really a cleanup, and calling it that clouds the
issue.  Both the mlx4 and ocrdma drivers implement incomplete RoCE gid
management support.  If this were a true cleanup, they would just merge
the support from mlx4 and ocrdma to core and switch the drivers over to
it.  But that's not the case.  The new core code implements everything
that the two drivers do, and then some more.  And in the process is
standardizes some things that weren't standardized before.  So, a much
more accurate description of this would be to say that the patchset
implements a core RoCE GID management that is a much more complete
management engine than either driver's engine, and that the last few
patches remove the partial engines from the drivers and switch the
drivers over to the core engine.

My only complaints so far are these:

1)  I would have preferred that this be treated just like the other ib
cache items.  The source of changes are different, but in essence, the
RoCE gid table *is* a cache.  It's not real until the hardware writes
it.  I would have preferred to see the handling of the roce_gid_table
all contained in the cache file with the other cache operations.  If you
wanted to keep the management portion in its own file, that I would have
been fine with, but anything that manipulated the table should have been
with the other cache manipulation functions.

2)  I'm not convinced at all that RCU was needed and that a rwlock
wouldn't have been sufficient.  What drove you to use RCU and do you
have numbers to back up that it matters?

>  Do an internal mellanox
> kernel team review of this series. Audit and fix all the locking,
> evaluate the code growth and design. Audit to confirm there is no
> functional change that is not documented in a commit message. Tell me
> v6 is the best effort *team Mellanox* can put forward.
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                             ` <1433984788.71666.78.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-06-11  3:57                               ` Jason Gunthorpe
       [not found]                                 ` <20150611035727.GA16599-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-06-11 10:09                               ` Matan Barak
  1 sibling, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-11  3:57 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Hefty, Sean, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 10, 2015 at 09:06:28PM -0400, Doug Ledford wrote:

> People tend to push the "patches should be small, self contained,
> incremental" ideal.  In some cases, that gets carried to an extreme.  In
> this case, patch 1 introduces one side of the locking and patch 3 and 5
> introduce the other halves.

I already did spot check patches 3 and 5 for exactly that. They add
other uses of RCU, but they appear to be totally different -
objectional for style reasons, but probably not incorrect.

For instance the rcu lock grabs in patch 3 and 5 are protecting the
call to netdev_master_upper_dev_get_rcu in patch 10. 'get_netdev' is
more correctly called 'get_netdev_rcu' in this design. (as I said, this
placement of rcu_read_lock is ugly).

.. and just searching through the patches for 'rcu' to write this, I
noticed this:

+void ib_enum_roce_ports_of_netdev(roce_netdev_filter filter,
[..]
+	down_read(&lists_rwsem);
+	list_for_each_entry_rcu(dev, &device_list, core_list)
+	     ib_dev_roce_ports_of_netdev(dev, filter, filter_cookie, cb,
+					    cookie);
+	up_read(&lists_rwsem);

Should't call list_for_each_entry_rcu under a rwsem, this is just left over
from the old locking regime...

> > I think you've got the right basic idea for a cleanup series here. It
> > is time to buckle down and execute it well.
> 
> Except that this isn't really a cleanup, and calling it that clouds the
> issue.

Well, I've been asking for a cleanup .. The entire goal is to make
things more reviewable and a no-functional-change cleanup would sure
help that..

> it. But that's not the case.  The new core code implements everything
> that the two drivers do, and then some more.

I'd be interested to see a list of the 'some more' included in the
patch comments, I didn't look with a fine toothed comb, but not much
functional stood out to me...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                     ` <1433981756.71666.60.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-06-11  4:07                       ` Jason Gunthorpe
  2015-06-11  9:51                       ` Matan Barak
  1 sibling, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-11  4:07 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Or Gerlitz, Hefty, Sean, Matan Barak, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 10, 2015 at 08:15:56PM -0400, Doug Ledford wrote:
> I'm not sure the complexity here is "latent RoCEv2" stuff versus simple
> over-design.

Well, for instance, the wrong RCU locking around
table->data_vec[ix].attr.ndev appears to exist to support find_gid
when called with GID_ATTR_FIND_MASK_NETDEV outside callers that hold
table->lock.

However, the unlocked call pattern is never used with
GID_ATTR_FIND_MASK_NETDEV - that possibility was punted in v5, but the
code to support it and the broken RCU to help it are still
present, and will eventually be needed by rocev2 and roce namespaces..

That is just one small example, I fully expect there are others, and I
think your remark about 'over-design' is likely equally contributing
to the size expansion as well.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 12/12] RDMA/ocrdma: Changes in driver to incorporate the moving of GID Table mgmt to IB/Core.
       [not found]     ` <1433772735-22416-13-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-11  4:11       ` Jason Gunthorpe
       [not found]         ` <20150611041124.GC16599-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-11  4:11 UTC (permalink / raw)
  To: Matan Barak
  Cc: Doug Ledford, Or Gerlitz, Moni Shoua, Sean Hefty, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Somnath Kotur, Devesh Sharma

On Mon, Jun 08, 2015 at 05:12:15PM +0300, Matan Barak wrote:
> From: Somnath Kotur <somnath.kotur-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>
> 
> 1.Check and set port capability flags to indicate RoCEV2 support.

??? This series has nothing to with rocev2 now, what is this about?

>  	mutex_init(&dev->dev_lock);
> -	dev->sgid_tbl = kzalloc(sizeof(union ib_gid) *
> -				OCRDMA_MAX_SGID, GFP_KERNEL);

Should sgid_tbl be dropped from the structure?

> +int ocrdma_modify_gid(struct ib_device *ibdev, u8 port_num, unsigned int index,
> +		      const union ib_gid *gid, const struct ib_gid_attr *attr,
> +		      void **context)
> +{
> +	struct ocrdma_dev *dev;
> +
> +	dev = get_ocrdma_dev(ibdev);
>  
>  	return 0;
>  }

Empty modify gid? Shouldn't it be completely empty?

This is correct? This HW sends the full SGID in the WQE?

> +enum {
> +     OCRDMA_L3_TYPE_IB_GRH   = 0x00,
> +     OCRDMA_L3_TYPE_IPV4     = 0x01,
> +     OCRDMA_L3_TYPE_IPV6     = 0x02
> +};

These added constants are not used? Probably others as well?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 03/12] IB/core: Add RoCE GID population
       [not found]     ` <1433772735-22416-4-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-11  4:18       ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-11  4:18 UTC (permalink / raw)
  To: Matan Barak
  Cc: Doug Ledford, Or Gerlitz, Moni Shoua, Sean Hefty, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 08, 2015 at 05:12:06PM +0300, Matan Barak wrote:
>  drivers/infiniband/core/core_priv.h      |  26 ++
>  drivers/infiniband/core/device.c         |  77 +++++

I wouldn't mind seeing the core portion which consists of adding
the get_netdev be it's own little mini-series of three, adding the
core and the two driver enablement changes.

> +				rcu_read_lock();
> +				if (ib_dev->get_netdev)
> +					idev =
> ib_dev->get_netdev(ib_dev, port);

If it wasn't clear from before, don't hold a rcu_read_lock around a
driver call back, return a net_device that is already dev_hold'd or
null.

> +	down_read(&lists_rwsem);
> +	list_for_each_entry_rcu(dev, &device_list, core_list)

No _rcu

> +void ib_dev_roce_ports_of_netdev(struct ib_device *ib_dev, roce_netdev_filter filter,
> +					 void *filter_cookie, roce_netdev_callback cb,
> +					 void *cookie)
> +{
> +	u8 port;
> +
> +	if (ib_dev->modify_gid)
[..]
> +		if (ib_dev->get_netdev)
> +			idev = ib_dev->get_netdev(ib_dev, port);

Why check modify_gid and then go on to test and call get_netdev? That
seems strange for a general purpose core API.

> +	/* When calling get_netdev, the HW vendor's driver should return the
> +	 * net device of device @device at port @port_num. The function
> +	 * is called in rtnl_lock. The HW vendor's device driver must guarantee
> +	 * to return NULL before the net device has reached
> +	 * NETDEV_UNREGISTER_FINAL state.

rtnl_lock ? That doesn't seem to match what is going on..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                                 ` <20150611035727.GA16599-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-11  4:49                                   ` Doug Ledford
       [not found]                                     ` <1433998199.71666.144.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-06-11 10:15                                   ` Matan Barak
  1 sibling, 1 reply; 45+ messages in thread
From: Doug Ledford @ 2015-06-11  4:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, Or Gerlitz, Hefty, Sean, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 4019 bytes --]

On Wed, 2015-06-10 at 21:57 -0600, Jason Gunthorpe wrote:
> On Wed, Jun 10, 2015 at 09:06:28PM -0400, Doug Ledford wrote:
> 
> > People tend to push the "patches should be small, self contained,
> > incremental" ideal.  In some cases, that gets carried to an extreme.  In
> > this case, patch 1 introduces one side of the locking and patch 3 and 5
> > introduce the other halves.
> 
> I already did spot check patches 3 and 5 for exactly that. They add
> other uses of RCU, but they appear to be totally different -
> objectional for style reasons, but probably not incorrect.
> 
> For instance the rcu lock grabs in patch 3 and 5 are protecting the
> call to netdev_master_upper_dev_get_rcu in patch 10. 'get_netdev' is
> more correctly called 'get_netdev_rcu' in this design. (as I said, this
> placement of rcu_read_lock is ugly).

Without going back to review it, I was thinking that patches 3 and 5
were related, but the rcu locking in the remaining patches were related
to core netdev rcu locking that is irrespective of what we do in the
roce gid table.  But, that just underscores my other point: we need the
patches that implement all of the relevant rcu locking for the gid table
ndev in the same patch.

> .. and just searching through the patches for 'rcu' to write this, I
> noticed this:
> 
> +void ib_enum_roce_ports_of_netdev(roce_netdev_filter filter,
> [..]
> +	down_read(&lists_rwsem);
> +	list_for_each_entry_rcu(dev, &device_list, core_list)
> +	     ib_dev_roce_ports_of_netdev(dev, filter, filter_cookie, cb,
> +					    cookie);
> +	up_read(&lists_rwsem);
> 
> Should't call list_for_each_entry_rcu under a rwsem, this is just left over
> from the old locking regime...
> 
> > > I think you've got the right basic idea for a cleanup series here. It
> > > is time to buckle down and execute it well.
> > 
> > Except that this isn't really a cleanup, and calling it that clouds the
> > issue.
> 
> Well, I've been asking for a cleanup .. The entire goal is to make
> things more reviewable and a no-functional-change cleanup would sure
> help that..

In this case I'm not sure that's entirely realistic though.  Due to the
fact that the mlx4 driver and the ocrdma driver had their own gid
management code, there were some distinct differences between the two.
The gid at index 0 never matched up in my testing for example.  One
supported bonding, the other didn't.  Even if you tried to limit things
to a cleanup, you would still end up altering behavior of both drivers
just because the cleanup would have to merge the two implementations and
the result would be different than either one of them.  So I don't think
there is any such thing as a no-functional-change cleanup possible here.
Now, they could have went minimal, but instead they went "create new
implementation that is supposedly done right and standardized, then
switch existing drivers over to it" with some dubious rcu locking.

> > it. But that's not the case.  The new core code implements everything
> > that the two drivers do, and then some more.
> 
> I'd be interested to see a list of the 'some more' included in the
> patch comments, I didn't look with a fine toothed comb, but not much
> functional stood out to me...

I get the impression that a lot of the extra is scalability changes for
perceived need (probably due to upcoming namespace/container additions).
But, just to be fair, the entire mlx4 gid code was -500 lines and
included bonding support.  The base gid code addition was +500 for the
gid table, +600 for the table population functions, +170 for default
gids, +290 for bonding support, and +140 lines to hook it into the
existing cache routines.  While part of this might be would seem to be
the over-designed locking, part of it is probably exactly as Matan said,
code growth due to generalization and standardization.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                                     ` <1433998199.71666.144.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-06-11  5:38                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-11  5:38 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matan Barak, Or Gerlitz, Hefty, Sean, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 11, 2015 at 12:49:59AM -0400, Doug Ledford wrote:

> fact that the mlx4 driver and the ocrdma driver had their own gid
> management code, there were some distinct differences between the two.
> The gid at index 0 never matched up in my testing for example.  One
> supported bonding, the other didn't.  Even if you tried to limit things
> to a cleanup, you would still end up altering behavior of both drivers
> just because the cleanup would have to merge the two implementations and
> the result would be different than either one of them.

This is very interesting detail, I would be very happy to see the user
visible functional changes to ocrdma listed in it's commit message.

I always understood that ocrdma would have some change, at least from
bonding, my 'no functional change' litmus test was that mlx4 operation
doesn't have a change. My expectation is the current v5 achieves that?

> I get the impression that a lot of the extra is scalability changes for
> perceived need (probably due to upcoming namespace/container additions).
> But, just to be fair, the entire mlx4 gid code was -500 lines and
> included bonding support.  The base gid code addition was +500 for the
> gid table, +600 for the table population functions, +170 for default
> gids, +290 for bonding support, and +140 lines to hook it into the
> existing cache routines.  While part of this might be would seem to be
> the over-designed locking, part of it is probably exactly as Matan said,
> code growth due to generalization and standardization.

Line count is such a funny metric.. It is very, very hard to write
succinct code, and I have no magic wand to shrink this particular
patch set down further. I suspect it could be made smaller, but that
would require more study than I care to employ :)

I feel line count is a hopeless avenue to argue, but also a red flag.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH for-next V5 12/12] RDMA/ocrdma: Changes in driver to incorporate the moving of GID Table mgmt to IB/Core.
       [not found]         ` <20150611041124.GC16599-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-11  6:04           ` Somnath Kotur
  0 siblings, 0 replies; 45+ messages in thread
From: Somnath Kotur @ 2015-06-11  6:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Matan Barak
  Cc: Doug Ledford, Or Gerlitz, Moni Shoua, Sean Hefty, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Somnath Kotur, Devesh Sharma

Hi,
  Yes , Matan and I need to work together and revisit this patch in light
of the split patch series and remove any references to RoCE v2...

Thanks for the feedback Jason and apologies for the oversight, we should
have worked this out internally before sending out V5

Regards
Som

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> Sent: Thursday, June 11, 2015 9:41 AM
> To: Matan Barak
> Cc: Doug Ledford; Or Gerlitz; Moni Shoua; Sean Hefty; Somnath Kotur;
linux-
> rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Somnath Kotur; Devesh Sharma
> Subject: Re: [PATCH for-next V5 12/12] RDMA/ocrdma: Changes in driver to
> incorporate the moving of GID Table mgmt to IB/Core.
>
> On Mon, Jun 08, 2015 at 05:12:15PM +0300, Matan Barak wrote:
> > From: Somnath Kotur <somnath.kotur-laKkSmNT4hbQT0dZR+AlfA@public.gmane.org>
> >
> > 1.Check and set port capability flags to indicate RoCEV2 support.
>
> ??? This series has nothing to with rocev2 now, what is this about?
>
> >  	mutex_init(&dev->dev_lock);
> > -	dev->sgid_tbl = kzalloc(sizeof(union ib_gid) *
> > -				OCRDMA_MAX_SGID, GFP_KERNEL);
>
> Should sgid_tbl be dropped from the structure?
>
> > +int ocrdma_modify_gid(struct ib_device *ibdev, u8 port_num, unsigned
> int index,
> > +		      const union ib_gid *gid, const struct ib_gid_attr
*attr,
> > +		      void **context)
> > +{
> > +	struct ocrdma_dev *dev;
> > +
> > +	dev = get_ocrdma_dev(ibdev);
> >
> >  	return 0;
> >  }
>
> Empty modify gid? Shouldn't it be completely empty?
>
> This is correct? This HW sends the full SGID in the WQE?
>
> > +enum {
> > +     OCRDMA_L3_TYPE_IB_GRH   = 0x00,
> > +     OCRDMA_L3_TYPE_IPV4     = 0x01,
> > +     OCRDMA_L3_TYPE_IPV6     = 0x02
> > +};
>
> These added constants are not used? Probably others as well?
>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 07/12] IB/core: Add RoCE table bonding support
       [not found]     ` <1433772735-22416-8-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-11  6:18       ` Jason Gunthorpe
       [not found]         ` <20150611061818.GB22369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-11  6:18 UTC (permalink / raw)
  To: Matan Barak
  Cc: Doug Ledford, Or Gerlitz, Moni Shoua, Sean Hefty, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 08, 2015 at 05:12:10PM +0300, Matan Barak wrote:

> +static enum bonding_slave_state is_eth_active_slave_of_bonding(struct net_device *idev,
> +							       struct net_device *upper)
> +{
> +	if (upper && IS_NETDEV_BONDING_MASTER(upper)) {
> +		struct net_device *pdev;
> +
> +		rcu_read_lock();
> +		pdev = bond_option_active_slave_get_rcu(netdev_priv(upper));
> +		rcu_read_unlock();
> +		if (pdev)
> +			return idev == pdev ? BONDING_SLAVE_STATE_ACTIVE :
> +				BONDING_SLAVE_STATE_INACTIVE;

This isn't buggy as written, but I think, it doesn't re-enforce the
rules for how rcu critical sections should work.

The only reason this is not buggy is because it is a pointer compare
and it works out that is OK in this particular case. But, it is subtle
and it might trip up someone down the road.

Keeping with the idomatic RCU pattern is better:

 enum bonding_slave_state res = BONDING_SLAVE_STATE_INACTIVE;;

 rcu_read_lock();
 pdev = bond_option_active_slave_get_rcu(netdev_priv(upper));
 if (pdev && pdev == idev)
   res = BONDING_SLAVE_STATE_ACTIVE;
 rcu_read_unlock();

 return res;

ie don't leak pdev out of the critical section unless a ref is taken
on it.

Same comment applies to other similar places in this series.

> +static bool is_upper_dev_rcu(struct net_device *dev, struct net_device *upper)
> +{
> +	struct net_device *_upper = NULL;
> +	struct list_head *iter;
> +
> +	rcu_read_lock();

A _rcu function should *always* be called with rcu_read_lock held.

It makes no sense to take it again in the body.

Change the name, or fix the one call site that doesn't hold the lock.

I see callers that hold the lock and callers that don't hold the lock,
shouldn't be both kinds.

>  static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
>  				 struct net_device *idev, void *cookie)
>  {
> -	struct net_device *rdev;
> -	struct net_device *mdev;
> -	struct net_device *ndev = (struct net_device *)cookie;
> +	return _is_eth_port_of_netdev(ib_dev, port, idev, cookie,
> +				      BONDING_SLAVE_STATE_ACTIVE |
> +				      BONDING_SLAVE_STATE_NA);
> +}

Why wrapper _is_eth_port_of_netdev with is_eth_port_of_netdev? There
is only one caller to _is_eth_port_of_netdev? Please look at all the
other small functions, there are quite a few..

> +static void _add_netdev_ips(struct ib_device *ib_dev, u8 port,
> +			    struct net_device *ndev)
> +{
> +	enum_netdev_ipv4_ips(ib_dev, port, ndev);
> +#if IS_ENABLED(CONFIG_IPV6)
> +	enum_netdev_ipv6_ips(ib_dev, port, ndev);
> +#endif
> +}

Did you try the 'if (IS_ENABLED(CONFIG_IPV6))' version I suggested a
few versions ago?

> +static void handle_netdev_upper(struct ib_device *ib_dev, u8 port,
> +				void *cookie,
> +				void (*handle_netdev)(struct ib_device *ib_dev,
> +						      u8 port,
> +						      struct
> net_device *ndev))
[..]

> +	LIST_HEAD(upper_list);
> +
> +	rcu_read_lock();
> +	netdev_for_each_all_upper_dev_rcu(ndev, upper, iter) {
> +		struct upper_list *entry = kmalloc(sizeof(*entry),
> +						   GFP_ATOMIC);
> +
> +		if (!entry) {
> +			pr_info("roce_gid_mgmt: couldn't allocate entry to delete ndev\n");
> +			continue;
> +		}
> +
> +		list_add_tail(&entry->list, &upper_list);
> +		dev_hold(upper);
> +		entry->upper = upper;

Everytime I see copying refs onto a stack list I really start to
wonder..

handle_netdev absolutely cannot be called with rcu_read_lock held? Is
that a wise design?

> @@ -309,11 +548,24 @@ static int netdevice_event(struct notifier_block *this, unsigned long event,
>  {
>  	static const struct netdev_event_work_cmd add_cmd = {
>  		.cb = add_netdev_ips, .filter = is_eth_port_of_netdev};
> +	static const struct netdev_event_work_cmd add_cmd_upper_ips = {
> +		.cb = add_netdev_upper_ips, .filter = is_eth_port_of_netdev};
>  	static const struct netdev_event_work_cmd del_cmd = {
>  		.cb = del_netdev_ips, .filter = pass_all_filter};
> +	static const struct netdev_event_work_cmd bonding_default_del_cmd_join = {
> +		.cb = del_netdev_default_ips_join, .filter = is_eth_port_inactive_slave};
> +	static const struct netdev_event_work_cmd bonding_default_del_cmd = {
> +		.cb = del_netdev_default_ips, .filter = is_eth_port_inactive_slave};
> +	static const struct netdev_event_work_cmd default_del_cmd = {
> +		.cb = del_netdev_default_ips, .filter = pass_all_filter};
> +	static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
> +		.cb = del_netdev_upper_ips, .filter = bonding_slaves_filter};
> +	static const struct netdev_event_work_cmd upper_ips_del_cmd = {
> +		.cb = del_netdev_upper_ips, .filter =
>  upper_device_filter};

I also wonder about all this. Can you talk about why the work queue is
needed at this level, and is this a wise design? Is it the same reason
we can't call handle_netdev with rcu read lock held?

I'm just guessing, but is it because the driver modify_gid callback is
allowed to sleep? Would it make more sense to drive only modify_gid
from a work q and leave the rest of this to run inline with the
notifier? That would save alot of code..

> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
> index 4df2894..c4fe29a8 100644
> +++ b/drivers/net/bonding/bond_options.c

Woah, Woah, this should be a dedicated patch, needs to be approved by
netdev.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 05/12] IB/core: Add default GID for RoCE GID table
       [not found]     ` <1433772735-22416-6-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-11  6:20       ` Jason Gunthorpe
       [not found]         ` <20150611062017.GC22369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-11  6:20 UTC (permalink / raw)
  To: Matan Barak
  Cc: Doug Ledford, Or Gerlitz, Moni Shoua, Sean Hefty, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 08, 2015 at 05:12:08PM +0300, Matan Barak wrote:

> +unsigned long roce_gid_type_mask_support(struct ib_device *ib_dev, u8 port);
> +

What is all this gid_type_mask stuff about? rocev2?

> +unsigned long roce_gid_type_mask_support(struct ib_device *ib_dev, u8 port)
> +{
> +	return !!rdma_protocol_roce(ib_dev, port);
> +}

Just return bool and drop the !!

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 10/12] IB/mlx4: Implement ib_device callbacks
       [not found]     ` <1433772735-22416-11-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-06-11  6:31       ` Jason Gunthorpe
       [not found]         ` <20150611063108.GE22369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-11  6:31 UTC (permalink / raw)
  To: Matan Barak
  Cc: Doug Ledford, Or Gerlitz, Moni Shoua, Sean Hefty, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 08, 2015 at 05:12:13PM +0300, Matan Barak wrote:

> +static struct net_device *mlx4_ib_get_netdev(struct ib_device *device, u8 port_num)
> +{

This function is never referenced in this patch, so we get compile
warnings?

Warnings are not a huge deal, but you did compile test every patch in
the series??

> +	if (mlx4_is_bonded(ibdev->dev)) {
> +		struct net_device *dev;
> +		struct net_device *upper = NULL;

So, I see this code in mlx4 touching bonding, but I don't see similar
code in ocrdma..

If bonding is a general feature why is this bit in the driver? Should
the core code be doing this?

Does bonding work in ocrdma at the end of this series? I was expecting
it to..

> +	gid_tbl = mailbox->buf;
> +
> +	for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i)
> +		memcpy(&gid_tbl[i], &gids[i].gid, sizeof(union ib_gid));
> +
> +	err = mlx4_cmd(dev, mailbox->dma,
> +		       MLX4_SET_PORT_GID_TABLE << 8 | port_num,
> +		       1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
> +		       MLX4_CMD_WRAPPED);
> +	if (mlx4_is_bonded(dev))
> +		err += mlx4_cmd(dev, mailbox->dma,
> +				MLX4_SET_PORT_GID_TABLE << 8 | 2,
> +				1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
> +				MLX4_CMD_WRAPPED);

Again, wonder why the driver is sensitive to bonding, and ocrdma is not..

> @@ -477,11 +673,22 @@ out:
>  static int iboe_query_gid(struct ib_device *ibdev, u8 port, int index,
>  			  union ib_gid *gid)
>  {
> -	struct mlx4_ib_dev *dev = to_mdev(ibdev);
> +	int ret;
>  
> -	*gid = dev->iboe.gid_table[port - 1][index];
> +	if (!rdma_cap_roce_gid_table(ibdev, port)) {
> +		struct mlx4_ib_dev *dev = to_mdev(ibdev);
>  
> -	return 0;
> +		*gid = dev->iboe.gid_table[port - 1][index];
> +		return 0;
> +	}
> +
> +	ret = ib_get_cached_gid(ibdev, port, index, gid);
> +	if (ret == -EAGAIN) {
> +		memcpy(gid, &zgid, sizeof(*gid));
> +		return 0;
> +	}
> +
> +	return ret;
>  }

Hum, is it Ok to change iboe_query_gid like this at this point in the
series? Should this be in patch 11?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 10/12] IB/mlx4: Implement ib_device callbacks
       [not found]         ` <20150611063108.GE22369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-11  6:53           ` Moni Shoua
  0 siblings, 0 replies; 45+ messages in thread
From: Moni Shoua @ 2015-06-11  6:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, Doug Ledford, Or Gerlitz, Sean Hefty, Somnath Kotur,
	linux-rdma

On Thu, Jun 11, 2015 at 9:31 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, Jun 08, 2015 at 05:12:13PM +0300, Matan Barak wrote:
>
>> +static struct net_device *mlx4_ib_get_netdev(struct ib_device *device, u8 port_num)
>> +{
>
> This function is never referenced in this patch, so we get compile
> warnings?
>
The motivation here was to add an implementation for the new RoCE GID
table management in one patch (this one) and replace it with the other
in the next patch. This way we have a working RoCE after each patch
while avoiding from having one big patch that adds the new
implementation and removes the old one.
The price is compilation warnings

> Warnings are not a huge deal, but you did compile test every patch in
> the series??
Yes we did and we are aware of the warnings.
>
>> +     if (mlx4_is_bonded(ibdev->dev)) {
>> +             struct net_device *dev;
>> +             struct net_device *upper = NULL;
>
> So, I see this code in mlx4 touching bonding, but I don't see similar
> code in ocrdma..
>
> If bonding is a general feature why is this bit in the driver? Should
> the core code be doing this?
>
> Does bonding work in ocrdma at the end of this series? I was expecting
> it to..
>
This relates to feature in mlx4 called RoCE port aggregation. I guess
that this is not relevant to OCRDMA
Other bonding aspects (like populating addresses of an upper dev or
populating only on the port of the active slave) is being done in the
core the same way for all vendors

>> +     gid_tbl = mailbox->buf;
>> +
>> +     for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i)
>> +             memcpy(&gid_tbl[i], &gids[i].gid, sizeof(union ib_gid));
>> +
>> +     err = mlx4_cmd(dev, mailbox->dma,
>> +                    MLX4_SET_PORT_GID_TABLE << 8 | port_num,
>> +                    1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
>> +                    MLX4_CMD_WRAPPED);
>> +     if (mlx4_is_bonded(dev))
>> +             err += mlx4_cmd(dev, mailbox->dma,
>> +                             MLX4_SET_PORT_GID_TABLE << 8 | 2,
>> +                             1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
>> +                             MLX4_CMD_WRAPPED);
>
> Again, wonder why the driver is sensitive to bonding, and ocrdma is not..
>
Same answer
>> @@ -477,11 +673,22 @@ out:
>>  static int iboe_query_gid(struct ib_device *ibdev, u8 port, int index,
>>                         union ib_gid *gid)
>>  {
>> -     struct mlx4_ib_dev *dev = to_mdev(ibdev);
>> +     int ret;
>>
>> -     *gid = dev->iboe.gid_table[port - 1][index];
>> +     if (!rdma_cap_roce_gid_table(ibdev, port)) {
>> +             struct mlx4_ib_dev *dev = to_mdev(ibdev);
>>
>> -     return 0;
>> +             *gid = dev->iboe.gid_table[port - 1][index];
>> +             return 0;
>> +     }
>> +
>> +     ret = ib_get_cached_gid(ibdev, port, index, gid);
>> +     if (ret == -EAGAIN) {
>> +             memcpy(gid, &zgid, sizeof(*gid));
>> +             return 0;
>> +     }
>> +
>> +     return ret;
>>  }
>
> Hum, is it Ok to change iboe_query_gid like this at this point in the
> series? Should this be in patch 11?
You are right. I'll fix.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                                 ` <20150610220154.GA4391-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-11  9:49                                   ` Matan Barak
       [not found]                                     ` <CAAKD3BChd10Gd4P2Mwm+46aW+PJBT3j7K-BLex0Fkm5UdtUG3w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-06-12 12:29                                   ` Or Gerlitz
  1 sibling, 1 reply; 45+ messages in thread
From: Matan Barak @ 2015-06-11  9:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, Or Gerlitz, Hefty, Sean, Doug Ledford, Moni Shoua,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 11, 2015 at 1:01 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Wed, Jun 10, 2015 at 11:19:03PM +0300, Matan Barak wrote:
>
>> > Sure gid_type is gone, but I didn't say roceve2 specific, I said
>> > latent elements. ie I'm assuming reasons for the scary locking are
>> > because the ripped out rocev2 code needed it?  And some of the
>> > complexity that looks pointless now was supporting ripped out rocev2
>> > elements? That is not necessarily bad, but the code had better be good
>> > quailty and working..
>>
>> Why do you think the locks have anything to do with roce v2?
>
> What else could they be for? The current mlx4 driver doesn't use use
> agressive performance locking.
>
> After writing this email, I am of the opinion that the locking should
> be simplified to rwsem and mutex, and every use of rcu, READ_ONCE and
> seqlock should be ditched.
>

No, that's not true.
For example cm_init_av_by_path calls ib_find_cached_gid *in a
spinlock* which in turns go to the roce_gid_table_find_gid.
So it's obvious you can't use a semaphore/mutex.
write_gid calls the vendor's modify_gid which might be a sleepable
context - so you can't use spin locks here.
There are other ways around this, but I think seqcount does this pretty easily.

READ_ONCE protects something entirely different.
roce_gid_table_client_cleanup_one sets the roce_gid_table to NULL in
order to indicate future calls that its being
destroyed and then flushes the workqueue. In order to avoid crashing
in roce_gid_table, we first store the pointer with
READ_ONCE such that we'll always have a valid pointer.

>> > But then I look at the patches, and the very first locking I test out
>> > looks wrong. I see call_rcu/synchronize_rcu being used without a
>> > single call to rcu_read_lock. So this fails #2 of the RCU review
>> > checklist (Seriously? Why am I catching this?)
>> >
>> > I stopped reading at that point.
>> >
>>
>> Well, that's easy to explain - write_gid could be called with one of
>> roce_gid_table's find API.
>
> That doesn't explain anything.
>
> You can't use call_rcu without also using rcu_dereference and
> rcu_read_lock. It doesn't make any sense otherwise.
>
> Your explanation seems confused too, did you reasearch this? Did you
> read the RCU checklist? Is this a knee-jerk reply? Please be thoughtfull.
>
>>  find is called and returns a ndev
>>  write_gid is called and calls dev_put(ndev)
>>  ndev is freed
>>  find uses the ndev
>
> Are you trying to say that this rcu is protecting this:
>
> +static int find_gid(struct ib_roce_gid_table *table, const union ib_gid *gid,
> +                   const struct ib_gid_attr *val, unsigned long mask)
> +{
> [..]
> +               if (mask & GID_ATTR_FIND_MASK_NETDEV &&
> +                   attr->ndev != val->ndev)
> +                       continue;
>
> That is an unlocked access to a RCU protected value, without
> rcu_dereference. Fails two points on the RCU checklist.
>

No, that's not what I'm saying.
When roce_gid_table_get_gid is called:
rcu_read_lock();
roce_gid_table_get_gid(..., attr);
/* attr->ndev is valid here */
rcu_read_unlock();

However, since netdev_wait_allrefs do rcu_barrier, we could probably
ditch rcu_barrier from our code.
I'll look at that more carefully before doing so.

> Where does it return ndev?
>
> Honestly, since RCU is done wrong, and I'm very suspicious seqlock is
> done wrong too, I would *strongly* encourage v6 to have simple
> read/write sem and mutex locking and nothing fancy for performance. I
> don't want to go round and round on subtle performance locking for a
> *cleanup patch*.
>

How is seqcount relates to RCU exactly?!? Anyway, I explained why
seqcount was chosen here and unless there is a good reason to switch
to another locking method, I prefer not to. "Having a feeling" that if
A is wrong (and to be exact, the rcu usage you pointed out above isn't
wrong but might just be unnecessary because of netdev_wait_allrefs)
doesn't mean B is wrong.

> There is also this RCU confusion:
>
> +                               rcu_read_lock();
> +                               if (ib_dev->get_netdev)
> +                                       idev = ib_dev->get_netdev(ib_dev, port);
>
> When holding the rcu_read_lock it should be obvious what the RCU
> protected data is. There is no way holding it around a driver call
> back makes any sense.
>
> The driver should return a held netdev or null.
>
> .. and maybe more, I stopped looking
>

I agree, rcu_dereference is missing here, but as you suggested we'll
use dev_hold by the vendor driver.

>> By calling the find API in RCU, your ndev is protected.
>
> When implementing locking, identify the data being locked, and
> confirm that every possible access to that data follows the required
> locking rules. In this case the data being locked is the
> table->data_vec[ix].attr.ndev pointer.
>
> It was the very first thing I checked, in the very first patch.
>

We'll go over the RCU usages here. A lot of them just compare
pointers, so they are safe as it is.

>> > I think you've got the right basic idea for a cleanup series here. It
>> > is time to buckle down and execute it well. Do an internal mellanox
>> > kernel team review of this series. Audit and fix all the locking,
>> > evaluate the code growth and design. Audit to confirm there is no
>> > functional change that is not documented in a commit message. Tell me
>> > v6 is the best effort *team Mellanox* can put forward.
>>
>> Jason, I really appreciate your review. If you have any comments, I
>> would like to either fix or write you back. This series wasn't sent
>> without being looked at by the internal team here.
>
> Well, I am looking at this thinking I don't want to invest time in
> searching for things I think your team can find on it's own.
>
> Take a breather, produce v6 very carefully.
>
> Jason

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                     ` <1433981756.71666.60.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-06-11  4:07                       ` Jason Gunthorpe
@ 2015-06-11  9:51                       ` Matan Barak
  1 sibling, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-11  9:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Or Gerlitz, Hefty, Sean, Matan Barak,
	Moni Shoua, Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 11, 2015 at 3:15 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, 2015-06-10 at 09:00 -0600, Jason Gunthorpe wrote:
>> On Wed, Jun 10, 2015 at 11:53:15AM +0300, Or Gerlitz wrote:
>>
>> > Jason, can you ack that this post addressed your comments?
>>
>> Well, I asked for a cleanup series, multiple times, and this is the
>> closest things have got.
>>
>> It isn't really a cleanup because the whole gid table is new code and
>> has latent elements for rocev2 - this is why it is so much bigger than
>> it should be.
>
> I'm not sure the complexity here is "latent RoCEv2" stuff versus simple
> over-design.  I didn't see anything in the RoCEv2 that warranted this
> level of complexity either.
>
> Just to be clear, I'm currently reviewing the RCU usage here.  Jason has
> brought up specific issue, if I can't convince myself that his
> objections to the RCU usage are wrong, then I'm going to second his
> request that we go back to a more simplistic rwlock.
>

The RCU protects the ndev from being freed and not the table itself.

>> The other core parts have been mostly trimmed, so that is the specific
>> things discussed last round.
>>
>> Is it Ok to go ahead with the gid table as is? I don't know, I haven't
>> studied the patch in any detail. Technically, that is not best
>> practice for kernel development process.
>>
>> Jason
>
>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>               GPG KeyID: 0E572FDD
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                             ` <1433984788.71666.78.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-06-11  3:57                               ` Jason Gunthorpe
@ 2015-06-11 10:09                               ` Matan Barak
  1 sibling, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-11 10:09 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Matan Barak, Or Gerlitz, Hefty, Sean,
	Moni Shoua, Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 11, 2015 at 4:06 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, 2015-06-10 at 12:49 -0600, Jason Gunthorpe wrote:
>> On Wed, Jun 10, 2015 at 06:08:30PM +0300, Matan Barak wrote:
>> > >It isn't really a cleanup because the whole gid table is new code and
>> > >has latent elements for rocev2 - this is why it is so much bigger than
>> > >it should be.
>> >
>> > I disagree. Could you please point on anything that is RoCE V2 specific?
>> > The essence of RoCE V2 in the previous series was the gid_type part.
>> > This is now completely removed.
>>
>> Sure gid_type is gone, but I didn't say roceve2 specific, I said
>> latent elements. ie I'm assuming reasons for the scary locking are
>> because the ripped out rocev2 code needed it?  And some of the
>> complexity that looks pointless now was supporting ripped out rocev2
>> elements? That is not necessarily bad, but the code had better be good
>> quailty and working..
>>
>> But then I look at the patches, and the very first locking I test out
>> looks wrong. I see call_rcu/synchronize_rcu being used without a
>> single call to rcu_read_lock. So this fails #2 of the RCU review
>> checklist (Seriously? Why am I catching this?)
>>
>> I stopped reading at that point.
>
> The way they chose to split up patches is part of the problem here.
>
> People tend to push the "patches should be small, self contained,
> incremental" ideal.  In some cases, that gets carried to an extreme.  In
> this case, patch 1 introduces one side of the locking and patch 3 and 5
> introduce the other halves.
>
> In all, this needs to be re-ordered first off:
>
> Patch 4 should be 1 and netdev@ should be Cc:ed
> Patch 6 should be 2 and netdev@ should be Cc:ed
> Patch 2 should be 3 (or just all by itself in a separate submission)
> Patch 1, 3, and 5 should be squashed down to a single patch so that the
> locking can actually be analyzed for correctness.
>
>> I think you've got the right basic idea for a cleanup series here. It
>> is time to buckle down and execute it well.
>

Ok, we'll reorder and squash the required parts.

> Except that this isn't really a cleanup, and calling it that clouds the
> issue.  Both the mlx4 and ocrdma drivers implement incomplete RoCE gid
> management support.  If this were a true cleanup, they would just merge
> the support from mlx4 and ocrdma to core and switch the drivers over to
> it.  But that's not the case.  The new core code implements everything
> that the two drivers do, and then some more.  And in the process is
> standardizes some things that weren't standardized before.  So, a much
> more accurate description of this would be to say that the patchset
> implements a core RoCE GID management that is a much more complete
> management engine than either driver's engine, and that the last few
> patches remove the partial engines from the drivers and switch the
> drivers over to the core engine.
>
> My only complaints so far are these:
>
> 1)  I would have preferred that this be treated just like the other ib
> cache items.  The source of changes are different, but in essence, the
> RoCE gid table *is* a cache.  It's not real until the hardware writes
> it.  I would have preferred to see the handling of the roce_gid_table
> all contained in the cache file with the other cache operations.  If you
> wanted to keep the management portion in its own file, that I would have
> been fine with, but anything that manipulated the table should have been
> with the other cache manipulation functions.
>

Actually I thought the putting in a different file will make it more
maintainable.
We could move it to cache.c. Since the roce_gid_table is structured a
bit differently than the current gid_cache, are you willing to live
with the same structure of code and data-structures, just moved to
cache.c and prefixed with ib_cache_roce_xxxx ?

> 2)  I'm not convinced at all that RCU was needed and that a rwlock
> wouldn't have been sufficient.  What drove you to use RCU and do you
> have numbers to back up that it matters?
>

RCU is only used in order to protect the ndev. seqcount is used to
protect reading while writing and mutex protects several writers.
seqcount was chosen as write is sleep-able context and read could take
place in atomic context.

>>  Do an internal mellanox
>> kernel team review of this series. Audit and fix all the locking,
>> evaluate the code growth and design. Audit to confirm there is no
>> functional change that is not documented in a commit message. Tell me
>> v6 is the best effort *team Mellanox* can put forward.
>>
>> Jason
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>               GPG KeyID: 0E572FDD
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                                 ` <20150611035727.GA16599-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-06-11  4:49                                   ` Doug Ledford
@ 2015-06-11 10:15                                   ` Matan Barak
  1 sibling, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-11 10:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, Matan Barak, Or Gerlitz, Hefty, Sean, Moni Shoua,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 11, 2015 at 6:57 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Wed, Jun 10, 2015 at 09:06:28PM -0400, Doug Ledford wrote:
>
>> People tend to push the "patches should be small, self contained,
>> incremental" ideal.  In some cases, that gets carried to an extreme.  In
>> this case, patch 1 introduces one side of the locking and patch 3 and 5
>> introduce the other halves.
>
> I already did spot check patches 3 and 5 for exactly that. They add
> other uses of RCU, but they appear to be totally different -
> objectional for style reasons, but probably not incorrect.
>
> For instance the rcu lock grabs in patch 3 and 5 are protecting the
> call to netdev_master_upper_dev_get_rcu in patch 10. 'get_netdev' is
> more correctly called 'get_netdev_rcu' in this design. (as I said, this
> placement of rcu_read_lock is ugly).
>

Correct, get_netdev is assumed to be called under rcu.
We'll switch to a version where the vendor calls dev_hold before it returns.

> .. and just searching through the patches for 'rcu' to write this, I
> noticed this:
>
> +void ib_enum_roce_ports_of_netdev(roce_netdev_filter filter,
> [..]
> +       down_read(&lists_rwsem);
> +       list_for_each_entry_rcu(dev, &device_list, core_list)
> +            ib_dev_roce_ports_of_netdev(dev, filter, filter_cookie, cb,
> +                                           cookie);
> +       up_read(&lists_rwsem);
>
> Should't call list_for_each_entry_rcu under a rwsem, this is just left over
> from the old locking regime...
>

Correct, thanks.

>> > I think you've got the right basic idea for a cleanup series here. It
>> > is time to buckle down and execute it well.
>>
>> Except that this isn't really a cleanup, and calling it that clouds the
>> issue.
>
> Well, I've been asking for a cleanup .. The entire goal is to make
> things more reviewable and a no-functional-change cleanup would sure
> help that..
>
>> it. But that's not the case.  The new core code implements everything
>> that the two drivers do, and then some more.
>
> I'd be interested to see a list of the 'some more' included in the
> patch comments, I didn't look with a fine toothed comb, but not much
> functional stood out to me...
>

Functionality wise, it's on par with the mlx4 implementation -
supports regular Ethernet device, bonding, vlans and default GIDs.

> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 05/12] IB/core: Add default GID for RoCE GID table
       [not found]         ` <20150611062017.GC22369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-11 15:30           ` Matan Barak
  0 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-11 15:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, Doug Ledford, Or Gerlitz, Moni Shoua, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 11, 2015 at 9:20 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, Jun 08, 2015 at 05:12:08PM +0300, Matan Barak wrote:
>
>> +unsigned long roce_gid_type_mask_support(struct ib_device *ib_dev, u8 port);
>> +
>
> What is all this gid_type_mask stuff about? rocev2?
>
>> +unsigned long roce_gid_type_mask_support(struct ib_device *ib_dev, u8 port)
>> +{
>> +     return !!rdma_protocol_roce(ib_dev, port);
>> +}
>
> Just return bool and drop the !!

I'll change that to bool.

>
> Jason

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 07/12] IB/core: Add RoCE table bonding support
       [not found]         ` <20150611061818.GB22369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-06-11 16:00           ` Matan Barak
  0 siblings, 0 replies; 45+ messages in thread
From: Matan Barak @ 2015-06-11 16:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, Doug Ledford, Or Gerlitz, Moni Shoua, Sean Hefty,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 11, 2015 at 9:18 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, Jun 08, 2015 at 05:12:10PM +0300, Matan Barak wrote:
>
>> +static enum bonding_slave_state is_eth_active_slave_of_bonding(struct net_device *idev,
>> +                                                            struct net_device *upper)
>> +{
>> +     if (upper && IS_NETDEV_BONDING_MASTER(upper)) {
>> +             struct net_device *pdev;
>> +
>> +             rcu_read_lock();
>> +             pdev = bond_option_active_slave_get_rcu(netdev_priv(upper));
>> +             rcu_read_unlock();
>> +             if (pdev)
>> +                     return idev == pdev ? BONDING_SLAVE_STATE_ACTIVE :
>> +                             BONDING_SLAVE_STATE_INACTIVE;
>
> This isn't buggy as written, but I think, it doesn't re-enforce the
> rules for how rcu critical sections should work.
>
> The only reason this is not buggy is because it is a pointer compare
> and it works out that is OK in this particular case. But, it is subtle
> and it might trip up someone down the road.
>
> Keeping with the idomatic RCU pattern is better:
>
>  enum bonding_slave_state res = BONDING_SLAVE_STATE_INACTIVE;;
>
>  rcu_read_lock();
>  pdev = bond_option_active_slave_get_rcu(netdev_priv(upper));
>  if (pdev && pdev == idev)
>    res = BONDING_SLAVE_STATE_ACTIVE;
>  rcu_read_unlock();
>
>  return res;
>
> ie don't leak pdev out of the critical section unless a ref is taken
> on it.
>
> Same comment applies to other similar places in this series.
>

As you stated, pointer comparison is valid. However, I can move it
inside the rcu section if you feel it's clearer.

>> +static bool is_upper_dev_rcu(struct net_device *dev, struct net_device *upper)
>> +{
>> +     struct net_device *_upper = NULL;
>> +     struct list_head *iter;
>> +
>> +     rcu_read_lock();
>
> A _rcu function should *always* be called with rcu_read_lock held.
>
> It makes no sense to take it again in the body.
>
> Change the name, or fix the one call site that doesn't hold the lock.
>
> I see callers that hold the lock and callers that don't hold the lock,
> shouldn't be both kinds.
>

I'll change this name to be with _rcu suffix.

>>  static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
>>                                struct net_device *idev, void *cookie)
>>  {
>> -     struct net_device *rdev;
>> -     struct net_device *mdev;
>> -     struct net_device *ndev = (struct net_device *)cookie;
>> +     return _is_eth_port_of_netdev(ib_dev, port, idev, cookie,
>> +                                   BONDING_SLAVE_STATE_ACTIVE |
>> +                                   BONDING_SLAVE_STATE_NA);
>> +}
>
> Why wrapper _is_eth_port_of_netdev with is_eth_port_of_netdev? There
> is only one caller to _is_eth_port_of_netdev? Please look at all the
> other small functions, there are quite a few..
>

I'll put its implementation in is_eth_port_of_netdev.

>> +static void _add_netdev_ips(struct ib_device *ib_dev, u8 port,
>> +                         struct net_device *ndev)
>> +{
>> +     enum_netdev_ipv4_ips(ib_dev, port, ndev);
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +     enum_netdev_ipv6_ips(ib_dev, port, ndev);
>> +#endif
>> +}
>
> Did you try the 'if (IS_ENABLED(CONFIG_IPV6))' version I suggested a
> few versions ago?
>
>> +static void handle_netdev_upper(struct ib_device *ib_dev, u8 port,
>> +                             void *cookie,
>> +                             void (*handle_netdev)(struct ib_device *ib_dev,
>> +                                                   u8 port,
>> +                                                   struct
>> net_device *ndev))
> [..]
>

You mentioned it on an IPoIB thread, but I'll adopt that.

>> +     LIST_HEAD(upper_list);
>> +
>> +     rcu_read_lock();
>> +     netdev_for_each_all_upper_dev_rcu(ndev, upper, iter) {
>> +             struct upper_list *entry = kmalloc(sizeof(*entry),
>> +                                                GFP_ATOMIC);
>> +
>> +             if (!entry) {
>> +                     pr_info("roce_gid_mgmt: couldn't allocate entry to delete ndev\n");
>> +                     continue;
>> +             }
>> +
>> +             list_add_tail(&entry->list, &upper_list);
>> +             dev_hold(upper);
>> +             entry->upper = upper;
>
> Everytime I see copying refs onto a stack list I really start to
> wonder..
>
> handle_netdev absolutely cannot be called with rcu_read_lock held? Is
> that a wise design?
>

handle_netdev is a callback the causes a write to the gid_table - so
it can't be called from an atomic context.

>> @@ -309,11 +548,24 @@ static int netdevice_event(struct notifier_block *this, unsigned long event,
>>  {
>>       static const struct netdev_event_work_cmd add_cmd = {
>>               .cb = add_netdev_ips, .filter = is_eth_port_of_netdev};
>> +     static const struct netdev_event_work_cmd add_cmd_upper_ips = {
>> +             .cb = add_netdev_upper_ips, .filter = is_eth_port_of_netdev};
>>       static const struct netdev_event_work_cmd del_cmd = {
>>               .cb = del_netdev_ips, .filter = pass_all_filter};
>> +     static const struct netdev_event_work_cmd bonding_default_del_cmd_join = {
>> +             .cb = del_netdev_default_ips_join, .filter = is_eth_port_inactive_slave};
>> +     static const struct netdev_event_work_cmd bonding_default_del_cmd = {
>> +             .cb = del_netdev_default_ips, .filter = is_eth_port_inactive_slave};
>> +     static const struct netdev_event_work_cmd default_del_cmd = {
>> +             .cb = del_netdev_default_ips, .filter = pass_all_filter};
>> +     static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
>> +             .cb = del_netdev_upper_ips, .filter = bonding_slaves_filter};
>> +     static const struct netdev_event_work_cmd upper_ips_del_cmd = {
>> +             .cb = del_netdev_upper_ips, .filter =
>>  upper_device_filter};
>
> I also wonder about all this. Can you talk about why the work queue is
> needed at this level, and is this a wise design? Is it the same reason
> we can't call handle_netdev with rcu read lock held?
>

There are several sources of events - inet, inet6 and net-device. We
would like to treat all of them as first come first served.
This is why we're queuing all of them to the same workqueue and handle
them one after another.
Since we would like to go for a generic mechanism, which handle all
events, no assumptions on the calling context were
being made.

> I'm just guessing, but is it because the driver modify_gid callback is
> allowed to sleep? Would it make more sense to drive only modify_gid
> from a work q and leave the rest of this to run inline with the
> notifier? That would save alot of code..
>

That's one reason for doing so. There are others, for example -
serializing event handling, using rwsem in
ib_enum_roce_ports_of_netdev and minimizing our event handling time.

>> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>> index 4df2894..c4fe29a8 100644
>> +++ b/drivers/net/bonding/bond_options.c
>
> Woah, Woah, this should be a dedicated patch, needs to be approved by
> netdev.
>

I'll make split it off this patch.

> Jason

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                                     ` <CAAKD3BChd10Gd4P2Mwm+46aW+PJBT3j7K-BLex0Fkm5UdtUG3w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-11 16:27                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-11 16:27 UTC (permalink / raw)
  To: Matan Barak
  Cc: Matan Barak, Or Gerlitz, Hefty, Sean, Doug Ledford, Moni Shoua,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 11, 2015 at 12:49:14PM +0300, Matan Barak wrote:

> No, that's not true.
> For example cm_init_av_by_path calls ib_find_cached_gid *in a
> spinlock* which in turns go to the roce_gid_table_find_gid.
> So it's obvious you can't use a semaphore/mutex.

Of course, spin lock is the appropriate trivial locking primitive for
that case.

> write_gid calls the vendor's modify_gid which might be a sleepable
> context - so you can't use spin locks here.

> There are other ways around this, but I think seqcount does this pretty easily.

A spinlock replacing the seqcount would be equally easy and have a
higher chance of being used properly. If you use a spinlock then the
RCU around ndev goes away because the ndev cannot become deref'd while
the spinlock is held. It is not longer necessary to combine RCU and
seqlock to protect ndev.

> READ_ONCE protects something entirely different.
> roce_gid_table_client_cleanup_one sets the roce_gid_table to NULL in
> order to indicate future calls that its being
> destroyed and then flushes the workqueue. In order to avoid crashing
> in roce_gid_table, we first store the pointer with
> READ_ONCE such that we'll always have a valid pointer.

I haven't even looked carefully at the tear down process, but it was
obvious that READ_ONCE was being used as another performance locking
scheme. And the smb_ barriers are strange to include along with work
queues. And it is strange to continue to inject new work when shutting
down.

> No, that's not what I'm saying.

Oh, well you should be saying that, because it is right.

> When roce_gid_table_get_gid is called:
> rcu_read_lock();
> roce_gid_table_get_gid(..., attr);
> /* attr->ndev is valid here */
> rcu_read_unlock();

This pattern never appears in the patch series, what are you talking
about? Also, the memcpy in roce_gid_table_get_gid is missing a RCU
read side for ndev.

Even so  that is ugly, callers that need the net_dev should have a
held ref returned for them by the API and should not be holding rcu.

> > Honestly, since RCU is done wrong, and I'm very suspicious seqlock is
> > done wrong too, I would *strongly* encourage v6 to have simple
> > read/write sem and mutex locking and nothing fancy for performance. I
> > don't want to go round and round on subtle performance locking for a
> > *cleanup patch*.

> How is seqcount relates to RCU exactly?!? Anyway, I explained why
> seqcount was chosen here and unless there is a good reason to switch
> to another locking method, I prefer not to.

Using seqcount requires RCU protection for ndev, they go hand in
hand.

> "Having a feeling" that if
> A is wrong (and to be exact, the rcu usage you pointed out above isn't
> wrong but might just be unnecessary because of netdev_wait_allrefs)
> doesn't mean B is wrong.

No, I looked at the seqcount and it sure looks slightly wrong, seeing
RCU *and* seqcount be wrong makes me throw up my hands and say: Stop
using performance locking.

> > There is also this RCU confusion:
> >
> > +                               rcu_read_lock();
> > +                               if (ib_dev->get_netdev)
> > +                                       idev = ib_dev->get_netdev(ib_dev, port);
> >
> > When holding the rcu_read_lock it should be obvious what the RCU
> > protected data is. There is no way holding it around a driver call
> > back makes any sense.
> >
> > The driver should return a held netdev or null.
> >
> > .. and maybe more, I stopped looking
> >
> 
> I agree, rcu_dereference is missing here, but as you suggested we'll
> use dev_hold by the vendor driver.

What? You think rcu_dereference is somehow needed above?? Why??

> We'll go over the RCU usages here. A lot of them just compare
> pointers, so they are safe as it is.

Comparing pointers is safe under certain situations, but you still
have to use rcu_derference and rcu_read_lock on the reader side of any
writer side RCU protected data.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                                 ` <20150610220154.GA4391-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-06-11  9:49                                   ` Matan Barak
@ 2015-06-12 12:29                                   ` Or Gerlitz
       [not found]                                     ` <CAJ3xEMiXWN9wC5u6iapKMVb4=bfzdnuy3CaZryV0nOFL_Cgmhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 45+ messages in thread
From: Or Gerlitz @ 2015-06-12 12:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: Matan Barak, Matan Barak, Hefty, Sean, Moni Shoua, Somnath Kotur,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 11, 2015 at 1:01 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Wed, Jun 10, 2015 at 11:19:03PM +0300, Matan Barak wrote:
>
>> > Sure gid_type is gone, but I didn't say roceve2 specific, I said
>> > latent elements. ie I'm assuming reasons for the scary locking are
>> > because the ripped out rocev2 code needed it?  And some of the
>> > complexity that looks pointless now was supporting ripped out rocev2
>> > elements? That is not necessarily bad, but the code had better be good
>> > quailty and working..
>>
>> Why do you think the locks have anything to do with roce v2?
>
> What else could they be for? The current mlx4 driver doesn't use use
> aggressive performance locking.
>
> After writing this email, I am of the opinion that the locking should
> be simplified to rwsem and mutex, and every use of rcu, READ_ONCE and
> seqlock should be ditched.

Hi Jason,

I understand the email thread went down into further details from this
point and on, but still, I'd like to stop here and ask for short
clarification -- my understanding of things re this reader-writer
scheme is the following:

1. some reader/s can't be made to sleep as they make calls in atomic context
and hence it wouldn't be correct to use RW semaphore

2. some writers go to sleep (invoke driver call to the firmware) and
hence they can't be holding RW spinlock

Agree? if not, can you shortly say what's wrong in these assumptions,
or I should just deeply read the thread.

B/c if these assumptions are correct, seqlock/seqcount seem to me very
much like the right approach. It addresses the constraints 1 and 2
above, simple/robust to implement/use

so?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core
       [not found]                                     ` <CAJ3xEMiXWN9wC5u6iapKMVb4=bfzdnuy3CaZryV0nOFL_Cgmhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-12 16:11                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2015-06-12 16:11 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Doug Ledford, Matan Barak, Matan Barak, Hefty, Sean, Moni Shoua,
	Somnath Kotur, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, Jun 12, 2015 at 03:29:25PM +0300, Or Gerlitz wrote:

> I understand the email thread went down into further details from this
> point and on, but still, I'd like to stop here and ask for short
> clarification -- my understanding of things re this reader-writer
> scheme is the following:
> 
> 1. some reader/s can't be made to sleep as they make calls in atomic context
> and hence it wouldn't be correct to use RW semaphore
> 
> 2. some writers go to sleep (invoke driver call to the firmware) and
> hence they can't be holding RW spinlock
> 
> Agree? if not, can you shortly say what's wrong in these assumptions,
> or I should just deeply read the thread.

Sure, broadly.

> B/c if these assumptions are correct, seqlock/seqcount seem to me very
> much like the right approach. It addresses the constraints 1 and 2
> above, simple/robust to implement/use

Well, it is anything but simple. There are two subtle complex things
going on with this use of seqlock:
 - If you hold a seqlock while sleeping and ignore the read-retry
   mechanism then upon update readers see three states: NEW,OLD,INVALID
- It is impossible to dev_hold attr->ndev without using RCU

You read the patch, did you understand those subtle details were
happening? It took me a few re-reads to get it..

Far better is a simple spin lock:

lock(table->lock)
spin_lock()
attr = ...;
invalid = 1;
spin_unlock();

driver->update_gid

spin_lock()
invalid = 0;
spin_unlock();
unlock(table->lock)

Which solves *exactly* the same problem and doesn't require RCU to
extract the ndev. [In this case, I think the invalid state is actually
undesirable, so I'd work to eliminate it]

I keep being suprised by this insistance on performance locking.  99%
of the time the performance locks and simple locks are
interchangeable. If you reach a situation where you think you *NEED* a
performance lock and can't see how to do it with a simple lock, almost
certainly, you are doing something wrong.

So as a reviewer, asking for simple locks should get the author to
consider their locking carefully and either produce something correct
with simple locks or solve the issues with their performance locks,
independently.

Especially with RCU, there is no excuse, there is a RCU review
checklist, and both Matan and Yishai's v1 patches *blatantly* fail points
in that list. 'RCU is used wrong, read the checklist' is all a reviewer
should have to say.

Being drawing into a stupid argument is pointless, and in future I'm
going to stop taking the bait. :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2015-06-12 16:11 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-08 14:12 [PATCH for-next V5 00/12] Move RoCE GID management to IB/Core Matan Barak
     [not found] ` <1433772735-22416-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-08 14:12   ` [PATCH for-next V5 01/12] IB/core: Add RoCE GID table Matan Barak
2015-06-08 14:12   ` [PATCH for-next V5 02/12] IB/core: Add rwsem to allow reading device list or client list Matan Barak
2015-06-08 14:12   ` [PATCH for-next V5 03/12] IB/core: Add RoCE GID population Matan Barak
     [not found]     ` <1433772735-22416-4-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-11  4:18       ` Jason Gunthorpe
2015-06-08 14:12   ` [PATCH for-next V5 04/12] net/ipv6: Export addrconf_ifid_eui48 Matan Barak
2015-06-08 14:12   ` [PATCH for-next V5 05/12] IB/core: Add default GID for RoCE GID table Matan Barak
     [not found]     ` <1433772735-22416-6-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-11  6:20       ` Jason Gunthorpe
     [not found]         ` <20150611062017.GC22369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-11 15:30           ` Matan Barak
2015-06-08 14:12   ` [PATCH for-next V5 06/12] net: Add info for NETDEV_CHANGEUPPER event Matan Barak
2015-06-08 14:12   ` [PATCH for-next V5 07/12] IB/core: Add RoCE table bonding support Matan Barak
     [not found]     ` <1433772735-22416-8-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-11  6:18       ` Jason Gunthorpe
     [not found]         ` <20150611061818.GB22369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-11 16:00           ` Matan Barak
2015-06-08 14:12   ` [PATCH for-next V5 08/12] IB/core: ib_cache routines should use roce_gid_table when needed Matan Barak
2015-06-08 14:12   ` [PATCH for-next V5 09/12] net/mlx4: Postpone the registration of net_device Matan Barak
2015-06-08 14:12   ` [PATCH for-next V5 10/12] IB/mlx4: Implement ib_device callbacks Matan Barak
     [not found]     ` <1433772735-22416-11-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-11  6:31       ` Jason Gunthorpe
     [not found]         ` <20150611063108.GE22369-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-11  6:53           ` Moni Shoua
2015-06-08 14:12   ` [PATCH for-next V5 11/12] IB/mlx4: Replace mechanism for RoCE GID management Matan Barak
2015-06-08 14:12   ` [PATCH for-next V5 12/12] RDMA/ocrdma: Changes in driver to incorporate the moving of GID Table mgmt to IB/Core Matan Barak
     [not found]     ` <1433772735-22416-13-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-11  4:11       ` Jason Gunthorpe
     [not found]         ` <20150611041124.GC16599-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-11  6:04           ` Somnath Kotur
2015-06-08 21:37   ` [PATCH for-next V5 00/12] Move RoCE GID management " Hefty, Sean
     [not found]     ` <1828884A29C6694DAF28B7E6B8A82373A8FE5D17-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-09  7:27       ` Matan Barak
     [not found]         ` <55769561.8000300-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-10  8:53           ` Or Gerlitz
     [not found]             ` <5577FAFB.8020205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-10 15:00               ` Jason Gunthorpe
     [not found]                 ` <20150610150010.GA11243-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-10 15:08                   ` Matan Barak
     [not found]                     ` <557852EE.5030107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-06-10 18:49                       ` Jason Gunthorpe
     [not found]                         ` <20150610184954.GA26404-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-10 20:19                           ` Matan Barak
     [not found]                             ` <CAAKD3BB90iZ98B2ADG+=ZYuEVtLq26a99BEjQCR8U1vzvcG+Gw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-10 22:01                               ` Jason Gunthorpe
     [not found]                                 ` <20150610220154.GA4391-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-11  9:49                                   ` Matan Barak
     [not found]                                     ` <CAAKD3BChd10Gd4P2Mwm+46aW+PJBT3j7K-BLex0Fkm5UdtUG3w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-11 16:27                                       ` Jason Gunthorpe
2015-06-12 12:29                                   ` Or Gerlitz
     [not found]                                     ` <CAJ3xEMiXWN9wC5u6iapKMVb4=bfzdnuy3CaZryV0nOFL_Cgmhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-12 16:11                                       ` Jason Gunthorpe
2015-06-11  1:06                           ` Doug Ledford
     [not found]                             ` <1433984788.71666.78.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-11  3:57                               ` Jason Gunthorpe
     [not found]                                 ` <20150611035727.GA16599-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-06-11  4:49                                   ` Doug Ledford
     [not found]                                     ` <1433998199.71666.144.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-11  5:38                                       ` Jason Gunthorpe
2015-06-11 10:15                                   ` Matan Barak
2015-06-11 10:09                               ` Matan Barak
2015-06-11  0:15                   ` Doug Ledford
     [not found]                     ` <1433981756.71666.60.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-11  4:07                       ` Jason Gunthorpe
2015-06-11  9:51                       ` Matan Barak
2015-06-10 15:09               ` Hefty, Sean
     [not found]                 ` <1828884A29C6694DAF28B7E6B8A82373A8FE6616-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-06-10 15:19                   ` Matan Barak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.