linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications
@ 2015-09-17 10:38 Or Gerlitz
       [not found] ` <1442486283-9699-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Or Gerlitz @ 2015-09-17 10:38 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter, Or Gerlitz

Patches from Erez, to be used for cleaning up send-only objects and multicast 
group SM registrations.

Or.

Erez Shitrit (2):
  IB/ipoib: Add mechanism for ipoib neigh state change notifications
  IB/ipoib: Add cleanup to sendonly multicast objects

 drivers/infiniband/ulp/ipoib/ipoib.h           | 17 +++++
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  4 ++
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 87 ++++++++++++++++++++++++++
 3 files changed, 108 insertions(+)

-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH rdma-rc 1/2] IB/ipoib: Add mechanism for ipoib neigh state change notifications
       [not found] ` <1442486283-9699-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-09-17 10:38   ` Or Gerlitz
  2015-09-17 10:38   ` [PATCH rdma-rc 2/2] IB/ipoib: Add cleanup to sendonly multicast objects Or Gerlitz
  2015-09-17 14:42   ` [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications Christoph Lameter
  2 siblings, 0 replies; 26+ messages in thread
From: Or Gerlitz @ 2015-09-17 10:38 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add callback function to the ipoib_neigh struct in order to add the
ability to inform the object that holds the neigh on its current state.

Each neigh object is kept by one and only one object (from type
path_record or multicast object), now this object can act accordingly
the change in the neigh state.

The callback should pay attention to the context it runs on, and should
act/run according to that context limitation, for example on the neigh
reap flow, the neigh calls the callback under spinlock etc.

Signed-off-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h      | 11 +++++++++++
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  4 ++++
 2 files changed, 15 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index ca28736..5b719e2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -417,6 +417,14 @@ struct ipoib_path {
 	int  		      valid;
 };
 
+enum ipoib_neigh_state {
+	IPOIB_NEIGH_CREATED,
+	IPOIB_NEIGH_REMOVED,
+};
+
+typedef int (*state_callback_fn)(struct ipoib_dev_priv *priv,
+				 enum ipoib_neigh_state state, void *context);
+
 struct ipoib_neigh {
 	struct ipoib_ah    *ah;
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
@@ -432,6 +440,9 @@ struct ipoib_neigh {
 	struct rcu_head     rcu;
 	atomic_t	    refcnt;
 	unsigned long       alive;
+	/* add the ability to notify the objects that hold that neigh */
+	state_callback_fn state_callback;
+	void *context;
 };
 
 #define IPOIB_UD_MTU(ib_mtu)		(ib_mtu - IPOIB_ENCAP_LEN)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 36536ce..6176441 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1179,6 +1179,10 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv *priv)
 				rcu_assign_pointer(*np,
 						   rcu_dereference_protected(neigh->hnext,
 									     lockdep_is_held(&priv->lock)));
+				/* inform state if requested */
+				if (neigh->state_callback != NULL)
+					neigh->state_callback(priv, IPOIB_NEIGH_REMOVED, neigh->context);
+
 				/* remove from path/mc list */
 				list_del(&neigh->list);
 				call_rcu(&neigh->rcu, ipoib_neigh_reclaim);
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH rdma-rc 2/2] IB/ipoib: Add cleanup to sendonly multicast objects
       [not found] ` <1442486283-9699-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-09-17 10:38   ` [PATCH rdma-rc 1/2] IB/ipoib: " Or Gerlitz
@ 2015-09-17 10:38   ` Or Gerlitz
       [not found]     ` <1442486283-9699-3-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-09-17 14:42   ` [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications Christoph Lameter
  2 siblings, 1 reply; 26+ messages in thread
From: Or Gerlitz @ 2015-09-17 10:38 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter,
	Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Sendonly multicast group entries are potentially created by the driver during
the xmit flow. Their objects remain in the driver memory, plus the related group
existing in the SM and the fabric till the driver goes down, even if no one
uses that multicast entry anymore.

Since this is sendonly, they are also not part of the kernel decvice multicast
list and hence invocation of the set_rx_mode ndo will not cleam them up either.

Each multicast entry has at least one neigh object, hence we can clean the
sendonly mcast object / leave the group by using the existing neigh notification
mechanism initiated from __ipoib_reap_neigh().

Signed-off-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  6 ++
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 87 ++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 5b719e2..7cbd7d1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -417,6 +417,12 @@ struct ipoib_path {
 	int  		      valid;
 };
 
+struct ipoib_free_sendonly_task {
+	struct work_struct work;
+	struct ipoib_mcast *mcast;
+	struct ipoib_dev_priv *priv;
+};
+
 enum ipoib_neigh_state {
 	IPOIB_NEIGH_CREATED,
 	IPOIB_NEIGH_REMOVED,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 09a1748..e3d035e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -702,6 +702,91 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 	return 0;
 }
 
+/* leave / free sendonly mcast */
+static void ipoib_sendonly_free_work(struct work_struct *work)
+{
+	unsigned long flags;
+	struct ipoib_mcast *tmcast;
+	bool found = false;
+	struct ipoib_free_sendonly_task *so_work =
+		container_of(work, struct ipoib_free_sendonly_task, work);
+	struct ipoib_mcast *mcast = so_work->mcast;
+	struct ipoib_dev_priv *priv = so_work->priv;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	/*
+	 * check the mcast is still in the list.
+	 * make sure we are not racing against ipoib_mcast_dev_flush
+	 */
+	list_for_each_entry(tmcast, &priv->multicast_list, list)
+		if (!memcmp(tmcast->mcmember.mgid.raw,
+			    mcast->mcmember.mgid.raw,
+			    sizeof(union ib_gid)))
+			found = true;
+
+	if (!found) {
+		pr_info("%s mcast: %pI6 already removed\n", __func__,
+			mcast->mcmember.mgid.raw);
+		spin_unlock(&priv->lock);
+		local_irq_restore(flags);
+		goto out;
+	}
+
+	/* delete from multicast_list and rb_tree */
+	rb_erase(&mcast->rb_node, &priv->multicast_tree);
+	list_del(&mcast->list);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	/*
+	 * make sure the in-flight joins have finished before we attempt
+	 * to leave
+	 */
+	if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
+		wait_for_completion(&mcast->done);
+
+	ipoib_mcast_leave(mcast->dev, mcast);
+	ipoib_mcast_free(mcast);
+
+out:
+	kfree(so_work);
+}
+
+/* get notification from the neigh that connected to mcast on its state */
+static int handle_neigh_state_change(struct ipoib_dev_priv *priv,
+				     enum ipoib_neigh_state state, void *context)
+{
+	struct ipoib_mcast *mcast = context;
+
+	switch (state) {
+	case IPOIB_NEIGH_REMOVED:
+		/* In sendonly the kernel doesn't clean mcast groups, so we use
+		 * the gc mechanism of the neigh that connected to that mcast in
+		 * order to clean them
+		 */
+		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
+			struct ipoib_free_sendonly_task *sendonly_mcast_work;
+
+			sendonly_mcast_work = kzalloc(sizeof(*sendonly_mcast_work), GFP_KERNEL);
+			if (!sendonly_mcast_work)
+				return -ENOMEM;
+
+			INIT_WORK(&sendonly_mcast_work->work,
+				  ipoib_sendonly_free_work);
+			sendonly_mcast_work->mcast = mcast;
+			sendonly_mcast_work->priv = priv;
+			queue_work(priv->wq, &sendonly_mcast_work->work);
+		}
+		break;
+	default:
+		pr_info("%s doesn't handle state %d for mcast: %pI6\n",
+			__func__, state, mcast->mcmember.mgid.raw);
+		break;
+	}
+
+	return 0;
+}
+
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -762,6 +847,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 				kref_get(&mcast->ah->ref);
 				neigh->ah	= mcast->ah;
 				list_add_tail(&neigh->list, &mcast->neigh_list);
+				neigh->state_callback = handle_neigh_state_change;
+				neigh->context = mcast;
 			}
 		}
 		spin_unlock_irqrestore(&priv->lock, flags);
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH rdma-rc 2/2] IB/ipoib: Add cleanup to sendonly multicast objects
       [not found]     ` <1442486283-9699-3-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-09-17 11:19       ` Or Gerlitz
  0 siblings, 0 replies; 26+ messages in thread
From: Or Gerlitz @ 2015-09-17 11:19 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Christoph Lameter, Erez Shitrit

On Thu, Sep 17, 2015 at 1:38 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> Sendonly multicast group entries are potentially created by the driver during
> the xmit flow. Their objects remain in the driver memory, plus the related group
> existing in the SM and the fabric till the driver goes down, even if no one
> uses that multicast entry anymore.
>
> Since this is sendonly, they are also not part of the kernel decvice multicast
> list and hence invocation of the set_rx_mode ndo will not cleam them up either.

oops, Doug, I see few typos here... need to s/decvice/device/ and s/cleam/clean/
-- just in case you are to pick V0
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications
       [not found] ` <1442486283-9699-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-09-17 10:38   ` [PATCH rdma-rc 1/2] IB/ipoib: " Or Gerlitz
  2015-09-17 10:38   ` [PATCH rdma-rc 2/2] IB/ipoib: Add cleanup to sendonly multicast objects Or Gerlitz
@ 2015-09-17 14:42   ` Christoph Lameter
       [not found]     ` <alpine.DEB.2.11.1509170940270.2052-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  2 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-09-17 14:42 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Could we simplify it a bit. This compiles but avoids all the
generalizations and workqueues. Had to export two new functions from
ipoib_multicast.c though.



Subject: ipoib: Expire sendonly multicast joins on neighbor expiration

Add mcast_leave functionality to __ipoib_reap_neighbor.

Based on Erez work.

Signed-off-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2015-09-09 13:14:03.412350354 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c	2015-09-17 09:34:03.169844055 -0500
@@ -1149,6 +1149,8 @@ static void __ipoib_reap_neigh(struct ip
 	unsigned long dt;
 	unsigned long flags;
 	int i;
+	LIST_HEAD(remove_list);
+	struct ipoib_mcast *mcast, *tmcast;

 	if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
 		return;
@@ -1176,6 +1178,18 @@ static void __ipoib_reap_neigh(struct ip
 							  lockdep_is_held(&priv->lock))) != NULL) {
 			/* was the neigh idle for two GC periods */
 			if (time_after(neigh_obsolete, neigh->alive)) {
+
+				/* Is this multicast ? */
+				if (neigh->daddr[4] == 0xff) {
+					mcast = __ipoib_mcast_find(priv->dev, neigh->daddr + 4);
+
+					if (mcast && test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
+						list_del(&mcast->list);
+						rb_erase(&mcast->rb_node, &priv->multicast_tree);
+						list_add_tail(&mcast->list, &remove_list);
+					}
+				}
+
 				rcu_assign_pointer(*np,
 						   rcu_dereference_protected(neigh->hnext,
 									     lockdep_is_held(&priv->lock)));
@@ -1191,6 +1205,8 @@ static void __ipoib_reap_neigh(struct ip

 out_unlock:
 	spin_unlock_irqrestore(&priv->lock, flags);
+	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
+		ipoib_mcast_leave(priv->dev, mcast);
 }

 static void ipoib_reap_neigh(struct work_struct *work)
Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2015-09-09 13:14:03.412350354 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h	2015-09-17 09:36:17.342455845 -0500
@@ -548,6 +548,8 @@ void ipoib_path_iter_read(struct ipoib_p

 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
 		       union ib_gid *mgid, int set_qkey);
+int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast);
+struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);

 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-09 13:14:03.412350354 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-17 09:36:55.305497262 -0500
@@ -153,7 +153,7 @@ static struct ipoib_mcast *ipoib_mcast_a
 	return mcast;
 }

-static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
+struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct rb_node *n = priv->multicast_tree.rb_node;
@@ -675,7 +675,7 @@ int ipoib_mcast_stop_thread(struct net_d
 	return 0;
 }

-static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
+int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications
       [not found]     ` <alpine.DEB.2.11.1509170940270.2052-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2015-09-17 18:55       ` Or Gerlitz
  2015-09-17 18:57       ` Or Gerlitz
  1 sibling, 0 replies; 26+ messages in thread
From: Or Gerlitz @ 2015-09-17 18:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Sep 17, 2015 at 5:42 PM, Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> wrote:
> Could we simplify it a bit. This compiles but avoids all the
> generalizations and workqueues.

Do you find some over complexity in Erez's implementation? what? as I
said, he's pretty busy, but I hope he can get to review your proposal


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications
       [not found]     ` <alpine.DEB.2.11.1509170940270.2052-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  2015-09-17 18:55       ` Or Gerlitz
@ 2015-09-17 18:57       ` Or Gerlitz
       [not found]         ` <CAJ3xEMg-h+J=cr2ddWvYUGU=MOBk4Aw-KtkeJutATH0Y17dnjg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Or Gerlitz @ 2015-09-17 18:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Sep 17, 2015 at 5:42 PM, Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> wrote:
> Could we simplify it a bit. [...] but avoids all the
> generalizations and workqueues. Had to export two new functions from
> ipoib_multicast.c though.

Do you find some over complexity in Erez's implementation? what? as I
said, he's pretty busy, but I hope he can get to review your proposal
early next week.

> This compiles

and... shouldn't be too hard to test it out, e.g with ping -b or iperf
mcast sender etc
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]         ` <CAJ3xEMg-h+J=cr2ddWvYUGU=MOBk4Aw-KtkeJutATH0Y17dnjg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-09-24 17:00           ` Christoph Lameter
       [not found]             ` <alpine.DEB.2.20.1509241158420.21198-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-09-24 17:00 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Ok here is the fixed up and tested V2 of the patch. Can this go in with
Doug's  patch?



Subject: ipoib: Expire sendonly multicast joins on neighbor expiration V2

Add mcast_leave functionality to __ipoib_reap_neighbor.

Signed-off-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2015-09-23 09:51:19.259274231 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c	2015-09-23 09:59:59.803289023 -0500
@@ -1149,6 +1149,9 @@
 	unsigned long dt;
 	unsigned long flags;
 	int i;
+	LIST_HEAD(remove_list);
+	struct ipoib_mcast *mcast, *tmcast;
+	struct net_device *dev = priv->dev;

 	if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
 		return;
@@ -1176,6 +1179,19 @@
 							  lockdep_is_held(&priv->lock))) != NULL) {
 			/* was the neigh idle for two GC periods */
 			if (time_after(neigh_obsolete, neigh->alive)) {
+				u8 *mgid = neigh->daddr + 4;
+
+				/* Is this multicast ? */
+				if (*mgid == 0xff) {
+					mcast = __ipoib_mcast_find(dev, mgid);
+
+					if (mcast && test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
+						list_del(&mcast->list);
+						rb_erase(&mcast->rb_node, &priv->multicast_tree);
+						list_add_tail(&mcast->list, &remove_list);
+					}
+				}
+
 				rcu_assign_pointer(*np,
 						   rcu_dereference_protected(neigh->hnext,
 									     lockdep_is_held(&priv->lock)));
@@ -1191,6 +1207,8 @@

 out_unlock:
 	spin_unlock_irqrestore(&priv->lock, flags);
+	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
+		ipoib_mcast_leave(dev, mcast);
 }

 static void ipoib_reap_neigh(struct work_struct *work)
Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2015-09-23 09:51:19.259274231 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h	2015-09-23 09:51:19.255274231 -0500
@@ -548,6 +548,8 @@

 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
 		       union ib_gid *mgid, int set_qkey);
+int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast);
+struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);

 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-23 09:51:19.259274231 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-23 09:51:19.255274231 -0500
@@ -158,7 +158,7 @@
 	return mcast;
 }

-static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
+struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct rb_node *n = priv->multicast_tree.rb_node;
@@ -705,7 +705,7 @@
 	return 0;
 }

-static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
+int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]             ` <alpine.DEB.2.20.1509241158420.21198-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2015-09-25 15:32               ` Or Gerlitz
       [not found]                 ` <CAJ3xEMg2qQ3CXrfTrg=gUX2tQRuPnpxA-8PC98GzoF5iz=j5PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Or Gerlitz @ 2015-09-25 15:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Sep 24, 2015 at 8:00 PM, Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> wrote:
> Ok here is the fixed up and tested V2 of the patch. Can this go in with
> Doug's  patch?


Repeating myself... do you find some over complexity in Erez's
implementation? what's the rational for not using his patch and yes
using yours? Erez and Co were very busy with some internal deadlines
and he's now OOO (it's a high Holiday season now) - will be able to
review your patch once he's back (Oct 6, I believe). It seems that the
patch does the job, but there are locking/contexts and such to
consider here, so I can't just ack it, have you passed it through
testing?

Or.


> Subject: ipoib: Expire sendonly multicast joins on neighbor expiration V2
>
> Add mcast_leave functionality to __ipoib_reap_neighbor.
>
> Signed-off-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
>
> Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
> ===================================================================
> --- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c        2015-09-23 09:51:19.259274231 -0500
> +++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c     2015-09-23 09:59:59.803289023 -0500
> @@ -1149,6 +1149,9 @@
>         unsigned long dt;
>         unsigned long flags;
>         int i;
> +       LIST_HEAD(remove_list);
> +       struct ipoib_mcast *mcast, *tmcast;
> +       struct net_device *dev = priv->dev;
>
>         if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
>                 return;
> @@ -1176,6 +1179,19 @@
>                                                           lockdep_is_held(&priv->lock))) != NULL) {
>                         /* was the neigh idle for two GC periods */
>                         if (time_after(neigh_obsolete, neigh->alive)) {
> +                               u8 *mgid = neigh->daddr + 4;
> +
> +                               /* Is this multicast ? */
> +                               if (*mgid == 0xff) {
> +                                       mcast = __ipoib_mcast_find(dev, mgid);
> +
> +                                       if (mcast && test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
> +                                               list_del(&mcast->list);
> +                                               rb_erase(&mcast->rb_node, &priv->multicast_tree);
> +                                               list_add_tail(&mcast->list, &remove_list);
> +                                       }
> +                               }
> +
>                                 rcu_assign_pointer(*np,
>                                                    rcu_dereference_protected(neigh->hnext,
>                                                                              lockdep_is_held(&priv->lock)));
> @@ -1191,6 +1207,8 @@
>
>  out_unlock:
>         spin_unlock_irqrestore(&priv->lock, flags);
> +       list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
> +               ipoib_mcast_leave(dev, mcast);
>  }
>
>  static void ipoib_reap_neigh(struct work_struct *work)
> Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
> ===================================================================
> --- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h     2015-09-23 09:51:19.259274231 -0500
> +++ linux/drivers/infiniband/ulp/ipoib/ipoib.h  2015-09-23 09:51:19.255274231 -0500
> @@ -548,6 +548,8 @@
>
>  int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
>                        union ib_gid *mgid, int set_qkey);
> +int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast);
> +struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);
>
>  int ipoib_init_qp(struct net_device *dev);
>  int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
> Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> ===================================================================
> --- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c   2015-09-23 09:51:19.259274231 -0500
> +++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c        2015-09-23 09:51:19.255274231 -0500
> @@ -158,7 +158,7 @@
>         return mcast;
>  }
>
> -static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
> +struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
>  {
>         struct ipoib_dev_priv *priv = netdev_priv(dev);
>         struct rb_node *n = priv->multicast_tree.rb_node;
> @@ -705,7 +705,7 @@
>         return 0;
>  }
>
> -static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
> +int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
>  {
>         struct ipoib_dev_priv *priv = netdev_priv(dev);
>         int ret = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                 ` <CAJ3xEMg2qQ3CXrfTrg=gUX2tQRuPnpxA-8PC98GzoF5iz=j5PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-09-25 16:41                   ` Christoph Lameter
       [not found]                     ` <alpine.DEB.2.20.1509251140200.31792-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-09-25 16:41 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, 25 Sep 2015, Or Gerlitz wrote:

> On Thu, Sep 24, 2015 at 8:00 PM, Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> wrote:
> > Ok here is the fixed up and tested V2 of the patch. Can this go in with
> > Doug's  patch?
>
>
> Repeating myself... do you find some over complexity in Erez's
> implementation? what's the rational for not using his patch and yes
> using yours? Erez and Co were very busy with some internal deadlines
> and he's now OOO (it's a high Holiday season now) - will be able to
> review your patch once he's back (Oct 6, I believe). It seems that the
> patch does the job, but there are locking/contexts and such to
> consider here, so I can't just ack it, have you passed it through
> testing?

Yes the patch introduces a new callback and creates workqueues that
recheck conditions etc etc.

Makes it difficult to review and potentially creates new race conditions.
I'd rather have a straightforward solution.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                     ` <alpine.DEB.2.20.1509251140200.31792-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2015-09-25 16:55                       ` Christoph Lameter
  2015-09-26 17:35                       ` Or Gerlitz
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-09-25 16:55 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

And yes this went through testing here and we want to run this as part of
our prod kernels.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                     ` <alpine.DEB.2.20.1509251140200.31792-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  2015-09-25 16:55                       ` Christoph Lameter
@ 2015-09-26 17:35                       ` Or Gerlitz
       [not found]                         ` <CAJ3xEMj9kGaM6s+AqBQ_-ZTB_+cgDk3kefhq_O0V3PM120O9yQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Or Gerlitz @ 2015-09-26 17:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, Sep 25, 2015 at 7:41 PM, Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> wrote:

> Yes the patch introduces a new callback and creates workqueues that

It's possible that this was done for a reason, so

> recheck conditions etc etc.
> Makes it difficult to review and potentially creates new race conditions.

maybe, and maybe avoid other race conditions

> I'd rather have a straightforward solution.

> And yes this went through testing here and we want to run this as
> part of our prod kernels.

sounds good, so taking into account that Erez is away till Oct 6th, we
can probably pick your patch and later, if Erez proves us that there's
deep problem there, revert it and take his.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                         ` <CAJ3xEMj9kGaM6s+AqBQ_-ZTB_+cgDk3kefhq_O0V3PM120O9yQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-09-27 16:39                           ` Christoph Lameter
       [not found]                             ` <alpine.DEB.2.20.1509271136570.14486-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-09-27 16:39 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA


On Sat, 26 Sep 2015, Or Gerlitz wrote:

> It's possible that this was done for a reason, so

> sounds good, so taking into account that Erez is away till Oct 6th, we
> can probably pick your patch and later, if Erez proves us that there's
> deep problem there, revert it and take his.

Ok but if Erez does not have the time to participate in code development
and follow up on the patch as issues arise then I would rather rework the
code so that it is easily understandable and I will continue to follow up
on the issues with the code as they develop. This seems to be much more
important to my company than Mellanox.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                             ` <alpine.DEB.2.20.1509271136570.14486-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2015-09-27 17:32                               ` Doug Ledford
       [not found]                                 ` <5608282F.1020507-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-09-28  6:37                               ` Or Gerlitz
  1 sibling, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2015-09-27 17:32 UTC (permalink / raw)
  To: Christoph Lameter, Or Gerlitz
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 2472 bytes --]

On 09/27/2015 12:39 PM, Christoph Lameter wrote:
> 
> On Sat, 26 Sep 2015, Or Gerlitz wrote:
> 
>> It's possible that this was done for a reason, so
> 
>> sounds good, so taking into account that Erez is away till Oct 6th, we
>> can probably pick your patch and later, if Erez proves us that there's
>> deep problem there, revert it and take his.
> 
> Ok but if Erez does not have the time to participate in code development
> and follow up on the patch as issues arise then I would rather rework the
> code so that it is easily understandable and I will continue to follow up
> on the issues with the code as they develop. This seems to be much more
> important to my company than Mellanox.
> 

Currently I'm testing your patch with a couple other patches.  I dropped
the patch of mine that added a module option, and added two different
patches.  However, I'm still waffling on this patch somewhat.  In the
discussions that Jason and I had, I pretty much decided that I would
like to see all send-only multicast sends be sent immediately with no
backlog queue.  That means that if we had to start a send-only join, or
if we started one and it hasn't completed yet, we would send the packet
immediately via the broadcast group versus queueing.  Doing so might
trip this new code up.

Right now, we start a join, we queue the packet on the mcast struct, and
in join_finish we create an ah, but that's it.  We then restart the send
by calling dev_queue_xmit on the skb we put in the backlog queue, which
takes us back around to mcast_send, where we not have both a mcast and a
mcast->ah, so *then* we alloc a new neigh entry, attach this mcast to
it, and send using it.

If I change mcast_send so that we start a join, but immediately send the
packet in the broadcast group, then I would have to change the
join_finish routine to alloc a neigh that has the right daddr so it can
be found in the future, without the benefit of the daddr passed into the
function mcast_send so missing the ipoib header and instead only having
the raw mgid in the mcmember struct.  But, we would have to have that
neigh struct so that the timeout would work in the case were we had a
packet or two that triggered a join but were all sent prior to the join
completing and so we never got a neigh alloc via mcast_send for this
mcast group.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                 ` <5608282F.1020507-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-09-28  2:28                                   ` Christoph Lameter
       [not found]                                     ` <alpine.DEB.2.20.1509272120090.18678-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-09-28  2:28 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sun, 27 Sep 2015, Doug Ledford wrote:

> Currently I'm testing your patch with a couple other patches.  I dropped
> the patch of mine that added a module option, and added two different
> patches.  However, I'm still waffling on this patch somewhat.  In the
> discussions that Jason and I had, I pretty much decided that I would
> like to see all send-only multicast sends be sent immediately with no
> backlog queue.  That means that if we had to start a send-only join, or
> if we started one and it hasn't completed yet, we would send the packet
> immediately via the broadcast group versus queueing.  Doing so might
> trip this new code up.

If we send immediately then we would need to check on each packet if the
multicast creation has been completed?

Also broadcast could cause a unecessary reception event on the NICs of
machines that have no interest in this traffic. We would like to keep
irrelevant traffic off the fabric as much as possible. An a reception
event that requires traffic to be thrown out will cause jitter in the
processing of inbound traffic that we also would like to avoid.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                             ` <alpine.DEB.2.20.1509271136570.14486-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  2015-09-27 17:32                               ` Doug Ledford
@ 2015-09-28  6:37                               ` Or Gerlitz
       [not found]                                 ` <CAJ3xEMg5aC8oeBVA9qygVJyLd2GM9C4sVLZqAL=kFGBcgBq_oQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Or Gerlitz @ 2015-09-28  6:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sun, Sep 27, 2015 at 7:39 PM, Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> wrote:
> Ok but if Erez does not have the time to participate in code development
> and follow up on the patch as issues arise then I would rather rework the
> code so that it is easily understandable and I will continue to follow up
> on the issues with the code as they develop.

As I mentioned to you earlier on this thread, currently it's not a
matter of having time to participate but rather happily going through
the Jewish new year holidays, this time Sukkot with many people being
off for the whole of it till Oct 6.

Personally, up to few weeks ago, I was under the misimpression that
not only IPoIB joins as full member also on the sendonly flow, but
also that such group can be actually opened under that flow, and it
turns out they don't. Later you said that your production environment
was running a very old non upstream stack that had a knob to somehow
make it work and as of that didn't realize that something goes wrong
for years w.r.t a gateway functionality with upstream/inbox code, so
we all screwed up here over a time period with is few orders of
magnitude longer than a holiday duration.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                 ` <CAJ3xEMg5aC8oeBVA9qygVJyLd2GM9C4sVLZqAL=kFGBcgBq_oQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-09-28 11:17                                   ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-09-28 11:17 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Or Gerlitz, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, 28 Sep 2015, Or Gerlitz wrote:

> Personally, up to few weeks ago, I was under the misimpression that
> not only IPoIB joins as full member also on the sendonly flow, but
> also that such group can be actually opened under that flow, and it
> turns out they don't. Later you said that your production environment
> was running a very old non upstream stack that had a knob to somehow

I said we run OFED 1.5.X on older systems. That is not custom.

> make it work and as of that didn't realize that something goes wrong
> for years w.r.t a gateway functionality with upstream/inbox code, so
> we all screwed up here over a time period with is few orders of
> magnitude longer than a holiday duration.

We have been are migrating to a RH7 native stack over the last months and
in a mixed environment the systems running OFED will create the MC groups
so the issue was hidden. We have talked about this migration a couple of
times even face to face. ???



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                     ` <alpine.DEB.2.20.1509272120090.18678-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2015-09-28 15:36                                       ` Doug Ledford
       [not found]                                         ` <56095E6B.60509-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2015-09-28 15:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Or Gerlitz, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 2769 bytes --]

On 09/27/2015 10:28 PM, Christoph Lameter wrote:
> On Sun, 27 Sep 2015, Doug Ledford wrote:
> 
>> Currently I'm testing your patch with a couple other patches.  I dropped
>> the patch of mine that added a module option, and added two different
>> patches.  However, I'm still waffling on this patch somewhat.  In the
>> discussions that Jason and I had, I pretty much decided that I would
>> like to see all send-only multicast sends be sent immediately with no
>> backlog queue.  That means that if we had to start a send-only join, or
>> if we started one and it hasn't completed yet, we would send the packet
>> immediately via the broadcast group versus queueing.  Doing so might
>> trip this new code up.
> 
> If we send immediately then we would need to check on each packet if the
> multicast creation has been completed?

We do that already anyway.  Calling find_mcast and then checking
if(!mcast || !mcast-ah) is exactly that check.

> Also broadcast could cause a unecessary reception event on the NICs of
> machines that have no interest in this traffic.

This is true.  However, I'm trying to balance between several competing
issues.  You also stated the revamped multicast code was adding latency
and dropped packets into the problem space.  Sending over the broadcast
would help with latency.  However, I have an alternative idea for that...

> We would like to keep
> irrelevant traffic off the fabric as much as possible. An a reception
> event that requires traffic to be thrown out will cause jitter in the
> processing of inbound traffic that we also would like to avoid.

That may not be optimal for your app, but we also need to try and
maintain proper emulation of typical IP/Ethernet behavior since this is
IPoIB after all.  That's why the app isn't required to join the group
before sending, and also why it should be able to expect that we will
fall back to sending via broadcast if needed.

However, the following algorithm might be suitable here:

On first packet:
  create mcast group
  queue packet to group
  schedule join

On subsequent packets:
  find mcast group
  check mcast state
    if already joined, send immediately
    if joining, queue packet to mcast queue
    if join is deferred, send via bcast

On join completion:
  successful join
    set mcast->ah
    send all queued packets via mcast
    if no queued packets, alloc neigh for default ipv4 ethertype
  on failed join
    mcast->ah remains NULL
    send all queued packets via bcast
    mcast->delay_until is set to future time (used to know join is deferred)
    schedule deferred join attemp


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                         ` <56095E6B.60509-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-09-28 15:51                                           ` Christoph Lameter
       [not found]                                             ` <alpine.DEB.2.20.1509281046440.30710-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  2015-09-28 17:10                                           ` Jason Gunthorpe
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-09-28 15:51 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, 28 Sep 2015, Doug Ledford wrote:

> > We would like to keep
> > irrelevant traffic off the fabric as much as possible. An a reception
> > event that requires traffic to be thrown out will cause jitter in the
> > processing of inbound traffic that we also would like to avoid.
>
> That may not be optimal for your app, but we also need to try and
> maintain proper emulation of typical IP/Ethernet behavior since this is
> IPoIB after all.  That's why the app isn't required to join the group
> before sending, and also why it should be able to expect that we will
> fall back to sending via broadcast if needed.

Ok this needs to work with the existing ethernet gateways and verified to
work with them.

> However, the following algorithm might be suitable here:
>
> On first packet:
>   create mcast group
>   queue packet to group
>   schedule join
>
> On subsequent packets:
>   find mcast group
>   check mcast state
>     if already joined, send immediately
>     if joining, queue packet to mcast queue
>     if join is deferred, send via bcast

Hmmm... If the multicast group does not exist in the SM then we could only
bcast to all routers instead? No host in the fabric could then be
listening the only listeners possible are outside the fabric.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                             ` <alpine.DEB.2.20.1509281046440.30710-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2015-09-28 16:59                                               ` Doug Ledford
       [not found]                                                 ` <560971DC.500-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2015-09-28 16:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Or Gerlitz, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1698 bytes --]

On 09/28/2015 11:51 AM, Christoph Lameter wrote:
> On Mon, 28 Sep 2015, Doug Ledford wrote:
> 
>>> We would like to keep
>>> irrelevant traffic off the fabric as much as possible. An a reception
>>> event that requires traffic to be thrown out will cause jitter in the
>>> processing of inbound traffic that we also would like to avoid.
>>
>> That may not be optimal for your app, but we also need to try and
>> maintain proper emulation of typical IP/Ethernet behavior since this is
>> IPoIB after all.  That's why the app isn't required to join the group
>> before sending, and also why it should be able to expect that we will
>> fall back to sending via broadcast if needed.
> 
> Ok this needs to work with the existing ethernet gateways and verified to
> work with them.
> 
>> However, the following algorithm might be suitable here:
>>
>> On first packet:
>>   create mcast group
>>   queue packet to group
>>   schedule join
>>
>> On subsequent packets:
>>   find mcast group
>>   check mcast state
>>     if already joined, send immediately
>>     if joining, queue packet to mcast queue
>>     if join is deferred, send via bcast
> 
> Hmmm... If the multicast group does not exist in the SM then we could only
> bcast to all routers instead? 

No, I was referring to using this on top of your patch and my other two
patches, which change the ipoib driver to create sendonly groups and
then expire them when the neighbor expires.

> No host in the fabric could then be
> listening the only listeners possible are outside the fabric.
> 


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                                 ` <560971DC.500-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-09-28 17:05                                                   ` Christoph Lameter
  2015-09-28 17:11                                                   ` Christoph Lameter
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-09-28 17:05 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, 28 Sep 2015, Doug Ledford wrote:

> No, I was referring to using this on top of your patch and my other two
> patches, which change the ipoib driver to create sendonly groups and
> then expire them when the neighbor expires.

Ok under which conditions could the joining be deferred and packets be
sent to broadcast?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                         ` <56095E6B.60509-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-09-28 15:51                                           ` Christoph Lameter
@ 2015-09-28 17:10                                           ` Jason Gunthorpe
       [not found]                                             ` <20150928171007.GE12415-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2015-09-28 17:10 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Christoph Lameter, Or Gerlitz, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Sep 28, 2015 at 11:36:11AM -0400, Doug Ledford wrote:

> > Also broadcast could cause a unecessary reception event on the NICs of
> > machines that have no interest in this traffic.
> 
> This is true.  However, I'm trying to balance between several competing
> issues.  You also stated the revamped multicast code was adding latency
> and dropped packets into the problem space.  Sending over the broadcast
> would help with latency.  However, I have an alternative idea for that...

I think your original idea of broadcast immediately and deferred
optimal mlid lookup is the best *functionally* for every case - only
when you enter the very edge world of caring about timing does it make
any difference.

Christoph's needs would probably be better served by giving some API
to control the mlid cache (ie the neightbour table is already 99% of
the way there). This would let some userspace component pre-load and
fix all relevant data and undesired cache activity simply can't add
jitter.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                                 ` <560971DC.500-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-09-28 17:05                                                   ` Christoph Lameter
@ 2015-09-28 17:11                                                   ` Christoph Lameter
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-09-28 17:11 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Or Gerlitz, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3355 bytes --]

Ok I refactored the whole thing to make it less invasive and keep more
functionality in ipoib_multicast.c. Since you are working on it it would
be best for you to have the newest version. I split this into two patches:
One preparatory and one that implements the actual logic.
Both attached. The patch that implements the join is inline here:


Subject: ipoib multicast: Expire MC groups when the address expires

Upon address expiration do the proper thing to also expire the
sendonly multicast group.

Signed-off-by: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2015-09-28 11:56:59.779764388 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h	2015-09-28 11:57:16.291764857 -0500
@@ -548,6 +548,8 @@

 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
 		       union ib_gid *mgid, int set_qkey);
+void ipoib_mcast_remove_mc_list(struct net_device *dev, struct list_head *list);
+void ipoib_mcast_detach_sendonly(struct ipoib_dev_priv *priv, u8 *mgid, struct list_head *list);

 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2015-09-28 11:56:59.779764388 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c	2015-09-28 11:56:59.775764388 -0500
@@ -1149,6 +1149,8 @@
 	unsigned long dt;
 	unsigned long flags;
 	int i;
+	LIST_HEAD(remove_list);
+	struct net_device *dev = priv->dev;

 	if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
 		return;
@@ -1176,6 +1178,9 @@
 							  lockdep_is_held(&priv->lock))) != NULL) {
 			/* was the neigh idle for two GC periods */
 			if (time_after(neigh_obsolete, neigh->alive)) {
+
+				ipoib_mcast_detach_sendonly(priv, neigh->daddr + 4, &remove_list);
+
 				rcu_assign_pointer(*np,
 						   rcu_dereference_protected(neigh->hnext,
 									     lockdep_is_held(&priv->lock)));
@@ -1191,6 +1196,7 @@

 out_unlock:
 	spin_unlock_irqrestore(&priv->lock, flags);
+	ipoib_mcast_remove_mc_list(dev, &remove_list);
 }

 static void ipoib_reap_neigh(struct work_struct *work)
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-28 11:56:59.779764388 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-28 11:56:59.775764388 -0500
@@ -800,6 +800,23 @@
 	}
 }

+/*
+ * Check if this is a sendonly multicast group. If so remove it from the list and put it
+ * onto the given list for final removal.
+ */
+void ipoib_mcast_detach_sendonly(struct ipoib_dev_priv *priv, u8 *mgid, struct list_head *remove_list)
+{
+	struct ipoib_mcast *mcast;
+
+	/* Is this multicast ? */
+	if (mcast_auto_create && *mgid == 0xff) {
+		mcast = __ipoib_mcast_find(priv->dev, mgid);
+
+		if (mcast && test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+			ipoib_detach_mc_group(priv, mcast, remove_list);
+	}
+}
+
 void ipoib_mcast_dev_flush(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);

[-- Attachment #2: Preparatory Patch --]
[-- Type: text/plain, Size: 2417 bytes --]

From: Christoph Lameter <cl@linux.com>
Subject: ipoib multicast: Extract two function from ipoib_mcast_flush

We need these two functions later to do the implicit leave when the
address handle expires so refactor the code.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-28 11:20:05.387701463 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-28 11:20:53.819702839 -0500
@@ -775,6 +775,31 @@
 	spin_unlock_irqrestore(&priv->lock, flags);
 }
 
+/*
+ * Detach a multicast group from the devices multicast tree and move it
+ * to a list for future removal
+ */
+static void ipoib_detach_mc_group(struct ipoib_dev_priv *priv,
+		struct ipoib_mcast *mcast, struct list_head *remove_list)
+{
+	list_del(&mcast->list);
+	rb_erase(&mcast->rb_node, &priv->multicast_tree);
+	list_add_tail(&mcast->list, remove_list);
+}
+
+/*
+ * Remove a list of multicast groups that has been detached and free them
+ */
+void ipoib_mcast_remove_mc_list(struct net_device *dev, struct list_head *remove_list)
+{
+	struct ipoib_mcast *mcast, *tmcast;
+
+	list_for_each_entry_safe(mcast, tmcast, remove_list, list) {
+		ipoib_mcast_leave(dev, mcast);
+		ipoib_mcast_free(mcast);
+	}
+}
+
 void ipoib_mcast_dev_flush(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -786,11 +811,8 @@
 
 	spin_lock_irqsave(&priv->lock, flags);
 
-	list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) {
-		list_del(&mcast->list);
-		rb_erase(&mcast->rb_node, &priv->multicast_tree);
-		list_add_tail(&mcast->list, &remove_list);
-	}
+	list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list)
+		ipoib_detach_mc_group(priv, mcast, &remove_list);
 
 	if (priv->broadcast) {
 		rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree);
@@ -808,10 +830,7 @@
 		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			wait_for_completion(&mcast->done);
 
-	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
-		ipoib_mcast_leave(dev, mcast);
-		ipoib_mcast_free(mcast);
-	}
+	ipoib_mcast_remove_mc_list(dev, &remove_list);
 }
 
 static int ipoib_mcast_addr_is_valid(const u8 *addr, const u8 *broadcast)

[-- Attachment #3: Implement the join --]
[-- Type: text/plain, Size: 3065 bytes --]

Subject: ipoib multicast: Expire MC groups when the address expires

Upon address expiration do the proper thing to also expire the
sendonly multicast group.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2015-09-28 11:56:59.779764388 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h	2015-09-28 11:57:16.291764857 -0500
@@ -548,6 +548,8 @@
 
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
 		       union ib_gid *mgid, int set_qkey);
+void ipoib_mcast_remove_mc_list(struct net_device *dev, struct list_head *list);
+void ipoib_mcast_detach_sendonly(struct ipoib_dev_priv *priv, u8 *mgid, struct list_head *list);
 
 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2015-09-28 11:56:59.779764388 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c	2015-09-28 11:56:59.775764388 -0500
@@ -1149,6 +1149,8 @@
 	unsigned long dt;
 	unsigned long flags;
 	int i;
+	LIST_HEAD(remove_list);
+	struct net_device *dev = priv->dev;
 
 	if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
 		return;
@@ -1176,6 +1178,9 @@
 							  lockdep_is_held(&priv->lock))) != NULL) {
 			/* was the neigh idle for two GC periods */
 			if (time_after(neigh_obsolete, neigh->alive)) {
+
+				ipoib_mcast_detach_sendonly(priv, neigh->daddr + 4, &remove_list);
+
 				rcu_assign_pointer(*np,
 						   rcu_dereference_protected(neigh->hnext,
 									     lockdep_is_held(&priv->lock)));
@@ -1191,6 +1196,7 @@
 
 out_unlock:
 	spin_unlock_irqrestore(&priv->lock, flags);
+	ipoib_mcast_remove_mc_list(dev, &remove_list);
 }
 
 static void ipoib_reap_neigh(struct work_struct *work)
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-28 11:56:59.779764388 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2015-09-28 11:56:59.775764388 -0500
@@ -800,6 +800,23 @@
 	}
 }
 
+/*
+ * Check if this is a sendonly multicast group. If so remove it from the list and put it
+ * onto the given list for final removal.
+ */
+void ipoib_mcast_detach_sendonly(struct ipoib_dev_priv *priv, u8 *mgid, struct list_head *remove_list)
+{
+	struct ipoib_mcast *mcast;
+
+	/* Is this multicast ? */
+	if (mcast_auto_create && *mgid == 0xff) {
+		mcast = __ipoib_mcast_find(priv->dev, mgid);
+
+		if (mcast && test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+			ipoib_detach_mc_group(priv, mcast, remove_list);
+	}
+}
+
 void ipoib_mcast_dev_flush(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                             ` <20150928171007.GE12415-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-09-28 17:19                                               ` Christoph Lameter
       [not found]                                                 ` <alpine.DEB.2.20.1509281217470.31260-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  2015-09-29 17:47                                               ` Doug Ledford
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-09-28 17:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, Or Gerlitz, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, 28 Sep 2015, Jason Gunthorpe wrote:

> I think your original idea of broadcast immediately and deferred
> optimal mlid lookup is the best *functionally* for every case - only
> when you enter the very edge world of caring about timing does it make
> any difference.

Infiniband is about the edge.

> Christoph's needs would probably be better served by giving some API
> to control the mlid cache (ie the neightbour table is already 99% of
> the way there). This would let some userspace component pre-load and
> fix all relevant data and undesired cache activity simply can't add
> jitter.

Ok so on boot up we preload 3000 multicast groups into the neighbor table?
What impact on IB performance would having such a number of mlid's in the
cache have?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                                 ` <alpine.DEB.2.20.1509281217470.31260-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2015-09-28 17:36                                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 26+ messages in thread
From: Jason Gunthorpe @ 2015-09-28 17:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Doug Ledford, Or Gerlitz, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Sep 28, 2015 at 12:19:04PM -0500, Christoph Lameter wrote:
 
> > Christoph's needs would probably be better served by giving some API
> > to control the mlid cache (ie the neightbour table is already 99% of
> > the way there). This would let some userspace component pre-load and
> > fix all relevant data and undesired cache activity simply can't add
> > jitter.
> 
> Ok so on boot up we preload 3000 multicast groups into the neighbor table?
> What impact on IB performance would having such a number of mlid's in the
> cache have?

What is the cost of a neighbour lookup? Isn't it hashed? So not very
much if the hash function/table size work well with the distribution
of IPs.

The win is any send to any of the 3000 groups is consistent in all
cases, no outlier that is slower due to the SA turn around.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
       [not found]                                             ` <20150928171007.GE12415-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-09-28 17:19                                               ` Christoph Lameter
@ 2015-09-29 17:47                                               ` Doug Ledford
  1 sibling, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2015-09-29 17:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Lameter, Or Gerlitz, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1478 bytes --]

On 09/28/2015 01:10 PM, Jason Gunthorpe wrote:
> On Mon, Sep 28, 2015 at 11:36:11AM -0400, Doug Ledford wrote:
> 
>>> Also broadcast could cause a unecessary reception event on the NICs of
>>> machines that have no interest in this traffic.
>>
>> This is true.  However, I'm trying to balance between several competing
>> issues.  You also stated the revamped multicast code was adding latency
>> and dropped packets into the problem space.  Sending over the broadcast
>> would help with latency.  However, I have an alternative idea for that...
> 
> I think your original idea of broadcast immediately and deferred
> optimal mlid lookup is the best *functionally* for every case - only
> when you enter the very edge world of caring about timing does it make
> any difference.
> 
> Christoph's needs would probably be better served by giving some API
> to control the mlid cache (ie the neightbour table is already 99% of
> the way there). This would let some userspace component pre-load and
> fix all relevant data and undesired cache activity simply can't add
> jitter.

So, I've taken Christoph's patch, added two of my own (just changed the
comment and the #if statement so that we create groups on send-only
joins, and upped the max send-only backlog queue).  We'll leave it at
that for 4.3 and try to address it more fully in 4.4.


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-09-29 17:47 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-17 10:38 [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications Or Gerlitz
     [not found] ` <1442486283-9699-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-09-17 10:38   ` [PATCH rdma-rc 1/2] IB/ipoib: " Or Gerlitz
2015-09-17 10:38   ` [PATCH rdma-rc 2/2] IB/ipoib: Add cleanup to sendonly multicast objects Or Gerlitz
     [not found]     ` <1442486283-9699-3-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-09-17 11:19       ` Or Gerlitz
2015-09-17 14:42   ` [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications Christoph Lameter
     [not found]     ` <alpine.DEB.2.11.1509170940270.2052-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2015-09-17 18:55       ` Or Gerlitz
2015-09-17 18:57       ` Or Gerlitz
     [not found]         ` <CAJ3xEMg-h+J=cr2ddWvYUGU=MOBk4Aw-KtkeJutATH0Y17dnjg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-24 17:00           ` [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications) Christoph Lameter
     [not found]             ` <alpine.DEB.2.20.1509241158420.21198-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2015-09-25 15:32               ` Or Gerlitz
     [not found]                 ` <CAJ3xEMg2qQ3CXrfTrg=gUX2tQRuPnpxA-8PC98GzoF5iz=j5PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-25 16:41                   ` Christoph Lameter
     [not found]                     ` <alpine.DEB.2.20.1509251140200.31792-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2015-09-25 16:55                       ` Christoph Lameter
2015-09-26 17:35                       ` Or Gerlitz
     [not found]                         ` <CAJ3xEMj9kGaM6s+AqBQ_-ZTB_+cgDk3kefhq_O0V3PM120O9yQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-27 16:39                           ` Christoph Lameter
     [not found]                             ` <alpine.DEB.2.20.1509271136570.14486-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2015-09-27 17:32                               ` Doug Ledford
     [not found]                                 ` <5608282F.1020507-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-09-28  2:28                                   ` Christoph Lameter
     [not found]                                     ` <alpine.DEB.2.20.1509272120090.18678-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2015-09-28 15:36                                       ` Doug Ledford
     [not found]                                         ` <56095E6B.60509-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-09-28 15:51                                           ` Christoph Lameter
     [not found]                                             ` <alpine.DEB.2.20.1509281046440.30710-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2015-09-28 16:59                                               ` Doug Ledford
     [not found]                                                 ` <560971DC.500-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-09-28 17:05                                                   ` Christoph Lameter
2015-09-28 17:11                                                   ` Christoph Lameter
2015-09-28 17:10                                           ` Jason Gunthorpe
     [not found]                                             ` <20150928171007.GE12415-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-09-28 17:19                                               ` Christoph Lameter
     [not found]                                                 ` <alpine.DEB.2.20.1509281217470.31260-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2015-09-28 17:36                                                   ` Jason Gunthorpe
2015-09-29 17:47                                               ` Doug Ledford
2015-09-28  6:37                               ` Or Gerlitz
     [not found]                                 ` <CAJ3xEMg5aC8oeBVA9qygVJyLd2GM9C4sVLZqAL=kFGBcgBq_oQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-28 11:17                                   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).