* [PATCH 0/8] IPoIB: Fix multiple race conditions @ 2014-08-12 23:38 Doug Ledford [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Cc: Doug Ledford Locking of multicast joins/leaves in the IPoIB layer have been problematic for a while. There have been recent changes to try and make things better, including these changes: bea1e22 IPoIB: Fix use-after-free of multicast object a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects Unfortunately, the following test still fails (miserably) on a plain upstream kernel: pass=0 ifdown ib0 while true; do ifconfig ib0 up ifconfig ib0 down echo "Pass $pass" let pass++ done This usually fails within 10 to 20 passes, although I did have a lucky run make it to 300 or so. If you happen to have a P_Key child interface, it fails even quicker. In tracking down the rtnl deadlock in the multicast code, I ended up re-designing the flag usage and clean up just the race conditions in the various tasks. Even that wasn't enough to resolve the issue entirely though, as if you ran the test above on multiple interfaces simultaneously, it could still deadlock. So in the end I re-did the workqueue usage too so that we now use a workqueue per device (and that includes P_Key devices have dedicated workqueues) as well as one global workqueue that does nothing but flush tasks. This turns out to be a much more elegant way of handling the workqueues and in fact enabled me to remove all of the klunky passing around of flush parameters to tell various functions not to flush the workqueue if it would deadlock the workqueue we are running from. Here is my test setup: 2 InfiniBand physical fabrics: ib0 and ib1 2 P_Keys on each fabric: default and 0x8002 on ib0 and 0x8003 on ib1 4 total IPoIB devices that I have named mlx4_ib0, mlx4_ib0.8002, mlx4_ib1, and mlx4_ib1.8003 In order to test my final patch set, I logged into my test machine on four different virtual terminals, I used ifdown on all of the above interfaces to get things in a consistent state, and then I ran the above loop, one per terminal per interface simultaneously. It's worth noting here that when you ifconfig a base interface up, it automatically brings up the child interface too, so the ifconfig mlx4_ib0 up is in fact racing with both ups and downs of mlx4_ib0.8002. The same is true for the mlx4_ib1 interface and its child. With my patch set in place, these loops are currently running without a problem and have passed 15,000 up/down cycles per interface. Doug Ledford (8): IPoIB: Consolidate rtnl_lock tasks in workqueue IPoIB: Make the carrier_on_task race aware IPoIB: fix MCAST_FLAG_BUSY usage IPoIB: fix mcast_dev_flush/mcast_restart_task race IPoIB: change init sequence ordering IPoIB: Use dedicated workqueues per interface IPoIB: Make ipoib_mcast_stop_thread flush the workqueue IPoIB: No longer use flush as a parameter drivers/infiniband/ulp/ipoib/ipoib.h | 19 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 18 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 27 ++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 49 +++-- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 239 ++++++++++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 22 ++- 6 files changed, 240 insertions(+), 134 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-12 23:38 ` Doug Ledford [not found] ` <2394730ce5ae1d46522dca04066293dd842edf16.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 2/8] IPoIB: Make the carrier_on_task race aware Doug Ledford ` (8 subsequent siblings) 9 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Cc: Doug Ledford Setting the mtu can safely be moved to the carrier_on_task, which keeps us from needing to take the rtnl lock in the join_finish section. Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index d4e005720d0..a0a42859f12 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, spin_unlock_irq(&priv->lock); priv->tx_wr.wr.ud.remote_qkey = priv->qkey; set_qkey = 1; - - if (!ipoib_cm_admin_enabled(dev)) { - rtnl_lock(); - dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu)); - rtnl_unlock(); - } } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { @@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) } rtnl_lock(); + if (!ipoib_cm_admin_enabled(priv->dev)) + dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); netif_carrier_on(priv->dev); rtnl_unlock(); } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 37+ messages in thread
[parent not found: <2394730ce5ae1d46522dca04066293dd842edf16.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue [not found] ` <2394730ce5ae1d46522dca04066293dd842edf16.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-15 22:11 ` Wendy Cheng 2014-09-04 14:28 ` Erez Shitrit 1 sibling, 0 replies; 37+ messages in thread From: Wendy Cheng @ 2014-08-15 22:11 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Setting the mtu can safely be moved to the carrier_on_task, which keeps > us from needing to take the rtnl lock in the join_finish section. > Looks good ! Acked-by: Wendy Cheng <wendy.cheng-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index d4e005720d0..a0a42859f12 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, > spin_unlock_irq(&priv->lock); > priv->tx_wr.wr.ud.remote_qkey = priv->qkey; > set_qkey = 1; > - > - if (!ipoib_cm_admin_enabled(dev)) { > - rtnl_lock(); > - dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu)); > - rtnl_unlock(); > - } > } > > if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { > @@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) > } > > rtnl_lock(); > + if (!ipoib_cm_admin_enabled(priv->dev)) > + dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); > netif_carrier_on(priv->dev); > rtnl_unlock(); > } > -- > 1.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue [not found] ` <2394730ce5ae1d46522dca04066293dd842edf16.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-15 22:11 ` Wendy Cheng @ 2014-09-04 14:28 ` Erez Shitrit 1 sibling, 0 replies; 37+ messages in thread From: Erez Shitrit @ 2014-09-04 14:28 UTC (permalink / raw) To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A > Setting the mtu can safely be moved to the carrier_on_task, which keeps > us from needing to take the rtnl lock in the join_finish section. > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index d4e005720d0..a0a42859f12 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, > spin_unlock_irq(&priv->lock); > priv->tx_wr.wr.ud.remote_qkey = priv->qkey; > set_qkey = 1; > - > - if (!ipoib_cm_admin_enabled(dev)) { > - rtnl_lock(); > - dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu)); > - rtnl_unlock(); > - } > } > > if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { > @@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) > } > > rtnl_lock(); > + if (!ipoib_cm_admin_enabled(priv->dev)) > + dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); > netif_carrier_on(priv->dev); > rtnl_unlock(); > } -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 2/8] IPoIB: Make the carrier_on_task race aware [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue Doug Ledford @ 2014-08-12 23:38 ` Doug Ledford [not found] ` <d05cdce70f1312f35f8be2d14bafd2a06809b137.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage Doug Ledford ` (7 subsequent siblings) 9 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Cc: Doug Ledford We blindly assume that we can just take the rtnl lock and that will prevent races with downing this interface. Unfortunately, that's not the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() in an attempt to clear out all remaining instances of ipoib_join_task. But, since this task is put on the same workqueue as the join task, the flush_workqueue waits on this thread too. But this thread is deadlocked on the rtnl lock. The better thing here is to use trylock and loop on that until we either get the lock or we see that FLAG_ADMIN_UP has been cleared, in which case we don't need to do anything anyway and we just return. Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index a0a42859f12..7e9cd39b5ef 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) carrier_on_task); struct ib_port_attr attr; - /* - * Take rtnl_lock to avoid racing with ipoib_stop() and - * turning the carrier back on while a device is being - * removed. - */ if (ib_query_port(priv->ca, priv->port, &attr) || attr.state != IB_PORT_ACTIVE) { ipoib_dbg(priv, "Keeping carrier off until IB port is active\n"); return; } - rtnl_lock(); + /* + * Take rtnl_lock to avoid racing with ipoib_stop() and + * turning the carrier back on while a device is being + * removed. However, ipoib_stop() will attempt to flush + * the workqueue while holding the rtnl lock, so loop + * on trylock until either we get the lock or we see + * FLAG_ADMIN_UP go away as that signals that we are bailing + * and can safely ignore the carrier on work + */ + while (!rtnl_trylock()) { + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + else + msleep(20); + } if (!ipoib_cm_admin_enabled(priv->dev)) dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); netif_carrier_on(priv->dev); -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 37+ messages in thread
[parent not found: <d05cdce70f1312f35f8be2d14bafd2a06809b137.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware [not found] ` <d05cdce70f1312f35f8be2d14bafd2a06809b137.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-18 23:26 ` Wendy Cheng [not found] ` <CABgxfbE6edfZZ58=mTvhGqWSkCxsik0XuQPR0L-yayze=803cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-09-04 12:13 ` Erez Shitrit 1 sibling, 1 reply; 37+ messages in thread From: Wendy Cheng @ 2014-08-18 23:26 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > We blindly assume that we can just take the rtnl lock and that will > prevent races with downing this interface. Unfortunately, that's not > the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() > in an attempt to clear out all remaining instances of ipoib_join_task. > But, since this task is put on the same workqueue as the join task, the > flush_workqueue waits on this thread too. But this thread is deadlocked > on the rtnl lock. The better thing here is to use trylock and loop on > that until we either get the lock or we see that FLAG_ADMIN_UP has > been cleared, in which case we don't need to do anything anyway and we > just return. > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++++++++------ > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index a0a42859f12..7e9cd39b5ef 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) > carrier_on_task); > struct ib_port_attr attr; > > - /* > - * Take rtnl_lock to avoid racing with ipoib_stop() and > - * turning the carrier back on while a device is being > - * removed. > - */ > if (ib_query_port(priv->ca, priv->port, &attr) || > attr.state != IB_PORT_ACTIVE) { > ipoib_dbg(priv, "Keeping carrier off until IB port is active\n"); > return; > } > > - rtnl_lock(); > + /* > + * Take rtnl_lock to avoid racing with ipoib_stop() and > + * turning the carrier back on while a device is being > + * removed. However, ipoib_stop() will attempt to flush > + * the workqueue while holding the rtnl lock, so loop > + * on trylock until either we get the lock or we see > + * FLAG_ADMIN_UP go away as that signals that we are bailing > + * and can safely ignore the carrier on work > + */ > + while (!rtnl_trylock()) { > + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > + return; > + else > + msleep(20); > + } I always think rtnl lock is too big for this purpose... and that 20 ms is not ideal either. Could we have a new IPOIB private mutex used by ipoib_stop() and this section of code ? So something like: ipoib_stop() { ..... mutex_lock(&something_new); clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); ... mutex_unlock(&something_new); return 0; } Then the loop would become: // this while-loop will be very short - since we either get the mutex quickly or "return" quickly. while (!mutex_trylock(&something_new)) { if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) return; } > if (!ipoib_cm_admin_enabled(priv->dev)) > dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); > netif_carrier_on(priv->dev); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <CABgxfbE6edfZZ58=mTvhGqWSkCxsik0XuQPR0L-yayze=803cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware [not found] ` <CABgxfbE6edfZZ58=mTvhGqWSkCxsik0XuQPR0L-yayze=803cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-08-19 20:32 ` Doug Ledford [not found] ` <2CC1794B-10CD-49A2-8F5D-0C66A6684DBC-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-19 20:32 UTC (permalink / raw) To: Wendy Cheng Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Whether or not the core network code is OK with us dropping the rtnl lock while manipulating the interface is the issue here. However, I did consider changing the mcast_mutex to a per interface lock instead. The are various optimizations that can be made now that the locking is correct and race free and that we don't flush from the workqueue we are flushing. I didn't go into those changes in this set. As for 20ms, I did that because checkpatch complains about shorter waits and I don't like to busy loop, especially since it's not like this is a performance sensitive section of code. Sent from my iPad > On Aug 18, 2014, at 4:26 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> We blindly assume that we can just take the rtnl lock and that will >> prevent races with downing this interface. Unfortunately, that's not >> the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() >> in an attempt to clear out all remaining instances of ipoib_join_task. >> But, since this task is put on the same workqueue as the join task, the >> flush_workqueue waits on this thread too. But this thread is deadlocked >> on the rtnl lock. The better thing here is to use trylock and loop on >> that until we either get the lock or we see that FLAG_ADMIN_UP has >> been cleared, in which case we don't need to do anything anyway and we >> just return. >> >> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> --- >> drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++++++++------ >> 1 file changed, 15 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> index a0a42859f12..7e9cd39b5ef 100644 >> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) >> carrier_on_task); >> struct ib_port_attr attr; >> >> - /* >> - * Take rtnl_lock to avoid racing with ipoib_stop() and >> - * turning the carrier back on while a device is being >> - * removed. >> - */ >> if (ib_query_port(priv->ca, priv->port, &attr) || >> attr.state != IB_PORT_ACTIVE) { >> ipoib_dbg(priv, "Keeping carrier off until IB port is active\n"); >> return; >> } >> >> - rtnl_lock(); >> + /* >> + * Take rtnl_lock to avoid racing with ipoib_stop() and >> + * turning the carrier back on while a device is being >> + * removed. However, ipoib_stop() will attempt to flush >> + * the workqueue while holding the rtnl lock, so loop >> + * on trylock until either we get the lock or we see >> + * FLAG_ADMIN_UP go away as that signals that we are bailing >> + * and can safely ignore the carrier on work >> + */ >> + while (!rtnl_trylock()) { >> + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) >> + return; >> + else >> + msleep(20); >> + } > > > I always think rtnl lock is too big for this purpose... and that 20 ms > is not ideal either. Could we have a new IPOIB private mutex used by > ipoib_stop() and this section of code ? So something like: > > ipoib_stop() > { ..... > mutex_lock(&something_new); > clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); > ... > mutex_unlock(&something_new); > return 0; > } > > Then the loop would become: > > // this while-loop will be very short - since we either get the mutex > quickly or "return" quickly. > while (!mutex_trylock(&something_new)) { > if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > return; > } > > >> if (!ipoib_cm_admin_enabled(priv->dev)) >> dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); >> netif_carrier_on(priv->dev); > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <2CC1794B-10CD-49A2-8F5D-0C66A6684DBC-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware [not found] ` <2CC1794B-10CD-49A2-8F5D-0C66A6684DBC-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-19 22:05 ` Wendy Cheng 0 siblings, 0 replies; 37+ messages in thread From: Wendy Cheng @ 2014-08-19 22:05 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On Tue, Aug 19, 2014 at 1:32 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Whether or not the core network code is OK with us dropping the rtnl lock while manipulating the interface is the issue here. However, I did consider changing the mcast_mutex to a per interface lock instead. The are various optimizations that can be made now that the locking is correct and race free and that we don't flush from the workqueue we are flushing. I didn't go into those changes in this set. As for 20ms, I did that because checkpatch complains about shorter waits and I don't like to busy loop, especially since it's not like this is a performance sensitive section of code. > I'm acking this patch for now. However, I want to remind everyone that rntl is a "BIG" global lock that has been used by other net devices such as Ethernet. It is particularly troublesome for us since the IPOIB needs to co-exist with Ethernet interface that has non-trivial amount of traffic as well. Imaging everyone tries to bring their interfaces up/down all together in a large cluster environment that needs to worry about power consumption. It would be "cool" if we can see less of this lock ! Acked-by: Wendy Cheng <wendy.cheng-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware [not found] ` <d05cdce70f1312f35f8be2d14bafd2a06809b137.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-18 23:26 ` Wendy Cheng @ 2014-09-04 12:13 ` Erez Shitrit [not found] ` <5408576C.7040609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> 1 sibling, 1 reply; 37+ messages in thread From: Erez Shitrit @ 2014-09-04 12:13 UTC (permalink / raw) To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A > We blindly assume that we can just take the rtnl lock and that will > prevent races with downing this interface. Unfortunately, that's not > the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() > in an attempt to clear out all remaining instances of ipoib_join_task. > But, since this task is put on the same workqueue as the join task, the > flush_workqueue waits on this thread too. But this thread is deadlocked > on the rtnl lock. The better thing here is to use trylock and loop on > that until we either get the lock or we see that FLAG_ADMIN_UP has > been cleared, in which case we don't need to do anything anyway and we > just return. > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++++++++------ > 1 file changed, 15 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index a0a42859f12..7e9cd39b5ef 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) > carrier_on_task); > struct ib_port_attr attr; > > - /* > - * Take rtnl_lock to avoid racing with ipoib_stop() and > - * turning the carrier back on while a device is being > - * removed. > - */ > if (ib_query_port(priv->ca, priv->port, &attr) || > attr.state != IB_PORT_ACTIVE) { > ipoib_dbg(priv, "Keeping carrier off until IB port is active\n"); > return; > } > > - rtnl_lock(); > + /* > + * Take rtnl_lock to avoid racing with ipoib_stop() and > + * turning the carrier back on while a device is being > + * removed. However, ipoib_stop() will attempt to flush > + * the workqueue while holding the rtnl lock, so loop > + * on trylock until either we get the lock or we see > + * FLAG_ADMIN_UP go away as that signals that we are bailing > + * and can safely ignore the carrier on work > + */ > + while (!rtnl_trylock()) { > + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > + return; I think that this check shouldn't be connected to the rtnl_lock, if IPOIB_FLAG_ADMIN_UP is clear no need to take the carrier on, just wait to the next loop when ipoib_open will be called. > + else > + msleep(20); > + } > if (!ipoib_cm_admin_enabled(priv->dev)) > dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); > netif_carrier_on(priv->dev); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <5408576C.7040609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>]
* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware [not found] ` <5408576C.7040609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> @ 2014-09-09 7:17 ` Doug Ledford 0 siblings, 0 replies; 37+ messages in thread From: Doug Ledford @ 2014-09-09 7:17 UTC (permalink / raw) To: Erez Shitrit, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On 09/04/2014 08:13 AM, Erez Shitrit wrote: >> We blindly assume that we can just take the rtnl lock and that will >> prevent races with downing this interface. Unfortunately, that's not >> the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() >> in an attempt to clear out all remaining instances of ipoib_join_task. >> But, since this task is put on the same workqueue as the join task, the >> flush_workqueue waits on this thread too. But this thread is deadlocked >> on the rtnl lock. The better thing here is to use trylock and loop on >> that until we either get the lock or we see that FLAG_ADMIN_UP has >> been cleared, in which case we don't need to do anything anyway and we >> just return. >> >> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> --- >> drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 >> +++++++++++++++------ >> 1 file changed, 15 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> index a0a42859f12..7e9cd39b5ef 100644 >> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct >> work_struct *work) >> carrier_on_task); >> struct ib_port_attr attr; >> - /* >> - * Take rtnl_lock to avoid racing with ipoib_stop() and >> - * turning the carrier back on while a device is being >> - * removed. >> - */ >> if (ib_query_port(priv->ca, priv->port, &attr) || >> attr.state != IB_PORT_ACTIVE) { >> ipoib_dbg(priv, "Keeping carrier off until IB port is >> active\n"); >> return; >> } >> - rtnl_lock(); >> + /* >> + * Take rtnl_lock to avoid racing with ipoib_stop() and >> + * turning the carrier back on while a device is being >> + * removed. However, ipoib_stop() will attempt to flush >> + * the workqueue while holding the rtnl lock, so loop >> + * on trylock until either we get the lock or we see >> + * FLAG_ADMIN_UP go away as that signals that we are bailing >> + * and can safely ignore the carrier on work >> + */ >> + while (!rtnl_trylock()) { >> + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) >> + return; > > I think that this check shouldn't be connected to the rtnl_lock, if > IPOIB_FLAG_ADMIN_UP is clear no need to take the carrier on, just wait > to the next loop when ipoib_open will be called. Actually, I was optimizing for the common case. The common case is that someone is *not* running a while true; do ifup/ifdown; done torture test. In that situation, you will get the rtnl lock, which you have to have before doing to two config items below, and so you can totally skip the check for FLAG_ADMIN_UP because A) the code that drops this flag gets the rtnl lock first and B) that same code flushes this workqueue while holding the lock, so if we ever get the lock, we already know the state of the ADMIN_UP flag. This is all moot though in the sense that once the devices are switched to per device work queues, the need to do this loop goes away and we can simplify the code. This was only needed because our flush task and carrier on task shared a work queue and we could deadlock without this little loop. >> + else >> + msleep(20); >> + } >> if (!ipoib_cm_admin_enabled(priv->dev)) >> dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu)); >> netif_carrier_on(priv->dev); > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue Doug Ledford 2014-08-12 23:38 ` [PATCH 2/8] IPoIB: Make the carrier_on_task race aware Doug Ledford @ 2014-08-12 23:38 ` Doug Ledford [not found] ` <a410e80dc5ca7cfa64229bbbf50c1337317e3bd8.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race Doug Ledford ` (6 subsequent siblings) 9 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Cc: Doug Ledford Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects) added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. This is further muddied by the fact that we overload the MCAST_FLAG_BUSY flag to mean two different things: we have a join in flight, and we have succeeded in getting an ib_sa_join_multicast. In order to make things clearer and aid in resolving the rtnl deadlock bug I've been chasing, I cleaned this up a bit. 1) Remove the MCAST_JOIN_STARTED flag entirely 2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight 3) Test on mcast->mc directly to see if we have completed ib_sa_join_multicast (using IS_ERR_OR_NULL) 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize the mcast->done completion struct 5) Make sure that before calling complete(&mcast->done), we always clear the MCAST_FLAG_BUSY bit 6) Take the mcast_mutex before we call ib_sa_multicast_join and also take the mutex in our join callback. This forces ib_sa_multicast_join to return and set mcast->mc before we process the callback. This way, our callback can safely clear mcast->mc if there is an error on the join and we will do the right thing as a result in mcast_dev_flush. 7) Because we need the mutex to synchronize mcast->mc, we can no longer call mcast_sendonly_join directly from mcast_send and instead must add sendonly join processing to the mcast_join_task A number of different races are resolved with these changes. These races existed with the old MCAST_FLAG_BUSY usage, the MCAST_JOIN_STARTED flag was an attempt to address them, and while it helped, a determined effort could still trip things up. One race looks something like this: Thread 1 Thread 2 ib_sa_join_multicast (as part of running restart mcast task) alloc member call callback ifconfig ib0 down wait_for_completion callback call completes wait_for_completion in mcast_dev_flush completes mcast->mc is PTR_ERR_OR_NULL so we skip ib_sa_leave_multicast return from callback return from ib_sa_join_multicast set mcast->mc = return from ib_sa_multicast We now have a permanently unbalanced join/leave issue that trips up the refcounting in core/multicast.c Another like this: Thread 1 Thread 2 Thread 3 ib_sa_multicast_join ifconfig ib0 down priv->broadcast = NULL join_complete wait_for_completion mcast->mc is not yet set, so don't clear return from ib_sa_join_multicast and set mcast->mc complete return -EAGAIN (making mcast->mc invalid) call ib_sa_multicast_leave on invalid mcast->mc, hang forever By holding the mutex around ib_sa_multicast_join and taking the mutex early in the callback, we force mcast->mc to be valid at the time we run the callback. This allows us to clear mcast->mc if there is an error and the join is going to fail. We do this before we complete the mcast. In this way, mcast_dev_flush always sees consistent state in regards to mcast->mc membership at the time that the wait_for_completion() returns. Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/ipoib/ipoib.h | 10 +- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 148 ++++++++++++++++--------- 2 files changed, 101 insertions(+), 57 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 3edce617c31..43840971ad8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -98,9 +98,15 @@ enum { IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ IPOIB_MCAST_FLAG_SENDONLY = 1, - IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + /* + * For IPOIB_MCAST_FLAG_BUSY + * When set, in flight join and mcast->mc is unreliable + * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or + * haven't started yet + * When clear and mcast->mc is valid pointer, join was successful + */ + IPOIB_MCAST_FLAG_BUSY = 2, IPOIB_MCAST_FLAG_ATTACHED = 3, - IPOIB_MCAST_JOIN_STARTED = 4, MAX_SEND_CQE = 16, IPOIB_CM_COPYBREAK = 256, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 7e9cd39b5ef..f5e8da530d9 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status, struct ipoib_mcast *mcast = multicast->context; struct net_device *dev = mcast->dev; + /* + * We have to take the mutex to force mcast_sendonly_join to + * return from ib_sa_multicast_join and set mcast->mc to a + * valid value. Otherwise we were racing with ourselves in + * that we might fail here, but get a valid return from + * ib_sa_multicast_join after we had cleared mcast->mc here, + * resulting in mis-matched joins and leaves and a deadlock + */ + mutex_lock(&mcast_mutex); + /* We trap for port events ourselves. */ if (status == -ENETRESET) - return 0; + goto out; if (!status) status = ipoib_mcast_join_finish(mcast, &multicast->rec); if (status) { if (mcast->logcount++ < 20) - ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for %pI6, status %d\n", + ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast " + "join failed for %pI6, status %d\n", mcast->mcmember.mgid.raw, status); /* Flush out any queued packets */ @@ -290,11 +301,15 @@ ipoib_mcast_sendonly_join_complete(int status, dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); } netif_tx_unlock_bh(dev); - - /* Clear the busy flag so we try again */ - status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, - &mcast->flags); } +out: + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + if (status) + mcast->mc = NULL; + complete(&mcast->done); + if (status == -ENETRESET) + status = 0; + mutex_unlock(&mcast_mutex); return status; } @@ -312,12 +327,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) int ret = 0; if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) { - ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n"); + ipoib_dbg_mcast(priv, "device shutting down, no sendonly " + "multicast joins\n"); return -ENODEV; } - if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { - ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { + ipoib_dbg_mcast(priv, "multicast entry busy, skipping " + "sendonly join\n"); return -EBUSY; } @@ -325,6 +342,9 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); + mutex_lock(&mcast_mutex); + init_completion(&mcast->done); + set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | @@ -337,12 +357,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) if (IS_ERR(mcast->mc)) { ret = PTR_ERR(mcast->mc); clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); - ipoib_warn(priv, "ib_sa_join_multicast failed (ret = %d)\n", - ret); + complete(&mcast->done); + ipoib_warn(priv, "ib_sa_join_multicast for sendonly join " + "failed (ret = %d)\n", ret); } else { - ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting join\n", - mcast->mcmember.mgid.raw); + ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting " + "sendonly join\n", mcast->mcmember.mgid.raw); } + mutex_unlock(&mcast_mutex); return ret; } @@ -390,22 +412,28 @@ static int ipoib_mcast_join_complete(int status, ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n", mcast->mcmember.mgid.raw, status); + /* + * We have to take the mutex to force mcast_join to + * return from ib_sa_multicast_join and set mcast->mc to a + * valid value. Otherwise we were racing with ourselves in + * that we might fail here, but get a valid return from + * ib_sa_multicast_join after we had cleared mcast->mc here, + * resulting in mis-matched joins and leaves and a deadlock + */ + mutex_lock(&mcast_mutex); + /* We trap for port events ourselves. */ - if (status == -ENETRESET) { - status = 0; + if (status == -ENETRESET) goto out; - } if (!status) status = ipoib_mcast_join_finish(mcast, &multicast->rec); if (!status) { mcast->backoff = 1; - mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0); - mutex_unlock(&mcast_mutex); /* * Defer carrier on work to ipoib_workqueue to avoid a @@ -413,37 +441,35 @@ static int ipoib_mcast_join_complete(int status, */ if (mcast == priv->broadcast) queue_work(ipoib_workqueue, &priv->carrier_on_task); - - status = 0; - goto out; - } - - if (mcast->logcount++ < 20) { - if (status == -ETIMEDOUT || status == -EAGAIN) { - ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n", - mcast->mcmember.mgid.raw, status); - } else { - ipoib_warn(priv, "multicast join failed for %pI6, status %d\n", - mcast->mcmember.mgid.raw, status); + } else { + if (mcast->logcount++ < 20) { + if (status == -ETIMEDOUT || status == -EAGAIN) { + ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n", + mcast->mcmember.mgid.raw, status); + } else { + ipoib_warn(priv, "multicast join failed for %pI6, status %d\n", + mcast->mcmember.mgid.raw, status); + } } - } - - mcast->backoff *= 2; - if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) - mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; - - /* Clear the busy flag so we try again */ - status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); - mutex_lock(&mcast_mutex); + mcast->backoff *= 2; + if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) + mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; + } +out: spin_lock_irq(&priv->lock); - if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + if (status) + mcast->mc = NULL; + complete(&mcast->done); + if (status == -ENETRESET) + status = 0; + if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_delayed_work(ipoib_workqueue, &priv->mcast_task, mcast->backoff * HZ); spin_unlock_irq(&priv->lock); mutex_unlock(&mcast_mutex); -out: - complete(&mcast->done); + return status; } @@ -492,10 +518,9 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, rec.hop_limit = priv->broadcast->mcmember.hop_limit; } - set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); + mutex_lock(&mcast_mutex); init_completion(&mcast->done); - set_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags); - + set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port, &rec, comp_mask, GFP_KERNEL, ipoib_mcast_join_complete, mcast); @@ -509,13 +534,12 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; - mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_delayed_work(ipoib_workqueue, &priv->mcast_task, mcast->backoff * HZ); - mutex_unlock(&mcast_mutex); } + mutex_unlock(&mcast_mutex); } void ipoib_mcast_join_task(struct work_struct *work) @@ -576,7 +600,8 @@ void ipoib_mcast_join_task(struct work_struct *work) } if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { - if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) + if (IS_ERR_OR_NULL(priv->broadcast->mc) && + !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) ipoib_mcast_join(dev, priv->broadcast, 0); return; } @@ -584,23 +609,33 @@ void ipoib_mcast_join_task(struct work_struct *work) while (1) { struct ipoib_mcast *mcast = NULL; + /* + * Need the mutex so our flags are consistent, need the + * priv->lock so we don't race with list removals in either + * mcast_dev_flush or mcast_restart_task + */ + mutex_lock(&mcast_mutex); spin_lock_irq(&priv->lock); list_for_each_entry(mcast, &priv->multicast_list, list) { - if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) - && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) - && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + if (IS_ERR_OR_NULL(mcast->mc) && + !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) && + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { /* Found the next unjoined group */ break; } } spin_unlock_irq(&priv->lock); + mutex_unlock(&mcast_mutex); if (&mcast->list == &priv->multicast_list) { /* All done */ break; } - ipoib_mcast_join(dev, mcast, 1); + if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) + ipoib_mcast_sendonly_join(mcast); + else + ipoib_mcast_join(dev, mcast, 1); return; } @@ -646,6 +681,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) int ret = 0; if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) + ipoib_warn(priv, "ipoib_mcast_leave on an in-flight join\n"); + + if (!IS_ERR_OR_NULL(mcast->mc)) ib_sa_free_multicast(mcast->mc); if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { @@ -698,6 +736,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb) memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid)); __ipoib_mcast_add(dev, mcast); list_add_tail(&mcast->list, &priv->multicast_list); + if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0); } if (!mcast->ah) { @@ -711,8 +751,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb) if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) ipoib_dbg_mcast(priv, "no address vector, " "but multicast join already started\n"); - else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) - ipoib_mcast_sendonly_join(mcast); /* * If lookup completes between here and out:, don't @@ -774,7 +812,7 @@ void ipoib_mcast_dev_flush(struct net_device *dev) /* seperate between the wait to the leave*/ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) - if (test_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags)) + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) wait_for_completion(&mcast->done); list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 37+ messages in thread
[parent not found: <a410e80dc5ca7cfa64229bbbf50c1337317e3bd8.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage [not found] ` <a410e80dc5ca7cfa64229bbbf50c1337317e3bd8.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-19 18:08 ` Wendy Cheng [not found] ` <CABgxfbH-Dt3CpxJKwCAZeHTUyupaA9y_WXVXuxgiPMet26PTQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 37+ messages in thread From: Wendy Cheng @ 2014-08-19 18:08 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects) > added a new flag MCAST_JOIN_STARTED, but was not very strict in how it > was used. We didn't always initialize the completion struct before we > set the flag, and we didn't always call complete on the completion > struct from all paths that complete it. This made it less than totally > effective, and certainly made its use confusing. And in the flush > function we would use the presence of this flag to signal that we should > wait on the completion struct, but we never cleared this flag, ever. > This is further muddied by the fact that we overload the MCAST_FLAG_BUSY > flag to mean two different things: we have a join in flight, and we have > succeeded in getting an ib_sa_join_multicast. > [snip] > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index 7e9cd39b5ef..f5e8da530d9 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status, > struct ipoib_mcast *mcast = multicast->context; > struct net_device *dev = mcast->dev; > > + /* > + * We have to take the mutex to force mcast_sendonly_join to > + * return from ib_sa_multicast_join and set mcast->mc to a > + * valid value. Otherwise we were racing with ourselves in > + * that we might fail here, but get a valid return from > + * ib_sa_multicast_join after we had cleared mcast->mc here, > + * resulting in mis-matched joins and leaves and a deadlock > + */ > + mutex_lock(&mcast_mutex); > + > /* We trap for port events ourselves. */ > if (status == -ENETRESET) > - return 0; > + goto out; > [snip] > +out: > + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > + if (status) > + mcast->mc = NULL; > + complete(&mcast->done); > + if (status == -ENETRESET) > + status = 0; > + mutex_unlock(&mcast_mutex); > return status; > } > I have not gone thru the whole patch yet. However, here is something quick ... Would this "return 0 if (status == -ENETRESET)" be wrong in the original code ? Say ..... when it returns to process_group_error(), without proper error return code, it'll skip ib_sa_free_multicast(). Should we worry about it ? -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <CABgxfbH-Dt3CpxJKwCAZeHTUyupaA9y_WXVXuxgiPMet26PTQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage [not found] ` <CABgxfbH-Dt3CpxJKwCAZeHTUyupaA9y_WXVXuxgiPMet26PTQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-08-19 20:28 ` Doug Ledford [not found] ` <902D5BF2-159A-4B31-A87F-7491F3C8057F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-19 20:28 UTC (permalink / raw) To: Wendy Cheng Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Excuse my top posting, I'm at a meeting and not using my normal mail client. To answer your question, this is a confusing issue that I had to work through myself. When we call ib_sa_multicast_join, it will create a record for us and return that. That record is not valid for us to call ib_sa_multicast_leave until after the join has completed and our callback has acknowledged that join completion with a 0 return status from our callback. At the time the record is created, ib_sa_multicast_join will also queue up the actual join. When the join completes, they call our callback (and without my locking fixes, we can get the callback before ib_sa_multicast_join returns our record entry to us). If our callback is called with the status already set, then the join has failed and will be handled that way when our callback returns. In this case, the callback is informative for us, that's all. When we trap -ENETRESET and put our return to 0, we are acknowledging that the join failed prior to our callback and that the join will not be completed and that mcast->mc as we have recorded is not valid. If the status is 0 to start, and we set it to a non-0 value, then that also indicates that the join has failed, but in this case the core code must unroll the join that was completed up until the callback was done (and it does). Any time this is the case, the member record returned from ib_sa_multicast_join is invalid and we can not call ib_sa_multicast_leave on the record or we will hang forever waiting for a completion that won't be coming. Only if we get our callback entered with a 0 status and return with a 0 status does the member record that ib_sa_multicast_join returned to us become fully valid. At this point, we must call ib_sa_multicast_leave on the record if we want it to go away. So that's why in this patch we 1) take a mutex to force ib_sa_join_multicast to return and us to set mcast->mc to the proper return value before we process the join completion callback 2) always clear mcast->mc if there is any error since we can't call ib_sa_multicast_leave 3) always complete the mcast in case we are waiting on it 4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event Sent from my iPad > On Aug 19, 2014, at 11:08 AM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects) >> added a new flag MCAST_JOIN_STARTED, but was not very strict in how it >> was used. We didn't always initialize the completion struct before we >> set the flag, and we didn't always call complete on the completion >> struct from all paths that complete it. This made it less than totally >> effective, and certainly made its use confusing. And in the flush >> function we would use the presence of this flag to signal that we should >> wait on the completion struct, but we never cleared this flag, ever. >> This is further muddied by the fact that we overload the MCAST_FLAG_BUSY >> flag to mean two different things: we have a join in flight, and we have >> succeeded in getting an ib_sa_join_multicast. >> > > [snip] > >> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> index 7e9cd39b5ef..f5e8da530d9 100644 >> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >> @@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status, >> struct ipoib_mcast *mcast = multicast->context; >> struct net_device *dev = mcast->dev; >> >> + /* >> + * We have to take the mutex to force mcast_sendonly_join to >> + * return from ib_sa_multicast_join and set mcast->mc to a >> + * valid value. Otherwise we were racing with ourselves in >> + * that we might fail here, but get a valid return from >> + * ib_sa_multicast_join after we had cleared mcast->mc here, >> + * resulting in mis-matched joins and leaves and a deadlock >> + */ >> + mutex_lock(&mcast_mutex); >> + >> /* We trap for port events ourselves. */ >> if (status == -ENETRESET) >> - return 0; >> + goto out; >> > > [snip] > > >> +out: >> + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); >> + if (status) >> + mcast->mc = NULL; >> + complete(&mcast->done); >> + if (status == -ENETRESET) >> + status = 0; >> + mutex_unlock(&mcast_mutex); >> return status; >> } >> > > I have not gone thru the whole patch yet. However, here is something quick ... > > Would this "return 0 if (status == -ENETRESET)" be wrong in the > original code ? Say ..... when it returns to process_group_error(), > without proper error return code, it'll skip ib_sa_free_multicast(). > Should we worry about it ? > > -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <902D5BF2-159A-4B31-A87F-7491F3C8057F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage [not found] ` <902D5BF2-159A-4B31-A87F-7491F3C8057F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-25 19:51 ` Wendy Cheng [not found] ` <CABgxfbHOD75vLdZ0TtWZbk8ne3kHd_eWObxPHmoJ-D8DjE0bkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 37+ messages in thread From: Wendy Cheng @ 2014-08-25 19:51 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > So that's why in this patch we > > 1) take a mutex to force ib_sa_join_multicast to return and us to set mcast->mc to the proper return value before we process the join completion callback > 2) always clear mcast->mc if there is any error since we can't call ib_sa_multicast_leave > 3) always complete the mcast in case we are waiting on it > 4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event > We don't have IPOIB_MCAST_JOIN_STARTED (and the "done" completion struct) in our code base (MPSS) yet ...I'm *not* n-acking this patch but I find it hard to understand the ramifications. It has nothing to do with this patch - actually the patch itself looks pretty ok (by eyes). The original IPOIB mcast flow, particularly its abnormal error path, confuses me. Is it really possible for ib_sa_join_multicast() to return *after* its callback (ipoib_mcast_sendonly_join_complete and ipoib_mcast_join_complete) ? The mcast->done completion struct looks dangerous as well. I'll let other capable people to do the final call(s). -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <CABgxfbHOD75vLdZ0TtWZbk8ne3kHd_eWObxPHmoJ-D8DjE0bkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage [not found] ` <CABgxfbHOD75vLdZ0TtWZbk8ne3kHd_eWObxPHmoJ-D8DjE0bkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-08-25 20:03 ` Doug Ledford [not found] ` <E3EFCBAC-2D6E-49D3-A556-DBD40701CC5F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-25 20:03 UTC (permalink / raw) To: Wendy Cheng Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A > On Aug 25, 2014, at 2:51 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> >> So that's why in this patch we >> >> 1) take a mutex to force ib_sa_join_multicast to return and us to set mcast->mc to the proper return value before we process the join completion callback >> 2) always clear mcast->mc if there is any error since we can't call ib_sa_multicast_leave >> 3) always complete the mcast in case we are waiting on it >> 4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event > > confuses me. Is it really possible for ib_sa_join_multicast() to > return *after* its callback (ipoib_mcast_sendonly_join_complete and > ipoib_mcast_join_complete) ? Yes. They are both on work queues and ib_sa_multicast_join simply fires off another workqueue task. The scheduler is free to start that task instantly if the workqueue isn't busy, and it often does (although not necessarily on the same CPU). Then it is a race to see who finishes first. > The mcast->done completion struct looks > dangerous as well. It *was* dangerous ;-). My patches cleaned it up. Before my patches there were failure to completes, then during an early version of my patch there were double completes and failure to clear mcast->mc, then I figured out the race on the callback finishing before ib_sa_multicast_join and added the mutex and things suddenly started working much better ;-) > I'll let other capable people to do the final call(s). > > -- Wendy > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Sent from my iPad -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <E3EFCBAC-2D6E-49D3-A556-DBD40701CC5F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage [not found] ` <E3EFCBAC-2D6E-49D3-A556-DBD40701CC5F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-26 18:04 ` Wendy Cheng 0 siblings, 0 replies; 37+ messages in thread From: Wendy Cheng @ 2014-08-26 18:04 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On Mon, Aug 25, 2014 at 1:03 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >> On Aug 25, 2014, at 2:51 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> >> Is it really possible for ib_sa_join_multicast() to >> return *after* its callback (ipoib_mcast_sendonly_join_complete and >> ipoib_mcast_join_complete) ? > > Yes. They are both on work queues and ib_sa_join_multicast simply fires off another workqueue task. The scheduler is free to start that task instantly if the workqueue isn't busy, and it often does (although not necessarily on the same CPU). Then it is a race to see who finishes first. > Ok, thanks for the explanation. I also googled and found the original patch where the IPOIB_MCAST_JOIN_STARTED was added. This patch now makes sense. Acked-by: Wendy Cheng <wendy.cheng-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> On the other hand, I'm still puzzled why ib_sa_join_multicast() can't be a blocking call (i.e. wait until callback is executed) - why would IPOIB pay the price to work around these nasty issues ? But I guess that is off-topic too much .. BTW, thanks for the work. Our users will be doing if-up-down a lot for power management, patches like these help ! -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (2 preceding siblings ...) 2014-08-12 23:38 ` [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage Doug Ledford @ 2014-08-12 23:38 ` Doug Ledford [not found] ` <e1dbcfc25d8930b281aad12699ebf8fa82485b0e.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 5/8] IPoIB: change init sequence ordering Doug Ledford ` (5 subsequent siblings) 9 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Cc: Doug Ledford Our mcast_dev_flush routine and our mcast_restart_task can race against each other. In particular, they both hold the priv->lock while manipulating the rbtree and while removing mcast entries from the multicast_list and while adding entries to the remove_list, but they also both drop their locks prior to doing the actual removes. The mcast_dev_flush routine is run entirely under the rtnl lock and so has at least some locking. The actual race condition is like this: Thread 1 Thread 2 ifconfig ib0 up start multicast join for broadcast multicast join completes for broadcast start to add more multicast joins call mcast_restart_task to add new entries ifconfig ib0 down mcast_dev_flush mcast_leave(mcast A) mcast_leave(mcast A) As mcast_leave calls ib_sa_multicast_leave, and as member in core/multicast.c is ref counted, we run into an unbalanced refcount issue. To avoid stomping on each others removes, take the rtnl lock specifically when we are deleting the entries from the remove list. Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++---- 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index f5e8da530d9..19e3fe75ebf 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev) spin_unlock_irqrestore(&priv->lock, flags); - /* seperate between the wait to the leave*/ + /* + * make sure the in-flight joins have finished before we attempt + * to leave + */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) wait_for_completion(&mcast->done); @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work) netif_addr_unlock(dev); local_irq_restore(flags); - /* We have to cancel outside of the spinlock */ + /* + * make sure the in-flight joins have finished before we attempt + * to leave + */ + list_for_each_entry_safe(mcast, tmcast, &remove_list, list) + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) + wait_for_completion(&mcast->done); + + /* + * We have to cancel outside of the spinlock, but we have to + * take the rtnl lock or else we race with the removal of + * entries from the remove list in mcast_dev_flush as part + * of ipoib_stop() which will call mcast_stop_thread with + * flush == 1 while holding the rtnl lock, and the + * flush_workqueue won't complete until this restart_mcast_task + * completes. So do like the carrier on task and attempt to + * take the rtnl lock, but if we can't before the ADMIN_UP flag + * goes away, then just return and know that the remove list will + * get flushed later by mcast_dev_flush. + */ + while (!rtnl_trylock()) { + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + else + msleep(20); + } list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } - - if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) - ipoib_mcast_start_thread(dev); + ipoib_mcast_start_thread(dev); + rtnl_unlock(); } #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 37+ messages in thread
[parent not found: <e1dbcfc25d8930b281aad12699ebf8fa82485b0e.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race [not found] ` <e1dbcfc25d8930b281aad12699ebf8fa82485b0e.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-29 21:53 ` Wendy Cheng [not found] ` <CABgxfbHDuUrdHuLJT2oD07Cy3Ys4_rj-bJ6eR=9+uv0CuPH7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 37+ messages in thread From: Wendy Cheng @ 2014-08-29 21:53 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Our mcast_dev_flush routine and our mcast_restart_task can race against > each other. In particular, they both hold the priv->lock while > manipulating the rbtree and while removing mcast entries from the > multicast_list and while adding entries to the remove_list, but they > also both drop their locks prior to doing the actual removes. The > mcast_dev_flush routine is run entirely under the rtnl lock and so has > at least some locking. The actual race condition is like this: > > Thread 1 Thread 2 > ifconfig ib0 up > start multicast join for broadcast > multicast join completes for broadcast > start to add more multicast joins > call mcast_restart_task to add new entries > ifconfig ib0 down > mcast_dev_flush > mcast_leave(mcast A) > mcast_leave(mcast A) > > As mcast_leave calls ib_sa_multicast_leave, and as member in > core/multicast.c is ref counted, we run into an unbalanced refcount > issue. To avoid stomping on each others removes, take the rtnl lock > specifically when we are deleting the entries from the remove list. Isn't "test_and_clear_bit()" atomic so it is unlikely that ib_sa_free_multicast() can run multiple times ? 638 static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) 639 { 640 struct ipoib_dev_priv *priv = netdev_priv(dev); 641 int ret = 0; 642 643 if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) 644 ib_sa_free_multicast(mcast->mc); 645 646 if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags) ) { -- Wendy > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++---- > 1 file changed, 32 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index f5e8da530d9..19e3fe75ebf 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev) > > spin_unlock_irqrestore(&priv->lock, flags); > > - /* seperate between the wait to the leave*/ > + /* > + * make sure the in-flight joins have finished before we attempt > + * to leave > + */ > list_for_each_entry_safe(mcast, tmcast, &remove_list, list) > if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > wait_for_completion(&mcast->done); > @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work) > netif_addr_unlock(dev); > local_irq_restore(flags); > > - /* We have to cancel outside of the spinlock */ > + /* > + * make sure the in-flight joins have finished before we attempt > + * to leave > + */ > + list_for_each_entry_safe(mcast, tmcast, &remove_list, list) > + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > + wait_for_completion(&mcast->done); > + > + /* > + * We have to cancel outside of the spinlock, but we have to > + * take the rtnl lock or else we race with the removal of > + * entries from the remove list in mcast_dev_flush as part > + * of ipoib_stop() which will call mcast_stop_thread with > + * flush == 1 while holding the rtnl lock, and the > + * flush_workqueue won't complete until this restart_mcast_task > + * completes. So do like the carrier on task and attempt to > + * take the rtnl lock, but if we can't before the ADMIN_UP flag > + * goes away, then just return and know that the remove list will > + * get flushed later by mcast_dev_flush. > + */ > + while (!rtnl_trylock()) { > + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > + return; > + else > + msleep(20); > + } > list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { > ipoib_mcast_leave(mcast->dev, mcast); > ipoib_mcast_free(mcast); > } > - > - if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > - ipoib_mcast_start_thread(dev); > + ipoib_mcast_start_thread(dev); > + rtnl_unlock(); > } > > #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG > -- > 1.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <CABgxfbHDuUrdHuLJT2oD07Cy3Ys4_rj-bJ6eR=9+uv0CuPH7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race [not found] ` <CABgxfbHDuUrdHuLJT2oD07Cy3Ys4_rj-bJ6eR=9+uv0CuPH7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-08-30 15:39 ` Wendy Cheng [not found] ` <CABgxfbGJOdmAn1sokEtisDdnA=r_4mfP=PfqZVsP0cd_oL50dA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-09-03 17:49 ` Doug Ledford 1 sibling, 1 reply; 37+ messages in thread From: Wendy Cheng @ 2014-08-30 15:39 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> Our mcast_dev_flush routine and our mcast_restart_task can race against >> each other. In particular, they both hold the priv->lock while >> manipulating the rbtree and while removing mcast entries from the >> multicast_list and while adding entries to the remove_list, but they >> also both drop their locks prior to doing the actual removes. The >> mcast_dev_flush routine is run entirely under the rtnl lock and so has >> at least some locking. The actual race condition is like this: >> >> Thread 1 Thread 2 >> ifconfig ib0 up >> start multicast join for broadcast >> multicast join completes for broadcast >> start to add more multicast joins >> call mcast_restart_task to add new entries >> ifconfig ib0 down >> mcast_dev_flush >> mcast_leave(mcast A) >> mcast_leave(mcast A) >> >> As mcast_leave calls ib_sa_multicast_leave, and as member in >> core/multicast.c is ref counted, we run into an unbalanced refcount >> issue. To avoid stomping on each others removes, take the rtnl lock >> specifically when we are deleting the entries from the remove list. > > Isn't "test_and_clear_bit()" atomic so it is unlikely that > ib_sa_free_multicast() can run multiple times ? Oops .. how about if the structure itself gets freed ? My bad ! However, isn't that the remove_list a local list on the caller's stack ? .. and the original list entry moving (to remove_list) is protected by the spin lock (priv->lock), it is unlikely that the ib_sa_free_multicast() can operate on the same entry ? The patch itself is harmless though .. but adding the rntl_lock is really not ideal. -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <CABgxfbGJOdmAn1sokEtisDdnA=r_4mfP=PfqZVsP0cd_oL50dA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race [not found] ` <CABgxfbGJOdmAn1sokEtisDdnA=r_4mfP=PfqZVsP0cd_oL50dA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-09-03 18:06 ` Doug Ledford [not found] ` <1409767614.26762.7.camel-v+aXH1h/sVwpzh8Nc7Vzg+562jBIR2Zt@public.gmane.org> 0 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-09-03 18:06 UTC (permalink / raw) To: Wendy Cheng Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A [-- Attachment #1: Type: text/plain, Size: 3223 bytes --] On Sat, 2014-08-30 at 08:39 -0700, Wendy Cheng wrote: > On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >> Our mcast_dev_flush routine and our mcast_restart_task can race against > >> each other. In particular, they both hold the priv->lock while > >> manipulating the rbtree and while removing mcast entries from the > >> multicast_list and while adding entries to the remove_list, but they > >> also both drop their locks prior to doing the actual removes. The > >> mcast_dev_flush routine is run entirely under the rtnl lock and so has > >> at least some locking. The actual race condition is like this: > >> > >> Thread 1 Thread 2 > >> ifconfig ib0 up > >> start multicast join for broadcast > >> multicast join completes for broadcast > >> start to add more multicast joins > >> call mcast_restart_task to add new entries > >> ifconfig ib0 down > >> mcast_dev_flush > >> mcast_leave(mcast A) > >> mcast_leave(mcast A) > >> > >> As mcast_leave calls ib_sa_multicast_leave, and as member in > >> core/multicast.c is ref counted, we run into an unbalanced refcount > >> issue. To avoid stomping on each others removes, take the rtnl lock > >> specifically when we are deleting the entries from the remove list. > > > > Isn't "test_and_clear_bit()" atomic so it is unlikely that > > ib_sa_free_multicast() can run multiple times ? > > Oops .. how about if the structure itself gets freed ? My bad ! Well, just like the last email, the code you are referring to is in the original code, and had other issues. After my patches it does not look like that. > However, isn't that the remove_list a local list on the caller's stack > ? .. and the original list entry moving (to remove_list) is protected > by the spin lock (priv->lock), it is unlikely that the > ib_sa_free_multicast() can operate on the same entry ? Yes, you're right. I had it in my mind that the remove_list was part of the ipoib private dev, not local on the stack. So you are right that we could probably get away with removing the rtnl lock there (although it would need to be in a later patch than the one you are reviewing here...here there would still be a race between the restart task and the downing of the interface because they all still share the same work queue, but once we switch to the per device work queues in patch #6, this can happen in parallel safely with the flush task I think). > The patch itself is harmless though .. but adding the rntl_lock is > really not ideal. > > -- Wendy > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <1409767614.26762.7.camel-v+aXH1h/sVwpzh8Nc7Vzg+562jBIR2Zt@public.gmane.org>]
* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race [not found] ` <1409767614.26762.7.camel-v+aXH1h/sVwpzh8Nc7Vzg+562jBIR2Zt@public.gmane.org> @ 2014-09-03 19:45 ` Wendy Cheng 0 siblings, 0 replies; 37+ messages in thread From: Wendy Cheng @ 2014-09-03 19:45 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On Wed, Sep 3, 2014 at 11:06 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Sat, 2014-08-30 at 08:39 -0700, Wendy Cheng wrote: >> On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> > On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> >> Our mcast_dev_flush routine and our mcast_restart_task can race against >> >> each other. In particular, they both hold the priv->lock while >> >> manipulating the rbtree and while removing mcast entries from the >> >> multicast_list and while adding entries to the remove_list, but they >> >> also both drop their locks prior to doing the actual removes. The >> >> mcast_dev_flush routine is run entirely under the rtnl lock and so has >> >> at least some locking. The actual race condition is like this: >> >> >> >> Thread 1 Thread 2 >> >> ifconfig ib0 up >> >> start multicast join for broadcast >> >> multicast join completes for broadcast >> >> start to add more multicast joins >> >> call mcast_restart_task to add new entries >> >> ifconfig ib0 down >> >> mcast_dev_flush >> >> mcast_leave(mcast A) >> >> mcast_leave(mcast A) >> >> >> >> As mcast_leave calls ib_sa_multicast_leave, and as member in >> >> core/multicast.c is ref counted, we run into an unbalanced refcount >> >> issue. To avoid stomping on each others removes, take the rtnl lock >> >> specifically when we are deleting the entries from the remove list. >> > [snip] .. >> However, isn't that the remove_list a local list on the caller's stack >> ? .. and the original list entry moving (to remove_list) is protected >> by the spin lock (priv->lock), it is unlikely that the >> ib_sa_free_multicast() can operate on the same entry ? > > Yes, you're right. I had it in my mind that the remove_list was part of > the ipoib private dev, not local on the stack. So you are right that we > could probably get away with removing the rtnl lock there (although it > would need to be in a later patch than the one you are reviewing > here...here there would still be a race between the restart task and the > downing of the interface because they all still share the same work > queue, but once we switch to the per device work queues in patch #6, > this can happen in parallel safely with the flush task I think). > Yep, that local "remove_list" is easy to miss. BTW, you might want to modify the text description of this patch to make your point clear. -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race [not found] ` <CABgxfbHDuUrdHuLJT2oD07Cy3Ys4_rj-bJ6eR=9+uv0CuPH7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-08-30 15:39 ` Wendy Cheng @ 2014-09-03 17:49 ` Doug Ledford 1 sibling, 0 replies; 37+ messages in thread From: Doug Ledford @ 2014-09-03 17:49 UTC (permalink / raw) To: Wendy Cheng Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A [-- Attachment #1: Type: text/plain, Size: 6050 bytes --] On Fri, 2014-08-29 at 14:53 -0700, Wendy Cheng wrote: > On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > Our mcast_dev_flush routine and our mcast_restart_task can race against > > each other. In particular, they both hold the priv->lock while > > manipulating the rbtree and while removing mcast entries from the > > multicast_list and while adding entries to the remove_list, but they > > also both drop their locks prior to doing the actual removes. The > > mcast_dev_flush routine is run entirely under the rtnl lock and so has > > at least some locking. The actual race condition is like this: > > > > Thread 1 Thread 2 > > ifconfig ib0 up > > start multicast join for broadcast > > multicast join completes for broadcast > > start to add more multicast joins > > call mcast_restart_task to add new entries > > ifconfig ib0 down > > mcast_dev_flush > > mcast_leave(mcast A) > > mcast_leave(mcast A) > > > > As mcast_leave calls ib_sa_multicast_leave, and as member in > > core/multicast.c is ref counted, we run into an unbalanced refcount > > issue. To avoid stomping on each others removes, take the rtnl lock > > specifically when we are deleting the entries from the remove list. > > Isn't "test_and_clear_bit()" atomic so it is unlikely that > ib_sa_free_multicast() can run multiple times ? > > 638 static int ipoib_mcast_leave(struct net_device *dev, struct > ipoib_mcast *mcast) > 639 { > 640 struct ipoib_dev_priv *priv = netdev_priv(dev); > 641 int ret = 0; > 642 > 643 if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > 644 ib_sa_free_multicast(mcast->mc); > 645 > 646 if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, > &mcast->flags) ) { This is the original code, and you would be right, but it had other problems. Even though test_and_clear_bit is atomic, the original usage of MCAST_FLAG_BUSY was really that the mcast join had started, there was no guarantee it had completed (we would set the flag, call ib_sa_multicast_join, and we could then race with our leave where it was possible to leave prior to the join completing). My patches change this around now, and the code only checks for MCAST_FLAG_BUSY as a double check that we aren't leaving an in-process join. > > -- Wendy > > > > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > --- > > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++---- > > 1 file changed, 32 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > > index f5e8da530d9..19e3fe75ebf 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > > @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev) > > > > spin_unlock_irqrestore(&priv->lock, flags); > > > > - /* seperate between the wait to the leave*/ > > + /* > > + * make sure the in-flight joins have finished before we attempt > > + * to leave > > + */ > > list_for_each_entry_safe(mcast, tmcast, &remove_list, list) > > if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > > wait_for_completion(&mcast->done); > > @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work) > > netif_addr_unlock(dev); > > local_irq_restore(flags); > > > > - /* We have to cancel outside of the spinlock */ > > + /* > > + * make sure the in-flight joins have finished before we attempt > > + * to leave > > + */ > > + list_for_each_entry_safe(mcast, tmcast, &remove_list, list) > > + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > > + wait_for_completion(&mcast->done); > > + > > + /* > > + * We have to cancel outside of the spinlock, but we have to > > + * take the rtnl lock or else we race with the removal of > > + * entries from the remove list in mcast_dev_flush as part > > + * of ipoib_stop() which will call mcast_stop_thread with > > + * flush == 1 while holding the rtnl lock, and the > > + * flush_workqueue won't complete until this restart_mcast_task > > + * completes. So do like the carrier on task and attempt to > > + * take the rtnl lock, but if we can't before the ADMIN_UP flag > > + * goes away, then just return and know that the remove list will > > + * get flushed later by mcast_dev_flush. > > + */ > > + while (!rtnl_trylock()) { > > + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > > + return; > > + else > > + msleep(20); > > + } > > list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { > > ipoib_mcast_leave(mcast->dev, mcast); > > ipoib_mcast_free(mcast); > > } > > - > > - if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > > - ipoib_mcast_start_thread(dev); > > + ipoib_mcast_start_thread(dev); > > + rtnl_unlock(); > > } > > > > #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG > > -- > > 1.9.3 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 5/8] IPoIB: change init sequence ordering [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (3 preceding siblings ...) 2014-08-12 23:38 ` [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race Doug Ledford @ 2014-08-12 23:38 ` Doug Ledford [not found] ` <ead9800512c1cb412b86cb1de3868c40f07c72be.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 6/8] IPoIB: Use dedicated workqueues per interface Doug Ledford ` (4 subsequent siblings) 9 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Cc: Doug Ledford In preparation for using per device work queues, we need to move the start of the neighbor thread task to after ipoib_ib_dev_init and move the destruction of the neighbor task to before ipoib_ib_dev_cleanup. Otherwise we will end up freeing our workqueue with work possibly still on it. Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 217cb77157d..949948a46d4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) { struct ipoib_dev_priv *priv = netdev_priv(dev); - if (ipoib_neigh_hash_init(priv) < 0) - goto out; /* Allocate RX/TX "rings" to hold queued skbs */ priv->rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring, GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", ca->name, ipoib_recvq_size); - goto out_neigh_hash_cleanup; + goto out; } priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring); @@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) if (ipoib_ib_dev_init(dev, ca, port)) goto out_tx_ring_cleanup; + /* + * Must be after ipoib_ib_dev_init so we can allocate a per + * device wq there and use it here + */ + if (ipoib_neigh_hash_init(priv) < 0) + goto out_dev_uninit; + return 0; +out_dev_uninit: + ipoib_ib_dev_cleanup(); + out_tx_ring_cleanup: vfree(priv->tx_ring); out_rx_ring_cleanup: kfree(priv->rx_ring); -out_neigh_hash_cleanup: - ipoib_neigh_hash_uninit(dev); out: return -ENOMEM; } @@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev) } unregister_netdevice_many(&head); + /* + * Must be before ipoib_ib_dev_cleanup or we delete an in use + * work queue + */ + ipoib_neigh_hash_uninit(dev); + ipoib_ib_dev_cleanup(dev); kfree(priv->rx_ring); @@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev) priv->rx_ring = NULL; priv->tx_ring = NULL; - - ipoib_neigh_hash_uninit(dev); } static const struct header_ops ipoib_header_ops = { -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 37+ messages in thread
[parent not found: <ead9800512c1cb412b86cb1de3868c40f07c72be.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 5/8] IPoIB: change init sequence ordering [not found] ` <ead9800512c1cb412b86cb1de3868c40f07c72be.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-09-04 12:36 ` Erez Shitrit 0 siblings, 0 replies; 37+ messages in thread From: Erez Shitrit @ 2014-09-04 12:36 UTC (permalink / raw) To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A > In preparation for using per device work queues, we need to move the > start of the neighbor thread task to after ipoib_ib_dev_init and move > the destruction of the neighbor task to before ipoib_ib_dev_cleanup. > Otherwise we will end up freeing our workqueue with work possibly still > on it. > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++++++++++++++++------- > 1 file changed, 17 insertions(+), 7 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 217cb77157d..949948a46d4 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > > - if (ipoib_neigh_hash_init(priv) < 0) > - goto out; > /* Allocate RX/TX "rings" to hold queued skbs */ > priv->rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring, > GFP_KERNEL); > if (!priv->rx_ring) { > printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", > ca->name, ipoib_recvq_size); > - goto out_neigh_hash_cleanup; > + goto out; > } > > priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring); > @@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) > if (ipoib_ib_dev_init(dev, ca, port)) > goto out_tx_ring_cleanup; > > + /* > + * Must be after ipoib_ib_dev_init so we can allocate a per > + * device wq there and use it here > + */ > + if (ipoib_neigh_hash_init(priv) < 0) > + goto out_dev_uninit; > + > return 0; > > +out_dev_uninit: > + ipoib_ib_dev_cleanup(); > + > out_tx_ring_cleanup: > vfree(priv->tx_ring); > > out_rx_ring_cleanup: > kfree(priv->rx_ring); > > -out_neigh_hash_cleanup: > - ipoib_neigh_hash_uninit(dev); > out: > return -ENOMEM; > } > @@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev) > } > unregister_netdevice_many(&head); > > + /* > + * Must be before ipoib_ib_dev_cleanup or we delete an in use > + * work queue > + */ > + ipoib_neigh_hash_uninit(dev); > + > ipoib_ib_dev_cleanup(dev); > > kfree(priv->rx_ring); > @@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev) > > priv->rx_ring = NULL; > priv->tx_ring = NULL; > - > - ipoib_neigh_hash_uninit(dev); > } > > static const struct header_ops ipoib_header_ops = { -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 6/8] IPoIB: Use dedicated workqueues per interface [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (4 preceding siblings ...) 2014-08-12 23:38 ` [PATCH 5/8] IPoIB: change init sequence ordering Doug Ledford @ 2014-08-12 23:38 ` Doug Ledford [not found] ` <f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue Doug Ledford ` (3 subsequent siblings) 9 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Cc: Doug Ledford During my recent work on the rtnl lock deadlock in the IPoIB driver, I saw that even once I fixed the apparent races for a single device, as soon as that device had any children, new races popped up. It turns out that this is because no matter how well we protect against races on a single device, the fact that all devices use the same workqueue, and flush_workqueue() flushes *everything* from that workqueue, we can have one device in the middle of a down and holding the rtnl lock and another totally unrelated device needing to run mcast_restart_task, which wants the rtnl lock and will loop trying to take it unless is sees its own FLAG_ADMIN_UP flag go away. Because the unrelated interface will never see its own ADMIN_UP flag drop, the interface going down will deadlock trying to flush the queue. There are several possible solutions to this problem: Make carrier_on_task and mcast_restart_task try to take the rtnl for some set period of time and if they fail, then bail. This runs the real risk of dropping work on the floor, which can end up being its own separate kind of deadlock. Set some global flag in the driver that says some device is in the middle of going down, letting all tasks know to bail. Again, this can drop work on the floor. I suppose if our own ADMIN_UP flag doesn't go away, then maybe after a few tries on the rtnl lock we can queue our own task back up as a delayed work and return and avoid dropping work on the floor that way. But I'm not 100% convinced that we won't cause other problems. Or the method this patch attempts to use, which is when we bring an interface up, create a workqueue specifically for that interface, so that when we take it back down, we are flushing only those tasks associated with our interface. In addition, keep the global workqueue, but now limit it to only flush tasks. In this way, the flush tasks can always flush the device specific work queues without having deadlock issues. Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 18 +++++++++--------- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 6 +++--- drivers/infiniband/ulp/ipoib/ipoib_main.c | 19 ++++++++++++------- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++-------------- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 22 +++++++++++++++++++++- 6 files changed, 58 insertions(+), 34 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 43840971ad8..7bf7339eaef 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -317,6 +317,7 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; + struct workqueue_struct *wq; struct delayed_work mcast_task; struct work_struct carrier_on_task; struct work_struct flush_light; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 933efcea0d0..56959adb6c7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even } spin_lock_irq(&priv->lock); - queue_delayed_work(ipoib_workqueue, + queue_delayed_work(priv->wq, &priv->cm.stale_task, IPOIB_CM_RX_DELAY); /* Add this entry to passive ids list head, but do not re-add it * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ @@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(&priv->lock, flags); list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); - queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); + queue_work(priv->wq, &priv->cm.rx_reap_task); spin_unlock_irqrestore(&priv->lock, flags); } else ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", @@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(&priv->lock, flags); list_move(&p->list, &priv->cm.rx_reap_list); spin_unlock_irqrestore(&priv->lock, flags); - queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); + queue_work(priv->wq, &priv->cm.rx_reap_task); } return; } @@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { list_move(&tx->list, &priv->cm.reap_list); - queue_work(ipoib_workqueue, &priv->cm.reap_task); + queue_work(priv->wq, &priv->cm.reap_task); } clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags); @@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { list_move(&tx->list, &priv->cm.reap_list); - queue_work(ipoib_workqueue, &priv->cm.reap_task); + queue_work(priv->wq, &priv->cm.reap_task); } spin_unlock_irqrestore(&priv->lock, flags); @@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path tx->dev = dev; list_add(&tx->list, &priv->cm.start_list); set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags); - queue_work(ipoib_workqueue, &priv->cm.start_task); + queue_work(priv->wq, &priv->cm.start_task); return tx; } @@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx) if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { spin_lock_irqsave(&priv->lock, flags); list_move(&tx->list, &priv->cm.reap_list); - queue_work(ipoib_workqueue, &priv->cm.reap_task); + queue_work(priv->wq, &priv->cm.reap_task); ipoib_dbg(priv, "Reap connection for gid %pI6\n", tx->neigh->daddr + 4); tx->neigh = NULL; @@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb, skb_queue_tail(&priv->cm.skb_queue, skb); if (e) - queue_work(ipoib_workqueue, &priv->cm.skb_task); + queue_work(priv->wq, &priv->cm.skb_task); } static void ipoib_cm_rx_reap(struct work_struct *work) @@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work) } if (!list_empty(&priv->cm.passive_ids)) - queue_delayed_work(ipoib_workqueue, + queue_delayed_work(priv->wq, &priv->cm.stale_task, IPOIB_CM_RX_DELAY); spin_unlock_irq(&priv->lock); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 72626c34817..bfd17d41b5f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work) __ipoib_reap_ah(dev); if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, + queue_delayed_work(priv->wq, &priv->ah_reap_task, round_jiffies_relative(HZ)); } @@ -696,7 +696,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush) } clear_bit(IPOIB_STOP_REAPER, &priv->flags); - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, + queue_delayed_work(priv->wq, &priv->ah_reap_task, round_jiffies_relative(HZ)); if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) @@ -881,7 +881,7 @@ timeout: set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); if (flush) - flush_workqueue(ipoib_workqueue); + flush_workqueue(priv->wq); begin = jiffies; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 949948a46d4..255c8296566 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev) return; } - queue_work(ipoib_workqueue, &priv->restart_task); + queue_work(priv->wq, &priv->restart_task); } static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr) @@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work) __ipoib_reap_neigh(priv); if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags)) - queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task, + queue_delayed_work(priv->wq, &priv->neigh_reap_task, arp_tbl.gc_interval); } @@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv) /* start garbage collection */ clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags); - queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task, + queue_delayed_work(priv->wq, &priv->neigh_reap_task, arp_tbl.gc_interval); return 0; @@ -1293,7 +1293,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) return 0; out_dev_uninit: - ipoib_ib_dev_cleanup(); + ipoib_ib_dev_cleanup(dev); out_tx_ring_cleanup: vfree(priv->tx_ring); @@ -1646,7 +1646,7 @@ register_failed: /* Stop GC if started before flush */ set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags); cancel_delayed_work(&priv->neigh_reap_task); - flush_workqueue(ipoib_workqueue); + flush_workqueue(priv->wq); event_failed: ipoib_dev_cleanup(priv->dev); @@ -1717,7 +1717,7 @@ static void ipoib_remove_one(struct ib_device *device) /* Stop GC */ set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags); cancel_delayed_work(&priv->neigh_reap_task); - flush_workqueue(ipoib_workqueue); + flush_workqueue(priv->wq); unregister_netdev(priv->dev); free_netdev(priv->dev); @@ -1758,8 +1758,13 @@ static int __init ipoib_init_module(void) * unregister_netdev() and linkwatch_event take the rtnl lock, * so flush_scheduled_work() can deadlock during device * removal. + * + * In addition, bringing one device up and another down at the + * same time can deadlock a single workqueue, so we have this + * global fallback workqueue, but we also attempt to open a + * per device workqueue each time we bring an interface up */ - ipoib_workqueue = create_singlethread_workqueue("ipoib"); + ipoib_workqueue = create_singlethread_workqueue("ipoib_flush"); if (!ipoib_workqueue) { ret = -ENOMEM; goto err_fs; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 19e3fe75ebf..969ef420518 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -388,7 +388,7 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) * the workqueue while holding the rtnl lock, so loop * on trylock until either we get the lock or we see * FLAG_ADMIN_UP go away as that signals that we are bailing - * and can safely ignore the carrier on work + * and can safely ignore the carrier on work. */ while (!rtnl_trylock()) { if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) @@ -432,15 +432,14 @@ static int ipoib_mcast_join_complete(int status, if (!status) { mcast->backoff = 1; if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) - queue_delayed_work(ipoib_workqueue, - &priv->mcast_task, 0); + queue_delayed_work(priv->wq, &priv->mcast_task, 0); /* - * Defer carrier on work to ipoib_workqueue to avoid a + * Defer carrier on work to priv->wq to avoid a * deadlock on rtnl_lock here. */ if (mcast == priv->broadcast) - queue_work(ipoib_workqueue, &priv->carrier_on_task); + queue_work(priv->wq, &priv->carrier_on_task); } else { if (mcast->logcount++ < 20) { if (status == -ETIMEDOUT || status == -EAGAIN) { @@ -465,7 +464,7 @@ out: if (status == -ENETRESET) status = 0; if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags)) - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, + queue_delayed_work(priv->wq, &priv->mcast_task, mcast->backoff * HZ); spin_unlock_irq(&priv->lock); mutex_unlock(&mcast_mutex); @@ -535,8 +534,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) - queue_delayed_work(ipoib_workqueue, - &priv->mcast_task, + queue_delayed_work(priv->wq, &priv->mcast_task, mcast->backoff * HZ); } mutex_unlock(&mcast_mutex); @@ -584,8 +582,8 @@ void ipoib_mcast_join_task(struct work_struct *work) ipoib_warn(priv, "failed to allocate broadcast group\n"); mutex_lock(&mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) - queue_delayed_work(ipoib_workqueue, - &priv->mcast_task, HZ); + queue_delayed_work(priv->wq, &priv->mcast_task, + HZ); mutex_unlock(&mcast_mutex); return; } @@ -652,7 +650,7 @@ int ipoib_mcast_start_thread(struct net_device *dev) mutex_lock(&mcast_mutex); if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0); + queue_delayed_work(priv->wq, &priv->mcast_task, 0); mutex_unlock(&mcast_mutex); return 0; @@ -670,7 +668,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush) mutex_unlock(&mcast_mutex); if (flush) - flush_workqueue(ipoib_workqueue); + flush_workqueue(priv->wq); return 0; } @@ -737,7 +735,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb) __ipoib_mcast_add(dev, mcast); list_add_tail(&mcast->list, &priv->multicast_list); if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0); + queue_delayed_work(priv->wq, &priv->mcast_task, 0); } if (!mcast->ah) { @@ -952,7 +950,7 @@ void ipoib_mcast_restart_task(struct work_struct *work) * completes. So do like the carrier on task and attempt to * take the rtnl lock, but if we can't before the ADMIN_UP flag * goes away, then just return and know that the remove list will - * get flushed later by mcast_dev_flush. + * get flushed later by mcast_stop_thread. */ while (!rtnl_trylock()) { if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index c56d5d44c53..b72a753eb41 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -145,10 +145,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) int ret, size; int i; + /* + * the various IPoIB tasks assume they will never race against + * themselves, so always use a single thread workqueue + */ + priv->wq = create_singlethread_workqueue("ipoib_wq"); + if (!priv->wq) { + printk(KERN_WARNING "ipoib: failed to allocate device WQ\n"); + return -ENODEV; + } + priv->pd = ib_alloc_pd(priv->ca); if (IS_ERR(priv->pd)) { printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name); - return -ENODEV; + goto out_free_wq; } priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); @@ -242,6 +252,10 @@ out_free_mr: out_free_pd: ib_dealloc_pd(priv->pd); + +out_free_wq: + destroy_workqueue(priv->wq); + priv->wq = NULL; return -ENODEV; } @@ -270,6 +284,12 @@ void ipoib_transport_dev_cleanup(struct net_device *dev) if (ib_dealloc_pd(priv->pd)) ipoib_warn(priv, "ib_dealloc_pd failed\n"); + + if (priv->wq) { + flush_workqueue(priv->wq); + destroy_workqueue(priv->wq); + priv->wq = NULL; + } } void ipoib_event(struct ib_event_handler *handler, -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 37+ messages in thread
[parent not found: <f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* RE: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface [not found] ` <f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-20 15:01 ` Estrin, Alex 2014-09-04 6:49 ` Erez Shitrit 1 sibling, 0 replies; 37+ messages in thread From: Estrin, Alex @ 2014-08-20 15:01 UTC (permalink / raw) To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A > -----Original Message----- > From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On > Behalf Of Doug Ledford > Sent: Tuesday, August 12, 2014 7:38 PM > To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org > Cc: Doug Ledford > Subject: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface > > During my recent work on the rtnl lock deadlock in the IPoIB driver, I > saw that even once I fixed the apparent races for a single device, as > soon as that device had any children, new races popped up. It turns out > that this is because no matter how well we protect against races on a > single device, the fact that all devices use the same workqueue, and > flush_workqueue() flushes *everything* from that workqueue, we can have > one device in the middle of a down and holding the rtnl lock and another > totally unrelated device needing to run mcast_restart_task, which wants > the rtnl lock and will loop trying to take it unless is sees its own > FLAG_ADMIN_UP flag go away. Because the unrelated interface will never > see its own ADMIN_UP flag drop, the interface going down will deadlock > trying to flush the queue. There are several possible solutions to this > problem: > > Make carrier_on_task and mcast_restart_task try to take the rtnl for > some set period of time and if they fail, then bail. This runs the real > risk of dropping work on the floor, which can end up being its own > separate kind of deadlock. > > Set some global flag in the driver that says some device is in the > middle of going down, letting all tasks know to bail. Again, this can > drop work on the floor. I suppose if our own ADMIN_UP flag doesn't go > away, then maybe after a few tries on the rtnl lock we can queue our own > task back up as a delayed work and return and avoid dropping work on the > floor that way. But I'm not 100% convinced that we won't cause other > problems. > > Or the method this patch attempts to use, which is when we bring an > interface up, create a workqueue specifically for that interface, so > that when we take it back down, we are flushing only those tasks > associated with our interface. In addition, keep the global workqueue, > but now limit it to only flush tasks. In this way, the flush tasks can > always flush the device specific work queues without having deadlock > issues. > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Workqueues per interface is a great idea. Thanks! Acked-by: Alex Estrin <alex.estrin-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib.h | 1 + > drivers/infiniband/ulp/ipoib/ipoib_cm.c | 18 +++++++++--------- > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 6 +++--- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 19 ++++++++++++------- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++-------------- > drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 22 +++++++++++++++++++++- > 6 files changed, 58 insertions(+), 34 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h > b/drivers/infiniband/ulp/ipoib/ipoib.h > index 43840971ad8..7bf7339eaef 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -317,6 +317,7 @@ struct ipoib_dev_priv { > struct list_head multicast_list; > struct rb_root multicast_tree; > > + struct workqueue_struct *wq; > struct delayed_work mcast_task; > struct work_struct carrier_on_task; > struct work_struct flush_light; > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > index 933efcea0d0..56959adb6c7 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > @@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct > ib_cm_event *even > } > > spin_lock_irq(&priv->lock); > - queue_delayed_work(ipoib_workqueue, > + queue_delayed_work(priv->wq, > &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > /* Add this entry to passive ids list head, but do not re-add it > * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ > @@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc > *wc) > spin_lock_irqsave(&priv->lock, flags); > list_splice_init(&priv->cm.rx_drain_list, &priv- > >cm.rx_reap_list); > ipoib_cm_start_rx_drain(priv); > - queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); > + queue_work(priv->wq, &priv->cm.rx_reap_task); > spin_unlock_irqrestore(&priv->lock, flags); > } else > ipoib_warn(priv, "cm recv completion event with wrid %d (> > %d)\n", > @@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc > *wc) > spin_lock_irqsave(&priv->lock, flags); > list_move(&p->list, &priv->cm.rx_reap_list); > spin_unlock_irqrestore(&priv->lock, flags); > - queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); > + queue_work(priv->wq, &priv->cm.rx_reap_task); > } > return; > } > @@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc > *wc) > > if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { > list_move(&tx->list, &priv->cm.reap_list); > - queue_work(ipoib_workqueue, &priv->cm.reap_task); > + queue_work(priv->wq, &priv->cm.reap_task); > } > > clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags); > @@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, > > if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { > list_move(&tx->list, &priv->cm.reap_list); > - queue_work(ipoib_workqueue, &priv->cm.reap_task); > + queue_work(priv->wq, &priv->cm.reap_task); > } > > spin_unlock_irqrestore(&priv->lock, flags); > @@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device > *dev, struct ipoib_path > tx->dev = dev; > list_add(&tx->list, &priv->cm.start_list); > set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags); > - queue_work(ipoib_workqueue, &priv->cm.start_task); > + queue_work(priv->wq, &priv->cm.start_task); > return tx; > } > > @@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx) > if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { > spin_lock_irqsave(&priv->lock, flags); > list_move(&tx->list, &priv->cm.reap_list); > - queue_work(ipoib_workqueue, &priv->cm.reap_task); > + queue_work(priv->wq, &priv->cm.reap_task); > ipoib_dbg(priv, "Reap connection for gid %pI6\n", > tx->neigh->daddr + 4); > tx->neigh = NULL; > @@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct > sk_buff *skb, > > skb_queue_tail(&priv->cm.skb_queue, skb); > if (e) > - queue_work(ipoib_workqueue, &priv->cm.skb_task); > + queue_work(priv->wq, &priv->cm.skb_task); > } > > static void ipoib_cm_rx_reap(struct work_struct *work) > @@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work) > } > > if (!list_empty(&priv->cm.passive_ids)) > - queue_delayed_work(ipoib_workqueue, > + queue_delayed_work(priv->wq, > &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > spin_unlock_irq(&priv->lock); > } > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index 72626c34817..bfd17d41b5f 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work) > __ipoib_reap_ah(dev); > > if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, > + queue_delayed_work(priv->wq, &priv->ah_reap_task, > round_jiffies_relative(HZ)); > } > > @@ -696,7 +696,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush) > } > > clear_bit(IPOIB_STOP_REAPER, &priv->flags); > - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, > + queue_delayed_work(priv->wq, &priv->ah_reap_task, > round_jiffies_relative(HZ)); > > if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) > @@ -881,7 +881,7 @@ timeout: > set_bit(IPOIB_STOP_REAPER, &priv->flags); > cancel_delayed_work(&priv->ah_reap_task); > if (flush) > - flush_workqueue(ipoib_workqueue); > + flush_workqueue(priv->wq); > > begin = jiffies; > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c > b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 949948a46d4..255c8296566 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev) > return; > } > > - queue_work(ipoib_workqueue, &priv->restart_task); > + queue_work(priv->wq, &priv->restart_task); > } > > static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr) > @@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work) > __ipoib_reap_neigh(priv); > > if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task, > + queue_delayed_work(priv->wq, &priv->neigh_reap_task, > arp_tbl.gc_interval); > } > > @@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv) > > /* start garbage collection */ > clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags); > - queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task, > + queue_delayed_work(priv->wq, &priv->neigh_reap_task, > arp_tbl.gc_interval); > > return 0; > @@ -1293,7 +1293,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device > *ca, int port) > return 0; > > out_dev_uninit: > - ipoib_ib_dev_cleanup(); > + ipoib_ib_dev_cleanup(dev); > > out_tx_ring_cleanup: > vfree(priv->tx_ring); > @@ -1646,7 +1646,7 @@ register_failed: > /* Stop GC if started before flush */ > set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags); > cancel_delayed_work(&priv->neigh_reap_task); > - flush_workqueue(ipoib_workqueue); > + flush_workqueue(priv->wq); > > event_failed: > ipoib_dev_cleanup(priv->dev); > @@ -1717,7 +1717,7 @@ static void ipoib_remove_one(struct ib_device *device) > /* Stop GC */ > set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags); > cancel_delayed_work(&priv->neigh_reap_task); > - flush_workqueue(ipoib_workqueue); > + flush_workqueue(priv->wq); > > unregister_netdev(priv->dev); > free_netdev(priv->dev); > @@ -1758,8 +1758,13 @@ static int __init ipoib_init_module(void) > * unregister_netdev() and linkwatch_event take the rtnl lock, > * so flush_scheduled_work() can deadlock during device > * removal. > + * > + * In addition, bringing one device up and another down at the > + * same time can deadlock a single workqueue, so we have this > + * global fallback workqueue, but we also attempt to open a > + * per device workqueue each time we bring an interface up > */ > - ipoib_workqueue = create_singlethread_workqueue("ipoib"); > + ipoib_workqueue = create_singlethread_workqueue("ipoib_flush"); > if (!ipoib_workqueue) { > ret = -ENOMEM; > goto err_fs; > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index 19e3fe75ebf..969ef420518 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -388,7 +388,7 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) > * the workqueue while holding the rtnl lock, so loop > * on trylock until either we get the lock or we see > * FLAG_ADMIN_UP go away as that signals that we are bailing > - * and can safely ignore the carrier on work > + * and can safely ignore the carrier on work. > */ > while (!rtnl_trylock()) { > if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > @@ -432,15 +432,14 @@ static int ipoib_mcast_join_complete(int status, > if (!status) { > mcast->backoff = 1; > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, > - &priv->mcast_task, 0); > + queue_delayed_work(priv->wq, &priv->mcast_task, 0); > > /* > - * Defer carrier on work to ipoib_workqueue to avoid a > + * Defer carrier on work to priv->wq to avoid a > * deadlock on rtnl_lock here. > */ > if (mcast == priv->broadcast) > - queue_work(ipoib_workqueue, &priv->carrier_on_task); > + queue_work(priv->wq, &priv->carrier_on_task); > } else { > if (mcast->logcount++ < 20) { > if (status == -ETIMEDOUT || status == -EAGAIN) { > @@ -465,7 +464,7 @@ out: > if (status == -ENETRESET) > status = 0; > if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, > + queue_delayed_work(priv->wq, &priv->mcast_task, > mcast->backoff * HZ); > spin_unlock_irq(&priv->lock); > mutex_unlock(&mcast_mutex); > @@ -535,8 +534,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct > ipoib_mcast *mcast, > mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; > > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, > - &priv->mcast_task, > + queue_delayed_work(priv->wq, &priv->mcast_task, > mcast->backoff * HZ); > } > mutex_unlock(&mcast_mutex); > @@ -584,8 +582,8 @@ void ipoib_mcast_join_task(struct work_struct *work) > ipoib_warn(priv, "failed to allocate broadcast group\n"); > mutex_lock(&mcast_mutex); > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, > - &priv->mcast_task, HZ); > + queue_delayed_work(priv->wq, &priv->mcast_task, > + HZ); > mutex_unlock(&mcast_mutex); > return; > } > @@ -652,7 +650,7 @@ int ipoib_mcast_start_thread(struct net_device *dev) > > mutex_lock(&mcast_mutex); > if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0); > + queue_delayed_work(priv->wq, &priv->mcast_task, 0); > mutex_unlock(&mcast_mutex); > > return 0; > @@ -670,7 +668,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush) > mutex_unlock(&mcast_mutex); > > if (flush) > - flush_workqueue(ipoib_workqueue); > + flush_workqueue(priv->wq); > > return 0; > } > @@ -737,7 +735,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct > sk_buff *skb) > __ipoib_mcast_add(dev, mcast); > list_add_tail(&mcast->list, &priv->multicast_list); > if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0); > + queue_delayed_work(priv->wq, &priv->mcast_task, 0); > } > > if (!mcast->ah) { > @@ -952,7 +950,7 @@ void ipoib_mcast_restart_task(struct work_struct *work) > * completes. So do like the carrier on task and attempt to > * take the rtnl lock, but if we can't before the ADMIN_UP flag > * goes away, then just return and know that the remove list will > - * get flushed later by mcast_dev_flush. > + * get flushed later by mcast_stop_thread. > */ > while (!rtnl_trylock()) { > if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > index c56d5d44c53..b72a753eb41 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > @@ -145,10 +145,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct > ib_device *ca) > int ret, size; > int i; > > + /* > + * the various IPoIB tasks assume they will never race against > + * themselves, so always use a single thread workqueue > + */ > + priv->wq = create_singlethread_workqueue("ipoib_wq"); > + if (!priv->wq) { > + printk(KERN_WARNING "ipoib: failed to allocate device WQ\n"); > + return -ENODEV; > + } > + > priv->pd = ib_alloc_pd(priv->ca); > if (IS_ERR(priv->pd)) { > printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name); > - return -ENODEV; > + goto out_free_wq; > } > > priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); > @@ -242,6 +252,10 @@ out_free_mr: > > out_free_pd: > ib_dealloc_pd(priv->pd); > + > +out_free_wq: > + destroy_workqueue(priv->wq); > + priv->wq = NULL; > return -ENODEV; > } > > @@ -270,6 +284,12 @@ void ipoib_transport_dev_cleanup(struct net_device *dev) > > if (ib_dealloc_pd(priv->pd)) > ipoib_warn(priv, "ib_dealloc_pd failed\n"); > + > + if (priv->wq) { > + flush_workqueue(priv->wq); > + destroy_workqueue(priv->wq); > + priv->wq = NULL; > + } > } > > void ipoib_event(struct ib_event_handler *handler, > -- > 1.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface [not found] ` <f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-20 15:01 ` Estrin, Alex @ 2014-09-04 6:49 ` Erez Shitrit [not found] ` <54080B6B.8050707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> 1 sibling, 1 reply; 37+ messages in thread From: Erez Shitrit @ 2014-09-04 6:49 UTC (permalink / raw) To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Hi Doug, The idea that "big" ipoib workqueue is only for the IB_EVENT_XXX looks good to me. one comment here: I don't see where the driver flushes it at all. IMHO, when the driver goes down you should cancel all pending tasks over that global wq prior the call for the destroy_workqueue. The current logic calls the destroy_workqueue after the remove_one, i think you should cancel the pending works after the ib_unregister_event_handler call in the ipoib_remove_one function, otherwise if there are pending tasks in that queue they will schedule after the dev/priv are gone. Thanks, Erez > During my recent work on the rtnl lock deadlock in the IPoIB driver, I > saw that even once I fixed the apparent races for a single device, as > soon as that device had any children, new races popped up. It turns out > that this is because no matter how well we protect against races on a > single device, the fact that all devices use the same workqueue, and > flush_workqueue() flushes *everything* from that workqueue, we can have > one device in the middle of a down and holding the rtnl lock and another > totally unrelated device needing to run mcast_restart_task, which wants > the rtnl lock and will loop trying to take it unless is sees its own > FLAG_ADMIN_UP flag go away. Because the unrelated interface will never > see its own ADMIN_UP flag drop, the interface going down will deadlock > trying to flush the queue. There are several possible solutions to this > problem: > > Make carrier_on_task and mcast_restart_task try to take the rtnl for > some set period of time and if they fail, then bail. This runs the real > risk of dropping work on the floor, which can end up being its own > separate kind of deadlock. > > Set some global flag in the driver that says some device is in the > middle of going down, letting all tasks know to bail. Again, this can > drop work on the floor. I suppose if our own ADMIN_UP flag doesn't go > away, then maybe after a few tries on the rtnl lock we can queue our own > task back up as a delayed work and return and avoid dropping work on the > floor that way. But I'm not 100% convinced that we won't cause other > problems. > > Or the method this patch attempts to use, which is when we bring an > interface up, create a workqueue specifically for that interface, so > that when we take it back down, we are flushing only those tasks > associated with our interface. In addition, keep the global workqueue, > but now limit it to only flush tasks. In this way, the flush tasks can > always flush the device specific work queues without having deadlock > issues. > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib.h | 1 + > drivers/infiniband/ulp/ipoib/ipoib_cm.c | 18 +++++++++--------- > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 6 +++--- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 19 ++++++++++++------- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++-------------- > drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 22 +++++++++++++++++++++- > 6 files changed, 58 insertions(+), 34 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h > index 43840971ad8..7bf7339eaef 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -317,6 +317,7 @@ struct ipoib_dev_priv { > struct list_head multicast_list; > struct rb_root multicast_tree; > > + struct workqueue_struct *wq; > struct delayed_work mcast_task; > struct work_struct carrier_on_task; > struct work_struct flush_light; > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > index 933efcea0d0..56959adb6c7 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > @@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even > } > > spin_lock_irq(&priv->lock); > - queue_delayed_work(ipoib_workqueue, > + queue_delayed_work(priv->wq, > &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > /* Add this entry to passive ids list head, but do not re-add it > * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ > @@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > spin_lock_irqsave(&priv->lock, flags); > list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list); > ipoib_cm_start_rx_drain(priv); > - queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); > + queue_work(priv->wq, &priv->cm.rx_reap_task); > spin_unlock_irqrestore(&priv->lock, flags); > } else > ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", > @@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) > spin_lock_irqsave(&priv->lock, flags); > list_move(&p->list, &priv->cm.rx_reap_list); > spin_unlock_irqrestore(&priv->lock, flags); > - queue_work(ipoib_workqueue, &priv->cm.rx_reap_task); > + queue_work(priv->wq, &priv->cm.rx_reap_task); > } > return; > } > @@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) > > if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { > list_move(&tx->list, &priv->cm.reap_list); > - queue_work(ipoib_workqueue, &priv->cm.reap_task); > + queue_work(priv->wq, &priv->cm.reap_task); > } > > clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags); > @@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, > > if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { > list_move(&tx->list, &priv->cm.reap_list); > - queue_work(ipoib_workqueue, &priv->cm.reap_task); > + queue_work(priv->wq, &priv->cm.reap_task); > } > > spin_unlock_irqrestore(&priv->lock, flags); > @@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path > tx->dev = dev; > list_add(&tx->list, &priv->cm.start_list); > set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags); > - queue_work(ipoib_workqueue, &priv->cm.start_task); > + queue_work(priv->wq, &priv->cm.start_task); > return tx; > } > > @@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx) > if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { > spin_lock_irqsave(&priv->lock, flags); > list_move(&tx->list, &priv->cm.reap_list); > - queue_work(ipoib_workqueue, &priv->cm.reap_task); > + queue_work(priv->wq, &priv->cm.reap_task); > ipoib_dbg(priv, "Reap connection for gid %pI6\n", > tx->neigh->daddr + 4); > tx->neigh = NULL; > @@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb, > > skb_queue_tail(&priv->cm.skb_queue, skb); > if (e) > - queue_work(ipoib_workqueue, &priv->cm.skb_task); > + queue_work(priv->wq, &priv->cm.skb_task); > } > > static void ipoib_cm_rx_reap(struct work_struct *work) > @@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work) > } > > if (!list_empty(&priv->cm.passive_ids)) > - queue_delayed_work(ipoib_workqueue, > + queue_delayed_work(priv->wq, > &priv->cm.stale_task, IPOIB_CM_RX_DELAY); > spin_unlock_irq(&priv->lock); > } > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index 72626c34817..bfd17d41b5f 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work) > __ipoib_reap_ah(dev); > > if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, > + queue_delayed_work(priv->wq, &priv->ah_reap_task, > round_jiffies_relative(HZ)); > } > > @@ -696,7 +696,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush) > } > > clear_bit(IPOIB_STOP_REAPER, &priv->flags); > - queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, > + queue_delayed_work(priv->wq, &priv->ah_reap_task, > round_jiffies_relative(HZ)); > > if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) > @@ -881,7 +881,7 @@ timeout: > set_bit(IPOIB_STOP_REAPER, &priv->flags); > cancel_delayed_work(&priv->ah_reap_task); > if (flush) > - flush_workqueue(ipoib_workqueue); > + flush_workqueue(priv->wq); > > begin = jiffies; > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 949948a46d4..255c8296566 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev) > return; > } > > - queue_work(ipoib_workqueue, &priv->restart_task); > + queue_work(priv->wq, &priv->restart_task); > } > > static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr) > @@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work) > __ipoib_reap_neigh(priv); > > if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task, > + queue_delayed_work(priv->wq, &priv->neigh_reap_task, > arp_tbl.gc_interval); > } > > @@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv) > > /* start garbage collection */ > clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags); > - queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task, > + queue_delayed_work(priv->wq, &priv->neigh_reap_task, > arp_tbl.gc_interval); > > return 0; > @@ -1293,7 +1293,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) > return 0; > > out_dev_uninit: > - ipoib_ib_dev_cleanup(); > + ipoib_ib_dev_cleanup(dev); > > out_tx_ring_cleanup: > vfree(priv->tx_ring); > @@ -1646,7 +1646,7 @@ register_failed: > /* Stop GC if started before flush */ > set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags); > cancel_delayed_work(&priv->neigh_reap_task); > - flush_workqueue(ipoib_workqueue); > + flush_workqueue(priv->wq); > > event_failed: > ipoib_dev_cleanup(priv->dev); > @@ -1717,7 +1717,7 @@ static void ipoib_remove_one(struct ib_device *device) > /* Stop GC */ > set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags); > cancel_delayed_work(&priv->neigh_reap_task); > - flush_workqueue(ipoib_workqueue); > + flush_workqueue(priv->wq); > > unregister_netdev(priv->dev); > free_netdev(priv->dev); > @@ -1758,8 +1758,13 @@ static int __init ipoib_init_module(void) > * unregister_netdev() and linkwatch_event take the rtnl lock, > * so flush_scheduled_work() can deadlock during device > * removal. > + * > + * In addition, bringing one device up and another down at the > + * same time can deadlock a single workqueue, so we have this > + * global fallback workqueue, but we also attempt to open a > + * per device workqueue each time we bring an interface up > */ > - ipoib_workqueue = create_singlethread_workqueue("ipoib"); > + ipoib_workqueue = create_singlethread_workqueue("ipoib_flush"); > if (!ipoib_workqueue) { > ret = -ENOMEM; > goto err_fs; > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index 19e3fe75ebf..969ef420518 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -388,7 +388,7 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) > * the workqueue while holding the rtnl lock, so loop > * on trylock until either we get the lock or we see > * FLAG_ADMIN_UP go away as that signals that we are bailing > - * and can safely ignore the carrier on work > + * and can safely ignore the carrier on work. > */ > while (!rtnl_trylock()) { > if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > @@ -432,15 +432,14 @@ static int ipoib_mcast_join_complete(int status, > if (!status) { > mcast->backoff = 1; > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, > - &priv->mcast_task, 0); > + queue_delayed_work(priv->wq, &priv->mcast_task, 0); > > /* > - * Defer carrier on work to ipoib_workqueue to avoid a > + * Defer carrier on work to priv->wq to avoid a > * deadlock on rtnl_lock here. > */ > if (mcast == priv->broadcast) > - queue_work(ipoib_workqueue, &priv->carrier_on_task); > + queue_work(priv->wq, &priv->carrier_on_task); > } else { > if (mcast->logcount++ < 20) { > if (status == -ETIMEDOUT || status == -EAGAIN) { > @@ -465,7 +464,7 @@ out: > if (status == -ENETRESET) > status = 0; > if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, > + queue_delayed_work(priv->wq, &priv->mcast_task, > mcast->backoff * HZ); > spin_unlock_irq(&priv->lock); > mutex_unlock(&mcast_mutex); > @@ -535,8 +534,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, > mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; > > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, > - &priv->mcast_task, > + queue_delayed_work(priv->wq, &priv->mcast_task, > mcast->backoff * HZ); > } > mutex_unlock(&mcast_mutex); > @@ -584,8 +582,8 @@ void ipoib_mcast_join_task(struct work_struct *work) > ipoib_warn(priv, "failed to allocate broadcast group\n"); > mutex_lock(&mcast_mutex); > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, > - &priv->mcast_task, HZ); > + queue_delayed_work(priv->wq, &priv->mcast_task, > + HZ); > mutex_unlock(&mcast_mutex); > return; > } > @@ -652,7 +650,7 @@ int ipoib_mcast_start_thread(struct net_device *dev) > > mutex_lock(&mcast_mutex); > if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0); > + queue_delayed_work(priv->wq, &priv->mcast_task, 0); > mutex_unlock(&mcast_mutex); > > return 0; > @@ -670,7 +668,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush) > mutex_unlock(&mcast_mutex); > > if (flush) > - flush_workqueue(ipoib_workqueue); > + flush_workqueue(priv->wq); > > return 0; > } > @@ -737,7 +735,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb) > __ipoib_mcast_add(dev, mcast); > list_add_tail(&mcast->list, &priv->multicast_list); > if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) > - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0); > + queue_delayed_work(priv->wq, &priv->mcast_task, 0); > } > > if (!mcast->ah) { > @@ -952,7 +950,7 @@ void ipoib_mcast_restart_task(struct work_struct *work) > * completes. So do like the carrier on task and attempt to > * take the rtnl lock, but if we can't before the ADMIN_UP flag > * goes away, then just return and know that the remove list will > - * get flushed later by mcast_dev_flush. > + * get flushed later by mcast_stop_thread. > */ > while (!rtnl_trylock()) { > if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > index c56d5d44c53..b72a753eb41 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > @@ -145,10 +145,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) > int ret, size; > int i; > > + /* > + * the various IPoIB tasks assume they will never race against > + * themselves, so always use a single thread workqueue > + */ > + priv->wq = create_singlethread_workqueue("ipoib_wq"); > + if (!priv->wq) { > + printk(KERN_WARNING "ipoib: failed to allocate device WQ\n"); > + return -ENODEV; > + } > + > priv->pd = ib_alloc_pd(priv->ca); > if (IS_ERR(priv->pd)) { > printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name); > - return -ENODEV; > + goto out_free_wq; > } > > priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE); > @@ -242,6 +252,10 @@ out_free_mr: > > out_free_pd: > ib_dealloc_pd(priv->pd); > + > +out_free_wq: > + destroy_workqueue(priv->wq); > + priv->wq = NULL; > return -ENODEV; > } > > @@ -270,6 +284,12 @@ void ipoib_transport_dev_cleanup(struct net_device *dev) > > if (ib_dealloc_pd(priv->pd)) > ipoib_warn(priv, "ib_dealloc_pd failed\n"); > + > + if (priv->wq) { > + flush_workqueue(priv->wq); > + destroy_workqueue(priv->wq); > + priv->wq = NULL; > + } > } > > void ipoib_event(struct ib_event_handler *handler, -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <54080B6B.8050707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>]
* Re: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface [not found] ` <54080B6B.8050707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> @ 2014-09-09 7:09 ` Doug Ledford [not found] ` <540EA7B2.2050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-09-09 7:09 UTC (permalink / raw) To: Erez Shitrit, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A On 09/04/2014 02:49 AM, Erez Shitrit wrote: > Hi Doug, > > The idea that "big" ipoib workqueue is only for the IB_EVENT_XXX looks > good to me. > > one comment here: I don't see where the driver flushes it at all. > IMHO, when the driver goes down you should cancel all pending tasks over > that global wq prior the call for the destroy_workqueue. I understand your thought, but I disagree with your conclusion. If we ever get to the portion of the driver down sequence where we are removing the big flush work queue and there are still items on that work queue, then we have already failed and we are going to crash because there are outstanding flushes for a deleted device. The real issue here is that we need to make sure it's not possible to delete a device that has outstanding flushes, and not possible to flush a device in the process of being deleted (Wendy, don't get mad at me, but the rtnl lock jumps out as appropriate for this purpose). > The current logic calls the destroy_workqueue after the remove_one, i > think you should cancel the pending works after the > ib_unregister_event_handler call in the ipoib_remove_one function, The ib_remove_one is device specific while the big flush work queue is not. As such, that can cancel work for devices other than the device being removed with no clear means of getting it back. > otherwise if there are pending tasks in that queue they will schedule > after the dev/priv are gone. I agree that there is a problem here. I think what needs to happen is that now that we have a work queue per device, and all flushes come in to us via a work queue that does not belong to the device, it is now possible to handle flushes synchronously. This would allow us to lock around the flush itself and prevent removal until after the flush is complete without getting into the hairy rat hole that is trying to flush the flush workqueue that might result in either flushing or canceling the very device we are working on. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <540EA7B2.2050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface [not found] ` <540EA7B2.2050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-09-10 17:27 ` Erez Shitrit 0 siblings, 0 replies; 37+ messages in thread From: Erez Shitrit @ 2014-09-10 17:27 UTC (permalink / raw) To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A > On 09/04/2014 02:49 AM, Erez Shitrit wrote: >> Hi Doug, >> >> The idea that "big" ipoib workqueue is only for the IB_EVENT_XXX looks >> good to me. >> >> one comment here: I don't see where the driver flushes it at all. >> IMHO, when the driver goes down you should cancel all pending tasks over >> that global wq prior the call for the destroy_workqueue. > > I understand your thought, but I disagree with your conclusion. If we > ever get to the portion of the driver down sequence where we are > removing the big flush work queue and there are still items on that > work queue, then we have already failed and we are going to crash > because there are outstanding flushes for a deleted device. it is not a rare case, sm events (pkey, hand-over, etc.) while driver restart on devices in the cluster, and you are there. removing one device of few on the same host, it is a rare condition, anyway i agree that cancel_work is not the solution. > The real issue here is that we need to make sure it's not possible to > delete a device that has outstanding flushes, and not possible to > flush a device in the process of being deleted (Wendy, don't get mad > at me, but the rtnl lock jumps out as appropriate for this purpose). we can remove the device and drain the workqueue while checking on each event that this event doesn't belong to deleted device (we have the list of all existing priv objects in the system via ib_get_client_data) after that we can be sure that no more works are for that deleted device. > >> The current logic calls the destroy_workqueue after the remove_one, i >> think you should cancel the pending works after the >> ib_unregister_event_handler call in the ipoib_remove_one function, > > The ib_remove_one is device specific while the big flush work queue is > not. As such, that can cancel work for devices other than the device > being removed with no clear means of getting it back. > >> otherwise if there are pending tasks in that queue they will schedule >> after the dev/priv are gone. > > I agree that there is a problem here. I think what needs to happen is > that now that we have a work queue per device, and all flushes come in > to us via a work queue that does not belong to the device, it is now > possible to handle flushes synchronously. This would allow us to lock > around the flush itself and prevent removal until after the flush is > complete without getting into the hairy rat hole that is trying to > flush the flush workqueue that might result in either flushing or > canceling the very device we are working on. > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (5 preceding siblings ...) 2014-08-12 23:38 ` [PATCH 6/8] IPoIB: Use dedicated workqueues per interface Doug Ledford @ 2014-08-12 23:38 ` Doug Ledford [not found] ` <ae3912431eeacd81d920a405a6bdeb3853791b1a.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 8/8] IPoIB: No longer use flush as a parameter Doug Ledford ` (2 subsequent siblings) 9 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Cc: Doug Ledford We used to pass a flush variable to mcast_stop_thread to indicate if we should flush the workqueue or not. This was due to some code trying to flush a workqueue that it was currently running on which is a no-no. Now that we have per-device work queues, and now that ipoib_mcast_restart_task has taken the fact that it is queued on a single thread workqueue with all of the ipoib_mcast_join_task's and therefore has no need to stop the join task while it runs, we can do away with the flush parameter and unilaterally flush always. Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/ipoib/ipoib.h | 2 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 4 ++-- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++------------ 3 files changed, 12 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 7bf7339eaef..2e6c0d91d66 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -493,7 +493,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb); void ipoib_mcast_restart_task(struct work_struct *work); int ipoib_mcast_start_thread(struct net_device *dev); -int ipoib_mcast_stop_thread(struct net_device *dev, int flush); +int ipoib_mcast_stop_thread(struct net_device *dev); void ipoib_mcast_dev_down(struct net_device *dev); void ipoib_mcast_dev_flush(struct net_device *dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index bfd17d41b5f..66096787119 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -747,7 +747,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush) clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); netif_carrier_off(dev); - ipoib_mcast_stop_thread(dev, flush); + ipoib_mcast_stop_thread(dev); ipoib_mcast_dev_flush(dev); ipoib_flush_paths(dev); @@ -1097,7 +1097,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev) */ ipoib_flush_paths(dev); - ipoib_mcast_stop_thread(dev, 1); + ipoib_mcast_stop_thread(dev); ipoib_mcast_dev_flush(dev); ipoib_transport_dev_cleanup(dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 969ef420518..8a538c010b9 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -656,7 +656,7 @@ int ipoib_mcast_start_thread(struct net_device *dev) return 0; } -int ipoib_mcast_stop_thread(struct net_device *dev, int flush) +int ipoib_mcast_stop_thread(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -667,8 +667,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush) cancel_delayed_work(&priv->mcast_task); mutex_unlock(&mcast_mutex); - if (flush) - flush_workqueue(priv->wq); + flush_workqueue(priv->wq); return 0; } @@ -846,8 +845,6 @@ void ipoib_mcast_restart_task(struct work_struct *work) ipoib_dbg_mcast(priv, "restarting multicast task\n"); - ipoib_mcast_stop_thread(dev, 0); - local_irq_save(flags); netif_addr_lock(dev); spin_lock(&priv->lock); @@ -944,13 +941,10 @@ void ipoib_mcast_restart_task(struct work_struct *work) * We have to cancel outside of the spinlock, but we have to * take the rtnl lock or else we race with the removal of * entries from the remove list in mcast_dev_flush as part - * of ipoib_stop() which will call mcast_stop_thread with - * flush == 1 while holding the rtnl lock, and the - * flush_workqueue won't complete until this restart_mcast_task - * completes. So do like the carrier on task and attempt to - * take the rtnl lock, but if we can't before the ADMIN_UP flag - * goes away, then just return and know that the remove list will - * get flushed later by mcast_stop_thread. + * of ipoib_stop(). We detect the drop of the ADMIN_UP flag + * to signal that we have hit this particular race, and we + * return since we know we don't need to do anything else + * anyway. */ while (!rtnl_trylock()) { if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) @@ -962,6 +956,9 @@ void ipoib_mcast_restart_task(struct work_struct *work) ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } + /* + * Restart our join task if needed + */ ipoib_mcast_start_thread(dev); rtnl_unlock(); } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 37+ messages in thread
[parent not found: <ae3912431eeacd81d920a405a6bdeb3853791b1a.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* RE: [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue [not found] ` <ae3912431eeacd81d920a405a6bdeb3853791b1a.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-20 15:03 ` Estrin, Alex 0 siblings, 0 replies; 37+ messages in thread From: Estrin, Alex @ 2014-08-20 15:03 UTC (permalink / raw) To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A > -----Original Message----- > From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On > Behalf Of Doug Ledford > Sent: Tuesday, August 12, 2014 7:38 PM > To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org > Cc: Doug Ledford > Subject: [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue > > We used to pass a flush variable to mcast_stop_thread to indicate if we > should flush the workqueue or not. This was due to some code trying to > flush a workqueue that it was currently running on which is a no-no. > Now that we have per-device work queues, and now that > ipoib_mcast_restart_task has taken the fact that it is queued on a > single thread workqueue with all of the ipoib_mcast_join_task's and > therefore has no need to stop the join task while it runs, we can do > away with the flush parameter and unilaterally flush always. > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: Alex Estrin <alex.estrin-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib.h | 2 +- > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 4 ++-- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++------------ > 3 files changed, 12 insertions(+), 15 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h > b/drivers/infiniband/ulp/ipoib/ipoib.h > index 7bf7339eaef..2e6c0d91d66 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -493,7 +493,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct > sk_buff *skb); > > void ipoib_mcast_restart_task(struct work_struct *work); > int ipoib_mcast_start_thread(struct net_device *dev); > -int ipoib_mcast_stop_thread(struct net_device *dev, int flush); > +int ipoib_mcast_stop_thread(struct net_device *dev); > > void ipoib_mcast_dev_down(struct net_device *dev); > void ipoib_mcast_dev_flush(struct net_device *dev); > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index bfd17d41b5f..66096787119 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -747,7 +747,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush) > clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); > netif_carrier_off(dev); > > - ipoib_mcast_stop_thread(dev, flush); > + ipoib_mcast_stop_thread(dev); > ipoib_mcast_dev_flush(dev); > > ipoib_flush_paths(dev); > @@ -1097,7 +1097,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev) > */ > ipoib_flush_paths(dev); > > - ipoib_mcast_stop_thread(dev, 1); > + ipoib_mcast_stop_thread(dev); > ipoib_mcast_dev_flush(dev); > > ipoib_transport_dev_cleanup(dev); > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index 969ef420518..8a538c010b9 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -656,7 +656,7 @@ int ipoib_mcast_start_thread(struct net_device *dev) > return 0; > } > > -int ipoib_mcast_stop_thread(struct net_device *dev, int flush) > +int ipoib_mcast_stop_thread(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > > @@ -667,8 +667,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush) > cancel_delayed_work(&priv->mcast_task); > mutex_unlock(&mcast_mutex); > > - if (flush) > - flush_workqueue(priv->wq); > + flush_workqueue(priv->wq); > > return 0; > } > @@ -846,8 +845,6 @@ void ipoib_mcast_restart_task(struct work_struct *work) > > ipoib_dbg_mcast(priv, "restarting multicast task\n"); > > - ipoib_mcast_stop_thread(dev, 0); > - > local_irq_save(flags); > netif_addr_lock(dev); > spin_lock(&priv->lock); > @@ -944,13 +941,10 @@ void ipoib_mcast_restart_task(struct work_struct *work) > * We have to cancel outside of the spinlock, but we have to > * take the rtnl lock or else we race with the removal of > * entries from the remove list in mcast_dev_flush as part > - * of ipoib_stop() which will call mcast_stop_thread with > - * flush == 1 while holding the rtnl lock, and the > - * flush_workqueue won't complete until this restart_mcast_task > - * completes. So do like the carrier on task and attempt to > - * take the rtnl lock, but if we can't before the ADMIN_UP flag > - * goes away, then just return and know that the remove list will > - * get flushed later by mcast_stop_thread. > + * of ipoib_stop(). We detect the drop of the ADMIN_UP flag > + * to signal that we have hit this particular race, and we > + * return since we know we don't need to do anything else > + * anyway. > */ > while (!rtnl_trylock()) { > if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > @@ -962,6 +956,9 @@ void ipoib_mcast_restart_task(struct work_struct *work) > ipoib_mcast_leave(mcast->dev, mcast); > ipoib_mcast_free(mcast); > } > + /* > + * Restart our join task if needed > + */ > ipoib_mcast_start_thread(dev); > rtnl_unlock(); > } > -- > 1.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 8/8] IPoIB: No longer use flush as a parameter [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (6 preceding siblings ...) 2014-08-12 23:38 ` [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue Doug Ledford @ 2014-08-12 23:38 ` Doug Ledford [not found] ` <ad7bb2b8da52f187cf2978e6a1c77ead32b60de3.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-15 22:08 ` [PATCH 0/8] IPoIB: Fix multiple race conditions Wendy Cheng 2014-09-03 13:52 ` Or Gerlitz 9 siblings, 1 reply; 37+ messages in thread From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A Cc: Doug Ledford Various places in the IPoIB code had a deadlock related to flushing the ipoib workqueue. Now that we have per device workqueues and a specific flush workqueue, there is no longer a deadlock issue with flushing the device specific workqueues and we can do so unilaterally. Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- drivers/infiniband/ulp/ipoib/ipoib.h | 6 +++--- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 19 +++++++++---------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 ++++---- 3 files changed, 16 insertions(+), 17 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 2e6c0d91d66..71b18883ffe 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -478,10 +478,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work); void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); -int ipoib_ib_dev_open(struct net_device *dev, int flush); +int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); -int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev, int flush); +int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev); void ipoib_pkey_dev_check_presence(struct net_device *dev); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 66096787119..fe65abb5150 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -664,7 +664,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx) drain_tx_cq((struct net_device *)ctx); } -int ipoib_ib_dev_open(struct net_device *dev, int flush) +int ipoib_ib_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; @@ -706,7 +706,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush) dev_stop: if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) napi_enable(&priv->napi); - ipoib_ib_dev_stop(dev, flush); + ipoib_ib_dev_stop(dev); return -1; } @@ -738,7 +738,7 @@ int ipoib_ib_dev_up(struct net_device *dev) return ipoib_mcast_start_thread(dev); } -int ipoib_ib_dev_down(struct net_device *dev, int flush) +int ipoib_ib_dev_down(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -807,7 +807,7 @@ void ipoib_drain_cq(struct net_device *dev) local_bh_enable(); } -int ipoib_ib_dev_stop(struct net_device *dev, int flush) +int ipoib_ib_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -880,8 +880,7 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, &priv->flags); cancel_delayed_work(&priv->ah_reap_task); - if (flush) - flush_workqueue(priv->wq); + flush_workqueue(priv->wq); begin = jiffies; @@ -918,7 +917,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) (unsigned long) dev); if (dev->flags & IFF_UP) { - if (ipoib_ib_dev_open(dev, 1)) { + if (ipoib_ib_dev_open(dev)) { ipoib_transport_dev_cleanup(dev); return -ENODEV; } @@ -1040,12 +1039,12 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, } if (level >= IPOIB_FLUSH_NORMAL) - ipoib_ib_dev_down(dev, 0); + ipoib_ib_dev_down(dev); if (level == IPOIB_FLUSH_HEAVY) { if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) - ipoib_ib_dev_stop(dev, 0); - if (ipoib_ib_dev_open(dev, 0) != 0) + ipoib_ib_dev_stop(dev); + if (ipoib_ib_dev_open(dev) != 0) return; if (netif_queue_stopped(dev)) netif_start_queue(dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 255c8296566..4e4f6ec59e1 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -108,7 +108,7 @@ int ipoib_open(struct net_device *dev) set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); - if (ipoib_ib_dev_open(dev, 1)) { + if (ipoib_ib_dev_open(dev)) { if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) return 0; goto err_disable; @@ -139,7 +139,7 @@ int ipoib_open(struct net_device *dev) return 0; err_stop: - ipoib_ib_dev_stop(dev, 1); + ipoib_ib_dev_stop(dev); err_disable: clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); @@ -157,8 +157,8 @@ static int ipoib_stop(struct net_device *dev) netif_stop_queue(dev); - ipoib_ib_dev_down(dev, 1); - ipoib_ib_dev_stop(dev, 0); + ipoib_ib_dev_down(dev); + ipoib_ib_dev_stop(dev); if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 37+ messages in thread
[parent not found: <ad7bb2b8da52f187cf2978e6a1c77ead32b60de3.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* RE: [PATCH 8/8] IPoIB: No longer use flush as a parameter [not found] ` <ad7bb2b8da52f187cf2978e6a1c77ead32b60de3.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-20 15:04 ` Estrin, Alex 0 siblings, 0 replies; 37+ messages in thread From: Estrin, Alex @ 2014-08-20 15:04 UTC (permalink / raw) To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A > -----Original Message----- > From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On > Behalf Of Doug Ledford > Sent: Tuesday, August 12, 2014 7:38 PM > To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org > Cc: Doug Ledford > Subject: [PATCH 8/8] IPoIB: No longer use flush as a parameter > > Various places in the IPoIB code had a deadlock related to flushing the > ipoib workqueue. Now that we have per device workqueues and a specific > flush workqueue, there is no longer a deadlock issue with flushing the > device specific workqueues and we can do so unilaterally. > > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: Alex Estrin <alex.estrin-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > drivers/infiniband/ulp/ipoib/ipoib.h | 6 +++--- > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 19 +++++++++---------- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 ++++---- > 3 files changed, 16 insertions(+), 17 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h > b/drivers/infiniband/ulp/ipoib/ipoib.h > index 2e6c0d91d66..71b18883ffe 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -478,10 +478,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work); > void ipoib_pkey_event(struct work_struct *work); > void ipoib_ib_dev_cleanup(struct net_device *dev); > > -int ipoib_ib_dev_open(struct net_device *dev, int flush); > +int ipoib_ib_dev_open(struct net_device *dev); > int ipoib_ib_dev_up(struct net_device *dev); > -int ipoib_ib_dev_down(struct net_device *dev, int flush); > -int ipoib_ib_dev_stop(struct net_device *dev, int flush); > +int ipoib_ib_dev_down(struct net_device *dev); > +int ipoib_ib_dev_stop(struct net_device *dev); > void ipoib_pkey_dev_check_presence(struct net_device *dev); > > int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index 66096787119..fe65abb5150 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -664,7 +664,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx) > drain_tx_cq((struct net_device *)ctx); > } > > -int ipoib_ib_dev_open(struct net_device *dev, int flush) > +int ipoib_ib_dev_open(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > int ret; > @@ -706,7 +706,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush) > dev_stop: > if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) > napi_enable(&priv->napi); > - ipoib_ib_dev_stop(dev, flush); > + ipoib_ib_dev_stop(dev); > return -1; > } > > @@ -738,7 +738,7 @@ int ipoib_ib_dev_up(struct net_device *dev) > return ipoib_mcast_start_thread(dev); > } > > -int ipoib_ib_dev_down(struct net_device *dev, int flush) > +int ipoib_ib_dev_down(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > > @@ -807,7 +807,7 @@ void ipoib_drain_cq(struct net_device *dev) > local_bh_enable(); > } > > -int ipoib_ib_dev_stop(struct net_device *dev, int flush) > +int ipoib_ib_dev_stop(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ib_qp_attr qp_attr; > @@ -880,8 +880,7 @@ timeout: > /* Wait for all AHs to be reaped */ > set_bit(IPOIB_STOP_REAPER, &priv->flags); > cancel_delayed_work(&priv->ah_reap_task); > - if (flush) > - flush_workqueue(priv->wq); > + flush_workqueue(priv->wq); > > begin = jiffies; > > @@ -918,7 +917,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device > *ca, int port) > (unsigned long) dev); > > if (dev->flags & IFF_UP) { > - if (ipoib_ib_dev_open(dev, 1)) { > + if (ipoib_ib_dev_open(dev)) { > ipoib_transport_dev_cleanup(dev); > return -ENODEV; > } > @@ -1040,12 +1039,12 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv > *priv, > } > > if (level >= IPOIB_FLUSH_NORMAL) > - ipoib_ib_dev_down(dev, 0); > + ipoib_ib_dev_down(dev); > > if (level == IPOIB_FLUSH_HEAVY) { > if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) > - ipoib_ib_dev_stop(dev, 0); > - if (ipoib_ib_dev_open(dev, 0) != 0) > + ipoib_ib_dev_stop(dev); > + if (ipoib_ib_dev_open(dev) != 0) > return; > if (netif_queue_stopped(dev)) > netif_start_queue(dev); > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c > b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 255c8296566..4e4f6ec59e1 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -108,7 +108,7 @@ int ipoib_open(struct net_device *dev) > > set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); > > - if (ipoib_ib_dev_open(dev, 1)) { > + if (ipoib_ib_dev_open(dev)) { > if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) > return 0; > goto err_disable; > @@ -139,7 +139,7 @@ int ipoib_open(struct net_device *dev) > return 0; > > err_stop: > - ipoib_ib_dev_stop(dev, 1); > + ipoib_ib_dev_stop(dev); > > err_disable: > clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); > @@ -157,8 +157,8 @@ static int ipoib_stop(struct net_device *dev) > > netif_stop_queue(dev); > > - ipoib_ib_dev_down(dev, 1); > - ipoib_ib_dev_stop(dev, 0); > + ipoib_ib_dev_down(dev); > + ipoib_ib_dev_stop(dev); > > if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { > struct ipoib_dev_priv *cpriv; > -- > 1.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/8] IPoIB: Fix multiple race conditions [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (7 preceding siblings ...) 2014-08-12 23:38 ` [PATCH 8/8] IPoIB: No longer use flush as a parameter Doug Ledford @ 2014-08-15 22:08 ` Wendy Cheng [not found] ` <CABgxfbEGfiNGUKT4NJi1GoDRouFznxpgogDt5yr47TLfwDB7hA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-09-03 13:52 ` Or Gerlitz 9 siblings, 1 reply; 37+ messages in thread From: Wendy Cheng @ 2014-08-15 22:08 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A, Cheng, Wendy On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Locking of multicast joins/leaves in the IPoIB layer have been problematic > for a while. There have been recent changes to try and make things better, > including these changes: > > bea1e22 IPoIB: Fix use-after-free of multicast object > a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects > > Unfortunately, the following test still fails (miserably) on a plain > upstream kernel: > > pass=0 > ifdown ib0 > while true; do > ifconfig ib0 up > ifconfig ib0 down > echo "Pass $pass" > let pass++ > done > > This usually fails within 10 to 20 passes, although I did have a lucky > run make it to 300 or so. If you happen to have a P_Key child interface, > it fails even quicker. > [snip] > > Doug Ledford (8): > IPoIB: Consolidate rtnl_lock tasks in workqueue > IPoIB: Make the carrier_on_task race aware > IPoIB: fix MCAST_FLAG_BUSY usage > IPoIB: fix mcast_dev_flush/mcast_restart_task race > IPoIB: change init sequence ordering > IPoIB: Use dedicated workqueues per interface > IPoIB: Make ipoib_mcast_stop_thread flush the workqueue > IPoIB: No longer use flush as a parameter > IPOIB is recently added as a technology preview for Intel Xeon Phi (currently a PCIe card) that runs embedded Linux (named MPSS) with Infiniband software stacks supported via emulation drivers. One early feedback from users with large cluster nodes is IPOIB's power consumption. The root cause of the reported issue is more to do with how MPSS handles its DMA buffers (vs. how Linux IB stacks work) - so submitting the fix to upstream is not planned at this moment (unless folks are interested in the changes). However, since this patch set happens to be in the heart of the reported power issue, we would like to take a closer look to avoid MPSS code base deviating too much from future upstream kernel(s). Question, comment, and/or ack will follow sometime next week. -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <CABgxfbEGfiNGUKT4NJi1GoDRouFznxpgogDt5yr47TLfwDB7hA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/8] IPoIB: Fix multiple race conditions [not found] ` <CABgxfbEGfiNGUKT4NJi1GoDRouFznxpgogDt5yr47TLfwDB7hA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-09-03 16:26 ` Wendy Cheng 0 siblings, 0 replies; 37+ messages in thread From: Wendy Cheng @ 2014-09-03 16:26 UTC (permalink / raw) To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A, Cheng, Wendy On Fri, Aug 15, 2014 at 3:08 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: [snip] >> >> Doug Ledford (8): >> IPoIB: Consolidate rtnl_lock tasks in workqueue >> IPoIB: Make the carrier_on_task race aware >> IPoIB: fix MCAST_FLAG_BUSY usage >> IPoIB: fix mcast_dev_flush/mcast_restart_task race >> IPoIB: change init sequence ordering >> IPoIB: Use dedicated workqueues per interface >> IPoIB: Make ipoib_mcast_stop_thread flush the workqueue >> IPoIB: No longer use flush as a parameter >> > > IPOIB is recently added as a technology preview for Intel Xeon Phi > (currently a PCIe card) that runs embedded Linux (named MPSS) with > Infiniband software stacks supported via emulation drivers. One early > feedback from users with large cluster nodes is IPOIB's power > consumption. The root cause of the reported issue is more to do with > how MPSS handles its DMA buffers (vs. how Linux IB stacks work) - so > submitting the fix to upstream is not planned at this moment (unless > folks are interested in the changes). > > However, since this patch set happens to be in the heart of the > reported power issue, we would like to take a closer look to avoid > MPSS code base deviating too much from future upstream kernel(s). > Question, comment, and/or ack will follow sometime next week. > I've reviewed the patch set - the first half of the patches look good. Patch #5,#6,#7,#8 are fine if we go for one WQ per device - will let others do the final call. On our system (OFED 1.5.4 based), similar deadlocks were also observed while the power management issues were worked on. Restricted by other issues that were specific to our platform, I took the advantage of single IPOIB workqueue by queuing the if-up(s) and/or if-down(s) to the work queue if one already in progress. It serialized the logic by default. However, I would not mind one WQ per device approach and will re-make the changes when this patch set is picked up by mainline kernel. -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 0/8] IPoIB: Fix multiple race conditions [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (8 preceding siblings ...) 2014-08-15 22:08 ` [PATCH 0/8] IPoIB: Fix multiple race conditions Wendy Cheng @ 2014-09-03 13:52 ` Or Gerlitz [not found] ` <54071D14.9040404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 9 siblings, 1 reply; 37+ messages in thread From: Or Gerlitz @ 2014-09-03 13:52 UTC (permalink / raw) To: Doug Ledford, Erez Shitrit, Saeed Mahameed Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A, Yevgeny Petrilin On 8/13/2014 2:38 AM, Doug Ledford wrote: > Locking of multicast joins/leaves in the IPoIB layer have been problematic > for a while. There have been recent changes to try and make things better, > including these changes: > > bea1e22 IPoIB: Fix use-after-free of multicast object > a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects > > Unfortunately, the following test still fails (miserably) on a plain > upstream kernel: > > pass=0 > ifdown ib0 > while true; do > ifconfig ib0 up > ifconfig ib0 down > echo "Pass $pass" > let pass++ > done > > This usually fails within 10 to 20 passes, although I did have a lucky > run make it to 300 or so. If you happen to have a P_Key child interface, > it fails even quicker. Hi Doug, Thanks for looking on that. I wasn't aware we're doing so badly... I checked here and the plan is that Erez Shitrit from Mellanox will also be providing feedback on the series. Anyway, just to make sure we're on the same page, people agree that picking such series is too heavy for post merge window, right? so we are talking on 3.18, agree? Or. > > In tracking down the rtnl deadlock in the multicast code, I ended up > re-designing the flag usage and clean up just the race conditions in > the various tasks. Even that wasn't enough to resolve the issue entirely > though, as if you ran the test above on multiple interfaces simultaneously, > it could still deadlock. So in the end I re-did the workqueue usage too > so that we now use a workqueue per device (and that includes P_Key devices > have dedicated workqueues) as well as one global workqueue that does > nothing but flush tasks. This turns out to be a much more elegant way > of handling the workqueues and in fact enabled me to remove all of the > klunky passing around of flush parameters to tell various functions not > to flush the workqueue if it would deadlock the workqueue we are running > from. > > Here is my test setup: > > 2 InfiniBand physical fabrics: ib0 and ib1 > 2 P_Keys on each fabric: default and 0x8002 on ib0 and 0x8003 on ib1 > 4 total IPoIB devices that I have named mlx4_ib0, mlx4_ib0.8002, > mlx4_ib1, and mlx4_ib1.8003 > > In order to test my final patch set, I logged into my test machine on > four different virtual terminals, I used ifdown on all of the above > interfaces to get things in a consistent state, and then I ran the above > loop, one per terminal per interface simultaneously. > > It's worth noting here that when you ifconfig a base interface up, it > automatically brings up the child interface too, so the ifconfig mlx4_ib0 > up is in fact racing with both ups and downs of mlx4_ib0.8002. The same > is true for the mlx4_ib1 interface and its child. With my patch set in > place, these loops are currently running without a problem and have passed > 15,000 up/down cycles per interface. > > Doug Ledford (8): > IPoIB: Consolidate rtnl_lock tasks in workqueue > IPoIB: Make the carrier_on_task race aware > IPoIB: fix MCAST_FLAG_BUSY usage > IPoIB: fix mcast_dev_flush/mcast_restart_task race > IPoIB: change init sequence ordering > IPoIB: Use dedicated workqueues per interface > IPoIB: Make ipoib_mcast_stop_thread flush the workqueue > IPoIB: No longer use flush as a parameter > > drivers/infiniband/ulp/ipoib/ipoib.h | 19 +- > drivers/infiniband/ulp/ipoib/ipoib_cm.c | 18 +- > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 27 ++- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 49 +++-- > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 239 ++++++++++++++++--------- > drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 22 ++- > 6 files changed, 240 insertions(+), 134 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <54071D14.9040404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH 0/8] IPoIB: Fix multiple race conditions [not found] ` <54071D14.9040404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2014-09-03 18:12 ` Doug Ledford 0 siblings, 0 replies; 37+ messages in thread From: Doug Ledford @ 2014-09-03 18:12 UTC (permalink / raw) To: Or Gerlitz Cc: Erez Shitrit, Saeed Mahameed, linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A, Yevgeny Petrilin [-- Attachment #1: Type: text/plain, Size: 4642 bytes --] On Wed, 2014-09-03 at 16:52 +0300, Or Gerlitz wrote: > On 8/13/2014 2:38 AM, Doug Ledford wrote: > > Locking of multicast joins/leaves in the IPoIB layer have been problematic > > for a while. There have been recent changes to try and make things better, > > including these changes: > > > > bea1e22 IPoIB: Fix use-after-free of multicast object > > a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects > > > > Unfortunately, the following test still fails (miserably) on a plain > > upstream kernel: > > > > pass=0 > > ifdown ib0 > > while true; do > > ifconfig ib0 up > > ifconfig ib0 down > > echo "Pass $pass" > > let pass++ > > done > > > > This usually fails within 10 to 20 passes, although I did have a lucky > > run make it to 300 or so. If you happen to have a P_Key child interface, > > it fails even quicker. > > Hi Doug, > > Thanks for looking on that. I wasn't aware we're doing so badly... I > checked here and the plan is that Erez Shitrit from Mellanox will also > be providing feedback on the series. > > Anyway, just to make sure we're on the same page, people agree that > picking such series is too heavy for post merge window, right? so we are > talking on 3.18, agree? Given how easy the problem is to reproduce, and how this patch set positively solves the reproducer, I would have preferred 3.17, and I had it in to Roland before the merge window closed, but Roland chose to put it off. /me boggles. > Or. > > > > > In tracking down the rtnl deadlock in the multicast code, I ended up > > re-designing the flag usage and clean up just the race conditions in > > the various tasks. Even that wasn't enough to resolve the issue entirely > > though, as if you ran the test above on multiple interfaces simultaneously, > > it could still deadlock. So in the end I re-did the workqueue usage too > > so that we now use a workqueue per device (and that includes P_Key devices > > have dedicated workqueues) as well as one global workqueue that does > > nothing but flush tasks. This turns out to be a much more elegant way > > of handling the workqueues and in fact enabled me to remove all of the > > klunky passing around of flush parameters to tell various functions not > > to flush the workqueue if it would deadlock the workqueue we are running > > from. > > > > Here is my test setup: > > > > 2 InfiniBand physical fabrics: ib0 and ib1 > > 2 P_Keys on each fabric: default and 0x8002 on ib0 and 0x8003 on ib1 > > 4 total IPoIB devices that I have named mlx4_ib0, mlx4_ib0.8002, > > mlx4_ib1, and mlx4_ib1.8003 > > > > In order to test my final patch set, I logged into my test machine on > > four different virtual terminals, I used ifdown on all of the above > > interfaces to get things in a consistent state, and then I ran the above > > loop, one per terminal per interface simultaneously. > > > > It's worth noting here that when you ifconfig a base interface up, it > > automatically brings up the child interface too, so the ifconfig mlx4_ib0 > > up is in fact racing with both ups and downs of mlx4_ib0.8002. The same > > is true for the mlx4_ib1 interface and its child. With my patch set in > > place, these loops are currently running without a problem and have passed > > 15,000 up/down cycles per interface. > > > > Doug Ledford (8): > > IPoIB: Consolidate rtnl_lock tasks in workqueue > > IPoIB: Make the carrier_on_task race aware > > IPoIB: fix MCAST_FLAG_BUSY usage > > IPoIB: fix mcast_dev_flush/mcast_restart_task race > > IPoIB: change init sequence ordering > > IPoIB: Use dedicated workqueues per interface > > IPoIB: Make ipoib_mcast_stop_thread flush the workqueue > > IPoIB: No longer use flush as a parameter > > > > drivers/infiniband/ulp/ipoib/ipoib.h | 19 +- > > drivers/infiniband/ulp/ipoib/ipoib_cm.c | 18 +- > > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 27 ++- > > drivers/infiniband/ulp/ipoib/ipoib_main.c | 49 +++-- > > drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 239 ++++++++++++++++--------- > > drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 22 ++- > > 6 files changed, 240 insertions(+), 134 deletions(-) > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2014-09-10 17:27 UTC | newest] Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-08-12 23:38 [PATCH 0/8] IPoIB: Fix multiple race conditions Doug Ledford [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-12 23:38 ` [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue Doug Ledford [not found] ` <2394730ce5ae1d46522dca04066293dd842edf16.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-15 22:11 ` Wendy Cheng 2014-09-04 14:28 ` Erez Shitrit 2014-08-12 23:38 ` [PATCH 2/8] IPoIB: Make the carrier_on_task race aware Doug Ledford [not found] ` <d05cdce70f1312f35f8be2d14bafd2a06809b137.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-18 23:26 ` Wendy Cheng [not found] ` <CABgxfbE6edfZZ58=mTvhGqWSkCxsik0XuQPR0L-yayze=803cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-08-19 20:32 ` Doug Ledford [not found] ` <2CC1794B-10CD-49A2-8F5D-0C66A6684DBC-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-19 22:05 ` Wendy Cheng 2014-09-04 12:13 ` Erez Shitrit [not found] ` <5408576C.7040609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> 2014-09-09 7:17 ` Doug Ledford 2014-08-12 23:38 ` [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage Doug Ledford [not found] ` <a410e80dc5ca7cfa64229bbbf50c1337317e3bd8.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-19 18:08 ` Wendy Cheng [not found] ` <CABgxfbH-Dt3CpxJKwCAZeHTUyupaA9y_WXVXuxgiPMet26PTQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-08-19 20:28 ` Doug Ledford [not found] ` <902D5BF2-159A-4B31-A87F-7491F3C8057F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-25 19:51 ` Wendy Cheng [not found] ` <CABgxfbHOD75vLdZ0TtWZbk8ne3kHd_eWObxPHmoJ-D8DjE0bkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-08-25 20:03 ` Doug Ledford [not found] ` <E3EFCBAC-2D6E-49D3-A556-DBD40701CC5F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-26 18:04 ` Wendy Cheng 2014-08-12 23:38 ` [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race Doug Ledford [not found] ` <e1dbcfc25d8930b281aad12699ebf8fa82485b0e.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-29 21:53 ` Wendy Cheng [not found] ` <CABgxfbHDuUrdHuLJT2oD07Cy3Ys4_rj-bJ6eR=9+uv0CuPH7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-08-30 15:39 ` Wendy Cheng [not found] ` <CABgxfbGJOdmAn1sokEtisDdnA=r_4mfP=PfqZVsP0cd_oL50dA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-09-03 18:06 ` Doug Ledford [not found] ` <1409767614.26762.7.camel-v+aXH1h/sVwpzh8Nc7Vzg+562jBIR2Zt@public.gmane.org> 2014-09-03 19:45 ` Wendy Cheng 2014-09-03 17:49 ` Doug Ledford 2014-08-12 23:38 ` [PATCH 5/8] IPoIB: change init sequence ordering Doug Ledford [not found] ` <ead9800512c1cb412b86cb1de3868c40f07c72be.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-09-04 12:36 ` Erez Shitrit 2014-08-12 23:38 ` [PATCH 6/8] IPoIB: Use dedicated workqueues per interface Doug Ledford [not found] ` <f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-20 15:01 ` Estrin, Alex 2014-09-04 6:49 ` Erez Shitrit [not found] ` <54080B6B.8050707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> 2014-09-09 7:09 ` Doug Ledford [not found] ` <540EA7B2.2050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-09-10 17:27 ` Erez Shitrit 2014-08-12 23:38 ` [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue Doug Ledford [not found] ` <ae3912431eeacd81d920a405a6bdeb3853791b1a.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-20 15:03 ` Estrin, Alex 2014-08-12 23:38 ` [PATCH 8/8] IPoIB: No longer use flush as a parameter Doug Ledford [not found] ` <ad7bb2b8da52f187cf2978e6a1c77ead32b60de3.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2014-08-20 15:04 ` Estrin, Alex 2014-08-15 22:08 ` [PATCH 0/8] IPoIB: Fix multiple race conditions Wendy Cheng [not found] ` <CABgxfbEGfiNGUKT4NJi1GoDRouFznxpgogDt5yr47TLfwDB7hA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-09-03 16:26 ` Wendy Cheng 2014-09-03 13:52 ` Or Gerlitz [not found] ` <54071D14.9040404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2014-09-03 18:12 ` Doug Ledford
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.