From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matan Azrad Subject: Re: [PATCH v3] net/mlx4: fix dev rmv not detected after port stop Date: Sat, 3 Feb 2018 19:42:14 +0000 Message-ID: References: <1517214877-126768-1-git-send-email-motih@mellanox.com> <20180130093958.GE4256@6wind.com> <20180131091513.GS4256@6wind.com> <20180131104352.GT4256@6wind.com> <20180131143157.GV4256@6wind.com> <20180202195307.GD4256@6wind.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: Shahaf Shuler , Mordechay Haimovsky , "dev@dpdk.org" , "stable@dpdk.org" To: Adrien Mazarguil Return-path: In-Reply-To: <20180202195307.GD4256@6wind.com> Content-Language: en-US List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Adrien > From: Adrien Mazarguil , Sent: Friday, February 2, 2018 9:53 PM > Hi Matan, >=20 > On Wed, Jan 31, 2018 at 05:07:56PM +0000, Matan Azrad wrote: > > Hi Adrien > > > > From: Adrien Mazarguil , Sent: Wednesday, January 31, 2018 4:32 PM > > > Hi Matan, > > > > > > On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote: > > > > Hi Adrien > > > > > I don't know what any application does but for me it is a mistake > > > > to stop all event processes in dev_stop(), Maybe for other > > > > application > > > maintainers too. > > > > > > Just like you, I don't know either what all the applications ever > > > written for DPDK expect out of dev_stop(). What's for sure is that > > > currently, LSC/RMV don't occcur afterward, the same way these events > > > do not occur before dev_start(). > > > > Why not? RMV event can occur any time after probe. >=20 > LSC as well (keep in mind this patch modifies the behavior for both event= s). > RMV events may also occur before application has a chance to register a > handler for it, in which case this approach fails to solve the problem it= 's > supposed to solve. Mitigate all you want, the application still can't rel= y on that > event only. Again, the application should register to event when it want to be notified= about it by callback and unregister to it when it doesn't want to. So, if application still didn't has chance to register to it, its ok not to= get notification about it. Obviously, Application can check both RMV and LCS statuses after the first = registration. =20 > > > Any application possibly relying on this fact will break. In such a > > > situation, a conservative approach is better. > > > > If an application should fail to get event in stopped state it may fail= in the > previous code too: > > The interrupt run from host thread so the next race may occur: > > dev_start() : master thread. > > Context switch. > > RMV interrupt started to run callbacks: host thread. > > Context switch. > > dev_stop(): master thread. > > Start reconfiguration: master thread. > > Context switch. > > Callback running. > > > > So, the only thing which can disable callback running after dev_stop() = is > callback unregistration before it. >=20 > After dev_stop() returns, new events cannot be triggered by the PMD which > is what matters. No, from application point of view it can happen even if the PMD will stop = to trigger it from dev_stop() (as the old code did). This is exactly what the above scenario says. =20 >Obviously a callback that already started to run before that > will eventually have to complete. What's your point? It can even start after dev_stop() in spite of the PMD will stop to trigger= the event, read the above scenario again. So, my point is: If application is not safe to get notification after dev_stop() it may fail= in both old and new versions. So every PMD which stops to trigger event from dev_stop() because of your r= easons makes a mistake and fails for its purpose.=20 =20 > There's a race only if an application performs multiple simultaneous cont= rol > operations on the underlying device, but this has always been unsafe (not > only during RMV) because there are no locks, it's documented as such. So, it is probably safe. > > > > > Setting up RMV/LSC callbacks is not the only configuration an > > > > > application usually performs before calling dev_start(). Think > > > > > about setting up flow rules, MAC addresses, VLANs, and so on, > > > > > this on multiple ports before starting them up all at once. > > > > > Previously it could be done in an unspecified order, now they > > > > > have to take special care > > > for RMV/LSC. > > > > > > > > Or maybe there callbacks code are already safe for it. > > > > Or they manages the unregister\register calls in the right places. > > > > > > That's my point, these "maybes" don't argue in favor of changing thin= gs. > > > > What I'm saying is that callbacks should be safe or not registered in t= he right > time. >=20 > I understand that, though it's not a valid counter argument :) With the above scenario it is :) > > > > > Many devops are only safe when called while a device is stopped. > > > > > It's even documented in rte_ethdev.h. > > > > > > > > > > > > > And? > > > > > > ...And applications therefore often do all their configuration in an > > > unspecified order while a port is stopped as a measure of safety. No > > > extra care is taken for RMV/LSC. This uncertainty can be addressed > > > by not modifying the current behavior. > > > > Or they expect to get interrupt and the corner case will come later if = we will > not change it. >=20 > Look, we're throwing opposite use cases at each other in order to make a > point, and I don't see an end to this since we're both stubborn. Let's th= us > assume applications use a bit of both. >=20 > Now we're left with a problem, before this patch neither use cases were > broken. No, again, The above scenario shows it can be broken even before this patch= . =20 > Now it's applied, mine is broken so let's agree something needs to > be done. Either all affected applications need to be updated, or we can > simply revert this and properly fix fail-safe instead. I think no, applications which will fail because of this patch may fail bef= ore this patch. So, probably applications should be safe for this patch. > > > > > So, at least for RMV event, we need the notification also in stoppe= d > state. > > > > > > You sent the rte_eth_dev_is_removed() series. You're aware that PMDs > > > implementing this call benefit from an automatic is_removed() check > > > on all remaining devops whenever some error occur. > > > In short, an application will get notified simply by getting > > > dev_start() (or any other callback) return -EIO and not being able to= use > the device. > > > > Yes, but between dev_stop to dev_start may not be any ethdev API callin= g. >=20 > So what, if an application is not using the device, why does it need to k= now > it's been removed? Data path will not check synchronously the removal status - this is must co= me by asynchronic notification. >If it's that important, why can't it run its own periodic > rte_eth_dev_is_removed() probe? Because ETHDEV have event for it, why to do similar mechanism again? > > > PMDs that do not implement is_removed() (or in addition to it) could > > > also artificially trigger a RMV event after dev_start() is called. > > > As long as the PMD remains quiet while the device is stopped, it's fi= ne. > > > > How can the PMD do it after dev_start()? Initiate alarm in dev start fu= nction > to do it later And entering into race again? >=20 > What race? All the PMD needs to to is trigger an event by registering one > with immediate effect, it won't make any difference to the application. I= f it > performs racy control operations from the handler, it's never been a PMD'= s > problem. See the race scenario above, it should be similar. > Anyway I just provided this idea as a counter argument, it's not really > needed; relying on rte_eth_dev_is_removed() is the safest approach in my > opinion. Not always. We need them both to be really hotplug aware from all sides. > > I think it is not worth the effort and this patch is doing the right th= ing(also > some other PMDs) . >=20 > Well, it's probably too late to revert this patch for 18.02 since one wou= ld also > have to come up with the proper fix for fail-safe, however that doesn't m= ean > breaking things and expecting applications to deal with it because it's n= ever > been properly documented is OK either. Nothing breaking for my opinion. > I'm post-NACKing this patch, it will have to be reverted and a fix submit= ted > for the upcoming 18.02 stable branch. In the meantime, it's not a fix for > mlx4 and as such must not be backported. I can't agree with you about it now. But you know, You are the maintainer :) do whatever you want. Anyway, I think we are all agree that all PMDs should do the same thing and= documentation somewhere must be added. My opinion, as you probably know, is to continue to trigger events even aft= er dev_stop(). Thanks, Matan.=20 > -- > Adrien Mazarguil > 6WIND