From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neil Horman Subject: Re: [PATCH v2] Change alarm cancel function to thread-safe: Date: Sun, 28 Sep 2014 16:47:54 -0400 Message-ID: <20140928204754.GC4012@localhost.localdomain> References: <20140926114630.GA3930@hmsreliant.think-freely.org> <20140926134014.GB3930@hmsreliant.think-freely.org> <20140926150156.GB5619@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB9772582137D88E@IRSMSX104.ger.corp.intel.com> <20140926162134.GE5619@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB9772582137D95F@IRSMSX104.ger.corp.intel.com> <20140926193905.GH5619@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB9772582138410B@IRSMSX104.ger.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "dev-VfR2kkLFssw@public.gmane.org" To: "Ananyev, Konstantin" Return-path: Content-Disposition: inline In-Reply-To: <2601191342CEEE43887BDE71AB9772582138410B-kPTMFJFq+rGvNW/NfzhIbrfspsVTdybXVpNB7YpNyf8@public.gmane.org> List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces-VfR2kkLFssw@public.gmane.org Sender: "dev" On Sun, Sep 28, 2014 at 04:12:04PM +0000, Ananyev, Konstantin wrote: > > > > -----Original Message----- > > From: Neil Horman [mailto:nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org] > > Sent: Friday, September 26, 2014 8:39 PM > > To: Ananyev, Konstantin > > Cc: Wodkowski, PawelX; dev-VfR2kkLFssw@public.gmane.org > > Subject: Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe: > > > > On Fri, Sep 26, 2014 at 06:07:14PM +0000, Ananyev, Konstantin wrote: > > > > > > > > > > > As I remember the purpose of the patch was to fix the race condition inside rte_alarm library. > > > > > I believe that the patch provided by Michal & Pawel fixes the issues you discovered. > > > > > If you think, that is not the case, could you please provide a list of remaining issues? > > > > > Excluding ones that you just don't like it, and you are not happy with rte_alarm API in total? > > > > > > > > > > Gladly. As Pawel explained the race, its possible that, after calling > > > > rte_eal_alarm_cancel, an in-flight execution of an alarm callback may still be > > > > running. The problem with that ostensibly is that data which is being accessed > > > > by the callback might be then accessed in parallel with another process leading > > > > to data corruption or some other problem. The issue I have with his patch is > > > > that it doesn't completely close the race. While it does close the race for the > > > > condition in whcih thread B is running the alarm callback while thread A is > > > > executing the cancel operation, it does not close the case for when a single > > > > thread B is running the cancel operation, as the in-flight execution itself is > > > > still active. > > > > > > A bit puzzled here: > > > Are you saying that calling alarm_cancel() for itself inside eal_alarm_callback() might cause a problem? > > > I still don't see how. > > > > > Potentially yes, by the same race condition that exists when using a secondary > > thread to do the cancel call. As I understand it the race that Pawel described > > is as follows: > > > > Thread A Thread B > > alarm_cancel() eal_alarm_callback > > block on alarm spinlock drop spinlock > > run cancel operation execute callback function > > return from cancel > > rte_eal_alarm_set > > > > As Pawel described the problem, there is a desire to not set the new alarm while > > the old alarm is still executing. And his patch accomplishes that for the two > > thread case above just fine > > > > The problem with Pawels patch is that its non functional in the case where the > > cancel happens within Thread B. Lets change the scenario just a little bit: > > > > Thread B Thread C > > eal_alarm_callback > > callback_function > > some_other_common_func > > rte_eal_alarm_cancel(this) > > pthread_signal(Thread C) wake up > > operate on alarm data rte_eal_alarm_set > > > > As I can see, there is an incorrect behaviour in your callback_function example. > It should first finish with " eal_alarm_callback" and only then send a signal to other thread. > Otherwise we can't help it in any way. > But I think, I understand your concern: > after rte_eal_aralm_cancel() finishes, the caller can't clearly distinguish what exactly happen: > 1) alarm was cancelled succesfully. > 2) alarm was not found (already cancelled or executed). > 3) alarm is executing by the same thread and can't be cancelled. > > Basically right now the caller can distinguish that either #1 or #2,3 happened, but can't distinguish between 2 & 3. > Correct? > Yes, this is my concern exactly. > If that's so, then I suppose we can do: make alarm_cancel() to return a negative value for the case #3 (-EINPROGRESS or something). > Something like: > ... > if (ap->executing == 0) { > LIST_REMOVE(ap,next); > rte_free(ap); > count++; > ap = ap_prev; > } else if (pthread_equal(ap->executing_id, pthread_self()) == 0) { > executing++; > } else { > ret = -EINPROGRESS; > } > ... > return ((ret != 0) ? ret : count); > > So the return value will be > 0 for #1, 0 for #2, <0 for #3. > As I remember, you already suggested something similar in one of the previous mails. Yes, I rolled the API changes I suggested in with this model, because I wanted to be able to do precise specification of a timer instance to cancel, but if we're not ready to make that change, I think what you propose above would be suffficient. Theres some question as to weather we would cancel timers that are still pending on a return of -EINPROGRESS, but I think if we document it accordingly, then it can be worked out just fine. Best Neil > Konstantin > > > > > > > > In this scenario the problem is not fixed because when called from within the > > alarm thread, the executing alarm is skipped (as it must be), but that fact is > > invisible to the caller, and because of that its still possible for the same > > origional problem to occur. > > > > Neil > > >