From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matan Barak Subject: Re: [PATCH v3 for-next 02/33] IB/core: Add kref to IB devices Date: Wed, 29 Apr 2015 12:16:30 +0300 Message-ID: <5540A16E.3050603@mellanox.com> References: <1427318422-12004-1-git-send-email-somnath.kotur@emulex.com> <9f65de5e-ed5f-48d2-bff2-03ffbe4f4876@CMEXHTCAS2.ad.emulex.com> <553DF294.4070507@mellanox.com> <553F931F.6000302@mellanox.com> <20150428174312.GB5497@obsidianresearch.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150428174312.GB5497-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Jason Gunthorpe Cc: Or Gerlitz , Somnath Kotur , Roland Dreier , "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" List-Id: linux-rdma@vger.kernel.org On 4/28/2015 8:43 PM, Jason Gunthorpe wrote: > On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote: > >> The cleanup of roce_gid_cache is done in a different context, so we >> need to make sure the device is still alive while doing so. > > This explanation doesn't look right to me. I don't see anything like > that under roce_gid_cache_cleanup_one ? > > Although, there must be a call to the driver's modify_gid to free > context before freeing, and I don't see that obviously happening.. > Of course there is: roce_gid_cache_client_cleanup_one -> roce_gid_cache_client_cleanup_one_work -> *work* -> roce_gid_cache_client_cleanup_work_handler -> roce_gid_cache_cleanup_one -> free_roce_gid_cache -> write_gid -> modify_gid > However, all the other async work launched doesn't look safe at all. > I think it's safe - roce_gid_cache_client_cleanup_one deactivates the cache (no new events are handled on the cache). Ongoing events are flushed in roce_gid_cache_client_cleanup_one_work (flush_workqueue(roce_gid_mgmt_wq)). When roce_gid_cache_cleanup_one is called - no work will be done or is currently in process on this cache. > So, did you mean that the device must still be alive while all the > other work is running? And the point of this scheme is to guarentee > all the work is flushed? (at least I hope it is, otherwise there are > bigger problems here) > It's not safe to free the client's context in ib_unregister_device while the client isn't done. The obvious solution is to wait in client->remove (like you suggested) until the client has finished cleaning up things. This doesn't fit our case - since client->remove is called under device_lock, but it's possible (for example) that roce_rescan_device_work_handler is currently running and waits to grab this exact mutex - DEAD LOCK. Freeing thing asynchronously is a possible solution - we don't lock all IB devices but just our device. Moreover, it's also more future proof as other clients might want to run other things concurrently. > It is just fundamentally wrong to return from ib_client.remove while > async work is still outstanding, the client is expected to deal with > this internally and synchronously. > > You don't need IB core help to do this. > netdev is taking a similar approach - please take a look at netdev_wait_allrefs > Or: This should have been fixed after Haggai brought it up... > > Jason > Matan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html