From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86AAEC4320A for ; Mon, 23 Aug 2021 12:17:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 67268613D3 for ; Mon, 23 Aug 2021 12:17:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236893AbhHWMRu (ORCPT ); Mon, 23 Aug 2021 08:17:50 -0400 Received: from new3-smtp.messagingengine.com ([66.111.4.229]:42041 "EHLO new3-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236080AbhHWMRs (ORCPT ); Mon, 23 Aug 2021 08:17:48 -0400 Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailnew.nyi.internal (Postfix) with ESMTP id 7B0B9580A2E; Mon, 23 Aug 2021 08:16:58 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute5.internal (MEProxy); Mon, 23 Aug 2021 08:16:58 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; bh=wjZ1Ab 3tT/gw6i3R6kDawbC7G4Vz4VHhTxsGWTGUDG8=; b=pfVnaAHNd58uLPax7rftEg 9vlYkcZb95W1KKAnNYR5Gg5zWR324nmezhQfz4VttC2lhmqlS/1ZHukLQn0pt2G7 vQassDGDi53E4IM4SZGq3v0Sm6MFo2X2k2NnyVqd+oqLm0nYKmfmG25PoS/yYZKr 0D5ynm5q3sadQv0UaTAVwKQOoyBC6z9KRWDPYYu56Ixx/j7yyvdpgSmP5h27vMMB v7oe3Zm2dcgl9xub5s04eFxqiKU3POFyr8Fzhg8aWcBsAOsyd3gnklFLVHIg3UZI MYrq8Dd28HTu8XNpRwDzqJWgkKdFEjjUo2iuhdY+wx4uHDEXUbYrTkRRfkZZovYA == X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvtddruddthedggeelucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepfffhvffukfhfgggtuggjsehttdertddttddvnecuhfhrohhmpefkughoucfu tghhihhmmhgvlhcuoehiughoshgthhesihguohhstghhrdhorhhgqeenucggtffrrghtth gvrhhnpedtffekkeefudffveegueejffejhfetgfeuuefgvedtieehudeuueekhfduheel teenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehiug hoshgthhesihguohhstghhrdhorhhg X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 23 Aug 2021 08:16:52 -0400 (EDT) Date: Mon, 23 Aug 2021 15:16:48 +0300 From: Ido Schimmel To: Vladimir Oltean Cc: Nikolay Aleksandrov , Vladimir Oltean , netdev@vger.kernel.org, Jakub Kicinski , "David S. Miller" , Roopa Prabhu , Andrew Lunn , Florian Fainelli , Vivien Didelot , Vadym Kochan , Taras Chornyi , Jiri Pirko , Ido Schimmel , UNGLinuxDriver@microchip.com, Grygorii Strashko , Marek Behun , DENG Qingfang , Kurt Kanzenbach , Hauke Mehrtens , Woojung Huh , Sean Wang , Landen Chao , Claudiu Manoil , Alexandre Belloni , George McCollister , Ioana Ciornei , Saeed Mahameed , Leon Romanovsky , Lars Povlsen , Steen Hegelund , Julian Wiedmann , Karsten Graul , Heiko Carstens , Vasily Gorbik , Christian Borntraeger , Ivan Vecera , Vlad Buslov , Jianbo Liu , Mark Bloch , Roi Dayan , Tobias Waldekranz , Vignesh Raghavendra , Jesse Brandeburg Subject: Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Message-ID: References: <20210820170639.rvzlh4b6agvrkv2w@skbuf> <65f3529f-06a3-b782-7436-83e167753609@nvidia.com> <20210822133145.jjgryhlkgk4av3c7@skbuf> <20210822174449.nby7gmrjhwv3walp@skbuf> <20210823110046.xuuo37kpsxdbl6c2@skbuf> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210823110046.xuuo37kpsxdbl6c2@skbuf> Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Mon, Aug 23, 2021 at 02:00:46PM +0300, Vladimir Oltean wrote: > On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote: > > On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote: > > > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote: > > > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote: > > > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are > > > > > deferred by drivers even from code paths that are initially blocking > > > > > (are running in process context): > > > > > > > > > > br_fdb_add > > > > > -> __br_fdb_add > > > > > -> fdb_add_entry > > > > > -> fdb_notify > > > > > -> br_switchdev_fdb_notify > > > > > > > > > > It seems fairly trivial to move the fdb_notify call outside of the > > > > > atomic section of fdb_add_entry, but with switchdev offering only an > > > > > API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would > > > > > still have to defer these events and are unable to provide > > > > > synchronous feedback to user space (error codes, extack). > > > > > > > > > > The above issues would warrant an attempt to fix a central problem, and > > > > > make switchdev expose an API that is easier to consume rather than > > > > > having drivers implement lateral workarounds. > > > > > > > > > > In this case, we must notice that > > > > > > > > > > (a) switchdev already has the concept of notifiers emitted from the fast > > > > > path that are still processed by drivers from blocking context. This > > > > > is accomplished through the SWITCHDEV_F_DEFER flag which is used by > > > > > e.g. SWITCHDEV_OBJ_ID_HOST_MDB. > > > > > > > > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process(). > > > > > So if we could hook into that, we could have a chance that the > > > > > bridge simply waits for our FDB entry offloading procedure to finish > > > > > before it calls netdev_upper_dev_unlink() - which is almost > > > > > immediately afterwards, and also when switchdev drivers typically > > > > > break their stateful associations between the bridge upper and > > > > > private data. > > > > > > > > > > So it is in fact possible to use switchdev's generic > > > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and > > > > > from there we can call_switchdev_blocking_notifiers(). > > > > > > > > > > To address all requirements: > > > > > > > > > > - drivers that are unconverted from atomic to blocking still work > > > > > - drivers that currently have a private workqueue are not worse off > > > > > - drivers that want the bridge to wait for their deferred work can use > > > > > the bridge's defer mechanism > > > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested > > > > > parties does not get deferred for no reason, because this takes the > > > > > rtnl_mutex and schedules a worker thread for nothing > > > > > > > > > > it looks like we can in fact start off by emitting > > > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in > > > > > struct switchdev_notifier_fdb_info called "needs_defer", and any > > > > > interested party can set this to true. > > > > > > > > > > This way: > > > > > > > > > > - unconverted drivers do their work (i.e. schedule their private work > > > > > item) based on the atomic notifier, and do not set "needs_defer" > > > > > - converted drivers only mark "needs_defer" and treat a separate > > > > > notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not > > > > > generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > > > > > > > Additionally, code paths that are blocking right not, like br_fdb_replay, > > > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all > > > > > consumers of the replayed FDB events support that (right now, that is > > > > > DSA and dpaa2-switch). > > > > > > > > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set > > > > > needs_defer as appropriate, then the notifiers emitted from process > > > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > > directly, and we would also have fully blocking context all the way > > > > > down, with the opportunity for error propagation and extack. > > > > > > > > IIUC, at this stage all the FDB notifications drivers get are blocking, > > > > either from the work queue (because they were deferred) or directly from > > > > process context. If so, how do we synchronize the two and ensure drivers > > > > get the notifications at the correct order? > > > > > > What does 'at this stage' mean? Does it mean 'assuming the patch we're > > > discussing now gets accepted'? If that's what it means, then 'at this > > > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE, > > > then would set needs_defer, then would receive the blocking > > > FDB_ADD_TO_DEVICE. > > > > I meant after: > > > > "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set > > needs_defer as appropriate, then the notifiers emitted from process > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > directly, and we would also have fully blocking context all the way > > down, with the opportunity for error propagation and extack." > > > > IIUC, after the conversion the 'needs_defer' is gone and all the FDB > > events are blocking? Either from syscall context or the workqueue. > > We would not delete 'needs_defer'. It still offers a useful preliminary > filtering mechanism for the fast path (and for br_fdb_replay). In > retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer' > instead of jumping to blocking context (if we care so much about performance). > > If a FDB event does not need to be processed by anyone (dynamically > learned entry on a switchdev port), the bridge notifies the atomic call > chain for the sake of it, but not the blocking chain. > > > If so, I'm not sure how we synchronize the two. That is, making sure > > that an event from syscall context does not reach drivers before an > > earlier event that was added to the 'deferred' list. > > > > I mean, in syscall context we are holding RTNL so whatever is already on > > the 'deferred' list cannot be dequeued and processed. > > So switchdev_deferred_process() has ASSERT_RTNL. If we call > switchdev_deferred_process() right before adding the blocking FDB entry > in process context (and we already hold rtnl_mutex), I though that would > be enough to ensure we have a synchronization point: Everything that was > scheduled before is flushed now, everything that is scheduled while we > are running will run after we unlock the rtnl_mutex. Is that not the > order we expect? I mean, if there is a fast path FDB entry being learned > / deleted while user space say adds that same FDB entry as static, how > is the relative ordering ensured between the two? I was thinking about the following case: t0 - is added in syscall context under 'hash_lock' t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in response to STP state. Notifications are added to 'deferred' list t2 - switchdev_deferred_process() is called in syscall context t3 - is notified as blocking Updates to the SW FDB are protected by 'hash_lock', but updates to the HW FDB are not. In this case, does not exist in SW, but it will exist in HW. Another case assuming switchdev_deferred_process() is called first: t0 - switchdev_deferred_process() is called in syscall context t1 - is learned under 'hash_lock'. Notification is added to 'deferred' list t2 - is modified in syscall context under 'hash_lock' to t3 - is notified as blocking t4 - is notified as blocking (next time the 'deferred' list is processed) In this case, the HW will have , but SW will have