From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id ACD28226847A6 for ; Thu, 19 Apr 2018 03:44:36 -0700 (PDT) Date: Thu, 19 Apr 2018 12:44:32 +0200 From: Jan Kara Subject: Re: [PATCH v8 15/18] mm, fs, dax: handle layout changes to pinned dax mappings Message-ID: <20180419104432.7lzk7nbjmwav6ojl@quack2.suse.cz> References: <152246892890.36038.18436540150980653229.stgit@dwillia2-desk3.amr.corp.intel.com> <152246901060.36038.4487158506830998280.stgit@dwillia2-desk3.amr.corp.intel.com> <20180404094656.dssixqvvdcp5jff2@quack2.suse.cz> <20180409164944.6u7i4wgbp6yihvin@quack2.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Dan Williams Cc: Dave Hansen , Jan Kara , Andrew Morton , Mike Snitzer , Matthew Wilcox , linux-nvdimm , Dave Chinner , Linux Kernel Mailing List , Christoph Hellwig , linux-xfs , Alexander Viro , linux-fsdevel , Josh Triplett , Paul McKenney , "Darrick J. Wong" List-ID: On Fri 13-04-18 15:03:51, Dan Williams wrote: > On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams wrote: > > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara wrote: > >> On Sat 07-04-18 12:38:24, Dan Williams wrote: > > [..] > >>> I wonder if this can be trivially solved by using srcu. I.e. we don't > >>> need to wait for a global quiescent state, just a > >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the > >>> srcu api? > >> > >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than > >> SRCU. It is a more-or-less standard locking mechanism rather than relying > >> on implementation properties of SRCU which is a data structure protection > >> method. And the overhead of percpu rwsemaphore for your use case should be > >> about the same as that of SRCU. > > > > I was just about to ask that. Yes, it seems they would share similar > > properties and it would be better to use the explicit implementation > > rather than a side effect of srcu. > > ...unfortunately: > > BUG: sleeping function called from invalid context at > ./include/linux/percpu-rwsem.h:34 > [..] > Call Trace: > dump_stack+0x85/0xcb > ___might_sleep+0x15b/0x240 > dax_layout_lock+0x18/0x80 > get_user_pages_fast+0xf8/0x140 > > ...and thinking about it more srcu is a better fit. We don't need the > 100% exclusion provided by an rwsem we only need the guarantee that > all cpus that might have been running get_user_pages_fast() have > finished it at least once. > > In my tests synchronize_srcu is a bit slower than unpatched for the > trivial 100 truncate test, but certainly not the 200x latency you were > seeing with syncrhonize_rcu. > > Elapsed time: > 0.006149178 unpatched > 0.009426360 srcu Hum, right. Yesterday I was looking into KSM for a different reason and I've noticed it also does writeprotect pages and deals with races with GUP. And what KSM relies on is: write_protect_page() ... entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); /* * Check that no O_DIRECT or similar I/O is in progress on the * page */ if (page_mapcount(page) + 1 + swapped != page_count(page)) { page used -> bail } And this really works because gup_pte_range() does: page = pte_page(pte); head = compound_head(page); if (!page_cache_get_speculative(head)) goto pte_unmap; if (unlikely(pte_val(pte) != pte_val(*ptep))) { bail } So either write_protect_page() page sees the elevated reference or gup_pte_range() bails because it will see the pte changed. In the truncate path things are a bit different but in principle the same should work - once truncate blocks page faults and unmaps pages from page tables, we can be sure GUP will not grab the page anymore or we'll see elevated page count. So IMO there's no need for any additional locking against the GUP path (but a comment explaining this is highly desirable I guess). Honza -- Jan Kara SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm