From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-x241.google.com (mail-oi0-x241.google.com [IPv6:2607:f8b0:4003:c06::241]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 96DDA21EBD1A9 for ; Thu, 19 Apr 2018 20:00:10 -0700 (PDT) Received: by mail-oi0-x241.google.com with SMTP id c15-v6so6795325oic.2 for ; Thu, 19 Apr 2018 20:00:10 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20180419104432.7lzk7nbjmwav6ojl@quack2.suse.cz> References: <152246892890.36038.18436540150980653229.stgit@dwillia2-desk3.amr.corp.intel.com> <152246901060.36038.4487158506830998280.stgit@dwillia2-desk3.amr.corp.intel.com> <20180404094656.dssixqvvdcp5jff2@quack2.suse.cz> <20180409164944.6u7i4wgbp6yihvin@quack2.suse.cz> <20180419104432.7lzk7nbjmwav6ojl@quack2.suse.cz> From: Dan Williams Date: Thu, 19 Apr 2018 20:00:09 -0700 Message-ID: Subject: Re: [PATCH v8 15/18] mm, fs, dax: handle layout changes to pinned dax mappings List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Jan Kara Cc: Dave Hansen , Josh Triplett , Mike Snitzer , "Darrick J. Wong" , Matthew Wilcox , Dave Chinner , Linux Kernel Mailing List , Christoph Hellwig , linux-xfs , linux-nvdimm , Alexander Viro , linux-fsdevel , Paul McKenney , Andrew Morton List-ID: On Thu, Apr 19, 2018 at 3:44 AM, Jan Kara wrote: > On Fri 13-04-18 15:03:51, Dan Williams wrote: >> On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams wrote: >> > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara wrote: >> >> On Sat 07-04-18 12:38:24, Dan Williams wrote: >> > [..] >> >>> I wonder if this can be trivially solved by using srcu. I.e. we don't >> >>> need to wait for a global quiescent state, just a >> >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the >> >>> srcu api? >> >> >> >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than >> >> SRCU. It is a more-or-less standard locking mechanism rather than relying >> >> on implementation properties of SRCU which is a data structure protection >> >> method. And the overhead of percpu rwsemaphore for your use case should be >> >> about the same as that of SRCU. >> > >> > I was just about to ask that. Yes, it seems they would share similar >> > properties and it would be better to use the explicit implementation >> > rather than a side effect of srcu. >> >> ...unfortunately: >> >> BUG: sleeping function called from invalid context at >> ./include/linux/percpu-rwsem.h:34 >> [..] >> Call Trace: >> dump_stack+0x85/0xcb >> ___might_sleep+0x15b/0x240 >> dax_layout_lock+0x18/0x80 >> get_user_pages_fast+0xf8/0x140 >> >> ...and thinking about it more srcu is a better fit. We don't need the >> 100% exclusion provided by an rwsem we only need the guarantee that >> all cpus that might have been running get_user_pages_fast() have >> finished it at least once. >> >> In my tests synchronize_srcu is a bit slower than unpatched for the >> trivial 100 truncate test, but certainly not the 200x latency you were >> seeing with syncrhonize_rcu. >> >> Elapsed time: >> 0.006149178 unpatched >> 0.009426360 srcu > > Hum, right. Yesterday I was looking into KSM for a different reason and > I've noticed it also does writeprotect pages and deals with races with GUP. > And what KSM relies on is: > > write_protect_page() > ... > entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > /* > * Check that no O_DIRECT or similar I/O is in progress on the > * page > */ > if (page_mapcount(page) + 1 + swapped != page_count(page)) { > page used -> bail Slick. > } > > And this really works because gup_pte_range() does: > > page = pte_page(pte); > head = compound_head(page); > > if (!page_cache_get_speculative(head)) > goto pte_unmap; > > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > bail Need to add a similar check to __gup_device_huge_pmd. > } > > So either write_protect_page() page sees the elevated reference or > gup_pte_range() bails because it will see the pte changed. > > In the truncate path things are a bit different but in principle the same > should work - once truncate blocks page faults and unmaps pages from page > tables, we can be sure GUP will not grab the page anymore or we'll see > elevated page count. So IMO there's no need for any additional locking > against the GUP path (but a comment explaining this is highly desirable I > guess). Yes, those "pte_val(pte) != pte_val(*ptep)" checks should be documented for the same reason we require comments on rmb/wmb pairs. I'll take a look, thanks Jan. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm