From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4BA6C4360C for ; Thu, 10 Oct 2019 13:05:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8B2B2205C9 for ; Thu, 10 Oct 2019 13:05:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="b1veXPCc" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8B2B2205C9 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2BA0A8E0005; Thu, 10 Oct 2019 09:05:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 26AF48E0003; Thu, 10 Oct 2019 09:05:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 181448E0005; Thu, 10 Oct 2019 09:05:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0222.hostedemail.com [216.40.44.222]) by kanga.kvack.org (Postfix) with ESMTP id EAEFD8E0003 for ; Thu, 10 Oct 2019 09:05:55 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 86F99812E for ; Thu, 10 Oct 2019 13:05:55 +0000 (UTC) X-FDA: 76027897470.28.can07_6daa3a46ee907 X-HE-Tag: can07_6daa3a46ee907 X-Filterd-Recvd-Size: 7387 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf35.hostedemail.com (Postfix) with ESMTP for ; Thu, 10 Oct 2019 13:05:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Transfer-Encoding :Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=tqf8tFgNaaRaIaDieyfawLGHKKkqZqtuef52CA1Rf+g=; b=b1veXPCc0t4IF5wvmsySDMJ4dz fJ7ac0Csl/pY02yu3PdJetQbkUKnlhkHeTTVhmJdpGidpq2WNtEkBFu71AP3HqbmDPQcXvmbbtMh8 z94TPqvrTixMFEHedbi+lXlo4nAxXm4tWIhr4YqAj0EihUMkEicsONPhkKsc861vJMTeFNEykuY7g foHr4SCUlLqDzAFwP5SrdAGhWFQjyA1zbRjfVtZk7OgBgVCCpXYTD+8W3OI/CyzqAkTeWAQ1leVzq tGivrwFCtToFBzqro6WIIZeTdbKOQRtFv63TcBRXaW0lHZG5t9HP2fffXyRCwTbcBwt2wBjLnmf1s I/7fqQEA==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.92.3 #3 (Red Hat Linux)) id 1iIY8X-0004H5-LV; Thu, 10 Oct 2019 13:05:46 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id A21833056BE; Thu, 10 Oct 2019 15:04:50 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id C171A20427801; Thu, 10 Oct 2019 15:05:42 +0200 (CEST) Date: Thu, 10 Oct 2019 15:05:42 +0200 From: Peter Zijlstra To: Thomas =?iso-8859-1?Q?Hellstr=F6m_=28VMware=29?= Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org, kirill@shutemov.name, Thomas Hellstrom , Andrew Morton , Matthew Wilcox , Will Deacon , Rik van Riel , Minchan Kim , Michal Hocko , Huang Ying , =?iso-8859-1?B?Suly9G1l?= Glisse Subject: Re: [PATCH v5 4/8] mm: Add write-protect and clean utilities for address space ranges Message-ID: <20191010130542.GP2328@hirez.programming.kicks-ass.net> References: <20191010124314.40067-1-thomas_os@shipmail.org> <20191010124314.40067-5-thomas_os@shipmail.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20191010124314.40067-5-thomas_os@shipmail.org> User-Agent: Mutt/1.10.1 (2018-07-13) Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 10, 2019 at 02:43:10PM +0200, Thomas Hellstr=F6m (VMware) wro= te: > +/** > + * struct wp_walk - Private struct for pagetable walk callbacks > + * @range: Range for mmu notifiers > + * @tlbflush_start: Address of first modified pte > + * @tlbflush_end: Address of last modified pte + 1 > + * @total: Total number of modified ptes > + */ > +struct wp_walk { > + struct mmu_notifier_range range; > + unsigned long tlbflush_start; > + unsigned long tlbflush_end; > + unsigned long total; > +}; > + > +/** > + * wp_pte - Write-protect a pte > + * @pte: Pointer to the pte > + * @addr: The virtual page address > + * @walk: pagetable walk callback argument > + * > + * The function write-protects a pte and records the range in > + * virtual address space of touched ptes for efficient range TLB flush= es. > + */ > +static int wp_pte(pte_t *pte, unsigned long addr, unsigned long end, > + struct mm_walk *walk) > +{ > + struct wp_walk *wpwalk =3D walk->private; > + pte_t ptent =3D *pte; > + > + if (pte_write(ptent)) { > + pte_t old_pte =3D ptep_modify_prot_start(walk->vma, addr, pte); > + > + ptent =3D pte_wrprotect(old_pte); > + ptep_modify_prot_commit(walk->vma, addr, pte, old_pte, ptent); > + wpwalk->total++; > + wpwalk->tlbflush_start =3D min(wpwalk->tlbflush_start, addr); > + wpwalk->tlbflush_end =3D max(wpwalk->tlbflush_end, > + addr + PAGE_SIZE); > + } > + > + return 0; > +} > +/* > + * wp_clean_pre_vma - The pagewalk pre_vma callback. > + * > + * The pre_vma callback performs the cache flush, stages the tlb flush > + * and calls the necessary mmu notifiers. > + */ > +static int wp_clean_pre_vma(unsigned long start, unsigned long end, > + struct mm_walk *walk) > +{ > + struct wp_walk *wpwalk =3D walk->private; > + > + wpwalk->tlbflush_start =3D end; > + wpwalk->tlbflush_end =3D start; > + > + mmu_notifier_range_init(&wpwalk->range, MMU_NOTIFY_PROTECTION_PAGE, 0= , > + walk->vma, walk->mm, start, end); > + mmu_notifier_invalidate_range_start(&wpwalk->range); > + flush_cache_range(walk->vma, start, end); > + > + /* > + * We're not using tlb_gather_mmu() since typically > + * only a small subrange of PTEs are affected, whereas > + * tlb_gather_mmu() records the full range. > + */ > + inc_tlb_flush_pending(walk->mm); > + > + return 0; > +} > + > +/* > + * wp_clean_post_vma - The pagewalk post_vma callback. > + * > + * The post_vma callback performs the tlb flush and calls necessary mm= u > + * notifiers. > + */ > +static void wp_clean_post_vma(struct mm_walk *walk) > +{ > + struct wp_walk *wpwalk =3D walk->private; > + > + if (wpwalk->tlbflush_end > wpwalk->tlbflush_start) > + flush_tlb_range(walk->vma, wpwalk->tlbflush_start, > + wpwalk->tlbflush_end); > + > + mmu_notifier_invalidate_range_end(&wpwalk->range); > + dec_tlb_flush_pending(walk->mm); > +} > +/** > + * wp_shared_mapping_range - Write-protect all ptes in an address spac= e range > + * @mapping: The address_space we want to write protect > + * @first_index: The first page offset in the range > + * @nr: Number of incremental page offsets to cover > + * > + * Note: This function currently skips transhuge page-table entries, s= ince > + * it's intended for dirty-tracking on the PTE level. It will warn on > + * encountering transhuge write-enabled entries, though, and can easil= y be > + * extended to handle them as well. > + * > + * Return: The number of ptes actually write-protected. Note that > + * already write-protected ptes are not counted. > + */ > +unsigned long wp_shared_mapping_range(struct address_space *mapping, > + pgoff_t first_index, pgoff_t nr) > +{ > + struct wp_walk wpwalk =3D { .total =3D 0 }; > + > + i_mmap_lock_read(mapping); > + WARN_ON(walk_page_mapping(mapping, first_index, nr, &wp_walk_ops, > + &wpwalk)); > + i_mmap_unlock_read(mapping); > + > + return wpwalk.total; > +} That's a read lock, this means there's concurrency to self. What happens if someone does two concurrent wp_shared_mapping_range() on the same mapping? The thing is, because of pte_wrprotect() the iteration that starts last will see a smaller pte_write range, if it completes first and does flush_tlb_range(), it will only flush a partial range. This is exactly what {inc,dec}_tlb_flush_pending() is for, but you're not using mm_tlb_flush_nested() to detect the situation and do a bigger flush. Or if you're not needing that, then I'm missing why.