From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions Date: Fri, 11 Jan 2019 11:51:42 -0500 Message-ID: <20190111165141.GB3190@redhat.com> References: <20181217181148.GA3341@redhat.com> <20181217183443.GO10600@bombadil.infradead.org> <20181218093017.GB18032@quack2.suse.cz> <9f43d124-2386-7bfd-d90b-4d0417f51ccd@nvidia.com> <20181219020723.GD4347@redhat.com> <20181219110856.GA18345@quack2.suse.cz> <20190103015533.GA15619@redhat.com> <20190103092654.GA31370@quack2.suse.cz> <20190103144405.GC3395@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: Jan Kara , Matthew Wilcox , Dave Chinner , Dan Williams , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , mike.marciniszyn@intel.com, rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel To: John Hubbard Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote: > On 1/3/19 6:44 AM, Jerome Glisse wrote: > > On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote: > >> On Wed 02-01-19 20:55:33, Jerome Glisse wrote: > >>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote: > >>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote: > >>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote: [...] > >>> Now page_mkclean: > >>> > >>> int page_mkclean(struct page *page) > >>> { > >>> int cleaned = 0; > >>> + int real_mapcount = 0; > >>> struct address_space *mapping; > >>> struct rmap_walk_control rwc = { > >>> .arg = (void *)&cleaned, > >>> .rmap_one = page_mkclean_one, > >>> .invalid_vma = invalid_mkclean_vma, > >>> + .mapcount = &real_mapcount, > >>> }; > >>> + int mapcount1, mapcount2; > >>> > >>> BUG_ON(!PageLocked(page)); > >>> > >>> if (!page_mapped(page)) > >>> return 0; > >>> > >>> mapping = page_mapping(page); > >>> if (!mapping) > >>> return 0; > >>> > >>> + mapcount1 = page_mapcount(page); > >>> // rmap_walk need to change to count mapping and return value > >>> // in .mapcount easy one > >>> rmap_walk(page, &rwc); > >> > >> So what prevents GUP_fast() to grab reference here and the test below would > >> think the page is not pinned? Or do you assume that every page_mkclean() > >> call will be protected by PageWriteback (currently it is not) so that > >> GUP_fast() blocks / bails out? > > Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned > for each page" question (ignoring, for now, what to actually *do* in response to > that flag being set): > > 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening". > This is probably less troubling than the next point, but it does undermine all the > complicated schemes involving PageWriteback, that try to synchronize gup() with > page_mkclean(). > > 2. Also, the mapcount approach here still does not reliably avoid false negatives > (that is, a page may have been gup'd, but page_mkclean could miss that): gup() > can always jump in and increment the mapcount, while page_mkclean is in the middle > of making (wrong) decisions based on that mapcount. There's no lock to prevent that. > > Again: mapcount can go up *or* down, so I'm not seeing a true solution yet. Both point is address by the solution at the end of this email. > > > > So GUP_fast() becomes: > > > > GUP_fast_existing() { ... } > > GUP_fast() > > { > > GUP_fast_existing(); > > > > for (i = 0; i < npages; ++i) { > > if (PageWriteback(pages[i])) { > > // need to force slow path for this page > > } else { > > SetPageDmaPinned(pages[i]); > > atomic_inc(pages[i]->mapcount); > > } > > } > > } > > > > This is a minor slow down for GUP fast and it takes care of a > > write back race on behalf of caller. This means that page_mkclean > > can not see a mapcount value that increase. This simplify thing > > we can relax that. Note that what this is doing is making sure > > that GUP_fast never get lucky :) ie never GUP a page that is in > > the process of being write back but has not yet had its pte > > updated to reflect that. > > > > > >> But I think that detecting pinned pages with small false positive rate is > >> OK. The extra page bouncing will cost some performance but if it is rare, > >> then we are OK. So I think we can go for the simple version of detecting > >> pinned pages as you mentioned in some earlier email. We just have to be > >> sure there are no false negatives. > > > > Agree with that sentiment, but there are still false negatives and I'm not > yet seeing any solutions for that. So here is the solution: Is a page pin ? With no false negative: ======================================= get_user_page*() aka GUP: if (!PageAnon(page)) { bool write_back = PageWriteback(page); bool page_is_pin = PagePin(page); if (write_back && !page_is_pin) { /* Wait for write back a re-try GUP */ ... goto retry; } [G1] smp_rmb(); [G2] atomic_inc(&page->_mapcount) [G3] smp_wmb(); [G4] SetPagePin(page); [G5] smp_wmb(); [G6] if (!write_back && !page_is_pin && PageWriteback(page)) { /* Back-off as write back might have miss us */ atomic_dec(&page->_mapcount); /* Wait for write back a re-try GUP */ ... goto retry; } } put_user_page() aka PUP: [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount); [P2] put_page(page); page_mkclean(): [C1] pined = TestClearPagePin(page); [C2] smp_mb(); [C3] map_and_pin_count = atomic_read(&page->_mapcount) [C4] map_count = rmap_walk(page); [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page); So with above code we store the map and pin count inside struct page _mapcount field. The idea is that we can count the number of page table entry that point to the page when reverse walking all the page mapping in page_mkclean() [C4]. The issue is that GUP, PUP and page table entry zapping can all run concurrently with page_mkclean() and thus we can not get the real map and pin count and the real map count at a given point in time ([C5] for instance in the above). However we only care about avoiding false negative ie we do not want to report a page as unpin if in fact it is pin (it has active GUP). Avoiding false positive would be nice but it would need more heavy weight synchronization within GUP and PUP (we can mitigate it see the section on that below). With the above scheme a page is _not_ pin (unpin) if and only if we have real_map_count == real_map_and_pin_count at a given point in time. In the above pseudo code the page is lock within page_mkclean() thus no new page table entry can be added and thus the number of page mapping can only go down (because of conccurent pte zapping). So no matter what happens at [C5] we have map_count <= real_map_count. At [C3] we have two cases to consider: [R1] A concurrent GUP after [C3] then we do not care what happens at [C5] as the GUP would already have set the page pin flag. If it raced before [C3] at [C1] with TestClearPagePin() then we would have the map_and_pin_count reflect the GUP thanks to the memory barrier [G3] and [C2]. [R2] No concurrent GUP after [C3] then we only have concurrent PUP to worry about and thus the real_map_and_pin_count can only go down. So because we first snap shot that value at [C5] we have: real_map_and_pin_count <= map_and_pin_count. So at [C5] we end up with map_count <= real_map_count and with real_map_and_pin_count <= map_pin_count but we also always have real_map_count <= real_map_and_pin_count so it means we are in a a <= b <= c <= d scenario and if a == d then b == c. So at [C5] if map_count == map_pin_count then we know for sure that we have real_map_count == real_map_and_pin_count and if that is the case then the page is no longer pin. So at [C5] we will never miss a pin page (no false negative). Another way to word this is that we always under-estimate the real map count and over estimate the map and pin count and thus we can never have false negative (map count equal to map and pin count while in fact real map count is inferior to real map and pin count). PageWriteback() test and ordering with page_mkclean() ===================================================== In GUP we test for page write back flag to avoid pining a page that is under going write back. That flag is set after page_mkclean() so the filesystem code that will check for the pin flag need some memory barrier: int __test_set_page_writeback(struct page *page, bool keep_write, + bool *use_bounce_page) { ... [T1] TestSetPageWriteback(page); + [T2] smp_wmb(); + [T3] *use_bounce_page = PagePin(page); ... } That way if there is a concurrent GUP we either have: [R1] GUP sees the write back flag set before [G1] so it back-off [R2] GUP sees no write back before [G1] here either we have GUP that sees the write back flag at [G6] or [T3] that sees the pin flag thanks to the memory barrier [G5] and [T2]. So in all cases we never miss a pin or a write back. Mitigate false positive: ======================== If false positive is ever an issue we can improve the situation and to properly account conccurent pte zapping with the following changes: page_mkclean(): [C1] pined = TestClearPagePin(page); [C2] smp_mb(); [C3] map_and_pin_count = atomic_read(&page->_mapcount) [C4] map_count = rmap_walk(page, &page_mkclean_one()); [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) { [C6] map_and_pin_count2 = atomic_read(&page->_mapcount) [C7] map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2); [C8] if (map_count < map_and_pin_count2) SetPagePin(page); } page_map_count(): [M1] if (pte_valid(pte) { map_count++; } } else if (pte_special_zap(pte)) { [M2] unsigned long map_count_at_zap = pte_special_zap_to_value(pte); [M3] if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++; } And pte zapping of file back page will write a special pte entry which has the page map and pin count value at the time the pte is zap. Also page_mkclean_one() unconditionaly replace those special pte with pte none and ignore them altogether. We only want to detect pte zapping that happens after [C6] and before [C7] is done. With [M3] we are counting all page table entry that have been zap after the map_and_pin_count value we read at [C6]. Again we have two cases: [R1] A concurrent GUP after [C6] then we do not care what happens at [C8] as the GUP would already have set the page pin flag. [R2] No concurrent GUP then we only have concurrent PUP to worry about. If they happen before [C6] they are included in [C6] map_and_pin_count value. If after [C6] then we might miss a page that is no longer pin ie we are over estimating the map_and_pin_count (real_map_and_pin_count < map_and_pin_count at [C8]). So no false negative just false positive. Here we just get the accurate real_map_count at [C6] time so if the page was no longer pin at [C6] time we will correctly detect it and not set the flag at [C8]. If there is any concurrent GUP that GUP would set the flag properly. There is one last thing to note about above code, the MASK in [M3]. For special pte entry we might not have enough bits to store the whole map and pin count value (on 32bits arch). So we might expose ourself to wrap around. Again we do not care about [R1] case as any concurrent GUP will set the pin flag. So we only care if the only thing happening concurrently is either PUP or pte zapping. In both case its means that the map and pin count is going down so if there is a wrap around sometimes within [C7]/page_map_count() we have: [t0] page_map_count() executed on some pte [t1] page_map_count() executed on another pte after [t1] With: (map_count_t0 & MASK) < (map_count_t1 & MASK) While in fact: map_count_t0 > map_count_t1 So if that happens then we will under-estimate the map count ie we will ignore some of the concurrent pte zapping and not count them. So again we are only exposing our self to false positive not false negative. --------------------------------------------------------------------- Hopes this prove that this solution do work. The false positive is something that i believe is acceptable. We will get them only when they are racing GUP or PUP. For racing GUP it is safer to have false positive. For racing PUP it would be nice to catch them but hey some times you just get unlucky. Note that any other solution will also suffer from false positive situation because anyway you are testing for the page pin status at a given point in time so it can always race with a PUP. So the only difference with any other solution would be how long is the false positive race window. Cheers, Jérôme From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 637B2C43387 for ; Fri, 11 Jan 2019 16:51:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 2902E20657 for ; Fri, 11 Jan 2019 16:51:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732294AbfAKQvt (ORCPT ); Fri, 11 Jan 2019 11:51:49 -0500 Received: from mx1.redhat.com ([209.132.183.28]:56153 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728784AbfAKQvt (ORCPT ); Fri, 11 Jan 2019 11:51:49 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 528D7C09942B; Fri, 11 Jan 2019 16:51:47 +0000 (UTC) Received: from redhat.com (ovpn-125-134.rdu2.redhat.com [10.10.125.134]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 4DFE4600C2; Fri, 11 Jan 2019 16:51:44 +0000 (UTC) Date: Fri, 11 Jan 2019 11:51:42 -0500 From: Jerome Glisse To: John Hubbard Cc: Jan Kara , Matthew Wilcox , Dave Chinner , Dan Williams , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , mike.marciniszyn@intel.com, rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions Message-ID: <20190111165141.GB3190@redhat.com> References: <20181217181148.GA3341@redhat.com> <20181217183443.GO10600@bombadil.infradead.org> <20181218093017.GB18032@quack2.suse.cz> <9f43d124-2386-7bfd-d90b-4d0417f51ccd@nvidia.com> <20181219020723.GD4347@redhat.com> <20181219110856.GA18345@quack2.suse.cz> <20190103015533.GA15619@redhat.com> <20190103092654.GA31370@quack2.suse.cz> <20190103144405.GC3395@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Fri, 11 Jan 2019 16:51:48 +0000 (UTC) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190111165142.N9YS2n1JgAz5vAvNbhnF56WvF55aJ0uKirv32568K-c@z> On Thu, Jan 10, 2019 at 06:59:31PM -0800, John Hubbard wrote: > On 1/3/19 6:44 AM, Jerome Glisse wrote: > > On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote: > >> On Wed 02-01-19 20:55:33, Jerome Glisse wrote: > >>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote: > >>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote: > >>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote: [...] > >>> Now page_mkclean: > >>> > >>> int page_mkclean(struct page *page) > >>> { > >>> int cleaned = 0; > >>> + int real_mapcount = 0; > >>> struct address_space *mapping; > >>> struct rmap_walk_control rwc = { > >>> .arg = (void *)&cleaned, > >>> .rmap_one = page_mkclean_one, > >>> .invalid_vma = invalid_mkclean_vma, > >>> + .mapcount = &real_mapcount, > >>> }; > >>> + int mapcount1, mapcount2; > >>> > >>> BUG_ON(!PageLocked(page)); > >>> > >>> if (!page_mapped(page)) > >>> return 0; > >>> > >>> mapping = page_mapping(page); > >>> if (!mapping) > >>> return 0; > >>> > >>> + mapcount1 = page_mapcount(page); > >>> // rmap_walk need to change to count mapping and return value > >>> // in .mapcount easy one > >>> rmap_walk(page, &rwc); > >> > >> So what prevents GUP_fast() to grab reference here and the test below would > >> think the page is not pinned? Or do you assume that every page_mkclean() > >> call will be protected by PageWriteback (currently it is not) so that > >> GUP_fast() blocks / bails out? > > Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned > for each page" question (ignoring, for now, what to actually *do* in response to > that flag being set): > > 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening". > This is probably less troubling than the next point, but it does undermine all the > complicated schemes involving PageWriteback, that try to synchronize gup() with > page_mkclean(). > > 2. Also, the mapcount approach here still does not reliably avoid false negatives > (that is, a page may have been gup'd, but page_mkclean could miss that): gup() > can always jump in and increment the mapcount, while page_mkclean is in the middle > of making (wrong) decisions based on that mapcount. There's no lock to prevent that. > > Again: mapcount can go up *or* down, so I'm not seeing a true solution yet. Both point is address by the solution at the end of this email. > > > > So GUP_fast() becomes: > > > > GUP_fast_existing() { ... } > > GUP_fast() > > { > > GUP_fast_existing(); > > > > for (i = 0; i < npages; ++i) { > > if (PageWriteback(pages[i])) { > > // need to force slow path for this page > > } else { > > SetPageDmaPinned(pages[i]); > > atomic_inc(pages[i]->mapcount); > > } > > } > > } > > > > This is a minor slow down for GUP fast and it takes care of a > > write back race on behalf of caller. This means that page_mkclean > > can not see a mapcount value that increase. This simplify thing > > we can relax that. Note that what this is doing is making sure > > that GUP_fast never get lucky :) ie never GUP a page that is in > > the process of being write back but has not yet had its pte > > updated to reflect that. > > > > > >> But I think that detecting pinned pages with small false positive rate is > >> OK. The extra page bouncing will cost some performance but if it is rare, > >> then we are OK. So I think we can go for the simple version of detecting > >> pinned pages as you mentioned in some earlier email. We just have to be > >> sure there are no false negatives. > > > > Agree with that sentiment, but there are still false negatives and I'm not > yet seeing any solutions for that. So here is the solution: Is a page pin ? With no false negative: ======================================= get_user_page*() aka GUP: if (!PageAnon(page)) { bool write_back = PageWriteback(page); bool page_is_pin = PagePin(page); if (write_back && !page_is_pin) { /* Wait for write back a re-try GUP */ ... goto retry; } [G1] smp_rmb(); [G2] atomic_inc(&page->_mapcount) [G3] smp_wmb(); [G4] SetPagePin(page); [G5] smp_wmb(); [G6] if (!write_back && !page_is_pin && PageWriteback(page)) { /* Back-off as write back might have miss us */ atomic_dec(&page->_mapcount); /* Wait for write back a re-try GUP */ ... goto retry; } } put_user_page() aka PUP: [P1] if (!PageAnon(page)) atomic_dec(&page->_mapcount); [P2] put_page(page); page_mkclean(): [C1] pined = TestClearPagePin(page); [C2] smp_mb(); [C3] map_and_pin_count = atomic_read(&page->_mapcount) [C4] map_count = rmap_walk(page); [C5] if (pined && map_count < map_and_pin_count) SetPagePin(page); So with above code we store the map and pin count inside struct page _mapcount field. The idea is that we can count the number of page table entry that point to the page when reverse walking all the page mapping in page_mkclean() [C4]. The issue is that GUP, PUP and page table entry zapping can all run concurrently with page_mkclean() and thus we can not get the real map and pin count and the real map count at a given point in time ([C5] for instance in the above). However we only care about avoiding false negative ie we do not want to report a page as unpin if in fact it is pin (it has active GUP). Avoiding false positive would be nice but it would need more heavy weight synchronization within GUP and PUP (we can mitigate it see the section on that below). With the above scheme a page is _not_ pin (unpin) if and only if we have real_map_count == real_map_and_pin_count at a given point in time. In the above pseudo code the page is lock within page_mkclean() thus no new page table entry can be added and thus the number of page mapping can only go down (because of conccurent pte zapping). So no matter what happens at [C5] we have map_count <= real_map_count. At [C3] we have two cases to consider: [R1] A concurrent GUP after [C3] then we do not care what happens at [C5] as the GUP would already have set the page pin flag. If it raced before [C3] at [C1] with TestClearPagePin() then we would have the map_and_pin_count reflect the GUP thanks to the memory barrier [G3] and [C2]. [R2] No concurrent GUP after [C3] then we only have concurrent PUP to worry about and thus the real_map_and_pin_count can only go down. So because we first snap shot that value at [C5] we have: real_map_and_pin_count <= map_and_pin_count. So at [C5] we end up with map_count <= real_map_count and with real_map_and_pin_count <= map_pin_count but we also always have real_map_count <= real_map_and_pin_count so it means we are in a a <= b <= c <= d scenario and if a == d then b == c. So at [C5] if map_count == map_pin_count then we know for sure that we have real_map_count == real_map_and_pin_count and if that is the case then the page is no longer pin. So at [C5] we will never miss a pin page (no false negative). Another way to word this is that we always under-estimate the real map count and over estimate the map and pin count and thus we can never have false negative (map count equal to map and pin count while in fact real map count is inferior to real map and pin count). PageWriteback() test and ordering with page_mkclean() ===================================================== In GUP we test for page write back flag to avoid pining a page that is under going write back. That flag is set after page_mkclean() so the filesystem code that will check for the pin flag need some memory barrier: int __test_set_page_writeback(struct page *page, bool keep_write, + bool *use_bounce_page) { ... [T1] TestSetPageWriteback(page); + [T2] smp_wmb(); + [T3] *use_bounce_page = PagePin(page); ... } That way if there is a concurrent GUP we either have: [R1] GUP sees the write back flag set before [G1] so it back-off [R2] GUP sees no write back before [G1] here either we have GUP that sees the write back flag at [G6] or [T3] that sees the pin flag thanks to the memory barrier [G5] and [T2]. So in all cases we never miss a pin or a write back. Mitigate false positive: ======================== If false positive is ever an issue we can improve the situation and to properly account conccurent pte zapping with the following changes: page_mkclean(): [C1] pined = TestClearPagePin(page); [C2] smp_mb(); [C3] map_and_pin_count = atomic_read(&page->_mapcount) [C4] map_count = rmap_walk(page, &page_mkclean_one()); [C5] if (pined && !PagePin(page) && map_count < map_and_pin_count) { [C6] map_and_pin_count2 = atomic_read(&page->_mapcount) [C7] map_count = rmap_walk(page, &page_map_count(), map_and_pin_count2); [C8] if (map_count < map_and_pin_count2) SetPagePin(page); } page_map_count(): [M1] if (pte_valid(pte) { map_count++; } } else if (pte_special_zap(pte)) { [M2] unsigned long map_count_at_zap = pte_special_zap_to_value(pte); [M3] if (map_count_at_zap <= (map_and_pin_count & MASK)) map_count++; } And pte zapping of file back page will write a special pte entry which has the page map and pin count value at the time the pte is zap. Also page_mkclean_one() unconditionaly replace those special pte with pte none and ignore them altogether. We only want to detect pte zapping that happens after [C6] and before [C7] is done. With [M3] we are counting all page table entry that have been zap after the map_and_pin_count value we read at [C6]. Again we have two cases: [R1] A concurrent GUP after [C6] then we do not care what happens at [C8] as the GUP would already have set the page pin flag. [R2] No concurrent GUP then we only have concurrent PUP to worry about. If they happen before [C6] they are included in [C6] map_and_pin_count value. If after [C6] then we might miss a page that is no longer pin ie we are over estimating the map_and_pin_count (real_map_and_pin_count < map_and_pin_count at [C8]). So no false negative just false positive. Here we just get the accurate real_map_count at [C6] time so if the page was no longer pin at [C6] time we will correctly detect it and not set the flag at [C8]. If there is any concurrent GUP that GUP would set the flag properly. There is one last thing to note about above code, the MASK in [M3]. For special pte entry we might not have enough bits to store the whole map and pin count value (on 32bits arch). So we might expose ourself to wrap around. Again we do not care about [R1] case as any concurrent GUP will set the pin flag. So we only care if the only thing happening concurrently is either PUP or pte zapping. In both case its means that the map and pin count is going down so if there is a wrap around sometimes within [C7]/page_map_count() we have: [t0] page_map_count() executed on some pte [t1] page_map_count() executed on another pte after [t1] With: (map_count_t0 & MASK) < (map_count_t1 & MASK) While in fact: map_count_t0 > map_count_t1 So if that happens then we will under-estimate the map count ie we will ignore some of the concurrent pte zapping and not count them. So again we are only exposing our self to false positive not false negative. --------------------------------------------------------------------- Hopes this prove that this solution do work. The false positive is something that i believe is acceptable. We will get them only when they are racing GUP or PUP. For racing GUP it is safer to have false positive. For racing PUP it would be nice to catch them but hey some times you just get unlucky. Note that any other solution will also suffer from false positive situation because anyway you are testing for the page pin status at a given point in time so it can always race with a PUP. So the only difference with any other solution would be how long is the false positive race window. Cheers, Jérôme