From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 8D36E6B04CE for ; Mon, 10 Jul 2017 20:52:28 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id u17so130208245pfa.6 for ; Mon, 10 Jul 2017 17:52:28 -0700 (PDT) Received: from mail-pg0-x236.google.com (mail-pg0-x236.google.com. [2607:f8b0:400e:c05::236]) by mx.google.com with ESMTPS id g128si9016111pgc.343.2017.07.10.17.52.27 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 10 Jul 2017 17:52:27 -0700 (PDT) Received: by mail-pg0-x236.google.com with SMTP id t186so57916128pgb.1 for ; Mon, 10 Jul 2017 17:52:27 -0700 (PDT) From: Nadav Amit Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Potential race in TLB flush batching? Message-Id: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> Date: Mon, 10 Jul 2017 17:52:25 -0700 Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Andy Lutomirski Cc: "open list:MEMORY MANAGEMENT" Something bothers me about the TLB flushes batching mechanism that Linux uses on x86 and I would appreciate your opinion regarding it. As you know, try_to_unmap_one() can batch TLB invalidations. While doing = so, however, the page-table lock(s) are not held, and I see no indication of = the pending flush saved (and regarded) in the relevant mm-structs. So, my question: what prevents, at least in theory, the following = scenario: CPU0 CPU1 ---- ---- user accesses memory using RW = PTE=20 [PTE now cached in TLB] try_to_unmap_one() =3D=3D> ptep_get_and_clear() =3D=3D> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) =3D=3D> change_pte_range() =3D=3D> [ PTE non-present - no = flush ] user writes using cached RW PTE ... try_to_unmap_flush() As you see CPU1 write should have failed, but may succeed.=20 Now I don=E2=80=99t have a PoC since in practice it seems hard to create = such a scenario: try_to_unmap_one() is likely to find the PTE accessed and the = PTE would not be reclaimed. Yet, isn=E2=80=99t it a problem? Am I missing something? Thanks, Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 4B8F36B04C0 for ; Tue, 11 Jul 2017 02:41:52 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id g46so29356922wrd.3 for ; Mon, 10 Jul 2017 23:41:52 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id a21si8448641wme.115.2017.07.10.23.41.51 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 10 Jul 2017 23:41:51 -0700 (PDT) Date: Tue, 11 Jul 2017 07:41:49 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170711064149.bg63nvi54ycynxw4@suse.de> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote: > Something bothers me about the TLB flushes batching mechanism that Linux > uses on x86 and I would appreciate your opinion regarding it. > > As you know, try_to_unmap_one() can batch TLB invalidations. While doing so, > however, the page-table lock(s) are not held, and I see no indication of the > pending flush saved (and regarded) in the relevant mm-structs. > > So, my question: what prevents, at least in theory, the following scenario: > > CPU0 CPU1 > ---- ---- > user accesses memory using RW PTE > [PTE now cached in TLB] > try_to_unmap_one() > ==> ptep_get_and_clear() > ==> set_tlb_ubc_flush_pending() > mprotect(addr, PROT_READ) > ==> change_pte_range() > ==> [ PTE non-present - no flush ] > > user writes using cached RW PTE > ... > > try_to_unmap_flush() > > > As you see CPU1 write should have failed, but may succeed. > > Now I don???t have a PoC since in practice it seems hard to create such a > scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE > would not be reclaimed. > That is the same to a race whereby there is no batching mechanism and the racing operation happens between a pte clear and a flush as ptep_clear_flush is not atomic. All that differs is that the race window is a different size. The application on CPU1 is buggy in that it may or may not succeed the write but it is buggy regardless of whether a batching mechanism is used or not. The user accessed the PTE before the mprotect so, at the time of mprotect, the PTE is either clean or dirty. If it is clean then any subsequent write would transition the PTE from clean to dirty and an architecture enabling the batching mechanism must trap a clean->dirty transition for unmapped entries as commented upon in try_to_unmap_one (and was checked that this is true for x86 at least). This avoids data corruption due to a lost update. If the previous access was a write then the batching flushes the page if any IO is required to avoid any writes after the IO has been initiated using try_to_unmap_flush_dirty so again there is no data corruption. There is a window where the TLB entry exists after the unmapping but this exists regardless of whether we batch or not. In either case, before a page is freed and potentially allocated to another process, the TLB is flushed. > Yet, isn???t it a problem? Am I missing something? > It's not a problem as such as it's basically a buggy application that can only hurt itself. I cannot see a path whereby the cached PTE can be used to corrupt data by either accessing it after IO has been initiated (lost data update) or access a physical page that has been allocated to another process (arbitrary corruption). -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 502E6440846 for ; Tue, 11 Jul 2017 03:30:31 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id c23so137288979pfe.11 for ; Tue, 11 Jul 2017 00:30:31 -0700 (PDT) Received: from mail-pg0-x244.google.com (mail-pg0-x244.google.com. [2607:f8b0:400e:c05::244]) by mx.google.com with ESMTPS id f2si2955920pgr.380.2017.07.11.00.30.30 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 00:30:30 -0700 (PDT) Received: by mail-pg0-x244.google.com with SMTP id d193so15723025pgc.2 for ; Tue, 11 Jul 2017 00:30:30 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170711064149.bg63nvi54ycynxw4@suse.de> Date: Tue, 11 Jul 2017 00:30:28 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote: >> Something bothers me about the TLB flushes batching mechanism that = Linux >> uses on x86 and I would appreciate your opinion regarding it. >>=20 >> As you know, try_to_unmap_one() can batch TLB invalidations. While = doing so, >> however, the page-table lock(s) are not held, and I see no indication = of the >> pending flush saved (and regarded) in the relevant mm-structs. >>=20 >> So, my question: what prevents, at least in theory, the following = scenario: >>=20 >> CPU0 CPU1 >> ---- ---- >> user accesses memory using RW = PTE=20 >> [PTE now cached in TLB] >> try_to_unmap_one() >> =3D=3D> ptep_get_and_clear() >> =3D=3D> set_tlb_ubc_flush_pending() >> mprotect(addr, PROT_READ) >> =3D=3D> change_pte_range() >> =3D=3D> [ PTE non-present - no = flush ] >>=20 >> user writes using cached RW PTE >> ... >>=20 >> try_to_unmap_flush() >>=20 >>=20 >> As you see CPU1 write should have failed, but may succeed.=20 >>=20 >> Now I don???t have a PoC since in practice it seems hard to create = such a >> scenario: try_to_unmap_one() is likely to find the PTE accessed and = the PTE >> would not be reclaimed. >=20 > That is the same to a race whereby there is no batching mechanism and = the > racing operation happens between a pte clear and a flush as = ptep_clear_flush > is not atomic. All that differs is that the race window is a different = size. > The application on CPU1 is buggy in that it may or may not succeed the = write > but it is buggy regardless of whether a batching mechanism is used or = not. Thanks for your quick and detailed response, but I fail to see how it = can happen without batching. Indeed, the PTE clear and flush are not = =E2=80=9Catomic=E2=80=9D, but without batching they are both performed under the page table lock (which is acquired in page_vma_mapped_walk and released in page_vma_mapped_walk_done). Since the lock is taken, other cores should = not be able to inspect/modify the PTE. Relevant functions, e.g., = zap_pte_range and change_pte_range, acquire the lock before accessing the PTEs. Can you please explain why you consider the application to be buggy? = AFAIU an application can wish to trap certain memory accesses using = userfaultfd or SIGSEGV. For example, it may do it for garbage collection or sandboxing. = To do so, it can use mprotect with PROT_NONE and expect to be able to trap future accesses to that memory. This use-case is described in usefaultfd documentation. > The user accessed the PTE before the mprotect so, at the time of = mprotect, > the PTE is either clean or dirty. If it is clean then any subsequent = write > would transition the PTE from clean to dirty and an architecture = enabling > the batching mechanism must trap a clean->dirty transition for = unmapped > entries as commented upon in try_to_unmap_one (and was checked that = this > is true for x86 at least). This avoids data corruption due to a lost = update. >=20 > If the previous access was a write then the batching flushes the page = if > any IO is required to avoid any writes after the IO has been initiated > using try_to_unmap_flush_dirty so again there is no data corruption. = There > is a window where the TLB entry exists after the unmapping but this = exists > regardless of whether we batch or not. >=20 > In either case, before a page is freed and potentially allocated to = another > process, the TLB is flushed. To clarify my concern again - I am not regarding a memory corruption as = you do, but situations in which the application wishes to trap certain = memory accesses but fails to do so. Having said that, I would add, that even if = an application has a bug, it may expect this bug not to affect memory that = was previously unmapped (and may be written to permanent storage). Thanks (again), Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id 9E8AB6B04D7 for ; Tue, 11 Jul 2017 05:29:38 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id g46so30346987wrd.3 for ; Tue, 11 Jul 2017 02:29:38 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id h82si8780781wmh.91.2017.07.11.02.29.37 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 11 Jul 2017 02:29:37 -0700 (PDT) Date: Tue, 11 Jul 2017 10:29:35 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170711092935.bogdb4oja6v7kilq@suse.de> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 12:30:28AM -0700, Nadav Amit wrote: > Mel Gorman wrote: > > > On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote: > >> Something bothers me about the TLB flushes batching mechanism that Linux > >> uses on x86 and I would appreciate your opinion regarding it. > >> > >> As you know, try_to_unmap_one() can batch TLB invalidations. While doing so, > >> however, the page-table lock(s) are not held, and I see no indication of the > >> pending flush saved (and regarded) in the relevant mm-structs. > >> > >> So, my question: what prevents, at least in theory, the following scenario: > >> > >> CPU0 CPU1 > >> ---- ---- > >> user accesses memory using RW PTE > >> [PTE now cached in TLB] > >> try_to_unmap_one() > >> ==> ptep_get_and_clear() > >> ==> set_tlb_ubc_flush_pending() > >> mprotect(addr, PROT_READ) > >> ==> change_pte_range() > >> ==> [ PTE non-present - no flush ] > >> > >> user writes using cached RW PTE > >> ... > >> > >> try_to_unmap_flush() > >> > >> > >> As you see CPU1 write should have failed, but may succeed. > >> > >> Now I don???t have a PoC since in practice it seems hard to create such a > >> scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE > >> would not be reclaimed. > > > > That is the same to a race whereby there is no batching mechanism and the > > racing operation happens between a pte clear and a flush as ptep_clear_flush > > is not atomic. All that differs is that the race window is a different size. > > The application on CPU1 is buggy in that it may or may not succeed the write > > but it is buggy regardless of whether a batching mechanism is used or not. > > Thanks for your quick and detailed response, but I fail to see how it can > happen without batching. Indeed, the PTE clear and flush are not ???atomic???, > but without batching they are both performed under the page table lock > (which is acquired in page_vma_mapped_walk and released in > page_vma_mapped_walk_done). Since the lock is taken, other cores should not > be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range > and change_pte_range, acquire the lock before accessing the PTEs. > I was primarily thinking in terms of memory corruption or data loss. However, we are still protected although it's not particularly obvious why. On the reclaim side, we are either reclaiming clean pages (which ignore the accessed bit) or normal reclaim. If it's clean pages then any parallel write must update the dirty bit at minimum. If it's normal reclaim then the accessed bit is checked and if cleared in try_to_unmap_one, it uses a ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim the page in either as part of page_referenced or try_to_unmap_one but clearing the accessed bit flushes the TLB. On the mprotect side then, as the page was first accessed, clearing the accessed bit incurs a TLB flush on the reclaim side before the second write. That means any TLB entry that exists cannot have the accessed bit set so a second write needs to update it. While it's not clearly documented, I checked with hardware engineers at the time that an update of the accessed or dirty bit even with a TLB entry will check the underlying page tables and trap if it's not present and the subsequent fault will then fail on sigsegv if the VMA protections no longer allow the write. So, on one side if ignoring the accessed bit during reclaim, the pages are clean so any access will set the dirty bit and trap if unmapped in parallel. On the other side, the accessed bit if set cleared the TLB and if not set, then the hardware needs to update and again will trap if unmapped in parallel. If this guarantee from hardware was every shown to be wrong or another architecture wanted to add batching without the same guarantee then mprotect would need to do a local_flush_tlb if no pages were updated by the mprotect but right now, this should not be necessary. > Can you please explain why you consider the application to be buggy? I considered it a bit dumb to mprotect for READ/NONE and then try writing the same mapping. However, it will behave as expected. > AFAIU > an application can wish to trap certain memory accesses using userfaultfd or > SIGSEGV. For example, it may do it for garbage collection or sandboxing. To > do so, it can use mprotect with PROT_NONE and expect to be able to trap > future accesses to that memory. This use-case is described in usefaultfd > documentation. > Such applications are safe due to how the accessed bit is handled by the software (flushes TLB if clearing young) and hardware (traps if updating the accessed or dirty bit and the underlying PTE was unmapped even if there is a TLB entry). -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id B3EF76B04DE for ; Tue, 11 Jul 2017 06:40:06 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id u5so144324752pgq.14 for ; Tue, 11 Jul 2017 03:40:06 -0700 (PDT) Received: from mail-pg0-x243.google.com (mail-pg0-x243.google.com. [2607:f8b0:400e:c05::243]) by mx.google.com with ESMTPS id r21si9982142pgo.321.2017.07.11.03.40.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 03:40:05 -0700 (PDT) Received: by mail-pg0-x243.google.com with SMTP id d193so16349546pgc.2 for ; Tue, 11 Jul 2017 03:40:05 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170711092935.bogdb4oja6v7kilq@suse.de> Date: Tue, 11 Jul 2017 03:40:02 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Tue, Jul 11, 2017 at 12:30:28AM -0700, Nadav Amit wrote: >> Mel Gorman wrote: >>=20 >>> On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote: >>>> Something bothers me about the TLB flushes batching mechanism that = Linux >>>> uses on x86 and I would appreciate your opinion regarding it. >>>>=20 >>>> As you know, try_to_unmap_one() can batch TLB invalidations. While = doing so, >>>> however, the page-table lock(s) are not held, and I see no = indication of the >>>> pending flush saved (and regarded) in the relevant mm-structs. >>>>=20 >>>> So, my question: what prevents, at least in theory, the following = scenario: >>>>=20 >>>> CPU0 CPU1 >>>> ---- ---- >>>> user accesses memory using RW = PTE=20 >>>> [PTE now cached in TLB] >>>> try_to_unmap_one() >>>> =3D=3D> ptep_get_and_clear() >>>> =3D=3D> set_tlb_ubc_flush_pending() >>>> mprotect(addr, PROT_READ) >>>> =3D=3D> change_pte_range() >>>> =3D=3D> [ PTE non-present - no = flush ] >>>>=20 >>>> user writes using cached RW PTE >>>> ... >>>>=20 >>>> try_to_unmap_flush() >>>>=20 >>>>=20 >>>> As you see CPU1 write should have failed, but may succeed.=20 >>>>=20 >>>> Now I don???t have a PoC since in practice it seems hard to create = such a >>>> scenario: try_to_unmap_one() is likely to find the PTE accessed and = the PTE >>>> would not be reclaimed. >>>=20 >>> That is the same to a race whereby there is no batching mechanism = and the >>> racing operation happens between a pte clear and a flush as = ptep_clear_flush >>> is not atomic. All that differs is that the race window is a = different size. >>> The application on CPU1 is buggy in that it may or may not succeed = the write >>> but it is buggy regardless of whether a batching mechanism is used = or not. >>=20 >> Thanks for your quick and detailed response, but I fail to see how it = can >> happen without batching. Indeed, the PTE clear and flush are not = ???atomic???, >> but without batching they are both performed under the page table = lock >> (which is acquired in page_vma_mapped_walk and released in >> page_vma_mapped_walk_done). Since the lock is taken, other cores = should not >> be able to inspect/modify the PTE. Relevant functions, e.g., = zap_pte_range >> and change_pte_range, acquire the lock before accessing the PTEs. >=20 > I was primarily thinking in terms of memory corruption or data loss. > However, we are still protected although it's not particularly obvious = why. >=20 > On the reclaim side, we are either reclaiming clean pages (which = ignore > the accessed bit) or normal reclaim. If it's clean pages then any = parallel > write must update the dirty bit at minimum. If it's normal reclaim = then > the accessed bit is checked and if cleared in try_to_unmap_one, it = uses a > ptep_clear_flush_young_notify so the TLB gets flushed. We don't = reclaim > the page in either as part of page_referenced or try_to_unmap_one but > clearing the accessed bit flushes the TLB. Wait. Are you looking at the x86 arch function? The TLB is not flushed = when the access bit is cleared: int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { /* * On x86 CPUs, clearing the accessed bit without a TLB flush * doesn't cause data corruption. [ It could cause incorrect * page aging and the (mistaken) reclaim of hot pages, but the * chance of that should be relatively low. ] * =20 * So as a performance optimization don't flush the TLB when * clearing the accessed bit, it will eventually be flushed by * a context switch or a VM operation anyway. [ In the rare * event of it not getting flushed for a long time the delay * shouldn't really matter because there's no real memory * pressure for swapout to react to. ] */ return ptep_test_and_clear_young(vma, address, ptep); } >=20 > On the mprotect side then, as the page was first accessed, clearing = the > accessed bit incurs a TLB flush on the reclaim side before the second = write. > That means any TLB entry that exists cannot have the accessed bit set = so > a second write needs to update it. >=20 > While it's not clearly documented, I checked with hardware engineers > at the time that an update of the accessed or dirty bit even with a = TLB > entry will check the underlying page tables and trap if it's not = present > and the subsequent fault will then fail on sigsegv if the VMA = protections > no longer allow the write. >=20 > So, on one side if ignoring the accessed bit during reclaim, the pages > are clean so any access will set the dirty bit and trap if unmapped in > parallel. On the other side, the accessed bit if set cleared the TLB = and > if not set, then the hardware needs to update and again will trap if > unmapped in parallel. Yet, even regardless to the TLB flush it seems there is still a possible race: CPU0 CPU1 ---- ---- ptep_clear_flush_young_notify =3D=3D> PTE.A=3D=3D0 access PTE =3D=3D> PTE.A=3D1 prep_get_and_clear change mapping (and PTE) Use stale TLB entry > If this guarantee from hardware was every shown to be wrong or another > architecture wanted to add batching without the same guarantee then = mprotect > would need to do a local_flush_tlb if no pages were updated by the = mprotect > but right now, this should not be necessary. >=20 >> Can you please explain why you consider the application to be buggy? >=20 > I considered it a bit dumb to mprotect for READ/NONE and then try = writing > the same mapping. However, it will behave as expected. I don=E2=80=99t think that this is the only scenario. For example, the = application may create a new memory mapping of a different file using mmap at the = same memory address that was used before, just as that memory is reclaimed. = The application can (inadvertently) cause such a scenario by using = MAP_FIXED. But even without MAP_FIXED, running mmap->munmap->mmap can reuse the = same virtual address. >> AFAIU >> an application can wish to trap certain memory accesses using = userfaultfd or >> SIGSEGV. For example, it may do it for garbage collection or = sandboxing. To >> do so, it can use mprotect with PROT_NONE and expect to be able to = trap >> future accesses to that memory. This use-case is described in = usefaultfd >> documentation. >=20 > Such applications are safe due to how the accessed bit is handled by = the > software (flushes TLB if clearing young) and hardware (traps if = updating > the accessed or dirty bit and the underlying PTE was unmapped even if > there is a TLB entry). I don=E2=80=99t think it is so. And I also think there are many = additional potentially problematic scenarios. Thanks for your patience, Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id D34846B0500 for ; Tue, 11 Jul 2017 09:20:26 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id x23so31789193wrb.6 for ; Tue, 11 Jul 2017 06:20:26 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id k12si10080439wrc.386.2017.07.11.06.20.25 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 11 Jul 2017 06:20:25 -0700 (PDT) Date: Tue, 11 Jul 2017 14:20:23 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 03:40:02AM -0700, Nadav Amit wrote: > Mel Gorman wrote: > > >>> That is the same to a race whereby there is no batching mechanism and the > >>> racing operation happens between a pte clear and a flush as ptep_clear_flush > >>> is not atomic. All that differs is that the race window is a different size. > >>> The application on CPU1 is buggy in that it may or may not succeed the write > >>> but it is buggy regardless of whether a batching mechanism is used or not. > >> > >> Thanks for your quick and detailed response, but I fail to see how it can > >> happen without batching. Indeed, the PTE clear and flush are not ???atomic???, > >> but without batching they are both performed under the page table lock > >> (which is acquired in page_vma_mapped_walk and released in > >> page_vma_mapped_walk_done). Since the lock is taken, other cores should not > >> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range > >> and change_pte_range, acquire the lock before accessing the PTEs. > > > > I was primarily thinking in terms of memory corruption or data loss. > > However, we are still protected although it's not particularly obvious why. > > > > On the reclaim side, we are either reclaiming clean pages (which ignore > > the accessed bit) or normal reclaim. If it's clean pages then any parallel > > write must update the dirty bit at minimum. If it's normal reclaim then > > the accessed bit is checked and if cleared in try_to_unmap_one, it uses a > > ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim > > the page in either as part of page_referenced or try_to_unmap_one but > > clearing the accessed bit flushes the TLB. > > Wait. Are you looking at the x86 arch function? The TLB is not flushed when > the access bit is cleared: > > int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long address, pte_t *ptep) > { > /* > * On x86 CPUs, clearing the accessed bit without a TLB flush > * doesn't cause data corruption. [ It could cause incorrect > * page aging and the (mistaken) reclaim of hot pages, but the > * chance of that should be relatively low. ] > * > * So as a performance optimization don't flush the TLB when > * clearing the accessed bit, it will eventually be flushed by > * a context switch or a VM operation anyway. [ In the rare > * event of it not getting flushed for a long time the delay > * shouldn't really matter because there's no real memory > * pressure for swapout to react to. ] > */ > return ptep_test_and_clear_young(vma, address, ptep); > } > I forgot this detail, thanks for correcting me. > > > > On the mprotect side then, as the page was first accessed, clearing the > > accessed bit incurs a TLB flush on the reclaim side before the second write. > > That means any TLB entry that exists cannot have the accessed bit set so > > a second write needs to update it. > > > > While it's not clearly documented, I checked with hardware engineers > > at the time that an update of the accessed or dirty bit even with a TLB > > entry will check the underlying page tables and trap if it's not present > > and the subsequent fault will then fail on sigsegv if the VMA protections > > no longer allow the write. > > > > So, on one side if ignoring the accessed bit during reclaim, the pages > > are clean so any access will set the dirty bit and trap if unmapped in > > parallel. On the other side, the accessed bit if set cleared the TLB and > > if not set, then the hardware needs to update and again will trap if > > unmapped in parallel. > > > Yet, even regardless to the TLB flush it seems there is still a possible > race: > > CPU0 CPU1 > ---- ---- > ptep_clear_flush_young_notify > ==> PTE.A==0 > access PTE > ==> PTE.A=1 > prep_get_and_clear > change mapping (and PTE) > Use stale TLB entry So I think you're right and this is a potential race. The first access can be a read or a write as it's a problem if the mprotect call restricts access. > > If this guarantee from hardware was every shown to be wrong or another > > architecture wanted to add batching without the same guarantee then mprotect > > would need to do a local_flush_tlb if no pages were updated by the mprotect > > but right now, this should not be necessary. > > > >> Can you please explain why you consider the application to be buggy? > > > > I considered it a bit dumb to mprotect for READ/NONE and then try writing > > the same mapping. However, it will behave as expected. > > I don???t think that this is the only scenario. For example, the application > may create a new memory mapping of a different file using mmap at the same > memory address that was used before, just as that memory is reclaimed. That requires the existing mapping to be unmapped which will flush the TLB and parallel mmap/munmap serialises on mmap_sem. The race appears to be specific to mprotect which avoids the TLB flush if no pages were updated. > The > application can (inadvertently) cause such a scenario by using MAP_FIXED. > But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same > virtual address. > With flushes in between. > > Such applications are safe due to how the accessed bit is handled by the > > software (flushes TLB if clearing young) and hardware (traps if updating > > the accessed or dirty bit and the underlying PTE was unmapped even if > > there is a TLB entry). > > I don???t think it is so. And I also think there are many additional > potentially problematic scenarios. > I believe it's specific to mprotect but can be handled by flushing the local TLB when mprotect updates no pages. Something like this; ---8<--- mm, mprotect: Flush the local TLB if mprotect potentially raced with a parallel reclaim Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE. This is not a data integrity issue as the TLB is always flushed before any IO is queued or a page is freed but it is a correctness issue as a process restricting access with mprotect() may still be able to access the data after the syscall returns due to a stale TLB entry. Handle this issue by flushing the local TLB if reclaim is potentially batching TLB flushes and mprotect altered no pages. Signed-off-by: Mel Gorman Cc: stable@vger.kernel.org # v4.4+ --- mm/internal.h | 5 ++++- mm/mprotect.c | 12 ++++++++++-- mm/rmap.c | 20 ++++++++++++++++++++ 3 files changed, 34 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 0e4f558412fb..9b7d1a597816 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); +void batched_unmap_protection_update(void); #else static inline void try_to_unmap_flush(void) { @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) static inline void try_to_unmap_flush_dirty(void) { } - +static inline void batched_unmap_protection_update() +{ +} #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; diff --git a/mm/mprotect.c b/mm/mprotect.c index 8edd0d576254..3de353d4b5fb 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -254,9 +254,17 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, dirty_accountable, prot_numa); } while (pgd++, addr = next, addr != end); - /* Only flush the TLB if we actually modified any entries: */ - if (pages) + /* + * Only flush all TLBs if we actually modified any entries. If no + * pages are modified, then call batched_unmap_protection_update + * if the context is a mprotect() syscall. + */ + if (pages) { flush_tlb_range(vma, start, end); + } else { + if (!prot_numa) + batched_unmap_protection_update(); + } clear_tlb_flush_pending(mm); return pages; diff --git a/mm/rmap.c b/mm/rmap.c index d405f0e0ee96..02cb035e4ce6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -643,6 +643,26 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) return should_defer; } + +/* + * This is called after an mprotect update that altered no pages. Batched + * unmap releases the PTL before a flush occurs leaving a window where + * an mprotect that reduces access rights can still access the page after + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing + * the local TLB if mprotect updates no pages so that the the caller of + * mprotect always gets expected behaviour. It's overkill and unnecessary to + * flush all TLBs as a separate thread accessing the data that raced with + * both reclaim and mprotect as there is no risk of data corruption and + * the exact timing of a parallel thread seeing a protection update without + * any serialisation on the application side is always uncertain. + */ +void batched_unmap_protection_update(void) +{ + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); + local_flush_tlb(); + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); +} + #else static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f69.google.com (mail-oi0-f69.google.com [209.85.218.69]) by kanga.kvack.org (Postfix) with ESMTP id 00E826B050F for ; Tue, 11 Jul 2017 10:58:28 -0400 (EDT) Received: by mail-oi0-f69.google.com with SMTP id t188so119801oih.15 for ; Tue, 11 Jul 2017 07:58:27 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id g7si109879oif.371.2017.07.11.07.58.27 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 07:58:27 -0700 (PDT) Received: from mail-ua0-f173.google.com (mail-ua0-f173.google.com [209.85.217.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 600DE22C99 for ; Tue, 11 Jul 2017 14:58:26 +0000 (UTC) Received: by mail-ua0-f173.google.com with SMTP id g40so1530481uaa.3 for ; Tue, 11 Jul 2017 07:58:26 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> From: Andy Lutomirski Date: Tue, 11 Jul 2017 07:58:04 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman wrote: > + > +/* > + * This is called after an mprotect update that altered no pages. Batched > + * unmap releases the PTL before a flush occurs leaving a window where > + * an mprotect that reduces access rights can still access the page after > + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing > + * the local TLB if mprotect updates no pages so that the the caller of > + * mprotect always gets expected behaviour. It's overkill and unnecessary to > + * flush all TLBs as a separate thread accessing the data that raced with > + * both reclaim and mprotect as there is no risk of data corruption and > + * the exact timing of a parallel thread seeing a protection update without > + * any serialisation on the application side is always uncertain. > + */ > +void batched_unmap_protection_update(void) > +{ > + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > + local_flush_tlb(); > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > +} > + What about remote CPUs? You could get migrated right after mprotect() or the inconsistency could be observed on another CPU. I also really don't like bypassing arch code like this. The implementation of flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!) is *very* different from what's there now, and it is not written in the expectation that some generic code might call local_tlb_flush() and expect any kind of coherency at all. I'm also still nervous about situations in which, while a batched flush is active, a user calls mprotect() and then does something else that gets confused by the fact that there's an RO PTE and doesn't flush out the RW TLB entry. COWing a page, perhaps? Would a better fix perhaps be to find a way to figure out whether a batched flush is pending on the mm in question and flush it out if you do any optimizations based on assuming that the TLB is in any respect consistent with the page tables? With the changes in -tip, x86 could, in principle, supply a function to sync up its TLB state. That would require cross-CPU poking at state or an inconditional IPI (that might end up not flushing anything), but either is doable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 725126B0528 for ; Tue, 11 Jul 2017 11:53:15 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id z45so951210wrb.13 for ; Tue, 11 Jul 2017 08:53:15 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id r30si195700wra.315.2017.07.11.08.53.14 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 11 Jul 2017 08:53:14 -0700 (PDT) Date: Tue, 11 Jul 2017 16:53:12 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170711155312.637eyzpqeghcgqzp@suse.de> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Nadav Amit , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote: > On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman wrote: > > + > > +/* > > + * This is called after an mprotect update that altered no pages. Batched > > + * unmap releases the PTL before a flush occurs leaving a window where > > + * an mprotect that reduces access rights can still access the page after > > + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing > > + * the local TLB if mprotect updates no pages so that the the caller of > > + * mprotect always gets expected behaviour. It's overkill and unnecessary to > > + * flush all TLBs as a separate thread accessing the data that raced with > > + * both reclaim and mprotect as there is no risk of data corruption and > > + * the exact timing of a parallel thread seeing a protection update without > > + * any serialisation on the application side is always uncertain. > > + */ > > +void batched_unmap_protection_update(void) > > +{ > > + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > > + local_flush_tlb(); > > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > > +} > > + > > What about remote CPUs? You could get migrated right after mprotect() > or the inconsistency could be observed on another CPU. If it's migrated then it has also context switched so the TLB entry will be read for the first time. If the entry is inconsistent for another CPU accessing the data then it'll potentially successfully access a page that was just mprotected but this is similar to simply racing with the call to mprotect itself. The timing isn't exact, nor does it need to be. One thread accessing data racing with another thread doing mprotect without any synchronisation in the application is always going to be unreliable. I'm less certain once PCID tracking is in place and whether it's possible for a process to be context switching fast enough to allow an access. If it's possible then batching would require an unconditional flush on mprotect even if no pages are updated if access is being limited by the mprotect which would be unfortunate. > I also really > don't like bypassing arch code like this. The implementation of > flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!) > is *very* different from what's there now, and it is not written in > the expectation that some generic code might call local_tlb_flush() > and expect any kind of coherency at all. > Assuming that gets merged first then the most straight-forward approach would be to setup a arch_tlbflush_unmap_batch with just the local CPU set in the mask or something similar. > I'm also still nervous about situations in which, while a batched > flush is active, a user calls mprotect() and then does something else > that gets confused by the fact that there's an RO PTE and doesn't > flush out the RW TLB entry. COWing a page, perhaps? > The race in question only applies if mprotect had no PTEs to update. If any page was updated then the TLB is flushed before mprotect returns. With the patch (or a variant on top of your work), at least the local TLB will be flushed even if no PTEs were updated. This might be more expensive than it has to be but I expect that mprotects on range with no PTEs to update are fairly rare. > Would a better fix perhaps be to find a way to figure out whether a > batched flush is pending on the mm in question and flush it out if you > do any optimizations based on assuming that the TLB is in any respect > consistent with the page tables? With the changes in -tip, x86 could, > in principle, supply a function to sync up its TLB state. That would > require cross-CPU poking at state or an inconditional IPI (that might > end up not flushing anything), but either is doable. It's potentially doable if a field like tlb_flush_pending was added to mm_struct that is set when batching starts. I don't think there is a logical place where it can be cleared as when the TLB gets flushed by reclaim, it can't rmap again to clear the flag. What would happen is that the first mprotect after any batching happened at any point in the past would have to unconditionally flush the TLB and then clear the flag. That would be a relatively minor hit and cover all the possibilities and should work unmodified with or without your series applied. Would that be preferable to you? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 517B66B052F for ; Tue, 11 Jul 2017 12:22:53 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id 13so4388186pgg.8 for ; Tue, 11 Jul 2017 09:22:53 -0700 (PDT) Received: from mail-pg0-x243.google.com (mail-pg0-x243.google.com. [2607:f8b0:400e:c05::243]) by mx.google.com with ESMTPS id j193si257220pge.239.2017.07.11.09.22.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 09:22:52 -0700 (PDT) Received: by mail-pg0-x243.google.com with SMTP id d193so475111pgc.2 for ; Tue, 11 Jul 2017 09:22:51 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> Date: Tue, 11 Jul 2017 09:22:47 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Tue, Jul 11, 2017 at 03:40:02AM -0700, Nadav Amit wrote: >> Mel Gorman wrote: >>=20 >>>>> That is the same to a race whereby there is no batching mechanism = and the >>>>> racing operation happens between a pte clear and a flush as = ptep_clear_flush >>>>> is not atomic. All that differs is that the race window is a = different size. >>>>> The application on CPU1 is buggy in that it may or may not succeed = the write >>>>> but it is buggy regardless of whether a batching mechanism is used = or not. >>>>=20 >>>> Thanks for your quick and detailed response, but I fail to see how = it can >>>> happen without batching. Indeed, the PTE clear and flush are not = ???atomic???, >>>> but without batching they are both performed under the page table = lock >>>> (which is acquired in page_vma_mapped_walk and released in >>>> page_vma_mapped_walk_done). Since the lock is taken, other cores = should not >>>> be able to inspect/modify the PTE. Relevant functions, e.g., = zap_pte_range >>>> and change_pte_range, acquire the lock before accessing the PTEs. >>>=20 >>> I was primarily thinking in terms of memory corruption or data loss. >>> However, we are still protected although it's not particularly = obvious why. >>>=20 >>> On the reclaim side, we are either reclaiming clean pages (which = ignore >>> the accessed bit) or normal reclaim. If it's clean pages then any = parallel >>> write must update the dirty bit at minimum. If it's normal reclaim = then >>> the accessed bit is checked and if cleared in try_to_unmap_one, it = uses a >>> ptep_clear_flush_young_notify so the TLB gets flushed. We don't = reclaim >>> the page in either as part of page_referenced or try_to_unmap_one = but >>> clearing the accessed bit flushes the TLB. >>=20 >> Wait. Are you looking at the x86 arch function? The TLB is not = flushed when >> the access bit is cleared: >>=20 >> int ptep_clear_flush_young(struct vm_area_struct *vma, >> unsigned long address, pte_t *ptep) >> { >> /* >> * On x86 CPUs, clearing the accessed bit without a TLB flush >> * doesn't cause data corruption. [ It could cause incorrect >> * page aging and the (mistaken) reclaim of hot pages, but the >> * chance of that should be relatively low. ] >> * =20 >> * So as a performance optimization don't flush the TLB when >> * clearing the accessed bit, it will eventually be flushed by >> * a context switch or a VM operation anyway. [ In the rare >> * event of it not getting flushed for a long time the delay >> * shouldn't really matter because there's no real memory >> * pressure for swapout to react to. ] >> */ >> return ptep_test_and_clear_young(vma, address, ptep); >> } >=20 > I forgot this detail, thanks for correcting me. >=20 >>> On the mprotect side then, as the page was first accessed, clearing = the >>> accessed bit incurs a TLB flush on the reclaim side before the = second write. >>> That means any TLB entry that exists cannot have the accessed bit = set so >>> a second write needs to update it. >>>=20 >>> While it's not clearly documented, I checked with hardware engineers >>> at the time that an update of the accessed or dirty bit even with a = TLB >>> entry will check the underlying page tables and trap if it's not = present >>> and the subsequent fault will then fail on sigsegv if the VMA = protections >>> no longer allow the write. >>>=20 >>> So, on one side if ignoring the accessed bit during reclaim, the = pages >>> are clean so any access will set the dirty bit and trap if unmapped = in >>> parallel. On the other side, the accessed bit if set cleared the TLB = and >>> if not set, then the hardware needs to update and again will trap if >>> unmapped in parallel. >>=20 >>=20 >> Yet, even regardless to the TLB flush it seems there is still a = possible >> race: >>=20 >> CPU0 CPU1 >> ---- ---- >> ptep_clear_flush_young_notify >> =3D=3D> PTE.A=3D=3D0 >> access PTE >> =3D=3D> PTE.A=3D1 >> prep_get_and_clear >> change mapping (and PTE) >> Use stale TLB entry >=20 > So I think you're right and this is a potential race. The first access = can > be a read or a write as it's a problem if the mprotect call restricts > access. >=20 >>> If this guarantee from hardware was every shown to be wrong or = another >>> architecture wanted to add batching without the same guarantee then = mprotect >>> would need to do a local_flush_tlb if no pages were updated by the = mprotect >>> but right now, this should not be necessary. >>>=20 >>>> Can you please explain why you consider the application to be = buggy? >>>=20 >>> I considered it a bit dumb to mprotect for READ/NONE and then try = writing >>> the same mapping. However, it will behave as expected. >>=20 >> I don???t think that this is the only scenario. For example, the = application >> may create a new memory mapping of a different file using mmap at the = same >> memory address that was used before, just as that memory is = reclaimed. >=20 > That requires the existing mapping to be unmapped which will flush the > TLB and parallel mmap/munmap serialises on mmap_sem. The race appears = to > be specific to mprotect which avoids the TLB flush if no pages were = updated. Why? As far as I see the chain of calls during munmap is somewhat like: do_munmap =3D>unmap_region =3D=3D>tlb_gather_mmu =3D=3D=3D>unmap_vmas =3D=3D=3D=3D>unmap_page_range ... =3D=3D=3D=3D=3D>zap_pte_range - this one batches only present PTEs =3D=3D=3D>free_pgtables - this one is only if page-tables are removed =3D=3D=3D>pte_free_tlb =3D=3D>tlb_finish_mmu =3D=3D=3D>tlb_flush_mmu =3D=3D=3D=3D>tlb_flush_mmu_tlbonly zap_pte_range will check if pte_none and can find it is - if a = concurrent try_to_unmap_one already cleared the PTE. In this case it will not = update the range of the mmu_gather and would not indicate that a flush of the = PTE is needed. Then, tlb_flush_mmu_tlbonly will find that no PTE was cleared (tlb->end =3D=3D 0) and avoid flush, or may just flush fewer PTEs than = actually needed. Due to this behavior, it raises a concern that in other cases as well, = when mmu_gather is used, a PTE flush may be missed. >> The >> application can (inadvertently) cause such a scenario by using = MAP_FIXED. >> But even without MAP_FIXED, running mmap->munmap->mmap can reuse the = same >> virtual address. >=20 > With flushes in between. >=20 >>> Such applications are safe due to how the accessed bit is handled by = the >>> software (flushes TLB if clearing young) and hardware (traps if = updating >>> the accessed or dirty bit and the underlying PTE was unmapped even = if >>> there is a TLB entry). >>=20 >> I don???t think it is so. And I also think there are many additional >> potentially problematic scenarios. >=20 > I believe it's specific to mprotect but can be handled by flushing the > local TLB when mprotect updates no pages. Something like this; >=20 > ---8<--- > mm, mprotect: Flush the local TLB if mprotect potentially raced with a = parallel reclaim >=20 > Nadav Amit identified a theoritical race between page reclaim and = mprotect > due to TLB flushes being batched outside of the PTL being held. He = described > the race as follows >=20 > CPU0 CPU1 > ---- ---- > user accesses memory using RW = PTE > [PTE now cached in TLB] > try_to_unmap_one() > =3D=3D> ptep_get_and_clear() > =3D=3D> set_tlb_ubc_flush_pending() > mprotect(addr, PROT_READ) > =3D=3D> change_pte_range() > =3D=3D> [ PTE non-present - no = flush ] >=20 > user writes using cached RW PTE > ... >=20 > try_to_unmap_flush() >=20 > The same type of race exists for reads when protecting for PROT_NONE. > This is not a data integrity issue as the TLB is always flushed before = any > IO is queued or a page is freed but it is a correctness issue as a = process > restricting access with mprotect() may still be able to access the = data > after the syscall returns due to a stale TLB entry. Handle this issue = by > flushing the local TLB if reclaim is potentially batching TLB flushes = and > mprotect altered no pages. >=20 > Signed-off-by: Mel Gorman > Cc: stable@vger.kernel.org # v4.4+ > --- > mm/internal.h | 5 ++++- > mm/mprotect.c | 12 ++++++++++-- > mm/rmap.c | 20 ++++++++++++++++++++ > 3 files changed, 34 insertions(+), 3 deletions(-) >=20 > diff --git a/mm/internal.h b/mm/internal.h > index 0e4f558412fb..9b7d1a597816 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; > #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > void try_to_unmap_flush(void); > void try_to_unmap_flush_dirty(void); > +void batched_unmap_protection_update(void); > #else > static inline void try_to_unmap_flush(void) > { > @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) > static inline void try_to_unmap_flush_dirty(void) > { > } > - > +static inline void batched_unmap_protection_update() > +{ > +} > #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ >=20 > extern const struct trace_print_flags pageflag_names[]; > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 8edd0d576254..3de353d4b5fb 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -254,9 +254,17 @@ static unsigned long = change_protection_range(struct vm_area_struct *vma, > dirty_accountable, prot_numa); > } while (pgd++, addr =3D next, addr !=3D end); >=20 > - /* Only flush the TLB if we actually modified any entries: */ > - if (pages) > + /* > + * Only flush all TLBs if we actually modified any entries. If = no > + * pages are modified, then call batched_unmap_protection_update > + * if the context is a mprotect() syscall. > + */ > + if (pages) { > flush_tlb_range(vma, start, end); > + } else { > + if (!prot_numa) > + batched_unmap_protection_update(); > + } > clear_tlb_flush_pending(mm); >=20 > return pages; > diff --git a/mm/rmap.c b/mm/rmap.c > index d405f0e0ee96..02cb035e4ce6 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -643,6 +643,26 @@ static bool should_defer_flush(struct mm_struct = *mm, enum ttu_flags flags) >=20 > return should_defer; > } > + > +/* > + * This is called after an mprotect update that altered no pages. = Batched > + * unmap releases the PTL before a flush occurs leaving a window = where > + * an mprotect that reduces access rights can still access the page = after > + * mprotect returns via a stale TLB entry. Avoid this possibility by = flushing > + * the local TLB if mprotect updates no pages so that the the caller = of > + * mprotect always gets expected behaviour. It's overkill and = unnecessary to > + * flush all TLBs as a separate thread accessing the data that raced = with > + * both reclaim and mprotect as there is no risk of data corruption = and > + * the exact timing of a parallel thread seeing a protection update = without > + * any serialisation on the application side is always uncertain. > + */ > +void batched_unmap_protection_update(void) > +{ > + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > + local_flush_tlb(); > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > +} > + > #else > static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool = writable) > { I don=E2=80=99t think this solution is enough. I am sorry for not = providing a solution, but I don=E2=80=99t see an easy one. Thanks, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f71.google.com (mail-oi0-f71.google.com [209.85.218.71]) by kanga.kvack.org (Postfix) with ESMTP id 15E3B6B0538 for ; Tue, 11 Jul 2017 13:24:15 -0400 (EDT) Received: by mail-oi0-f71.google.com with SMTP id d77so407765oig.7 for ; Tue, 11 Jul 2017 10:24:15 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id c141si391009oig.176.2017.07.11.10.24.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 10:24:12 -0700 (PDT) Received: from mail-vk0-f47.google.com (mail-vk0-f47.google.com [209.85.213.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id ECF3A22C97 for ; Tue, 11 Jul 2017 17:24:11 +0000 (UTC) Received: by mail-vk0-f47.google.com with SMTP id 191so3800055vko.2 for ; Tue, 11 Jul 2017 10:24:11 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170711155312.637eyzpqeghcgqzp@suse.de> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> From: Andy Lutomirski Date: Tue, 11 Jul 2017 10:23:50 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , Nadav Amit , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 8:53 AM, Mel Gorman wrote: > On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote: >> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman wrote: >> > + >> > +/* >> > + * This is called after an mprotect update that altered no pages. Batched >> > + * unmap releases the PTL before a flush occurs leaving a window where >> > + * an mprotect that reduces access rights can still access the page after >> > + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing >> > + * the local TLB if mprotect updates no pages so that the the caller of >> > + * mprotect always gets expected behaviour. It's overkill and unnecessary to >> > + * flush all TLBs as a separate thread accessing the data that raced with >> > + * both reclaim and mprotect as there is no risk of data corruption and >> > + * the exact timing of a parallel thread seeing a protection update without >> > + * any serialisation on the application side is always uncertain. >> > + */ >> > +void batched_unmap_protection_update(void) >> > +{ >> > + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); >> > + local_flush_tlb(); >> > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); >> > +} >> > + >> >> What about remote CPUs? You could get migrated right after mprotect() >> or the inconsistency could be observed on another CPU. > > If it's migrated then it has also context switched so the TLB entry will > be read for the first time. I don't think this is true. On current kernels, if the other CPU is running a thread in the same process, then there won't be a flush if we migrate there. In -tip, slated for 4.13, if the other CPU is lazy and is using the current process's page tables, it won't flush if we migrate there and it's not stale (as determined by the real flush APIs, not local_tlb_flush()). With PCID, the kernel will aggressively try to avoid the flush no matter what. > If the entry is inconsistent for another CPU > accessing the data then it'll potentially successfully access a page that > was just mprotected but this is similar to simply racing with the call > to mprotect itself. The timing isn't exact, nor does it need to be. Thread A: mprotect(..., PROT_READ); pthread_mutex_unlock(); Thread B: pthread_mutex_lock(); write to the mprotected address; I think it's unlikely that this exact scenario will affect a conventional C program, but I can see various GC systems and sandboxes being very surprised. > One > thread accessing data racing with another thread doing mprotect without > any synchronisation in the application is always going to be unreliable. As above, there can be synchronization that's entirely invisible to the kernel. >> I also really >> don't like bypassing arch code like this. The implementation of >> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!) >> is *very* different from what's there now, and it is not written in >> the expectation that some generic code might call local_tlb_flush() >> and expect any kind of coherency at all. >> > > Assuming that gets merged first then the most straight-forward approach > would be to setup a arch_tlbflush_unmap_batch with just the local CPU set > in the mask or something similar. With what semantics? >> Would a better fix perhaps be to find a way to figure out whether a >> batched flush is pending on the mm in question and flush it out if you >> do any optimizations based on assuming that the TLB is in any respect >> consistent with the page tables? With the changes in -tip, x86 could, >> in principle, supply a function to sync up its TLB state. That would >> require cross-CPU poking at state or an inconditional IPI (that might >> end up not flushing anything), but either is doable. > > It's potentially doable if a field like tlb_flush_pending was added > to mm_struct that is set when batching starts. I don't think there is > a logical place where it can be cleared as when the TLB gets flushed by > reclaim, it can't rmap again to clear the flag. What would happen is that > the first mprotect after any batching happened at any point in the past > would have to unconditionally flush the TLB and then clear the flag. That > would be a relatively minor hit and cover all the possibilities and should > work unmodified with or without your series applied. > > Would that be preferable to you? I'm not sure I understand it well enough to know whether I like it. I'm imagining an API that says "I'm about to rely on TLBs being coherent for this mm -- make it so". On x86, this would be roughly equivalent to a flush on the mm minus the mandatory flush part, at least with my patches applied. It would be considerably messier without my patches. But I'd like to make sure that the full extent of the problem is understood before getting too excited about solving it. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id A70306810BE for ; Tue, 11 Jul 2017 15:18:27 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id u110so358256wrb.14 for ; Tue, 11 Jul 2017 12:18:27 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id v143si124226wmd.52.2017.07.11.12.18.25 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 11 Jul 2017 12:18:25 -0700 (PDT) Date: Tue, 11 Jul 2017 20:18:23 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170711191823.qthrmdgqcd3rygjk@suse.de> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Nadav Amit , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 10:23:50AM -0700, Andrew Lutomirski wrote: > On Tue, Jul 11, 2017 at 8:53 AM, Mel Gorman wrote: > > On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote: > >> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman wrote: > >> > + > >> > +/* > >> > + * This is called after an mprotect update that altered no pages. Batched > >> > + * unmap releases the PTL before a flush occurs leaving a window where > >> > + * an mprotect that reduces access rights can still access the page after > >> > + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing > >> > + * the local TLB if mprotect updates no pages so that the the caller of > >> > + * mprotect always gets expected behaviour. It's overkill and unnecessary to > >> > + * flush all TLBs as a separate thread accessing the data that raced with > >> > + * both reclaim and mprotect as there is no risk of data corruption and > >> > + * the exact timing of a parallel thread seeing a protection update without > >> > + * any serialisation on the application side is always uncertain. > >> > + */ > >> > +void batched_unmap_protection_update(void) > >> > +{ > >> > + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > >> > + local_flush_tlb(); > >> > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > >> > +} > >> > + > >> > >> What about remote CPUs? You could get migrated right after mprotect() > >> or the inconsistency could be observed on another CPU. > > > > If it's migrated then it has also context switched so the TLB entry will > > be read for the first time. > > I don't think this is true. On current kernels, if the other CPU is > running a thread in the same process, then there won't be a flush if > we migrate there. True although that would also be covered if a flush happening unconditionally on mprotect (and arguably munmap) if a batched TLB flush took place in the past. It's heavier than it needs to be but it would be trivial to track and only incur a cost if reclaim touched any pages belonging to the process in the past so a relatively rare operation in the normal case. It could be forced by continually keeping a system under memory pressure while looping around mprotect but the worst-case would be similar costs to never batching the flushing at all. > In -tip, slated for 4.13, if the other CPU is lazy > and is using the current process's page tables, it won't flush if we > migrate there and it's not stale (as determined by the real flush > APIs, not local_tlb_flush()). With PCID, the kernel will aggressively > try to avoid the flush no matter what. > I agree that PCID means that flushing needs to be more agressive and there is not much point working on two solutions and assume PCID is merged. > > If the entry is inconsistent for another CPU > > accessing the data then it'll potentially successfully access a page that > > was just mprotected but this is similar to simply racing with the call > > to mprotect itself. The timing isn't exact, nor does it need to be. > > Thread A: > mprotect(..., PROT_READ); > pthread_mutex_unlock(); > > Thread B: > pthread_mutex_lock(); > write to the mprotected address; > > I think it's unlikely that this exact scenario will affect a > conventional C program, but I can see various GC systems and sandboxes > being very surprised. > Maybe. The window is massively wide as the mprotect, unlock, remote wakeup and write all need to complete between the unmap releasing the PTL and the flush taking place. Still, it is theoritically possible. > > >> I also really > >> don't like bypassing arch code like this. The implementation of > >> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!) > >> is *very* different from what's there now, and it is not written in > >> the expectation that some generic code might call local_tlb_flush() > >> and expect any kind of coherency at all. > >> > > > > Assuming that gets merged first then the most straight-forward approach > > would be to setup a arch_tlbflush_unmap_batch with just the local CPU set > > in the mask or something similar. > > With what semantics? > I'm dropping this idea because the more I think about it, the more I think that a more general flush is needed if TLB batching was used in the past. We could keep active track of mm's with flushes pending but it would be fairly complex, cost in terms of keeping track of mm's needing flushing and ultimately might be more expensive than just flushing immediately. If it's actually unfixable then, even though it's theoritical given the massive amount of activity that has to happen in a very short window, there would be no choice but to remove the TLB batching entirely which would be very unfortunate given that IPIs during reclaim will be very high once again. > >> Would a better fix perhaps be to find a way to figure out whether a > >> batched flush is pending on the mm in question and flush it out if you > >> do any optimizations based on assuming that the TLB is in any respect > >> consistent with the page tables? With the changes in -tip, x86 could, > >> in principle, supply a function to sync up its TLB state. That would > >> require cross-CPU poking at state or an inconditional IPI (that might > >> end up not flushing anything), but either is doable. > > > > It's potentially doable if a field like tlb_flush_pending was added > > to mm_struct that is set when batching starts. I don't think there is > > a logical place where it can be cleared as when the TLB gets flushed by > > reclaim, it can't rmap again to clear the flag. What would happen is that > > the first mprotect after any batching happened at any point in the past > > would have to unconditionally flush the TLB and then clear the flag. That > > would be a relatively minor hit and cover all the possibilities and should > > work unmodified with or without your series applied. > > > > Would that be preferable to you? > > I'm not sure I understand it well enough to know whether I like it. > I'm imagining an API that says "I'm about to rely on TLBs being > coherent for this mm -- make it so". I don't think we should be particularly clever about this and instead just flush the full mm if there is a risk of a parallel batching of flushing is in progress resulting in a stale TLB entry being used. I think tracking mms that are currently batching would end up being costly in terms of memory, fairly complex, or both. Something like this? diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 45cdb27791a3..ab8f7e11c160 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -495,6 +495,10 @@ struct mm_struct { */ bool tlb_flush_pending; #endif +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + /* See flush_tlb_batched_pending() */ + bool tlb_flush_batched; +#endif struct uprobes_state uprobes_state; #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; diff --git a/mm/internal.h b/mm/internal.h index 0e4f558412fb..bf835a5a9854 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); +void flush_tlb_batched_pending(struct mm_struct *mm); #else static inline void try_to_unmap_flush(void) { @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) static inline void try_to_unmap_flush_dirty(void) { } - +static inline void mm_tlb_flush_batched(struct mm_struct *mm) +{ +} #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; diff --git a/mm/memory.c b/mm/memory.c index bb11c474857e..b0c3d1556a94 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, init_rss_vec(rss); start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); pte = start_pte; + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { pte_t ptent = *pte; diff --git a/mm/mprotect.c b/mm/mprotect.c index 8edd0d576254..27135b91a4b4 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (!pte) return 0; + /* Guard against parallel reclaim batching a TLB flush without PTL */ + flush_tlb_batched_pending(vma->vm_mm); + /* Get target node for single threaded private VMAs */ if (prot_numa && !(vma->vm_flags & VM_SHARED) && atomic_read(&vma->vm_mm->mm_users) == 1) diff --git a/mm/rmap.c b/mm/rmap.c index d405f0e0ee96..52633a124a4e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) return false; /* If remote CPUs need to be flushed then defer batch the flush */ - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) { should_defer = true; + mm->tlb_flush_batched = true; + } put_cpu(); return should_defer; } + +/* + * Reclaim batches unmaps pages under the PTL but does not flush the TLB + * TLB prior to releasing the PTL. It's possible a parallel mprotect or + * munmap can race between reclaim unmapping the page and flushing the + * page. If this race occurs, it potentially allows access to data via + * a stale TLB entry. Tracking all mm's that have TLB batching pending + * would be expensive during reclaim so instead track whether TLB batching + * occured in the past and if so then do a full mm flush here. This will + * cost one additional flush per reclaim cycle paid by the first munmap or + * mprotect. This assumes it's called under the PTL to synchronise access + * to mm->tlb_flush_batched. + */ +void flush_tlb_batched_pending(struct mm_struct *mm) +{ + if (mm->tlb_flush_batched) { + flush_tlb_mm(mm); + mm->tlb_flush_batched = false; + } +} #else static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id AAD546810BE for ; Tue, 11 Jul 2017 16:06:52 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id d18so2618898pfe.8 for ; Tue, 11 Jul 2017 13:06:52 -0700 (PDT) Received: from mail-pf0-x244.google.com (mail-pf0-x244.google.com. [2607:f8b0:400e:c00::244]) by mx.google.com with ESMTPS id o89si176843pfk.208.2017.07.11.13.06.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 13:06:51 -0700 (PDT) Received: by mail-pf0-x244.google.com with SMTP id c24so314662pfe.1 for ; Tue, 11 Jul 2017 13:06:51 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170711191823.qthrmdgqcd3rygjk@suse.de> Date: Tue, 11 Jul 2017 13:06:48 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <3373F577-F289-4028-B6F6-777D029A7B07@gmail.com> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Tue, Jul 11, 2017 at 10:23:50AM -0700, Andrew Lutomirski wrote: >> On Tue, Jul 11, 2017 at 8:53 AM, Mel Gorman wrote: >>> On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote: >>>> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman = wrote: >>>>> + >>>>> +/* >>>>> + * This is called after an mprotect update that altered no pages. = Batched >>>>> + * unmap releases the PTL before a flush occurs leaving a window = where >>>>> + * an mprotect that reduces access rights can still access the = page after >>>>> + * mprotect returns via a stale TLB entry. Avoid this possibility = by flushing >>>>> + * the local TLB if mprotect updates no pages so that the the = caller of >>>>> + * mprotect always gets expected behaviour. It's overkill and = unnecessary to >>>>> + * flush all TLBs as a separate thread accessing the data that = raced with >>>>> + * both reclaim and mprotect as there is no risk of data = corruption and >>>>> + * the exact timing of a parallel thread seeing a protection = update without >>>>> + * any serialisation on the application side is always uncertain. >>>>> + */ >>>>> +void batched_unmap_protection_update(void) >>>>> +{ >>>>> + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); >>>>> + local_flush_tlb(); >>>>> + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); >>>>> +} >>>>> + >>>>=20 >>>> What about remote CPUs? You could get migrated right after = mprotect() >>>> or the inconsistency could be observed on another CPU. >>>=20 >>> If it's migrated then it has also context switched so the TLB entry = will >>> be read for the first time. >>=20 >> I don't think this is true. On current kernels, if the other CPU is >> running a thread in the same process, then there won't be a flush if >> we migrate there. >=20 > True although that would also be covered if a flush happening = unconditionally > on mprotect (and arguably munmap) if a batched TLB flush took place in = the > past. It's heavier than it needs to be but it would be trivial to = track > and only incur a cost if reclaim touched any pages belonging to the = process > in the past so a relatively rare operation in the normal case. It = could be > forced by continually keeping a system under memory pressure while = looping > around mprotect but the worst-case would be similar costs to never = batching > the flushing at all. >=20 >> In -tip, slated for 4.13, if the other CPU is lazy >> and is using the current process's page tables, it won't flush if we >> migrate there and it's not stale (as determined by the real flush >> APIs, not local_tlb_flush()). With PCID, the kernel will = aggressively >> try to avoid the flush no matter what. >=20 > I agree that PCID means that flushing needs to be more agressive and = there > is not much point working on two solutions and assume PCID is merged. >=20 >>> If the entry is inconsistent for another CPU >>> accessing the data then it'll potentially successfully access a page = that >>> was just mprotected but this is similar to simply racing with the = call >>> to mprotect itself. The timing isn't exact, nor does it need to be. >>=20 >> Thread A: >> mprotect(..., PROT_READ); >> pthread_mutex_unlock(); >>=20 >> Thread B: >> pthread_mutex_lock(); >> write to the mprotected address; >>=20 >> I think it's unlikely that this exact scenario will affect a >> conventional C program, but I can see various GC systems and = sandboxes >> being very surprised. >=20 > Maybe. The window is massively wide as the mprotect, unlock, remote = wakeup > and write all need to complete between the unmap releasing the PTL and > the flush taking place. Still, it is theoritically possible. Consider also virtual machines. A VCPU may be preempted by the = hypervisor right after a PTE change and before the flush - so the time between the = two can be rather large. >>>> I also really >>>> don't like bypassing arch code like this. The implementation of >>>> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge = window!) >>>> is *very* different from what's there now, and it is not written in >>>> the expectation that some generic code might call local_tlb_flush() >>>> and expect any kind of coherency at all. >>>=20 >>> Assuming that gets merged first then the most straight-forward = approach >>> would be to setup a arch_tlbflush_unmap_batch with just the local = CPU set >>> in the mask or something similar. >>=20 >> With what semantics? >=20 > I'm dropping this idea because the more I think about it, the more I = think > that a more general flush is needed if TLB batching was used in the = past. > We could keep active track of mm's with flushes pending but it would = be > fairly complex, cost in terms of keeping track of mm's needing = flushing > and ultimately might be more expensive than just flushing immediately. >=20 > If it's actually unfixable then, even though it's theoritical given = the > massive amount of activity that has to happen in a very short window, = there > would be no choice but to remove the TLB batching entirely which would = be > very unfortunate given that IPIs during reclaim will be very high once = again. >=20 >>>> Would a better fix perhaps be to find a way to figure out whether a >>>> batched flush is pending on the mm in question and flush it out if = you >>>> do any optimizations based on assuming that the TLB is in any = respect >>>> consistent with the page tables? With the changes in -tip, x86 = could, >>>> in principle, supply a function to sync up its TLB state. That = would >>>> require cross-CPU poking at state or an inconditional IPI (that = might >>>> end up not flushing anything), but either is doable. >>>=20 >>> It's potentially doable if a field like tlb_flush_pending was added >>> to mm_struct that is set when batching starts. I don't think there = is >>> a logical place where it can be cleared as when the TLB gets flushed = by >>> reclaim, it can't rmap again to clear the flag. What would happen is = that >>> the first mprotect after any batching happened at any point in the = past >>> would have to unconditionally flush the TLB and then clear the flag. = That >>> would be a relatively minor hit and cover all the possibilities and = should >>> work unmodified with or without your series applied. >>>=20 >>> Would that be preferable to you? >>=20 >> I'm not sure I understand it well enough to know whether I like it. >> I'm imagining an API that says "I'm about to rely on TLBs being >> coherent for this mm -- make it so". >=20 > I don't think we should be particularly clever about this and instead = just > flush the full mm if there is a risk of a parallel batching of = flushing is > in progress resulting in a stale TLB entry being used. I think = tracking mms > that are currently batching would end up being costly in terms of = memory, > fairly complex, or both. Something like this? >=20 > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 45cdb27791a3..ab8f7e11c160 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -495,6 +495,10 @@ struct mm_struct { > */ > bool tlb_flush_pending; > #endif > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > + /* See flush_tlb_batched_pending() */ > + bool tlb_flush_batched; > +#endif > struct uprobes_state uprobes_state; > #ifdef CONFIG_HUGETLB_PAGE > atomic_long_t hugetlb_usage; > diff --git a/mm/internal.h b/mm/internal.h > index 0e4f558412fb..bf835a5a9854 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; > #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > void try_to_unmap_flush(void); > void try_to_unmap_flush_dirty(void); > +void flush_tlb_batched_pending(struct mm_struct *mm); > #else > static inline void try_to_unmap_flush(void) > { > @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) > static inline void try_to_unmap_flush_dirty(void) > { > } > - > +static inline void mm_tlb_flush_batched(struct mm_struct *mm) > +{ > +} > #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ >=20 > extern const struct trace_print_flags pageflag_names[]; > diff --git a/mm/memory.c b/mm/memory.c > index bb11c474857e..b0c3d1556a94 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct = mmu_gather *tlb, > init_rss_vec(rss); > start_pte =3D pte_offset_map_lock(mm, pmd, addr, &ptl); > pte =3D start_pte; > + flush_tlb_batched_pending(mm); > arch_enter_lazy_mmu_mode(); > do { > pte_t ptent =3D *pte; > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 8edd0d576254..27135b91a4b4 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct = vm_area_struct *vma, pmd_t *pmd, > if (!pte) > return 0; >=20 > + /* Guard against parallel reclaim batching a TLB flush without = PTL */ > + flush_tlb_batched_pending(vma->vm_mm); > + > /* Get target node for single threaded private VMAs */ > if (prot_numa && !(vma->vm_flags & VM_SHARED) && > atomic_read(&vma->vm_mm->mm_users) =3D=3D 1) > diff --git a/mm/rmap.c b/mm/rmap.c > index d405f0e0ee96..52633a124a4e 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct = *mm, enum ttu_flags flags) > return false; >=20 > /* If remote CPUs need to be flushed then defer batch the flush = */ > - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) { > should_defer =3D true; > + mm->tlb_flush_batched =3D true; > + } > put_cpu(); >=20 > return should_defer; > } > + > +/* > + * Reclaim batches unmaps pages under the PTL but does not flush the = TLB > + * TLB prior to releasing the PTL. It's possible a parallel mprotect = or > + * munmap can race between reclaim unmapping the page and flushing = the > + * page. If this race occurs, it potentially allows access to data = via > + * a stale TLB entry. Tracking all mm's that have TLB batching = pending > + * would be expensive during reclaim so instead track whether TLB = batching > + * occured in the past and if so then do a full mm flush here. This = will > + * cost one additional flush per reclaim cycle paid by the first = munmap or > + * mprotect. This assumes it's called under the PTL to synchronise = access > + * to mm->tlb_flush_batched. > + */ > +void flush_tlb_batched_pending(struct mm_struct *mm) > +{ > + if (mm->tlb_flush_batched) { > + flush_tlb_mm(mm); > + mm->tlb_flush_batched =3D false; > + } > +} > #else > static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool = writable) > { I don=E2=80=99t know what is exactly the invariant that is kept, so it = is hard for me to figure out all sort of questions: Should pte_accessible return true if mm->tlb_flush_batch=3D=3Dtrue ? Does madvise_free_pte_range need to be modified as well? How will future code not break anything? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 534DA6810BE for ; Tue, 11 Jul 2017 16:09:26 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id x23so633320wrb.6 for ; Tue, 11 Jul 2017 13:09:26 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id l9si173573wrb.104.2017.07.11.13.09.24 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 11 Jul 2017 13:09:25 -0700 (PDT) Date: Tue, 11 Jul 2017 21:09:23 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170711200923.gyaxfjzz3tpvreuq@suse.de> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20170711191823.qthrmdgqcd3rygjk@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Nadav Amit , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote: > I don't think we should be particularly clever about this and instead just > flush the full mm if there is a risk of a parallel batching of flushing is > in progress resulting in a stale TLB entry being used. I think tracking mms > that are currently batching would end up being costly in terms of memory, > fairly complex, or both. Something like this? > mremap and madvise(DONTNEED) would also need to flush. Memory policies are fine as a move_pages call that hits the race will simply fail to migrate a page that is being freed and once migration starts, it'll be flushed so a stale access has no further risk. copy_page_range should also be ok as the old mm is flushed and the new mm cannot have entries yet. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id E661D6B04D3 for ; Tue, 11 Jul 2017 17:09:22 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id u30so982594wrc.9 for ; Tue, 11 Jul 2017 14:09:22 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id p128si352333wmb.40.2017.07.11.14.09.21 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 11 Jul 2017 14:09:21 -0700 (PDT) Date: Tue, 11 Jul 2017 22:09:19 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170711210919.y4odiqtfeb4e3ulz@suse.de> References: <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <3373F577-F289-4028-B6F6-777D029A7B07@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <3373F577-F289-4028-B6F6-777D029A7B07@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 01:06:48PM -0700, Nadav Amit wrote: > > +/* > > + * Reclaim batches unmaps pages under the PTL but does not flush the TLB > > + * TLB prior to releasing the PTL. It's possible a parallel mprotect or > > + * munmap can race between reclaim unmapping the page and flushing the > > + * page. If this race occurs, it potentially allows access to data via > > + * a stale TLB entry. Tracking all mm's that have TLB batching pending > > + * would be expensive during reclaim so instead track whether TLB batching > > + * occured in the past and if so then do a full mm flush here. This will > > + * cost one additional flush per reclaim cycle paid by the first munmap or > > + * mprotect. This assumes it's called under the PTL to synchronise access > > + * to mm->tlb_flush_batched. > > + */ > > +void flush_tlb_batched_pending(struct mm_struct *mm) > > +{ > > + if (mm->tlb_flush_batched) { > > + flush_tlb_mm(mm); > > + mm->tlb_flush_batched = false; > > + } > > +} > > #else > > static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) > > { > > I don???t know what is exactly the invariant that is kept, so it is hard for > me to figure out all sort of questions: > > Should pte_accessible return true if mm->tlb_flush_batch==true ? > It shouldn't be necessary. The contexts where we hit the path are uprobes: elevated page count so no parallel reclaim dax: PTEs are not mapping that would be reclaimed hugetlbfs: Not reclaimed ksm: holds page lock and elevates count so cannot race with reclaim cow: at the time of the flush, the page count is elevated so cannot race with reclaim page_mkclean: only concerned with marking existing ptes clean but in any case, the batching flushes the TLB before issueing any IO so there isn't space for a stable TLB entry to be used for something bad. > Does madvise_free_pte_range need to be modified as well? > Yes, I noticed that out shortly after sending the first version and commented upon it. > How will future code not break anything? > I can't really answer that without a crystal ball. Code dealing with page table updates would need to take some care if it can race with parallel reclaim. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id 769D16810BE for ; Tue, 11 Jul 2017 17:52:44 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id 23so1171912wry.4 for ; Tue, 11 Jul 2017 14:52:44 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id m63si392447wme.38.2017.07.11.14.52.42 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 11 Jul 2017 14:52:43 -0700 (PDT) Date: Tue, 11 Jul 2017 22:52:41 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170711215240.tdpmwmgwcuerjj3o@suse.de> References: <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20170711200923.gyaxfjzz3tpvreuq@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Nadav Amit , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote: > On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote: > > I don't think we should be particularly clever about this and instead just > > flush the full mm if there is a risk of a parallel batching of flushing is > > in progress resulting in a stale TLB entry being used. I think tracking mms > > that are currently batching would end up being costly in terms of memory, > > fairly complex, or both. Something like this? > > > > mremap and madvise(DONTNEED) would also need to flush. Memory policies are > fine as a move_pages call that hits the race will simply fail to migrate > a page that is being freed and once migration starts, it'll be flushed so > a stale access has no further risk. copy_page_range should also be ok as > the old mm is flushed and the new mm cannot have entries yet. > Adding those results in ---8<--- mm, mprotect: Flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not a data integrity issue but it is a correctness issue. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either page reference counts are elevated preventing parallel reclaim or in the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Signed-off-by: Mel Gorman Cc: stable@vger.kernel.org # v4.4+ --- include/linux/mm_types.h | 4 ++++ mm/internal.h | 5 ++++- mm/madvise.c | 1 + mm/memory.c | 1 + mm/mprotect.c | 3 +++ mm/mremap.c | 1 + mm/rmap.c | 24 +++++++++++++++++++++++- 7 files changed, 37 insertions(+), 2 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 45cdb27791a3..ab8f7e11c160 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -495,6 +495,10 @@ struct mm_struct { */ bool tlb_flush_pending; #endif +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + /* See flush_tlb_batched_pending() */ + bool tlb_flush_batched; +#endif struct uprobes_state uprobes_state; #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; diff --git a/mm/internal.h b/mm/internal.h index 0e4f558412fb..bf835a5a9854 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); +void flush_tlb_batched_pending(struct mm_struct *mm); #else static inline void try_to_unmap_flush(void) { @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) static inline void try_to_unmap_flush_dirty(void) { } - +static inline void mm_tlb_flush_batched(struct mm_struct *mm) +{ +} #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; diff --git a/mm/madvise.c b/mm/madvise.c index 25b78ee4fc2c..75d2cffbe61d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -320,6 +320,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, tlb_remove_check_page_size_change(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; diff --git a/mm/memory.c b/mm/memory.c index bb11c474857e..b0c3d1556a94 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, init_rss_vec(rss); start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); pte = start_pte; + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { pte_t ptent = *pte; diff --git a/mm/mprotect.c b/mm/mprotect.c index 8edd0d576254..27135b91a4b4 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (!pte) return 0; + /* Guard against parallel reclaim batching a TLB flush without PTL */ + flush_tlb_batched_pending(vma->vm_mm); + /* Get target node for single threaded private VMAs */ if (prot_numa && !(vma->vm_flags & VM_SHARED) && atomic_read(&vma->vm_mm->mm_users) == 1) diff --git a/mm/mremap.c b/mm/mremap.c index cd8a1b199ef9..6e3d857458de 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -152,6 +152,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, new_ptl = pte_lockptr(mm, new_pmd); if (new_ptl != old_ptl) spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, diff --git a/mm/rmap.c b/mm/rmap.c index d405f0e0ee96..5a3e4ff9c4a0 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) return false; /* If remote CPUs need to be flushed then defer batch the flush */ - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) { should_defer = true; + mm->tlb_flush_batched = true; + } put_cpu(); return should_defer; } + +/* + * Reclaim unmaps pages under the PTL but does not flush the TLB prior to + * releasing the PTL if TLB flushes are batched. It's possible a parallel + * operation such as mprotect or munmap to race between reclaim unmapping + * the page and flushing the page If this race occurs, it potentially allows + * access to data via a stale TLB entry. Tracking all mm's that have TLB + * batching pending would be expensive during reclaim so instead track + * whether TLB batching occured in the past and if so then do a full mmi + * flush here. This will cost one additional flush per reclaim cycle paid + * by the first operation at risk such as mprotect and mumap. This assumes + * it's called under the PTL to synchronise access to mm->tlb_flush_batched. + */ +void flush_tlb_batched_pending(struct mm_struct *mm) +{ + if (mm->tlb_flush_batched) { + flush_tlb_mm(mm); + mm->tlb_flush_batched = false; + } +} #else static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f69.google.com (mail-oi0-f69.google.com [209.85.218.69]) by kanga.kvack.org (Postfix) with ESMTP id 713DB6810BE for ; Tue, 11 Jul 2017 18:08:20 -0400 (EDT) Received: by mail-oi0-f69.google.com with SMTP id t194so407093oif.8 for ; Tue, 11 Jul 2017 15:08:20 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id k132si356554oia.74.2017.07.11.15.08.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 15:08:19 -0700 (PDT) Received: from mail-vk0-f49.google.com (mail-vk0-f49.google.com [209.85.213.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id B8E7122CAC for ; Tue, 11 Jul 2017 22:08:18 +0000 (UTC) Received: by mail-vk0-f49.google.com with SMTP id 191so3204355vko.2 for ; Tue, 11 Jul 2017 15:08:18 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170711191823.qthrmdgqcd3rygjk@suse.de> References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> From: Andy Lutomirski Date: Tue, 11 Jul 2017 15:07:57 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , Nadav Amit , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 12:18 PM, Mel Gorman wrote: I would change this slightly: > +void flush_tlb_batched_pending(struct mm_struct *mm) > +{ > + if (mm->tlb_flush_batched) { > + flush_tlb_mm(mm); How about making this a new helper arch_tlbbatch_flush_one_mm(mm); The idea is that this could be implemented as flush_tlb_mm(mm), but the actual semantics needed are weaker. All that's really needed AFAICS is to make sure that any arch_tlbbatch_add_mm() calls on this mm that have already happened become effective by the time that arch_tlbbatch_flush_one_mm() returns. The initial implementation would be this: struct flush_tlb_info info = { .mm = mm, .new_tlb_gen = atomic64_read(&mm->context.tlb_gen); .start = 0, .end = TLB_FLUSH_ALL, }; and the rest is like flush_tlb_mm_range(). flush_tlb_func_common() will already do the right thing, but the comments should probably be updated, too. The benefit would be that, if you just call this on an mm when everything is already flushed, it will still do the IPIs but it won't do the actual flush. A better future implementation could iterate over each cpu in mm_cpumask(), and, using either a new lock or very careful atomics, check whether that CPU really needs flushing. In -tip, all the information needed to figure this out is already there in the percpu state -- it's just not currently set up for remote access. For backports, it would just be flush_tlb_mm(). --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id A05A66810BE for ; Tue, 11 Jul 2017 18:27:58 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id u5so6026579pgq.14 for ; Tue, 11 Jul 2017 15:27:58 -0700 (PDT) Received: from mail-pg0-x242.google.com (mail-pg0-x242.google.com. [2607:f8b0:400e:c05::242]) by mx.google.com with ESMTPS id h9si419604pln.160.2017.07.11.15.27.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 15:27:57 -0700 (PDT) Received: by mail-pg0-x242.google.com with SMTP id y129so621953pgy.3 for ; Tue, 11 Jul 2017 15:27:57 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170711215240.tdpmwmgwcuerjj3o@suse.de> Date: Tue, 11 Jul 2017 15:27:55 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> References: <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote: >> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote: >>> I don't think we should be particularly clever about this and = instead just >>> flush the full mm if there is a risk of a parallel batching of = flushing is >>> in progress resulting in a stale TLB entry being used. I think = tracking mms >>> that are currently batching would end up being costly in terms of = memory, >>> fairly complex, or both. Something like this? >>=20 >> mremap and madvise(DONTNEED) would also need to flush. Memory = policies are >> fine as a move_pages call that hits the race will simply fail to = migrate >> a page that is being freed and once migration starts, it'll be = flushed so >> a stale access has no further risk. copy_page_range should also be ok = as >> the old mm is flushed and the new mm cannot have entries yet. >=20 > Adding those results in You are way too fast for me. > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct = *mm, enum ttu_flags flags) > return false; >=20 > /* If remote CPUs need to be flushed then defer batch the flush = */ > - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) { > should_defer =3D true; > + mm->tlb_flush_batched =3D true; > + } Since mm->tlb_flush_batched is set before the PTE is actually cleared, = it still seems to leave a short window for a race. CPU0 CPU1 ---- ---- should_defer_flush =3D> mm->tlb_flush_batched=3Dtrue =09 flush_tlb_batched_pending (another PT) =3D> flush TLB =3D> mm->tlb_flush_batched=3Dfalse ptep_get_and_clear ... flush_tlb_batched_pending (batched PT) use the stale PTE ... try_to_unmap_flush IOW it seems that mm->flush_flush_batched should be set after the PTE is cleared (and have some compiler barrier to be on the safe side). Just to clarify - I don=E2=80=99t try to annoy, but I considered = building and submitting a patch based on some artifacts of a study I conducted, and = this issue drove me crazy. One more question, please: how does elevated page count or even locking = the page help (as you mention in regard to uprobes and ksm)? Yes, the page = will not be reclaimed, but IIUC try_to_unmap is called before the reference = count is frozen, and the page lock is dropped on each iteration of the loop in shrink_page_list. In this case, it seems to me that uprobes or ksm may = still not flush the TLB. Thanks, Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id 5D8376810BE for ; Tue, 11 Jul 2017 18:33:15 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id x23so1391748wrb.6 for ; Tue, 11 Jul 2017 15:33:15 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id k129si464331wmf.136.2017.07.11.15.33.14 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 11 Jul 2017 15:33:14 -0700 (PDT) Date: Tue, 11 Jul 2017 23:33:11 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170711223311.iu7hxce5swmet6u3@suse.de> References: <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Nadav Amit , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 03:07:57PM -0700, Andrew Lutomirski wrote: > On Tue, Jul 11, 2017 at 12:18 PM, Mel Gorman wrote: > > I would change this slightly: > > > +void flush_tlb_batched_pending(struct mm_struct *mm) > > +{ > > + if (mm->tlb_flush_batched) { > > + flush_tlb_mm(mm); > > How about making this a new helper arch_tlbbatch_flush_one_mm(mm); > The idea is that this could be implemented as flush_tlb_mm(mm), but > the actual semantics needed are weaker. All that's really needed > AFAICS is to make sure that any arch_tlbbatch_add_mm() calls on this > mm that have already happened become effective by the time that > arch_tlbbatch_flush_one_mm() returns. > > The initial implementation would be this: > > struct flush_tlb_info info = { > .mm = mm, > .new_tlb_gen = atomic64_read(&mm->context.tlb_gen); > .start = 0, > .end = TLB_FLUSH_ALL, > }; > > and the rest is like flush_tlb_mm_range(). flush_tlb_func_common() > will already do the right thing, but the comments should probably be > updated, too. Yes, from what I remember from your patches and a quick recheck, that should be fine. I'll be leaving it until the morning to actually do the work. It requires that your stuff be upstream first but last time I checked, they were expected in this merge window. > The benefit would be that, if you just call this on an > mm when everything is already flushed, it will still do the IPIs but > it won't do the actual flush. > The benefit is somewhat marginal given that a process that has been partially reclaimed already has taken a hit on any latency requirements it has. However, it's a much better fit with your work in general. > A better future implementation could iterate over each cpu in > mm_cpumask(), and, using either a new lock or very careful atomics, > check whether that CPU really needs flushing. In -tip, all the > information needed to figure this out is already there in the percpu > state -- it's just not currently set up for remote access. > Potentially yes although I'm somewhat wary of adding too much complexity in that path. It'll either be very rare in which case the maintenance cost isn't worth it or the process is being continually thrashed by reclaim in which case saving a few TLB flushes isn't going to prevent performance falling through the floor. > For backports, it would just be flush_tlb_mm(). > Agreed. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 729E86810BE for ; Tue, 11 Jul 2017 18:34:47 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id t8so6410585pgs.5 for ; Tue, 11 Jul 2017 15:34:47 -0700 (PDT) Received: from mail-pf0-x243.google.com (mail-pf0-x243.google.com. [2607:f8b0:400e:c00::243]) by mx.google.com with ESMTPS id o5si442944pgk.27.2017.07.11.15.34.46 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 15:34:46 -0700 (PDT) Received: by mail-pf0-x243.google.com with SMTP id z6so670236pfk.3 for ; Tue, 11 Jul 2017 15:34:46 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> Date: Tue, 11 Jul 2017 15:34:44 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <54C7B456-17EF-442D-8FAC-C8BE9D160750@gmail.com> References: <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Nadav Amit wrote: > Mel Gorman wrote: >=20 >> On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote: >>> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote: >>>> I don't think we should be particularly clever about this and = instead just >>>> flush the full mm if there is a risk of a parallel batching of = flushing is >>>> in progress resulting in a stale TLB entry being used. I think = tracking mms >>>> that are currently batching would end up being costly in terms of = memory, >>>> fairly complex, or both. Something like this? >>>=20 >>> mremap and madvise(DONTNEED) would also need to flush. Memory = policies are >>> fine as a move_pages call that hits the race will simply fail to = migrate >>> a page that is being freed and once migration starts, it'll be = flushed so >>> a stale access has no further risk. copy_page_range should also be = ok as >>> the old mm is flushed and the new mm cannot have entries yet. >>=20 >> Adding those results in >=20 > You are way too fast for me. >=20 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct = *mm, enum ttu_flags flags) >> return false; >>=20 >> /* If remote CPUs need to be flushed then defer batch the flush = */ >> - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) >> + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) { >> should_defer =3D true; >> + mm->tlb_flush_batched =3D true; >> + } >=20 > Since mm->tlb_flush_batched is set before the PTE is actually cleared, = it > still seems to leave a short window for a race. >=20 > CPU0 CPU1 > ---- ---- > should_defer_flush > =3D> mm->tlb_flush_batched=3Dtrue =09 > flush_tlb_batched_pending (another PT) > =3D> flush TLB > =3D> mm->tlb_flush_batched=3Dfalse > ptep_get_and_clear > ... >=20 > flush_tlb_batched_pending (batched PT) > use the stale PTE > ... > try_to_unmap_flush >=20 >=20 > IOW it seems that mm->flush_flush_batched should be set after the PTE = is > cleared (and have some compiler barrier to be on the safe side). I=E2=80=99m actually not sure about that. Without a lock that other = order may be racy as well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id 9D3A46B053C for ; Wed, 12 Jul 2017 04:27:36 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id p64so3771664wrc.8 for ; Wed, 12 Jul 2017 01:27:36 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id y21si1638056wmh.132.2017.07.12.01.27.34 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 12 Jul 2017 01:27:35 -0700 (PDT) Date: Wed, 12 Jul 2017 09:27:33 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170712082733.ouf7yx2bnvwwcfms@suse.de> References: <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 11, 2017 at 03:27:55PM -0700, Nadav Amit wrote: > Mel Gorman wrote: > > > On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote: > >> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote: > >>> I don't think we should be particularly clever about this and instead just > >>> flush the full mm if there is a risk of a parallel batching of flushing is > >>> in progress resulting in a stale TLB entry being used. I think tracking mms > >>> that are currently batching would end up being costly in terms of memory, > >>> fairly complex, or both. Something like this? > >> > >> mremap and madvise(DONTNEED) would also need to flush. Memory policies are > >> fine as a move_pages call that hits the race will simply fail to migrate > >> a page that is being freed and once migration starts, it'll be flushed so > >> a stale access has no further risk. copy_page_range should also be ok as > >> the old mm is flushed and the new mm cannot have entries yet. > > > > Adding those results in > > You are way too fast for me. > > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > > return false; > > > > /* If remote CPUs need to be flushed then defer batch the flush */ > > - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) { > > should_defer = true; > > + mm->tlb_flush_batched = true; > > + } > > Since mm->tlb_flush_batched is set before the PTE is actually cleared, it > still seems to leave a short window for a race. > > CPU0 CPU1 > ---- ---- > should_defer_flush > => mm->tlb_flush_batched=true > flush_tlb_batched_pending (another PT) > => flush TLB > => mm->tlb_flush_batched=false > ptep_get_and_clear > ... > > flush_tlb_batched_pending (batched PT) > use the stale PTE > ... > try_to_unmap_flush > > IOW it seems that mm->flush_flush_batched should be set after the PTE is > cleared (and have some compiler barrier to be on the safe side). I'm relying on setting and clearing of tlb_flush_batched is under a PTL that is contended if the race is active. If reclaim is first, it'll take the PTL, set batched while a racing mprotect/munmap/etc spins. On release, the racing mprotect/munmmap immediately calls flush_tlb_batched_pending() before proceeding as normal, finding pte_none with the TLB flushed. If the mprotect/munmap/etc is first, it'll take the PTL, observe that pte_present and handle the flushing itself while reclaim potentially spins. When reclaim acquires the lock, it'll still set set tlb_flush_batched. As it's PTL that is taken for that field, it is possible for the accesses to be re-ordered but only in the case where a race is not occurring. I'll think some more about whether barriers are necessary but concluded they weren't needed in this instance. Doing the setting/clear+flush under the PTL, the protection is similar to normal page table operations that do not batch the flush. > One more question, please: how does elevated page count or even locking the > page help (as you mention in regard to uprobes and ksm)? Yes, the page will > not be reclaimed, but IIUC try_to_unmap is called before the reference count > is frozen, and the page lock is dropped on each iteration of the loop in > shrink_page_list. In this case, it seems to me that uprobes or ksm may still > not flush the TLB. > If page lock is held then reclaim skips the page entirely and uprobe, ksm and cow holds the page lock for pages that potentially be observed by reclaim. That is the primary protection for those paths. The elevated page count is less relevant but I was keeping it in mind trying to think of cases where a stale TLB entry existed and pointed to the wrong page. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 8E730440874 for ; Wed, 12 Jul 2017 19:27:27 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id q87so37826551pfk.15 for ; Wed, 12 Jul 2017 16:27:27 -0700 (PDT) Received: from mail-pf0-x244.google.com (mail-pf0-x244.google.com. [2607:f8b0:400e:c00::244]) by mx.google.com with ESMTPS id 3si2987329plz.629.2017.07.12.16.27.26 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Jul 2017 16:27:26 -0700 (PDT) Received: by mail-pf0-x244.google.com with SMTP id e199so4892257pfh.0 for ; Wed, 12 Jul 2017 16:27:26 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170712082733.ouf7yx2bnvwwcfms@suse.de> Date: Wed, 12 Jul 2017 16:27:23 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> References: <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Tue, Jul 11, 2017 at 03:27:55PM -0700, Nadav Amit wrote: >> Mel Gorman wrote: >>=20 >>> On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote: >>>> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote: >>>>> I don't think we should be particularly clever about this and = instead just >>>>> flush the full mm if there is a risk of a parallel batching of = flushing is >>>>> in progress resulting in a stale TLB entry being used. I think = tracking mms >>>>> that are currently batching would end up being costly in terms of = memory, >>>>> fairly complex, or both. Something like this? >>>>=20 >>>> mremap and madvise(DONTNEED) would also need to flush. Memory = policies are >>>> fine as a move_pages call that hits the race will simply fail to = migrate >>>> a page that is being freed and once migration starts, it'll be = flushed so >>>> a stale access has no further risk. copy_page_range should also be = ok as >>>> the old mm is flushed and the new mm cannot have entries yet. >>>=20 >>> Adding those results in >>=20 >> You are way too fast for me. >>=20 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -637,12 +637,34 @@ static bool should_defer_flush(struct = mm_struct *mm, enum ttu_flags flags) >>> return false; >>>=20 >>> /* If remote CPUs need to be flushed then defer batch the flush = */ >>> - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) >>> + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) { >>> should_defer =3D true; >>> + mm->tlb_flush_batched =3D true; >>> + } >>=20 >> Since mm->tlb_flush_batched is set before the PTE is actually = cleared, it >> still seems to leave a short window for a race. >>=20 >> CPU0 CPU1 >> ---- ---- >> should_defer_flush >> =3D> mm->tlb_flush_batched=3Dtrue =09 >> flush_tlb_batched_pending (another PT) >> =3D> flush TLB >> =3D> mm->tlb_flush_batched=3Dfalse >> ptep_get_and_clear >> ... >>=20 >> flush_tlb_batched_pending (batched PT) >> use the stale PTE >> ... >> try_to_unmap_flush >>=20 >> IOW it seems that mm->flush_flush_batched should be set after the PTE = is >> cleared (and have some compiler barrier to be on the safe side). >=20 > I'm relying on setting and clearing of tlb_flush_batched is under a = PTL > that is contended if the race is active. >=20 > If reclaim is first, it'll take the PTL, set batched while a racing > mprotect/munmap/etc spins. On release, the racing mprotect/munmmap > immediately calls flush_tlb_batched_pending() before proceeding as = normal, > finding pte_none with the TLB flushed. This is the scenario I regarded in my example. Notice that when the = first flush_tlb_batched_pending is called, CPU0 and CPU1 hold different = page-table locks - allowing them to run concurrently. As a result flush_tlb_batched_pending is executed before the PTE was cleared and mm->tlb_flush_batched is cleared. Later, after CPU0 runs = ptep_get_and_clear mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE. > If the mprotect/munmap/etc is first, it'll take the PTL, observe that > pte_present and handle the flushing itself while reclaim potentially > spins. When reclaim acquires the lock, it'll still set set = tlb_flush_batched. >=20 > As it's PTL that is taken for that field, it is possible for the = accesses > to be re-ordered but only in the case where a race is not occurring. > I'll think some more about whether barriers are necessary but = concluded > they weren't needed in this instance. Doing the setting/clear+flush = under > the PTL, the protection is similar to normal page table operations = that > do not batch the flush. >=20 >> One more question, please: how does elevated page count or even = locking the >> page help (as you mention in regard to uprobes and ksm)? Yes, the = page will >> not be reclaimed, but IIUC try_to_unmap is called before the = reference count >> is frozen, and the page lock is dropped on each iteration of the loop = in >> shrink_page_list. In this case, it seems to me that uprobes or ksm = may still >> not flush the TLB. >=20 > If page lock is held then reclaim skips the page entirely and uprobe, > ksm and cow holds the page lock for pages that potentially be observed > by reclaim. That is the primary protection for those paths. It is really hard, at least for me, to track this synchronization = scheme, as each path is protected in different means. I still don=E2=80=99t = understand why it is true, since the loop in shrink_page_list calls = __ClearPageLocked(page) on each iteration, before the actual flush takes place. Actually, I think that based on Andy=E2=80=99s patches there is a = relatively reasonable solution. For each mm we will hold both a = =E2=80=9Cpending_tlb_gen=E2=80=9D (increased under the PT-lock) and an =E2=80=9Cexecuted_tlb_gen=E2=80=9D. = Once flush_tlb_mm_range finishes flushing it will use cmpxchg to update the executed_tlb_gen to the pending_tlb_gen that was prior the flush (the cmpxchg will ensure the TLB gen only goes forward). Then, whenever pending_tlb_gen is different than executed_tlb_gen - a flush is needed. Nadav=20 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f72.google.com (mail-oi0-f72.google.com [209.85.218.72]) by kanga.kvack.org (Postfix) with ESMTP id 3ACC0440874 for ; Wed, 12 Jul 2017 19:36:36 -0400 (EDT) Received: by mail-oi0-f72.google.com with SMTP id f134so2820054oig.14 for ; Wed, 12 Jul 2017 16:36:36 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id u187si2718888oie.76.2017.07.12.16.36.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Jul 2017 16:36:35 -0700 (PDT) Received: from mail-ua0-f174.google.com (mail-ua0-f174.google.com [209.85.217.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id C8BED22C9F for ; Wed, 12 Jul 2017 23:36:34 +0000 (UTC) Received: by mail-ua0-f174.google.com with SMTP id g13so5241558uaj.0 for ; Wed, 12 Jul 2017 16:36:34 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> References: <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> From: Andy Lutomirski Date: Wed, 12 Jul 2017 16:36:13 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Mel Gorman , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit wrote: > > Actually, I think that based on Andy=E2=80=99s patches there is a relativ= ely > reasonable solution. For each mm we will hold both a =E2=80=9Cpending_tlb= _gen=E2=80=9D > (increased under the PT-lock) and an =E2=80=9Cexecuted_tlb_gen=E2=80=9D. = Once > flush_tlb_mm_range finishes flushing it will use cmpxchg to update the > executed_tlb_gen to the pending_tlb_gen that was prior the flush (the > cmpxchg will ensure the TLB gen only goes forward). Then, whenever > pending_tlb_gen is different than executed_tlb_gen - a flush is needed. > Why do we need executed_tlb_gen? We already have cpu_tlbstate.ctxs[...].tlb_gen. Or is the idea that executed_tlb_gen guarantees that all cpus in mm_cpumask are at least up to date to executed_tlb_gen? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 022CA440874 for ; Wed, 12 Jul 2017 19:42:34 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id p15so40499554pgs.7 for ; Wed, 12 Jul 2017 16:42:33 -0700 (PDT) Received: from mail-pg0-x230.google.com (mail-pg0-x230.google.com. [2607:f8b0:400e:c05::230]) by mx.google.com with ESMTPS id e13si2969907pgu.2.2017.07.12.16.42.32 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Jul 2017 16:42:33 -0700 (PDT) Received: by mail-pg0-x230.google.com with SMTP id t186so20334920pgb.1 for ; Wed, 12 Jul 2017 16:42:32 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: Date: Wed, 12 Jul 2017 16:42:30 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <079D9048-0FFD-4A58-90EF-889259EB6ECE@gmail.com> References: <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Mel Gorman , "open list:MEMORY MANAGEMENT" Andy Lutomirski wrote: > On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit = wrote: >> Actually, I think that based on Andy=E2=80=99s patches there is a = relatively >> reasonable solution. For each mm we will hold both a = =E2=80=9Cpending_tlb_gen=E2=80=9D >> (increased under the PT-lock) and an =E2=80=9Cexecuted_tlb_gen=E2=80=9D= . Once >> flush_tlb_mm_range finishes flushing it will use cmpxchg to update = the >> executed_tlb_gen to the pending_tlb_gen that was prior the flush (the >> cmpxchg will ensure the TLB gen only goes forward). Then, whenever >> pending_tlb_gen is different than executed_tlb_gen - a flush is = needed. >=20 > Why do we need executed_tlb_gen? We already have > cpu_tlbstate.ctxs[...].tlb_gen. Or is the idea that executed_tlb_gen > guarantees that all cpus in mm_cpumask are at least up to date to > executed_tlb_gen? Hm... So actually it may be enough, no? Just compare mm->context.tlb_gen with cpu_tlbstate.ctxs[...].tlb_gen and flush if they are different? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f72.google.com (mail-oi0-f72.google.com [209.85.218.72]) by kanga.kvack.org (Postfix) with ESMTP id 54F68440874 for ; Thu, 13 Jul 2017 01:39:15 -0400 (EDT) Received: by mail-oi0-f72.google.com with SMTP id z82so3512812oiz.6 for ; Wed, 12 Jul 2017 22:39:15 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id p5si3130759oig.335.2017.07.12.22.39.14 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Jul 2017 22:39:14 -0700 (PDT) Received: from mail-vk0-f46.google.com (mail-vk0-f46.google.com [209.85.213.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 7B2E322C97 for ; Thu, 13 Jul 2017 05:39:13 +0000 (UTC) Received: by mail-vk0-f46.google.com with SMTP id y70so24010837vky.3 for ; Wed, 12 Jul 2017 22:39:13 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <079D9048-0FFD-4A58-90EF-889259EB6ECE@gmail.com> References: <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <079D9048-0FFD-4A58-90EF-889259EB6ECE@gmail.com> From: Andy Lutomirski Date: Wed, 12 Jul 2017 22:38:51 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , Mel Gorman , "open list:MEMORY MANAGEMENT" On Wed, Jul 12, 2017 at 4:42 PM, Nadav Amit wrote: > Andy Lutomirski wrote: > >> On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit wrote= : >>> Actually, I think that based on Andy=E2=80=99s patches there is a relat= ively >>> reasonable solution. For each mm we will hold both a =E2=80=9Cpending_t= lb_gen=E2=80=9D >>> (increased under the PT-lock) and an =E2=80=9Cexecuted_tlb_gen=E2=80=9D= . Once >>> flush_tlb_mm_range finishes flushing it will use cmpxchg to update the >>> executed_tlb_gen to the pending_tlb_gen that was prior the flush (the >>> cmpxchg will ensure the TLB gen only goes forward). Then, whenever >>> pending_tlb_gen is different than executed_tlb_gen - a flush is needed. >> >> Why do we need executed_tlb_gen? We already have >> cpu_tlbstate.ctxs[...].tlb_gen. Or is the idea that executed_tlb_gen >> guarantees that all cpus in mm_cpumask are at least up to date to >> executed_tlb_gen? > > Hm... So actually it may be enough, no? Just compare mm->context.tlb_gen > with cpu_tlbstate.ctxs[...].tlb_gen and flush if they are different? > Wouldn't that still leave the races where the CPU observing the stale TLB entry isn't the CPU that did munmap/mprotect/whatever? I think executed_tlb_gen or similar may really be needed for your approach. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id 134AF440874 for ; Thu, 13 Jul 2017 02:07:10 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id 77so8043979wrb.11 for ; Wed, 12 Jul 2017 23:07:10 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 1si3465597wrq.16.2017.07.12.23.07.08 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 12 Jul 2017 23:07:08 -0700 (PDT) Date: Thu, 13 Jul 2017 07:07:06 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170713060706.o2cuko5y6irxwnww@suse.de> References: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote: > > If reclaim is first, it'll take the PTL, set batched while a racing > > mprotect/munmap/etc spins. On release, the racing mprotect/munmmap > > immediately calls flush_tlb_batched_pending() before proceeding as normal, > > finding pte_none with the TLB flushed. > > This is the scenario I regarded in my example. Notice that when the first > flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table > locks - allowing them to run concurrently. As a result > flush_tlb_batched_pending is executed before the PTE was cleared and > mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear > mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE. > If they hold different PTL locks, it means that reclaim and and the parallel munmap/mprotect/madvise/mremap operation are operating on different regions of an mm or separate mm's and the race should not apply or at the very least is equivalent to not batching the flushes. For multiple parallel operations, munmap/mprotect/mremap are serialised by mmap_sem so there is only one risky operation at a time. For multiple madvise, there is a small window when a page is accessible after madvise returns but it is an advisory call so it's primarily a data integrity concern and the TLB is flushed before the page is either freed or IO starts on the reclaim side. > > If the mprotect/munmap/etc is first, it'll take the PTL, observe that > > pte_present and handle the flushing itself while reclaim potentially > > spins. When reclaim acquires the lock, it'll still set set tlb_flush_batched. > > > > As it's PTL that is taken for that field, it is possible for the accesses > > to be re-ordered but only in the case where a race is not occurring. > > I'll think some more about whether barriers are necessary but concluded > > they weren't needed in this instance. Doing the setting/clear+flush under > > the PTL, the protection is similar to normal page table operations that > > do not batch the flush. > > > >> One more question, please: how does elevated page count or even locking the > >> page help (as you mention in regard to uprobes and ksm)? Yes, the page will > >> not be reclaimed, but IIUC try_to_unmap is called before the reference count > >> is frozen, and the page lock is dropped on each iteration of the loop in > >> shrink_page_list. In this case, it seems to me that uprobes or ksm may still > >> not flush the TLB. > > > > If page lock is held then reclaim skips the page entirely and uprobe, > > ksm and cow holds the page lock for pages that potentially be observed > > by reclaim. That is the primary protection for those paths. > > It is really hard, at least for me, to track this synchronization scheme, as > each path is protected in different means. I still don???t understand why it > is true, since the loop in shrink_page_list calls __ClearPageLocked(page) on > each iteration, before the actual flush takes place. > At teh point of __ClearPageLocked, reclaim was holding the page lock and reached the point where there cannot be any other references to it and is definitely cleaned. Any hypothetical TLB entry that exists at this point is for read-only which would trap if a write was attempted and the TLB is flushed before the page is freed so there is no possibility the page is reallocated and the TLB entry now points to unrelated data. > Actually, I think that based on Andy???s patches there is a relatively > reasonable solution. On top of Andy's work, the patch currently is below. Andy, is that roughly what you had in mind? I didn't think the comments in flush_tlb_func_common needed updating. I would have test results but the test against the tip tree without the patch failed overnight with what looks like filesystem corruption that happened *after* tests completed and it was untarring and building the next kernel to test with the patch applied. I'm not sure why yet or how reproducible it is. ---8<--- mm, mprotect: Flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not a data integrity issue but it is a correctness issue. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either the page lock is held preventing parallel reclaim or a page reference count is elevated preventing a parallel free leading to corruption. In the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Signed-off-by: Mel Gorman Cc: stable@vger.kernel.org # v4.4+ diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 6397275008db..1ad93cf26826 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -325,6 +325,7 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, } extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); +extern void arch_tlbbatch_flush_one_mm(struct mm_struct *mm); #ifndef CONFIG_PARAVIRT #define flush_tlb_others(mask, info) \ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 2c1b8881e9d3..a72975a517a1 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) put_cpu(); } +/* + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when + * this returns. Using the current mm tlb_gen means the TLB will be up to date + * with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a flush has + * happened since then the IPIs will still be sent but the actual flush is + * avoided. Unfortunately the IPIs are necessary as the per-cpu context + * tlb_gens cannot be safely accessed. + */ +void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) +{ + int cpu; + struct flush_tlb_info info = { + .mm = mm, + .new_tlb_gen = atomic64_read(&mm->context.tlb_gen), + .start = 0, + .end = TLB_FLUSH_ALL, + }; + + cpu = get_cpu(); + + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { + VM_WARN_ON(irqs_disabled()); + local_irq_disable(); + flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); + local_irq_enable(); + } + + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) + flush_tlb_others(mm_cpumask(mm), &info); + + put_cpu(); +} + static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf, size_t count, loff_t *ppos) { diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 45cdb27791a3..ab8f7e11c160 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -495,6 +495,10 @@ struct mm_struct { */ bool tlb_flush_pending; #endif +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + /* See flush_tlb_batched_pending() */ + bool tlb_flush_batched; +#endif struct uprobes_state uprobes_state; #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; diff --git a/mm/internal.h b/mm/internal.h index 0e4f558412fb..9c8a2bfb975c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); +void flush_tlb_batched_pending(struct mm_struct *mm); #else static inline void try_to_unmap_flush(void) { @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) static inline void try_to_unmap_flush_dirty(void) { } - +static inline void flush_tlb_batched_pending(struct mm_struct *mm) +{ +} #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; diff --git a/mm/madvise.c b/mm/madvise.c index 25b78ee4fc2c..75d2cffbe61d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -320,6 +320,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, tlb_remove_check_page_size_change(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; diff --git a/mm/memory.c b/mm/memory.c index bb11c474857e..b0c3d1556a94 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, init_rss_vec(rss); start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); pte = start_pte; + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { pte_t ptent = *pte; diff --git a/mm/mprotect.c b/mm/mprotect.c index 8edd0d576254..f42749e6bf4e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -66,6 +66,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, atomic_read(&vma->vm_mm->mm_users) == 1) target_node = numa_node_id(); + flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); do { oldpte = *pte; diff --git a/mm/mremap.c b/mm/mremap.c index cd8a1b199ef9..6e3d857458de 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -152,6 +152,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, new_ptl = pte_lockptr(mm, new_pmd); if (new_ptl != old_ptl) spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, diff --git a/mm/rmap.c b/mm/rmap.c index 130c238fe384..7c5c8ef583fa 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -603,6 +603,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); tlb_ubc->flush_required = true; + mm->tlb_flush_batched = true; /* * If the PTE was dirty then it's best to assume it's writable. The @@ -631,6 +632,29 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) return should_defer; } + +/* + * Reclaim unmaps pages under the PTL but does not flush the TLB prior to + * releasing the PTL if TLB flushes are batched. It's possible a parallel + * operation such as mprotect or munmap to race between reclaim unmapping + * the page and flushing the page If this race occurs, it potentially allows + * access to data via a stale TLB entry. Tracking all mm's that have TLB + * batching in flight would be expensive during reclaim so instead track + * whether TLB batching occured in the past and if so then do a flush here + * if required. This will cost one additional flush per reclaim cycle paid + * by the first operation at risk such as mprotect and mumap. + * + * This must be called under the PTL so that accesses to tlb_flush_batched + * that is potentially a "reclaim vs mprotect/munmap/etc" race will + * synchronise via the PTL. + */ +void flush_tlb_batched_pending(struct mm_struct *mm) +{ + if (mm->tlb_flush_batched) { + arch_tlbbatch_flush_one_mm(mm); + mm->tlb_flush_batched = false; + } +} #else static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 0A5FD440874 for ; Thu, 13 Jul 2017 12:05:05 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id 125so62857334pgi.2 for ; Thu, 13 Jul 2017 09:05:05 -0700 (PDT) Received: from mail-pg0-x244.google.com (mail-pg0-x244.google.com. [2607:f8b0:400e:c05::244]) by mx.google.com with ESMTPS id 1si4630574pgp.88.2017.07.13.09.05.03 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Jul 2017 09:05:04 -0700 (PDT) Received: by mail-pg0-x244.google.com with SMTP id j186so7441708pge.1 for ; Thu, 13 Jul 2017 09:05:03 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: Date: Thu, 13 Jul 2017 09:05:01 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <5F1E23BA-B0DE-4FF0-AB8F-C22936263EAA@gmail.com> References: <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <079D9048-0FFD-4A58-90EF-889259EB6ECE@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Mel Gorman , "open list:MEMORY MANAGEMENT" Andy Lutomirski wrote: > On Wed, Jul 12, 2017 at 4:42 PM, Nadav Amit = wrote: >> Andy Lutomirski wrote: >>=20 >>> On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit = wrote: >>>> Actually, I think that based on Andy=E2=80=99s patches there is a = relatively >>>> reasonable solution. For each mm we will hold both a = =E2=80=9Cpending_tlb_gen=E2=80=9D >>>> (increased under the PT-lock) and an =E2=80=9Cexecuted_tlb_gen=E2=80=9D= . Once >>>> flush_tlb_mm_range finishes flushing it will use cmpxchg to update = the >>>> executed_tlb_gen to the pending_tlb_gen that was prior the flush = (the >>>> cmpxchg will ensure the TLB gen only goes forward). Then, whenever >>>> pending_tlb_gen is different than executed_tlb_gen - a flush is = needed. >>>=20 >>> Why do we need executed_tlb_gen? We already have >>> cpu_tlbstate.ctxs[...].tlb_gen. Or is the idea that = executed_tlb_gen >>> guarantees that all cpus in mm_cpumask are at least up to date to >>> executed_tlb_gen? >>=20 >> Hm... So actually it may be enough, no? Just compare = mm->context.tlb_gen >> with cpu_tlbstate.ctxs[...].tlb_gen and flush if they are different? >=20 > Wouldn't that still leave the races where the CPU observing the stale > TLB entry isn't the CPU that did munmap/mprotect/whatever? I think > executed_tlb_gen or similar may really be needed for your approach. Yes, you are right. This approach requires a counter that is only updated after the flush is completed by all cores. This way you ensure there is no CPU that did not complete the flush. Does it make sense?= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f72.google.com (mail-oi0-f72.google.com [209.85.218.72]) by kanga.kvack.org (Postfix) with ESMTP id 15378440874 for ; Thu, 13 Jul 2017 12:07:13 -0400 (EDT) Received: by mail-oi0-f72.google.com with SMTP id a142so4500576oii.5 for ; Thu, 13 Jul 2017 09:07:13 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id w10si4120380oib.92.2017.07.13.09.07.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Jul 2017 09:07:12 -0700 (PDT) Received: from mail-ua0-f181.google.com (mail-ua0-f181.google.com [209.85.217.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 6705C22C98 for ; Thu, 13 Jul 2017 16:07:11 +0000 (UTC) Received: by mail-ua0-f181.google.com with SMTP id z22so36659582uah.1 for ; Thu, 13 Jul 2017 09:07:11 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <5F1E23BA-B0DE-4FF0-AB8F-C22936263EAA@gmail.com> References: <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <079D9048-0FFD-4A58-90EF-889259EB6ECE@gmail.com> <5F1E23BA-B0DE-4FF0-AB8F-C22936263EAA@gmail.com> From: Andy Lutomirski Date: Thu, 13 Jul 2017 09:06:49 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , Mel Gorman , "open list:MEMORY MANAGEMENT" On Thu, Jul 13, 2017 at 9:05 AM, Nadav Amit wrote: > Andy Lutomirski wrote: > >> On Wed, Jul 12, 2017 at 4:42 PM, Nadav Amit wrote= : >>> Andy Lutomirski wrote: >>> >>>> On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit wro= te: >>>>> Actually, I think that based on Andy=E2=80=99s patches there is a rel= atively >>>>> reasonable solution. For each mm we will hold both a =E2=80=9Cpending= _tlb_gen=E2=80=9D >>>>> (increased under the PT-lock) and an =E2=80=9Cexecuted_tlb_gen=E2=80= =9D. Once >>>>> flush_tlb_mm_range finishes flushing it will use cmpxchg to update th= e >>>>> executed_tlb_gen to the pending_tlb_gen that was prior the flush (the >>>>> cmpxchg will ensure the TLB gen only goes forward). Then, whenever >>>>> pending_tlb_gen is different than executed_tlb_gen - a flush is neede= d. >>>> >>>> Why do we need executed_tlb_gen? We already have >>>> cpu_tlbstate.ctxs[...].tlb_gen. Or is the idea that executed_tlb_gen >>>> guarantees that all cpus in mm_cpumask are at least up to date to >>>> executed_tlb_gen? >>> >>> Hm... So actually it may be enough, no? Just compare mm->context.tlb_ge= n >>> with cpu_tlbstate.ctxs[...].tlb_gen and flush if they are different? >> >> Wouldn't that still leave the races where the CPU observing the stale >> TLB entry isn't the CPU that did munmap/mprotect/whatever? I think >> executed_tlb_gen or similar may really be needed for your approach. > > Yes, you are right. > > This approach requires a counter that is only updated after the flush is > completed by all cores. This way you ensure there is no CPU that did not > complete the flush. > > Does it make sense? Yes. It could be a delta on top of Mel's patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f72.google.com (mail-oi0-f72.google.com [209.85.218.72]) by kanga.kvack.org (Postfix) with ESMTP id 11544440874 for ; Thu, 13 Jul 2017 12:08:45 -0400 (EDT) Received: by mail-oi0-f72.google.com with SMTP id n2so4489056oig.12 for ; Thu, 13 Jul 2017 09:08:45 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id v8si4274294oie.78.2017.07.13.09.08.44 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Jul 2017 09:08:44 -0700 (PDT) Received: from mail-vk0-f46.google.com (mail-vk0-f46.google.com [209.85.213.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 599B322C97 for ; Thu, 13 Jul 2017 16:08:43 +0000 (UTC) Received: by mail-vk0-f46.google.com with SMTP id r126so32727157vkg.0 for ; Thu, 13 Jul 2017 09:08:43 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170713060706.o2cuko5y6irxwnww@suse.de> References: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> From: Andy Lutomirski Date: Thu, 13 Jul 2017 09:08:21 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 12, 2017 at 11:07 PM, Mel Gorman wrote: > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > put_cpu(); > } > > +/* > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when s/are up to date/have flushed the TLBs/ perhaps? Can you update this comment in arch/x86/include/asm/tlbflush.h: * - Fully flush a single mm. .mm will be set, .end will be * TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to * which the IPI sender is trying to catch us up. by adding something like: This can also happen due to arch_tlbflush_flush_one_mm(), in which case it's quite likely that most or all CPUs are already up to date. Thanks, Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id B500F440874 for ; Thu, 13 Jul 2017 13:07:15 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id b189so5322012wmb.12 for ; Thu, 13 Jul 2017 10:07:15 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id u10si5820276wma.89.2017.07.13.10.07.13 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 13 Jul 2017 10:07:14 -0700 (PDT) Date: Thu, 13 Jul 2017 18:07:12 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170713170712.4iriw5lncoulcgda@suse.de> References: <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Nadav Amit , "open list:MEMORY MANAGEMENT" On Thu, Jul 13, 2017 at 09:08:21AM -0700, Andrew Lutomirski wrote: > On Wed, Jul 12, 2017 at 11:07 PM, Mel Gorman wrote: > > --- a/arch/x86/mm/tlb.c > > +++ b/arch/x86/mm/tlb.c > > @@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > > put_cpu(); > > } > > > > +/* > > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when > > s/are up to date/have flushed the TLBs/ perhaps? > > > Can you update this comment in arch/x86/include/asm/tlbflush.h: > > * - Fully flush a single mm. .mm will be set, .end will be > * TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to > * which the IPI sender is trying to catch us up. > > by adding something like: This can also happen due to > arch_tlbflush_flush_one_mm(), in which case it's quite likely that > most or all CPUs are already up to date. > No problem, thanks. Care to ack the patch below? If so, I'll send it to Ingo with x86 and linux-mm cc'd after some tests complete (hopefully successfully). It's fairly x86 specific and makes sense to go in with the rest of the pcid and mm tlb_gen stuff rather than via Andrew's tree even through it touches core mm. ---8<--- mm, mprotect: Flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not necessarily a data integrity issue but it is a correctness issue as there is a window where an mprotect that limits access still allows access. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either the page lock is held preventing parallel reclaim or a page reference count is elevated preventing a parallel free leading to corruption. In the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Signed-off-by: Mel Gorman Cc: stable@vger.kernel.org # v4.4+ --- arch/x86/include/asm/tlbflush.h | 6 +++++- arch/x86/mm/tlb.c | 33 +++++++++++++++++++++++++++++++++ include/linux/mm_types.h | 4 ++++ mm/internal.h | 5 ++++- mm/madvise.c | 1 + mm/memory.c | 1 + mm/mprotect.c | 1 + mm/mremap.c | 1 + mm/rmap.c | 24 ++++++++++++++++++++++++ 9 files changed, 74 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index d23e61dc0640..1849e8da7a27 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -294,7 +294,10 @@ struct flush_tlb_info { * * - Fully flush a single mm. .mm will be set, .end will be * TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to - * which the IPI sender is trying to catch us up. + * which the IPI sender is trying to catch us up. This can + * also happen due to arch_tlbflush_flush_one_mm(), in which + * case it's quite likely that most or all CPUs are already + * up to date. * * - Partially flush a single mm. .mm will be set, .start and * .end will indicate the range, and .new_tlb_gen will be set @@ -339,6 +342,7 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, } extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); +extern void arch_tlbbatch_flush_one_mm(struct mm_struct *mm); #ifndef CONFIG_PARAVIRT #define flush_tlb_others(mask, info) \ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 63a5b451c128..248063dc5be8 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -505,6 +505,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) put_cpu(); } +/* + * Ensure that any arch_tlbbatch_add_mm calls on this mm have flushed the TLB + * when this returns. Using the current mm tlb_gen means the TLB will be up + * to date with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a + * flush has happened since then the IPIs will still be sent but the actual + * flush is avoided. Unfortunately the IPIs are necessary as the per-cpu + * context tlb_gens cannot be safely accessed. + */ +void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) +{ + int cpu; + struct flush_tlb_info info = { + .mm = mm, + .new_tlb_gen = atomic64_read(&mm->context.tlb_gen), + .start = 0, + .end = TLB_FLUSH_ALL, + }; + + cpu = get_cpu(); + + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { + VM_WARN_ON(irqs_disabled()); + local_irq_disable(); + flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); + local_irq_enable(); + } + + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) + flush_tlb_others(mm_cpumask(mm), &info); + + put_cpu(); +} + static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf, size_t count, loff_t *ppos) { diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 45cdb27791a3..ab8f7e11c160 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -495,6 +495,10 @@ struct mm_struct { */ bool tlb_flush_pending; #endif +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + /* See flush_tlb_batched_pending() */ + bool tlb_flush_batched; +#endif struct uprobes_state uprobes_state; #ifdef CONFIG_HUGETLB_PAGE atomic_long_t hugetlb_usage; diff --git a/mm/internal.h b/mm/internal.h index 0e4f558412fb..9c8a2bfb975c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH void try_to_unmap_flush(void); void try_to_unmap_flush_dirty(void); +void flush_tlb_batched_pending(struct mm_struct *mm); #else static inline void try_to_unmap_flush(void) { @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) static inline void try_to_unmap_flush_dirty(void) { } - +static inline void flush_tlb_batched_pending(struct mm_struct *mm) +{ +} #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ extern const struct trace_print_flags pageflag_names[]; diff --git a/mm/madvise.c b/mm/madvise.c index 25b78ee4fc2c..75d2cffbe61d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -320,6 +320,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, tlb_remove_check_page_size_change(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; diff --git a/mm/memory.c b/mm/memory.c index bb11c474857e..b0c3d1556a94 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, init_rss_vec(rss); start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); pte = start_pte; + flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); do { pte_t ptent = *pte; diff --git a/mm/mprotect.c b/mm/mprotect.c index 8edd0d576254..f42749e6bf4e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -66,6 +66,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, atomic_read(&vma->vm_mm->mm_users) == 1) target_node = numa_node_id(); + flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); do { oldpte = *pte; diff --git a/mm/mremap.c b/mm/mremap.c index cd8a1b199ef9..6e3d857458de 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -152,6 +152,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, new_ptl = pte_lockptr(mm, new_pmd); if (new_ptl != old_ptl) spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + flush_tlb_batched_pending(vma->vm_mm); arch_enter_lazy_mmu_mode(); for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, diff --git a/mm/rmap.c b/mm/rmap.c index 130c238fe384..7c5c8ef583fa 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -603,6 +603,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); tlb_ubc->flush_required = true; + mm->tlb_flush_batched = true; /* * If the PTE was dirty then it's best to assume it's writable. The @@ -631,6 +632,29 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) return should_defer; } + +/* + * Reclaim unmaps pages under the PTL but does not flush the TLB prior to + * releasing the PTL if TLB flushes are batched. It's possible a parallel + * operation such as mprotect or munmap to race between reclaim unmapping + * the page and flushing the page If this race occurs, it potentially allows + * access to data via a stale TLB entry. Tracking all mm's that have TLB + * batching in flight would be expensive during reclaim so instead track + * whether TLB batching occured in the past and if so then do a flush here + * if required. This will cost one additional flush per reclaim cycle paid + * by the first operation at risk such as mprotect and mumap. + * + * This must be called under the PTL so that accesses to tlb_flush_batched + * that is potentially a "reclaim vs mprotect/munmap/etc" race will + * synchronise via the PTL. + */ +void flush_tlb_batched_pending(struct mm_struct *mm) +{ + if (mm->tlb_flush_batched) { + arch_tlbbatch_flush_one_mm(mm); + mm->tlb_flush_batched = false; + } +} #else static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f70.google.com (mail-oi0-f70.google.com [209.85.218.70]) by kanga.kvack.org (Postfix) with ESMTP id 98A9F440874 for ; Thu, 13 Jul 2017 13:15:38 -0400 (EDT) Received: by mail-oi0-f70.google.com with SMTP id b130so4600241oii.9 for ; Thu, 13 Jul 2017 10:15:38 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id w186si4569951oiw.25.2017.07.13.10.15.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Jul 2017 10:15:37 -0700 (PDT) Received: from mail-ua0-f175.google.com (mail-ua0-f175.google.com [209.85.217.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 1CE2922CA1 for ; Thu, 13 Jul 2017 17:15:37 +0000 (UTC) Received: by mail-ua0-f175.google.com with SMTP id z22so37869666uah.1 for ; Thu, 13 Jul 2017 10:15:37 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170713170712.4iriw5lncoulcgda@suse.de> References: <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170713170712.4iriw5lncoulcgda@suse.de> From: Andy Lutomirski Date: Thu, 13 Jul 2017 10:15:15 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , Nadav Amit , "open list:MEMORY MANAGEMENT" On Thu, Jul 13, 2017 at 10:07 AM, Mel Gorman wrote: > On Thu, Jul 13, 2017 at 09:08:21AM -0700, Andrew Lutomirski wrote: >> On Wed, Jul 12, 2017 at 11:07 PM, Mel Gorman wrote: >> > --- a/arch/x86/mm/tlb.c >> > +++ b/arch/x86/mm/tlb.c >> > @@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) >> > put_cpu(); >> > } >> > >> > +/* >> > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when >> >> s/are up to date/have flushed the TLBs/ perhaps? >> >> >> Can you update this comment in arch/x86/include/asm/tlbflush.h: >> >> * - Fully flush a single mm. .mm will be set, .end will be >> * TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to >> * which the IPI sender is trying to catch us up. >> >> by adding something like: This can also happen due to >> arch_tlbflush_flush_one_mm(), in which case it's quite likely that >> most or all CPUs are already up to date. >> > > No problem, thanks. Care to ack the patch below? If so, I'll send it > to Ingo with x86 and linux-mm cc'd after some tests complete (hopefully > successfully). It's fairly x86 specific and makes sense to go in with the > rest of the pcid and mm tlb_gen stuff rather than via Andrew's tree even > through it touches core mm. Acked-by: Andy Lutomirski # for the x86 parts When you send to Ingo, you might want to change arch_tlbbatch_flush_one_mm to arch_tlbbatch_flush_one_mm(), because otherwise he'll probably do it for you :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id A64D0440874 for ; Thu, 13 Jul 2017 14:23:53 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id 23so9771664wry.4 for ; Thu, 13 Jul 2017 11:23:53 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id o188si45277wmd.102.2017.07.13.11.23.52 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 13 Jul 2017 11:23:52 -0700 (PDT) Date: Thu, 13 Jul 2017 19:23:50 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170713182350.n64dmnkgbiivikmh@suse.de> References: <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170713170712.4iriw5lncoulcgda@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Nadav Amit , "open list:MEMORY MANAGEMENT" On Thu, Jul 13, 2017 at 10:15:15AM -0700, Andrew Lutomirski wrote: > On Thu, Jul 13, 2017 at 10:07 AM, Mel Gorman wrote: > > On Thu, Jul 13, 2017 at 09:08:21AM -0700, Andrew Lutomirski wrote: > >> On Wed, Jul 12, 2017 at 11:07 PM, Mel Gorman wrote: > >> > --- a/arch/x86/mm/tlb.c > >> > +++ b/arch/x86/mm/tlb.c > >> > @@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > >> > put_cpu(); > >> > } > >> > > >> > +/* > >> > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when > >> > >> s/are up to date/have flushed the TLBs/ perhaps? > >> > >> > >> Can you update this comment in arch/x86/include/asm/tlbflush.h: > >> > >> * - Fully flush a single mm. .mm will be set, .end will be > >> * TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to > >> * which the IPI sender is trying to catch us up. > >> > >> by adding something like: This can also happen due to > >> arch_tlbflush_flush_one_mm(), in which case it's quite likely that > >> most or all CPUs are already up to date. > >> > > > > No problem, thanks. Care to ack the patch below? If so, I'll send it > > to Ingo with x86 and linux-mm cc'd after some tests complete (hopefully > > successfully). It's fairly x86 specific and makes sense to go in with the > > rest of the pcid and mm tlb_gen stuff rather than via Andrew's tree even > > through it touches core mm. > > Acked-by: Andy Lutomirski # for the x86 parts > > When you send to Ingo, you might want to change > arch_tlbbatch_flush_one_mm to arch_tlbbatch_flush_one_mm(), because > otherwise he'll probably do it for you :) *cringe*. I fixed it up. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f70.google.com (mail-it0-f70.google.com [209.85.214.70]) by kanga.kvack.org (Postfix) with ESMTP id 8C90D4408E5 for ; Fri, 14 Jul 2017 03:00:50 -0400 (EDT) Received: by mail-it0-f70.google.com with SMTP id 188so96718190itx.9 for ; Fri, 14 Jul 2017 00:00:50 -0700 (PDT) Received: from gate.crashing.org (gate.crashing.org. [63.228.1.57]) by mx.google.com with ESMTPS id u126si1462190itg.48.2017.07.14.00.00.49 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 14 Jul 2017 00:00:49 -0700 (PDT) Message-ID: <1500015641.2865.81.camel@kernel.crashing.org> Subject: Re: Potential race in TLB flush batching? From: Benjamin Herrenschmidt Date: Fri, 14 Jul 2017 17:00:41 +1000 In-Reply-To: References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski , Mel Gorman Cc: Nadav Amit , linux-mm@kvack.org On Tue, 2017-07-11 at 15:07 -0700, Andy Lutomirski wrote: > On Tue, Jul 11, 2017 at 12:18 PM, Mel Gorman wrote: > > I would change this slightly: > > > +void flush_tlb_batched_pending(struct mm_struct *mm) > > +{ > > +A A A A A A if (mm->tlb_flush_batched) { > > +A A A A A A A A A A A A A A flush_tlb_mm(mm); > > How about making this a new helper arch_tlbbatch_flush_one_mm(mm); > The idea is that this could be implemented as flush_tlb_mm(mm), but > the actual semantics needed are weaker.A All that's really needed > AFAICS is to make sure that any arch_tlbbatch_add_mm() calls on this > mm that have already happened become effective by the time that > arch_tlbbatch_flush_one_mm() returns. Jumping in ... I just discovered that 'new' batching stuff... is it documented anywhere ? We already had some form of batching via the mmu_gather, now there's a different somewhat orthogonal and it's completely unclear what it's about and why we couldn't use what we already had. Also what assumptions it makes if I want to port it to my arch.... The page table management code was messy enough without yet another undocumented batching mechanism that isn't quite the one we already had... Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id D1059440905 for ; Fri, 14 Jul 2017 04:31:17 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id c81so8148796wmd.10 for ; Fri, 14 Jul 2017 01:31:17 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id l202si1713012wmb.153.2017.07.14.01.31.16 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 14 Jul 2017 01:31:16 -0700 (PDT) Date: Fri, 14 Jul 2017 09:31:14 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170714083114.zhaz3pszrklnrn52@suse.de> References: <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <1500015641.2865.81.camel@kernel.crashing.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1500015641.2865.81.camel@kernel.crashing.org> Sender: owner-linux-mm@kvack.org List-ID: To: Benjamin Herrenschmidt Cc: Andy Lutomirski , Nadav Amit , linux-mm@kvack.org On Fri, Jul 14, 2017 at 05:00:41PM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2017-07-11 at 15:07 -0700, Andy Lutomirski wrote: > > On Tue, Jul 11, 2017 at 12:18 PM, Mel Gorman wrote: > > > > I would change this slightly: > > > > > +void flush_tlb_batched_pending(struct mm_struct *mm) > > > +{ > > > + if (mm->tlb_flush_batched) { > > > + flush_tlb_mm(mm); > > > > How about making this a new helper arch_tlbbatch_flush_one_mm(mm); > > The idea is that this could be implemented as flush_tlb_mm(mm), but > > the actual semantics needed are weaker. All that's really needed > > AFAICS is to make sure that any arch_tlbbatch_add_mm() calls on this > > mm that have already happened become effective by the time that > > arch_tlbbatch_flush_one_mm() returns. > > Jumping in ... I just discovered that 'new' batching stuff... is it > documented anywhere ? > This should be a new thread. The original commit log has many of the details and the comments have others. It's clearer what the boundaries are and what is needed from an architecture with Andy's work on top which right now is easier to see from tip/x86/mm > We already had some form of batching via the mmu_gather, now there's a > different somewhat orthogonal and it's completely unclear what it's > about and why we couldn't use what we already had. Also what > assumptions it makes if I want to port it to my arch.... > The batching in this context is more about mm's than individual pages and was done this was as the number of mm's to track was potentially unbound. At the time of implementation, tracking individual pages and the extra bits for mmu_gather was overkill and fairly complex due to the need to potentiall restart when the gather structure filled. It may also be only a gain on a limited number of architectures depending on exactly how an architecture handles flushing. At the time, batching this for x86 in the worse-case scenario where all pages being reclaimed were mapped from multiple threads knocked 24.4% off elapsed run time and 29% off system CPU but only on multi-socket NUMA machines. On UMA, it was barely noticable. For some workloads where only a few pages are mapped or the mapped pages on the LRU are relatively sparese, it'll make no difference. The worst-case situation is extremely IPI intensive on x86 where many IPIs were being sent for each unmap. It's only worth even considering if you see that the time spent sending IPIs for flushes is a large portion of reclaim. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f72.google.com (mail-it0-f72.google.com [209.85.214.72]) by kanga.kvack.org (Postfix) with ESMTP id 3D8D2440905 for ; Fri, 14 Jul 2017 05:03:21 -0400 (EDT) Received: by mail-it0-f72.google.com with SMTP id k192so103000261ith.0 for ; Fri, 14 Jul 2017 02:03:21 -0700 (PDT) Received: from gate.crashing.org (gate.crashing.org. [63.228.1.57]) by mx.google.com with ESMTPS id p132si1672318itb.11.2017.07.14.02.03.19 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 14 Jul 2017 02:03:20 -0700 (PDT) Message-ID: <1500022977.2865.88.camel@kernel.crashing.org> Subject: Re: Potential race in TLB flush batching? From: Benjamin Herrenschmidt Date: Fri, 14 Jul 2017 19:02:57 +1000 In-Reply-To: <20170714083114.zhaz3pszrklnrn52@suse.de> References: <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <1500015641.2865.81.camel@kernel.crashing.org> <20170714083114.zhaz3pszrklnrn52@suse.de> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , Nadav Amit , linux-mm@kvack.org, "Aneesh Kumar K.V" On Fri, 2017-07-14 at 09:31 +0100, Mel Gorman wrote: > It may also be only a gain on a limited number of architectures depending > on exactly how an architecture handles flushing. At the time, batching > this for x86 in the worse-case scenario where all pages being reclaimed > were mapped from multiple threads knocked 24.4% off elapsed run time and > 29% off system CPU but only on multi-socket NUMA machines. On UMA, it was > barely noticable. For some workloads where only a few pages are mapped or > the mapped pages on the LRU are relatively sparese, it'll make no difference. > > The worst-case situation is extremely IPI intensive on x86 where many > IPIs were being sent for each unmap. It's only worth even considering if > you see that the time spent sending IPIs for flushes is a large portion > of reclaim. Ok, it would be interesting to see how that compares to powerpc with its HW tlb invalidation broadcasts. We tend to hate them and prefer IPIs in most cases but maybe not *this* case .. (mostly we find that IPI + local inval is better for large scale invals, such as full mm on exit/fork etc...). In the meantime I found the original commits, we'll dig and see if it's useful for us. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id A1A28440905 for ; Fri, 14 Jul 2017 05:27:50 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id g46so10835114wrd.3 for ; Fri, 14 Jul 2017 02:27:50 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id x73si1817680wma.0.2017.07.14.02.27.49 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 14 Jul 2017 02:27:49 -0700 (PDT) Date: Fri, 14 Jul 2017 10:27:47 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170714092747.ebytils6c65zporo@suse.de> References: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <1500015641.2865.81.camel@kernel.crashing.org> <20170714083114.zhaz3pszrklnrn52@suse.de> <1500022977.2865.88.camel@kernel.crashing.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1500022977.2865.88.camel@kernel.crashing.org> Sender: owner-linux-mm@kvack.org List-ID: To: Benjamin Herrenschmidt Cc: Andy Lutomirski , Nadav Amit , linux-mm@kvack.org, "Aneesh Kumar K.V" On Fri, Jul 14, 2017 at 07:02:57PM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2017-07-14 at 09:31 +0100, Mel Gorman wrote: > > It may also be only a gain on a limited number of architectures depending > > on exactly how an architecture handles flushing. At the time, batching > > this for x86 in the worse-case scenario where all pages being reclaimed > > were mapped from multiple threads knocked 24.4% off elapsed run time and > > 29% off system CPU but only on multi-socket NUMA machines. On UMA, it was > > barely noticable. For some workloads where only a few pages are mapped or > > the mapped pages on the LRU are relatively sparese, it'll make no difference. > > > > The worst-case situation is extremely IPI intensive on x86 where many > > IPIs were being sent for each unmap. It's only worth even considering if > > you see that the time spent sending IPIs for flushes is a large portion > > of reclaim. > > Ok, it would be interesting to see how that compares to powerpc with > its HW tlb invalidation broadcasts. We tend to hate them and prefer > IPIs in most cases but maybe not *this* case .. (mostly we find that > IPI + local inval is better for large scale invals, such as full mm on > exit/fork etc...). > > In the meantime I found the original commits, we'll dig and see if it's > useful for us. > I would suggest that it is based on top of Andy's work that is currently in Linus' tree for 4.13-rc1 as the core/arch boundary is a lot clearer. While there is other work pending on top related to mm and generation counters, that is primarily important for addressing the race which ppc64 may not need if you always flush to clear the accessed bit (or equivalent). The main thing to watch for is that if an accessed or young bit is being set for the first time that the arch check the underlying PTE and trap if it's invalid. If that holds and there is a flush when the young bit is cleared then you probably do not need the arch hook that closes the race. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f70.google.com (mail-oi0-f70.google.com [209.85.218.70]) by kanga.kvack.org (Postfix) with ESMTP id 8F3B5440941 for ; Fri, 14 Jul 2017 18:21:27 -0400 (EDT) Received: by mail-oi0-f70.google.com with SMTP id 6so7630390oik.11 for ; Fri, 14 Jul 2017 15:21:27 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id f137si7418060oib.237.2017.07.14.15.21.26 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 14 Jul 2017 15:21:26 -0700 (PDT) Received: from mail-vk0-f41.google.com (mail-vk0-f41.google.com [209.85.213.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id AFD6222D3B for ; Fri, 14 Jul 2017 22:21:25 +0000 (UTC) Received: by mail-vk0-f41.google.com with SMTP id y70so53292018vky.3 for ; Fri, 14 Jul 2017 15:21:25 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170714092747.ebytils6c65zporo@suse.de> References: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <1500015641.2865.81.camel@kernel.crashing.org> <20170714083114.zhaz3pszrklnrn52@suse.de> <1500022977.2865.88.camel@kernel.crashing.org> <20170714092747.ebytils6c65zporo@suse.de> From: Andy Lutomirski Date: Fri, 14 Jul 2017 15:21:03 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Benjamin Herrenschmidt , Andy Lutomirski , Nadav Amit , "linux-mm@kvack.org" , "Aneesh Kumar K.V" On Fri, Jul 14, 2017 at 2:27 AM, Mel Gorman wrote: > On Fri, Jul 14, 2017 at 07:02:57PM +1000, Benjamin Herrenschmidt wrote: >> On Fri, 2017-07-14 at 09:31 +0100, Mel Gorman wrote: >> > It may also be only a gain on a limited number of architectures depending >> > on exactly how an architecture handles flushing. At the time, batching >> > this for x86 in the worse-case scenario where all pages being reclaimed >> > were mapped from multiple threads knocked 24.4% off elapsed run time and >> > 29% off system CPU but only on multi-socket NUMA machines. On UMA, it was >> > barely noticable. For some workloads where only a few pages are mapped or >> > the mapped pages on the LRU are relatively sparese, it'll make no difference. >> > >> > The worst-case situation is extremely IPI intensive on x86 where many >> > IPIs were being sent for each unmap. It's only worth even considering if >> > you see that the time spent sending IPIs for flushes is a large portion >> > of reclaim. >> >> Ok, it would be interesting to see how that compares to powerpc with >> its HW tlb invalidation broadcasts. We tend to hate them and prefer >> IPIs in most cases but maybe not *this* case .. (mostly we find that >> IPI + local inval is better for large scale invals, such as full mm on >> exit/fork etc...). >> >> In the meantime I found the original commits, we'll dig and see if it's >> useful for us. >> > > I would suggest that it is based on top of Andy's work that is currently in > Linus' tree for 4.13-rc1 as the core/arch boundary is a lot clearer. While > there is other work pending on top related to mm and generation counters, > that is primarily important for addressing the race which ppc64 may not > need if you always flush to clear the accessed bit (or equivalent). The > main thing to watch for is that if an accessed or young bit is being set > for the first time that the arch check the underlying PTE and trap if it's > invalid. If that holds and there is a flush when the young bit is cleared > then you probably do not need the arch hook that closes the race. > Ben, if you could read the API in tip:x86/mm + Mel's patch, it would be fantastic. I'd like to know whether a non-x86 non-mm person can understand the API (arch_tlbbatch_add_mm, arch_tlbbatch_flush, and arch_tlbbatch_flush_one_mm) well enough to implement it. I'd also like to know for real that it makes sense outside of x86. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 4917F440941 for ; Fri, 14 Jul 2017 19:16:49 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id v26so101956532pfa.0 for ; Fri, 14 Jul 2017 16:16:49 -0700 (PDT) Received: from mail-pg0-x242.google.com (mail-pg0-x242.google.com. [2607:f8b0:400e:c05::242]) by mx.google.com with ESMTPS id k33si7770821pld.481.2017.07.14.16.16.47 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 14 Jul 2017 16:16:47 -0700 (PDT) Received: by mail-pg0-x242.google.com with SMTP id j186so12017368pge.1 for ; Fri, 14 Jul 2017 16:16:47 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170713060706.o2cuko5y6irxwnww@suse.de> Date: Fri, 14 Jul 2017 16:16:44 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote: >>> If reclaim is first, it'll take the PTL, set batched while a racing >>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap >>> immediately calls flush_tlb_batched_pending() before proceeding as = normal, >>> finding pte_none with the TLB flushed. >>=20 >> This is the scenario I regarded in my example. Notice that when the = first >> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different = page-table >> locks - allowing them to run concurrently. As a result >> flush_tlb_batched_pending is executed before the PTE was cleared and >> mm->tlb_flush_batched is cleared. Later, after CPU0 runs = ptep_get_and_clear >> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE. >=20 > If they hold different PTL locks, it means that reclaim and and the = parallel > munmap/mprotect/madvise/mremap operation are operating on different = regions > of an mm or separate mm's and the race should not apply or at the very > least is equivalent to not batching the flushes. For multiple parallel > operations, munmap/mprotect/mremap are serialised by mmap_sem so there > is only one risky operation at a time. For multiple madvise, there is = a > small window when a page is accessible after madvise returns but it is = an > advisory call so it's primarily a data integrity concern and the TLB = is > flushed before the page is either freed or IO starts on the reclaim = side. I think there is some miscommunication. Perhaps one detail was missing: CPU0 CPU1 ---- ---- should_defer_flush =3D> mm->tlb_flush_batched=3Dtrue =09 flush_tlb_batched_pending (another PT) =3D> flush TLB =3D> mm->tlb_flush_batched=3Dfalse Access PTE (and cache in TLB) ptep_get_and_clear(PTE) ... flush_tlb_batched_pending (batched PT) [ no flush since tlb_flush_batched=3Dfalse= ] use the stale PTE ... try_to_unmap_flush There are only 2 CPUs and both regard the same address-space. CPU0 = reclaim a page from this address-space. Just between setting tlb_flush_batch and = the actual clearing of the PTE, the process on CPU1 runs munmap and calls flush_tlb_batched_pending. This can happen if CPU1 regards a different page-table. So CPU1 flushes the TLB and clears the tlb_flush_batched indication. = Note, however, that CPU0 still did not clear the PTE so CPU1 can access this = PTE and cache it. Then, after CPU0 clears the PTE, the process on CPU1 can = try to munmap the region that includes the cleared PTE. However, now it does = not flush the TLB. > +/* > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to = date when > + * this returns. Using the current mm tlb_gen means the TLB will be = up to date > + * with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a = flush has > + * happened since then the IPIs will still be sent but the actual = flush is > + * avoided. Unfortunately the IPIs are necessary as the per-cpu = context > + * tlb_gens cannot be safely accessed. > + */ > +void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) > +{ > + int cpu; > + struct flush_tlb_info info =3D { > + .mm =3D mm, > + .new_tlb_gen =3D atomic64_read(&mm->context.tlb_gen), > + .start =3D 0, > + .end =3D TLB_FLUSH_ALL, > + }; > + > + cpu =3D get_cpu(); > + > + if (mm =3D=3D this_cpu_read(cpu_tlbstate.loaded_mm)) { > + VM_WARN_ON(irqs_disabled()); > + local_irq_disable(); > + flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); > + local_irq_enable(); > + } > + > + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) > + flush_tlb_others(mm_cpumask(mm), &info); > + > + put_cpu(); > +} > + It is a shame that after Andy collapsed all the different flushing = flows, you create a new one. How about squashing this untested one to yours? -- >8 -- Subject: x86/mm: refactor flush_tlb_mm_range and = arch_tlbbatch_flush_one_mm flush_tlb_mm_range() and arch_tlbbatch_flush_one_mm() share a lot of = mutual code. After the recent work on combining the x86 TLB userspace entries flushes, it is a shame to break them into different code-paths again. Refactor the mutual code into perform_tlb_flush(). Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 48 +++++++++++++++++++----------------------------- 1 file changed, 19 insertions(+), 29 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 248063dc5be8..56e00443a6cf 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -404,17 +404,30 @@ void native_flush_tlb_others(const struct cpumask = *cpumask, */ static unsigned long tlb_single_page_flush_ceiling __read_mostly =3D = 33; =20 +static void perform_tlb_flush(struct mm_struct *mm, struct = flush_tlb_info *info) +{ + int cpu =3D get_cpu(); + + if (info->mm =3D=3D this_cpu_read(cpu_tlbstate.loaded_mm)) { + VM_WARN_ON(irqs_disabled()); + local_irq_disable(); + flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN); + local_irq_enable(); + } + + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) + flush_tlb_others(mm_cpumask(mm), info); + + put_cpu(); +} + void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) { - int cpu; - struct flush_tlb_info info =3D { .mm =3D mm, }; =20 - cpu =3D get_cpu(); - /* This is also a barrier that synchronizes with switch_mm(). */ info.new_tlb_gen =3D inc_mm_tlb_gen(mm); =20 @@ -429,17 +442,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, = unsigned long start, info.end =3D TLB_FLUSH_ALL; } =20 - if (mm =3D=3D this_cpu_read(cpu_tlbstate.loaded_mm)) { - VM_WARN_ON(irqs_disabled()); - local_irq_disable(); - flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); - local_irq_enable(); - } - - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), &info); - - put_cpu(); + perform_tlb_flush(mm, &info); } =20 =20 @@ -515,7 +518,6 @@ void arch_tlbbatch_flush(struct = arch_tlbflush_unmap_batch *batch) */ void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) { - int cpu; struct flush_tlb_info info =3D { .mm =3D mm, .new_tlb_gen =3D atomic64_read(&mm->context.tlb_gen), @@ -523,19 +525,7 @@ void arch_tlbbatch_flush_one_mm(struct mm_struct = *mm) .end =3D TLB_FLUSH_ALL, }; =20 - cpu =3D get_cpu(); - - if (mm =3D=3D this_cpu_read(cpu_tlbstate.loaded_mm)) { - VM_WARN_ON(irqs_disabled()); - local_irq_disable(); - flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); - local_irq_enable(); - } - - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), &info); - - put_cpu(); + perform_tlb_flush(mm, &info); } =20 static ssize_t tlbflush_read_file(struct file *file, char __user = *user_buf,= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id E0D606B05FE for ; Sat, 15 Jul 2017 11:55:23 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id t3so13507990wme.9 for ; Sat, 15 Jul 2017 08:55:23 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id b7si5373776wma.134.2017.07.15.08.55.20 for (version=TLS1 cipher=AES128-SHA bits=128/128); Sat, 15 Jul 2017 08:55:20 -0700 (PDT) Date: Sat, 15 Jul 2017 16:55:18 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170715155518.ok2q62efc2vurqk5@suse.de> References: <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Fri, Jul 14, 2017 at 04:16:44PM -0700, Nadav Amit wrote: > Mel Gorman wrote: > > > On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote: > >>> If reclaim is first, it'll take the PTL, set batched while a racing > >>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap > >>> immediately calls flush_tlb_batched_pending() before proceeding as normal, > >>> finding pte_none with the TLB flushed. > >> > >> This is the scenario I regarded in my example. Notice that when the first > >> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table > >> locks - allowing them to run concurrently. As a result > >> flush_tlb_batched_pending is executed before the PTE was cleared and > >> mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear > >> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE. > > > > If they hold different PTL locks, it means that reclaim and and the parallel > > munmap/mprotect/madvise/mremap operation are operating on different regions > > of an mm or separate mm's and the race should not apply or at the very > > least is equivalent to not batching the flushes. For multiple parallel > > operations, munmap/mprotect/mremap are serialised by mmap_sem so there > > is only one risky operation at a time. For multiple madvise, there is a > > small window when a page is accessible after madvise returns but it is an > > advisory call so it's primarily a data integrity concern and the TLB is > > flushed before the page is either freed or IO starts on the reclaim side. > > I think there is some miscommunication. Perhaps one detail was missing: > > CPU0 CPU1 > ---- ---- > should_defer_flush > => mm->tlb_flush_batched=true > flush_tlb_batched_pending (another PT) > => flush TLB > => mm->tlb_flush_batched=false > > Access PTE (and cache in TLB) > ptep_get_and_clear(PTE) > ... > > flush_tlb_batched_pending (batched PT) > [ no flush since tlb_flush_batched=false ] > use the stale PTE > ... > try_to_unmap_flush > > There are only 2 CPUs and both regard the same address-space. CPU0 reclaim a > page from this address-space. Just between setting tlb_flush_batch and the > actual clearing of the PTE, the process on CPU1 runs munmap and calls > flush_tlb_batched_pending. This can happen if CPU1 regards a different > page-table. > If both regard the same address-space then they have the same page table so there is a disconnect between the first and last sentence in your paragraph above. On CPU 0, the setting of tlb_flush_batched and ptep_get_and_clear is also reversed as the sequence is pteval = ptep_get_and_clear(mm, address, pvmw.pte); set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); Additional barriers should not be needed as within the critical section that can race, it's protected by the lock and with Andy's code, there is a full barrier before the setting of tlb_flush_batched. With Andy's code, there may be a need for a compiler barrier but I can rethink about that and add it during the backport to -stable if necessary. So the setting happens after the clear and if they share the same address space and collide then they both share the same PTL so are protected from each other. If there are separate address spaces using a shared mapping then the same race does not occur. > > +/* > > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when > > + * this returns. Using the current mm tlb_gen means the TLB will be up to date > > + * with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a flush has > > + * happened since then the IPIs will still be sent but the actual flush is > > + * avoided. Unfortunately the IPIs are necessary as the per-cpu context > > + * tlb_gens cannot be safely accessed. > > + */ > > +void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) > > +{ > > + int cpu; > > + struct flush_tlb_info info = { > > + .mm = mm, > > + .new_tlb_gen = atomic64_read(&mm->context.tlb_gen), > > + .start = 0, > > + .end = TLB_FLUSH_ALL, > > + }; > > + > > + cpu = get_cpu(); > > + > > + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { > > + VM_WARN_ON(irqs_disabled()); > > + local_irq_disable(); > > + flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); > > + local_irq_enable(); > > + } > > + > > + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) > > + flush_tlb_others(mm_cpumask(mm), &info); > > + > > + put_cpu(); > > +} > > + > > It is a shame that after Andy collapsed all the different flushing flows, > you create a new one. How about squashing this untested one to yours? > The patch looks fine to be but when writing the patch, I wondered why the original code disabled preemption before inc_mm_tlb_gen. I didn't spot the reason for it but given the importance of properly synchronising with switch_mm, I played it safe. However, this should be ok on top and maintain the existing sequences diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 248063dc5be8..cbd8621a0bee 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -404,6 +404,21 @@ void native_flush_tlb_others(const struct cpumask *cpumask, */ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33; +static void flush_tlb_mm_common(struct flush_tlb_info *info, int cpu) +{ + struct mm_struct *mm = info->mm; + + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { + VM_WARN_ON(irqs_disabled()); + local_irq_disable(); + flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); + local_irq_enable(); + } + + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) + flush_tlb_others(mm_cpumask(mm), info); +} + void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag) { @@ -429,15 +444,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, info.end = TLB_FLUSH_ALL; } - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { - VM_WARN_ON(irqs_disabled()); - local_irq_disable(); - flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); - local_irq_enable(); - } - - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), &info); + flush_tlb_mm_common(&info, cpu); put_cpu(); } @@ -515,7 +522,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) */ void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) { - int cpu; struct flush_tlb_info info = { .mm = mm, .new_tlb_gen = atomic64_read(&mm->context.tlb_gen), @@ -523,17 +529,7 @@ void arch_tlbbatch_flush_one_mm(struct mm_struct *mm) .end = TLB_FLUSH_ALL, }; - cpu = get_cpu(); - - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { - VM_WARN_ON(irqs_disabled()); - local_irq_disable(); - flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); - local_irq_enable(); - } - - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), &info); + flush_tlb_mm_common(&info, get_cpu()); put_cpu(); } -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f69.google.com (mail-oi0-f69.google.com [209.85.218.69]) by kanga.kvack.org (Postfix) with ESMTP id 15EF96B05FE for ; Sat, 15 Jul 2017 12:41:59 -0400 (EDT) Received: by mail-oi0-f69.google.com with SMTP id 191so8767134oii.4 for ; Sat, 15 Jul 2017 09:41:59 -0700 (PDT) Received: from mail.kernel.org (mail.kernel.org. [198.145.29.99]) by mx.google.com with ESMTPS id y23si7737055oia.336.2017.07.15.09.41.58 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 15 Jul 2017 09:41:58 -0700 (PDT) Received: from mail-ua0-f169.google.com (mail-ua0-f169.google.com [209.85.217.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 81DFC22D40 for ; Sat, 15 Jul 2017 16:41:57 +0000 (UTC) Received: by mail-ua0-f169.google.com with SMTP id z22so66317174uah.1 for ; Sat, 15 Jul 2017 09:41:57 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170715155518.ok2q62efc2vurqk5@suse.de> References: <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> From: Andy Lutomirski Date: Sat, 15 Jul 2017 09:41:35 -0700 Message-ID: Subject: Re: Potential race in TLB flush batching? Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Sat, Jul 15, 2017 at 8:55 AM, Mel Gorman wrote: > The patch looks fine to be but when writing the patch, I wondered why the > original code disabled preemption before inc_mm_tlb_gen. I didn't spot > the reason for it but given the importance of properly synchronising with > switch_mm, I played it safe. However, this should be ok on top and > maintain the existing sequences LGTM. You could also fold it into your patch or even put it before your patch, too. FWIW, I didn't have any real reason to inc_mm_tlb_gen() with preemption disabled. I think I did it because the code it replaced was also called with preemption off. That being said, it's effectively a single instruction, so it barely matters latency-wise. (Hmm. Would there be a performance downside if a thread got preempted between inc_mm_tlb_gen() and doing the flush? It could arbitrarily delay the IPIs, which would give a big window for something else to flush and maybe make our IPIs unnecessary. Whether that's a win or a loss isn't so clear to me.) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 4AFCE6B0279 for ; Mon, 17 Jul 2017 03:49:45 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id b20so18493752wmd.6 for ; Mon, 17 Jul 2017 00:49:45 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id i65si9018606wmg.60.2017.07.17.00.49.43 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 17 Jul 2017 00:49:43 -0700 (PDT) Date: Mon, 17 Jul 2017 08:49:41 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170717074941.sti4dqm3ysy5upen@suse.de> References: <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Nadav Amit , "open list:MEMORY MANAGEMENT" On Sat, Jul 15, 2017 at 09:41:35AM -0700, Andrew Lutomirski wrote: > On Sat, Jul 15, 2017 at 8:55 AM, Mel Gorman wrote: > > The patch looks fine to be but when writing the patch, I wondered why the > > original code disabled preemption before inc_mm_tlb_gen. I didn't spot > > the reason for it but given the importance of properly synchronising with > > switch_mm, I played it safe. However, this should be ok on top and > > maintain the existing sequences > > LGTM. You could also fold it into your patch or even put it before > your patch, too. > Thanks. > FWIW, I didn't have any real reason to inc_mm_tlb_gen() with > preemption disabled. I think I did it because the code it replaced > was also called with preemption off. That being said, it's > effectively a single instruction, so it barely matters latency-wise. > (Hmm. Would there be a performance downside if a thread got preempted > between inc_mm_tlb_gen() and doing the flush? There isn't a preemption point until the point where irqs are disabled/enabled for the local TLB flush so it doesn't really matter. It can still be preempted by an interrupt but that's not surprising. I don't think it matters that much either way so I'll leave it at it is. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 310DB6B0292 for ; Tue, 18 Jul 2017 17:28:33 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id s64so31855732pfa.1 for ; Tue, 18 Jul 2017 14:28:33 -0700 (PDT) Received: from mail-pg0-x244.google.com (mail-pg0-x244.google.com. [2607:f8b0:400e:c05::244]) by mx.google.com with ESMTPS id b60si2590315plc.594.2017.07.18.14.28.31 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 18 Jul 2017 14:28:31 -0700 (PDT) Received: by mail-pg0-x244.google.com with SMTP id v190so4319231pgv.1 for ; Tue, 18 Jul 2017 14:28:31 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170715155518.ok2q62efc2vurqk5@suse.de> Date: Tue, 18 Jul 2017 14:28:27 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20170711155312.637eyzpqeghcgqzp@suse.de> <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Fri, Jul 14, 2017 at 04:16:44PM -0700, Nadav Amit wrote: >> Mel Gorman wrote: >>=20 >>> On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote: >>>>> If reclaim is first, it'll take the PTL, set batched while a = racing >>>>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap >>>>> immediately calls flush_tlb_batched_pending() before proceeding as = normal, >>>>> finding pte_none with the TLB flushed. >>>>=20 >>>> This is the scenario I regarded in my example. Notice that when the = first >>>> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different = page-table >>>> locks - allowing them to run concurrently. As a result >>>> flush_tlb_batched_pending is executed before the PTE was cleared = and >>>> mm->tlb_flush_batched is cleared. Later, after CPU0 runs = ptep_get_and_clear >>>> mm->tlb_flush_batched remains clear, and CPU1 can use the stale = PTE. >>>=20 >>> If they hold different PTL locks, it means that reclaim and and the = parallel >>> munmap/mprotect/madvise/mremap operation are operating on different = regions >>> of an mm or separate mm's and the race should not apply or at the = very >>> least is equivalent to not batching the flushes. For multiple = parallel >>> operations, munmap/mprotect/mremap are serialised by mmap_sem so = there >>> is only one risky operation at a time. For multiple madvise, there = is a >>> small window when a page is accessible after madvise returns but it = is an >>> advisory call so it's primarily a data integrity concern and the TLB = is >>> flushed before the page is either freed or IO starts on the reclaim = side. >>=20 >> I think there is some miscommunication. Perhaps one detail was = missing: >>=20 >> CPU0 CPU1 >> ---- ---- >> should_defer_flush >> =3D> mm->tlb_flush_batched=3Dtrue =09 >> flush_tlb_batched_pending (another PT) >> =3D> flush TLB >> =3D> mm->tlb_flush_batched=3Dfalse >>=20 >> Access PTE (and cache in TLB) >> ptep_get_and_clear(PTE) >> ... >>=20 >> flush_tlb_batched_pending (batched PT) >> [ no flush since tlb_flush_batched=3Dfalse= ] >> use the stale PTE >> ... >> try_to_unmap_flush >>=20 >> There are only 2 CPUs and both regard the same address-space. CPU0 = reclaim a >> page from this address-space. Just between setting tlb_flush_batch = and the >> actual clearing of the PTE, the process on CPU1 runs munmap and calls >> flush_tlb_batched_pending. This can happen if CPU1 regards a = different >> page-table. >=20 > If both regard the same address-space then they have the same page = table so > there is a disconnect between the first and last sentence in your = paragraph > above. On CPU 0, the setting of tlb_flush_batched and = ptep_get_and_clear > is also reversed as the sequence is >=20 > pteval =3D ptep_get_and_clear(mm, address, = pvmw.pte); > set_tlb_ubc_flush_pending(mm, = pte_dirty(pteval)); >=20 > Additional barriers should not be needed as within the critical = section > that can race, it's protected by the lock and with Andy's code, there = is > a full barrier before the setting of tlb_flush_batched. With Andy's = code, > there may be a need for a compiler barrier but I can rethink about = that > and add it during the backport to -stable if necessary. >=20 > So the setting happens after the clear and if they share the same = address > space and collide then they both share the same PTL so are protected = from > each other. >=20 > If there are separate address spaces using a shared mapping then the > same race does not occur. I missed the fact you reverted the two operations since the previous = version of the patch. This specific scenario should be solved with this patch. But in general, I think there is a need for a simple locking scheme. Otherwise, people (like me) would be afraid to make any changes to the = code, and additional missing TLB flushes would exist. For example, I suspect = that a user may trigger insert_pfn() or insert_page(), and rely on their = output. While it makes little sense, the user can try to insert the page on the = same address of another page. If the other page was already reclaimed the operation should succeed and otherwise fail. But it may succeed while = the other page is going through reclamation, resulting in: CPU0 CPU1 ---- ---- =09 ptep_clear_flush_notify() - access memory using a PTE [ PTE cached in TLB ] try_to_unmap_one() =3D=3D> ptep_get_and_clear() =3D=3D= false insert_page() =3D=3D> pte_none() =3D true [retval =3D 0] - access memory using a stale PTE Additional potential situations can be caused, IIUC, by = mcopy_atomic_pte(), mfill_zeropage_pte(), shmem_mcopy_atomic_pte(). Even more importantly, I suspect there is an additional similar but unrelated problem. clear_refs_write() can be used with = CLEAR_REFS_SOFT_DIRTY to write-protect PTEs. However, it batches TLB flushes, while only = holding mmap_sem for read, and without any indication in mm that TLB flushes are pending. As a result, concurrent operation such as KSM=E2=80=99s = write_protect_page() or page_mkclean_one() can consider the page write-protected while in fact = it is still accessible - since the TLB flush was deferred. As a result, they = may mishandle the PTE without flushing the page. In the case of page_mkclean_one(), I suspect it may even lead to memory corruption. I = admit that in x86 there are some mitigating factors that would make such = =E2=80=9Cattack=E2=80=9D complicated, but it still seems wrong to me, no? Thanks, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 9D93B6B0279 for ; Wed, 19 Jul 2017 03:41:34 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id k69so1161876wmc.14 for ; Wed, 19 Jul 2017 00:41:34 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id b137si4536723wmf.61.2017.07.19.00.41.32 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 19 Jul 2017 00:41:32 -0700 (PDT) Date: Wed, 19 Jul 2017 08:41:31 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170719074131.75wexoal3fiyoxw5@suse.de> References: <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote: > > If there are separate address spaces using a shared mapping then the > > same race does not occur. > > I missed the fact you reverted the two operations since the previous version > of the patch. This specific scenario should be solved with this patch. > > But in general, I think there is a need for a simple locking scheme. Such as? > Otherwise, people (like me) would be afraid to make any changes to the code, > and additional missing TLB flushes would exist. For example, I suspect that > a user may trigger insert_pfn() or insert_page(), and rely on their output. That API is for device drivers to insert pages (which may not be RAM) directly into userspace and the pages are not on the LRU so not subject to the same races. > While it makes little sense, the user can try to insert the page on the same > address of another page. Even if a drivers was dumb enough to do so, the second insert should fail on a !pte_none() test. > If the other page was already reclaimed the > operation should succeed and otherwise fail. But it may succeed while the > other page is going through reclamation, resulting in: > It doesn't go through reclaim as the page isn't on the LRU until the last mmap or the driver frees the page. > CPU0 CPU1 > ---- ---- > ptep_clear_flush_notify() > - access memory using a PTE > [ PTE cached in TLB ] > try_to_unmap_one() > ==> ptep_get_and_clear() == false > insert_page() > ==> pte_none() = true > [retval = 0] > > - access memory using a stale PTE That race assumes that the page was on the LRU and the VMAs in question are VM_MIXEDMAP or VM_PFNMAP. If the region is unmapped and a new mapping put in place, the last patch ensures the region is flushed. > Additional potential situations can be caused, IIUC, by mcopy_atomic_pte(), > mfill_zeropage_pte(), shmem_mcopy_atomic_pte(). > I didn't dig into the exact locking for userfaultfd because largely it doesn't matter. The operations are copy operations which means that any stale TLB is being used to read data only. If the page is reclaimed then a fault is raised. If data is read for a short duration before the TLB flush then it still doesn't matter because there is no data integrity issue. The TLB will be flushed if an operation occurs that could leak the wrong data. > Even more importantly, I suspect there is an additional similar but > unrelated problem. clear_refs_write() can be used with CLEAR_REFS_SOFT_DIRTY > to write-protect PTEs. However, it batches TLB flushes, while only holding > mmap_sem for read, and without any indication in mm that TLB flushes are > pending. > Again, consider whether there is a data integrity issue. A TLB entry existing after an unmap is not in itself dangerous. There is always some degree of race between when a PTE is unmapped and the IPIs for the flush are delivered. > As a result, concurrent operation such as KSM???s write_protect_page() or write_protect_page operates under the page lock and cannot race with reclaim. > page_mkclean_one() can consider the page write-protected while in fact it is > still accessible - since the TLB flush was deferred. As long as it's flushed before any IO occurs that would lose a data update, it's not a data integrity issue. > As a result, they may > mishandle the PTE without flushing the page. In the case of > page_mkclean_one(), I suspect it may even lead to memory corruption. I admit > that in x86 there are some mitigating factors that would make such ???attack??? > complicated, but it still seems wrong to me, no? > I worry that you're beginning to see races everywhere. I admit that the rules and protections here are varied and complex but it's worth keeping in mind that data integrity is the key concern (no false reads to wrong data, no lost writes) and the first race you identified found some problems here. However, with or without batching, there is always a delay between when a PTE is cleared and when the TLB entries are removed. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id E78526B025F for ; Wed, 19 Jul 2017 15:41:06 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id 125so11480866pgi.2 for ; Wed, 19 Jul 2017 12:41:06 -0700 (PDT) Received: from mail-pf0-x241.google.com (mail-pf0-x241.google.com. [2607:f8b0:400e:c00::241]) by mx.google.com with ESMTPS id y70si321878plh.556.2017.07.19.12.41.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 19 Jul 2017 12:41:05 -0700 (PDT) Received: by mail-pf0-x241.google.com with SMTP id o88so705570pfk.1 for ; Wed, 19 Jul 2017 12:41:05 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170719074131.75wexoal3fiyoxw5@suse.de> Date: Wed, 19 Jul 2017 12:41:01 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20170711191823.qthrmdgqcd3rygjk@suse.de> <20170711200923.gyaxfjzz3tpvreuq@suse.de> <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote: >>> If there are separate address spaces using a shared mapping then the >>> same race does not occur. >>=20 >> I missed the fact you reverted the two operations since the previous = version >> of the patch. This specific scenario should be solved with this = patch. >>=20 >> But in general, I think there is a need for a simple locking scheme. >=20 > Such as? Something like: bool is_potentially_stale_pte(pte_t pte, pgprot_t prot, int lock_state); which would get the current PTE, the protection bits that the user is interested in, and whether mmap_sem is taken read/write/none.=20 It would return whether this PTE may be potentially stale and needs to = be invalidated. Obviously, any code that removes protection or unmaps need = to be updated for this information to be correct. [snip] >> As a result, concurrent operation such as KSM???s = write_protect_page() or >=20 > write_protect_page operates under the page lock and cannot race with = reclaim. I still do not understand this claim. IIUC, reclaim can unmap the page = in some page table, decide not to reclaim the page and release the = page-lock before flush. >> page_mkclean_one() can consider the page write-protected while in = fact it is >> still accessible - since the TLB flush was deferred. >=20 > As long as it's flushed before any IO occurs that would lose a data = update, > it's not a data integrity issue. >=20 >> As a result, they may >> mishandle the PTE without flushing the page. In the case of >> page_mkclean_one(), I suspect it may even lead to memory corruption. = I admit >> that in x86 there are some mitigating factors that would make such = ???attack??? >> complicated, but it still seems wrong to me, no? >=20 > I worry that you're beginning to see races everywhere. I admit that = the > rules and protections here are varied and complex but it's worth = keeping > in mind that data integrity is the key concern (no false reads to = wrong > data, no lost writes) and the first race you identified found some = problems > here. However, with or without batching, there is always a delay = between > when a PTE is cleared and when the TLB entries are removed. Sure, but usually the delay occurs while the page-table lock is taken so there is no race. Now, it is not fair to call me a paranoid, considering that these races = are real - I confirmed that at least two can happen in practice. There are = many possibilities for concurrent TLB batching and you cannot expect = developers to consider all of them. I don=E2=80=99t think many people are capable = of doing the voodoo tricks of avoiding a TLB flush if the page-lock is taken or the = VMA is anonymous. I doubt that these tricks work and anyhow IMHO they are = likely to fail in the future since they are undocumented and complicated. As for =E2=80=9Cdata integrity is the key concern=E2=80=9D - violating = the memory management API can cause data integrity issues for programs. It may not cause the = OS to crash, but it should not be acceptable either, and may potentially raise security concerns. If you think that the current behavior is ok, let the documentation and man pages clarify that mprotect may not protect, = madvise may not advise and so on. And although you would use it against me, I would say: Nobody knew that = TLB flushing could be so complicated. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id 33FBA6B025F for ; Wed, 19 Jul 2017 15:58:24 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id w63so10458105wrc.5 for ; Wed, 19 Jul 2017 12:58:24 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id c6si3796345wrb.310.2017.07.19.12.58.22 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 19 Jul 2017 12:58:22 -0700 (PDT) Date: Wed, 19 Jul 2017 20:58:20 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170719195820.drtfmweuhdc4eca6@suse.de> References: <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 19, 2017 at 12:41:01PM -0700, Nadav Amit wrote: > Mel Gorman wrote: > > > On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote: > >>> If there are separate address spaces using a shared mapping then the > >>> same race does not occur. > >> > >> I missed the fact you reverted the two operations since the previous version > >> of the patch. This specific scenario should be solved with this patch. > >> > >> But in general, I think there is a need for a simple locking scheme. > > > > Such as? > > Something like: > > bool is_potentially_stale_pte(pte_t pte, pgprot_t prot, int lock_state); > > which would get the current PTE, the protection bits that the user is > interested in, and whether mmap_sem is taken read/write/none. > >>From a PTE you cannot know the state of mmap_sem because you can rmap back to multiple mm's for shared mappings. It's also fairly heavy handed. Technically, you could lock on the basis of the VMA but that has other consequences for scalability. The staleness is also a factor because it's a case of "does the staleness matter". Sometimes it does, sometimes it doesn't. mmap_sem even if it could be used does not always tell us the right information either because it can matter whether we are racing against a userspace reference or a kernel operation. It's possible your idea could be made work, but right now I'm not seeing a solution that handles every corner case. I asked to hear what your ideas were because anything I thought of that could batch TLB flushing in the general case had flaws that did not improve over what is already there. > [snip] > > >> As a result, concurrent operation such as KSM???s write_protect_page() or > > > > write_protect_page operates under the page lock and cannot race with reclaim. > > I still do not understand this claim. IIUC, reclaim can unmap the page in > some page table, decide not to reclaim the page and release the page-lock > before flush. > shrink_page_list is the caller of try_to_unmap in reclaim context. It has this check if (!trylock_page(page)) goto keep; For pages it cannot lock, they get put back on the LRU and recycled instead of reclaimed. Hence, if KSM or anything else holds the page lock, reclaim can't unmap it. > >> page_mkclean_one() can consider the page write-protected while in fact it is > >> still accessible - since the TLB flush was deferred. > > > > As long as it's flushed before any IO occurs that would lose a data update, > > it's not a data integrity issue. > > > >> As a result, they may > >> mishandle the PTE without flushing the page. In the case of > >> page_mkclean_one(), I suspect it may even lead to memory corruption. I admit > >> that in x86 there are some mitigating factors that would make such ???attack??? > >> complicated, but it still seems wrong to me, no? > > > > I worry that you're beginning to see races everywhere. I admit that the > > rules and protections here are varied and complex but it's worth keeping > > in mind that data integrity is the key concern (no false reads to wrong > > data, no lost writes) and the first race you identified found some problems > > here. However, with or without batching, there is always a delay between > > when a PTE is cleared and when the TLB entries are removed. > > Sure, but usually the delay occurs while the page-table lock is taken so > there is no race. > > Now, it is not fair to call me a paranoid, considering that these races are > real - I confirmed that at least two can happen in practice. It's less an accusation of paranoia and more a caution that the fact that pte_clear_flush is not atomic means that it can be difficult to find what races matter and what ones don't. > As for ???data integrity is the key concern??? - violating the memory management > API can cause data integrity issues for programs. The madvise one should be fixed too. It could also be "fixed" by removing all batching but the performance cost will be sufficiently high that there will be pressure to find an alternative. > It may not cause the OS to > crash, but it should not be acceptable either, and may potentially raise > security concerns. If you think that the current behavior is ok, let the > documentation and man pages clarify that mprotect may not protect, madvise > may not advise and so on. > The madvise one should be fixed, not because because it allows a case whereby userspace thinks it has initialised a structure that is actually in a page that is freed after a TLB is flushed resulting in a lost write. It wouldn't cause any issues with shared or file-backed mappings but it is a problem for anonymous. > And although you would use it against me, I would say: Nobody knew that TLB > flushing could be so complicated. > There is no question that the area is complicated. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 4D1F56B025F for ; Wed, 19 Jul 2017 16:20:06 -0400 (EDT) Received: by mail-pf0-f197.google.com with SMTP id y62so10756938pfa.3 for ; Wed, 19 Jul 2017 13:20:06 -0700 (PDT) Received: from mail-pg0-x230.google.com (mail-pg0-x230.google.com. [2607:f8b0:400e:c05::230]) by mx.google.com with ESMTPS id o12si504296pfa.545.2017.07.19.13.20.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 19 Jul 2017 13:20:05 -0700 (PDT) Received: by mail-pg0-x230.google.com with SMTP id 123so4808181pgj.1 for ; Wed, 19 Jul 2017 13:20:05 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170719195820.drtfmweuhdc4eca6@suse.de> Date: Wed, 19 Jul 2017 13:20:01 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> References: <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Wed, Jul 19, 2017 at 12:41:01PM -0700, Nadav Amit wrote: >> Mel Gorman wrote: >>=20 >>> On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote: >>>>> If there are separate address spaces using a shared mapping then = the >>>>> same race does not occur. >>>>=20 >>>> I missed the fact you reverted the two operations since the = previous version >>>> of the patch. This specific scenario should be solved with this = patch. >>>>=20 >>>> But in general, I think there is a need for a simple locking = scheme. >>>=20 >>> Such as? >>=20 >> Something like: >>=20 >> bool is_potentially_stale_pte(pte_t pte, pgprot_t prot, int = lock_state); >>=20 >> which would get the current PTE, the protection bits that the user is >> interested in, and whether mmap_sem is taken read/write/none. >=20 > =46rom a PTE you cannot know the state of mmap_sem because you can = rmap > back to multiple mm's for shared mappings. It's also fairly heavy = handed. > Technically, you could lock on the basis of the VMA but that has other > consequences for scalability. The staleness is also a factor because > it's a case of "does the staleness matter". Sometimes it does, = sometimes > it doesn't. mmap_sem even if it could be used does not always tell us > the right information either because it can matter whether we are = racing > against a userspace reference or a kernel operation. >=20 > It's possible your idea could be made work, but right now I'm not = seeing a > solution that handles every corner case. I asked to hear what your = ideas > were because anything I thought of that could batch TLB flushing in = the > general case had flaws that did not improve over what is already = there. I don=E2=80=99t disagree with what you say - perhaps my scheme is too = simplistic. But the bottom line, if you cannot form simple rules for when TLB needs = to be flushed, what are the chances others would get it right? >> [snip] >>=20 >>>> As a result, concurrent operation such as KSM???s = write_protect_page() or >>>=20 >>> write_protect_page operates under the page lock and cannot race with = reclaim. >>=20 >> I still do not understand this claim. IIUC, reclaim can unmap the = page in >> some page table, decide not to reclaim the page and release the = page-lock >> before flush. >=20 > shrink_page_list is the caller of try_to_unmap in reclaim context. It > has this check >=20 > if (!trylock_page(page)) > goto keep; >=20 > For pages it cannot lock, they get put back on the LRU and recycled = instead > of reclaimed. Hence, if KSM or anything else holds the page lock, = reclaim > can't unmap it. Yes, of course, since KSM does not batch TLB flushes. I regarded the = other direction - first try_to_unmap() removes the PTE (but still does not = flush), unlocks the page, and then KSM acquires the page lock and calls write_protect_page(). It finds out the PTE is not present and does not = flush the TLB. >>>> page_mkclean_one() can consider the page write-protected while in = fact it is >>>> still accessible - since the TLB flush was deferred. >>>=20 >>> As long as it's flushed before any IO occurs that would lose a data = update, >>> it's not a data integrity issue. >>>=20 >>>> As a result, they may >>>> mishandle the PTE without flushing the page. In the case of >>>> page_mkclean_one(), I suspect it may even lead to memory = corruption. I admit >>>> that in x86 there are some mitigating factors that would make such = ???attack??? >>>> complicated, but it still seems wrong to me, no? >>>=20 >>> I worry that you're beginning to see races everywhere. I admit that = the >>> rules and protections here are varied and complex but it's worth = keeping >>> in mind that data integrity is the key concern (no false reads to = wrong >>> data, no lost writes) and the first race you identified found some = problems >>> here. However, with or without batching, there is always a delay = between >>> when a PTE is cleared and when the TLB entries are removed. >>=20 >> Sure, but usually the delay occurs while the page-table lock is taken = so >> there is no race. >>=20 >> Now, it is not fair to call me a paranoid, considering that these = races are >> real - I confirmed that at least two can happen in practice. >=20 > It's less an accusation of paranoia and more a caution that the fact = that > pte_clear_flush is not atomic means that it can be difficult to find = what > races matter and what ones don't. >=20 >> As for ???data integrity is the key concern??? - violating the memory = management >> API can cause data integrity issues for programs. >=20 > The madvise one should be fixed too. It could also be "fixed" by > removing all batching but the performance cost will be sufficiently = high > that there will be pressure to find an alternative. >=20 >> It may not cause the OS to >> crash, but it should not be acceptable either, and may potentially = raise >> security concerns. If you think that the current behavior is ok, let = the >> documentation and man pages clarify that mprotect may not protect, = madvise >> may not advise and so on. >=20 > The madvise one should be fixed, not because because it allows a case > whereby userspace thinks it has initialised a structure that is = actually > in a page that is freed after a TLB is flushed resulting in a lost > write. It wouldn't cause any issues with shared or file-backed = mappings > but it is a problem for anonymous. >=20 >> And although you would use it against me, I would say: Nobody knew = that TLB >> flushing could be so complicated. >=20 > There is no question that the area is complicated. My comment was actually an unfunny joke... Never mind. Thanks, Nadav p.s.: Thanks for your patience. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id 68FF66B02B4 for ; Wed, 19 Jul 2017 17:47:11 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id z36so2350482wrb.13 for ; Wed, 19 Jul 2017 14:47:11 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id j19si591482wmi.144.2017.07.19.14.47.10 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 19 Jul 2017 14:47:10 -0700 (PDT) Date: Wed, 19 Jul 2017 22:47:08 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170719214708.wuzq3di6rt43txtn@suse.de> References: <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 19, 2017 at 01:20:01PM -0700, Nadav Amit wrote: > > From a PTE you cannot know the state of mmap_sem because you can rmap > > back to multiple mm's for shared mappings. It's also fairly heavy handed. > > Technically, you could lock on the basis of the VMA but that has other > > consequences for scalability. The staleness is also a factor because > > it's a case of "does the staleness matter". Sometimes it does, sometimes > > it doesn't. mmap_sem even if it could be used does not always tell us > > the right information either because it can matter whether we are racing > > against a userspace reference or a kernel operation. > > > > It's possible your idea could be made work, but right now I'm not seeing a > > solution that handles every corner case. I asked to hear what your ideas > > were because anything I thought of that could batch TLB flushing in the > > general case had flaws that did not improve over what is already there. > > I don???t disagree with what you say - perhaps my scheme is too simplistic. > But the bottom line, if you cannot form simple rules for when TLB needs to > be flushed, what are the chances others would get it right? > Broad rule is "flush before the page is freed/reallocated for clean pages or any IO is initiated for dirty pages" with a lot of details that are not documented. Often it's the PTL and flush with it held that protects the majority of cases but it's not universal as the page lock and mmap_sem play important rules depending ont the context and AFAIK, that's also not documented. > > shrink_page_list is the caller of try_to_unmap in reclaim context. It > > has this check > > > > if (!trylock_page(page)) > > goto keep; > > > > For pages it cannot lock, they get put back on the LRU and recycled instead > > of reclaimed. Hence, if KSM or anything else holds the page lock, reclaim > > can't unmap it. > > Yes, of course, since KSM does not batch TLB flushes. I regarded the other > direction - first try_to_unmap() removes the PTE (but still does not flush), > unlocks the page, and then KSM acquires the page lock and calls > write_protect_page(). It finds out the PTE is not present and does not flush > the TLB. > When KSM acquires the page lock, it then acquires the PTL where the cleared PTE is observed directly and skipped. > > There is no question that the area is complicated. > > My comment was actually an unfunny joke... Never mind. > > Thanks, > Nadav > > p.s.: Thanks for your patience. > No need for thanks. As you pointed out yourself, you have been identifying races. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 7B3F16B025F for ; Wed, 19 Jul 2017 18:19:05 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id d18so12877380pfe.8 for ; Wed, 19 Jul 2017 15:19:05 -0700 (PDT) Received: from mail-pg0-x241.google.com (mail-pg0-x241.google.com. [2607:f8b0:400e:c05::241]) by mx.google.com with ESMTPS id f2si505945plj.110.2017.07.19.15.19.04 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 19 Jul 2017 15:19:04 -0700 (PDT) Received: by mail-pg0-x241.google.com with SMTP id y129so1062652pgy.3 for ; Wed, 19 Jul 2017 15:19:04 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170719214708.wuzq3di6rt43txtn@suse.de> Date: Wed, 19 Jul 2017 15:19:00 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> References: <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Wed, Jul 19, 2017 at 01:20:01PM -0700, Nadav Amit wrote: >>> =46rom a PTE you cannot know the state of mmap_sem because you can = rmap >>> back to multiple mm's for shared mappings. It's also fairly heavy = handed. >>> Technically, you could lock on the basis of the VMA but that has = other >>> consequences for scalability. The staleness is also a factor because >>> it's a case of "does the staleness matter". Sometimes it does, = sometimes >>> it doesn't. mmap_sem even if it could be used does not always tell = us >>> the right information either because it can matter whether we are = racing >>> against a userspace reference or a kernel operation. >>>=20 >>> It's possible your idea could be made work, but right now I'm not = seeing a >>> solution that handles every corner case. I asked to hear what your = ideas >>> were because anything I thought of that could batch TLB flushing in = the >>> general case had flaws that did not improve over what is already = there. >>=20 >> I don???t disagree with what you say - perhaps my scheme is too = simplistic. >> But the bottom line, if you cannot form simple rules for when TLB = needs to >> be flushed, what are the chances others would get it right? >=20 > Broad rule is "flush before the page is freed/reallocated for clean = pages > or any IO is initiated for dirty pages" with a lot of details that are = not > documented. Often it's the PTL and flush with it held that protects = the > majority of cases but it's not universal as the page lock and mmap_sem > play important rules depending ont the context and AFAIK, that's also > not documented. >=20 >>> shrink_page_list is the caller of try_to_unmap in reclaim context. = It >>> has this check >>>=20 >>> if (!trylock_page(page)) >>> goto keep; >>>=20 >>> For pages it cannot lock, they get put back on the LRU and recycled = instead >>> of reclaimed. Hence, if KSM or anything else holds the page lock, = reclaim >>> can't unmap it. >>=20 >> Yes, of course, since KSM does not batch TLB flushes. I regarded the = other >> direction - first try_to_unmap() removes the PTE (but still does not = flush), >> unlocks the page, and then KSM acquires the page lock and calls >> write_protect_page(). It finds out the PTE is not present and does = not flush >> the TLB. >=20 > When KSM acquires the page lock, it then acquires the PTL where the > cleared PTE is observed directly and skipped. I don=E2=80=99t see why. Let=E2=80=99s try again - CPU0 reclaims while = CPU1 deduplicates: CPU0 CPU1 ---- ---- shrink_page_list() =3D> try_to_unmap() =3D=3D> try_to_unmap_one() [ unmaps from some page-tables ] [ try_to_unmap returns false; page not reclaimed ] =3D> keep_locked: unlock_page() [ TLB flush deferred ] try_to_merge_one_page() =3D> trylock_page() =3D> write_protect_page() =3D=3D> acquire ptl [ PTE non-present =E2=80=94> no PTE = change and no flush ] =3D=3D> release ptl =3D=3D> replace_page() At this point, while replace_page() is running, CPU0 may still not have flushed the TLBs. Another CPU (CPU2) may hold a stale PTE, which is not write-protected. It can therefore write to that page while = replace_page() is running, resulting in memory corruption. No? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id B9CC96B025F for ; Wed, 19 Jul 2017 18:59:53 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id m75so1149086wmb.12 for ; Wed, 19 Jul 2017 15:59:53 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id b20si662830wmd.176.2017.07.19.15.59.52 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 19 Jul 2017 15:59:52 -0700 (PDT) Date: Wed, 19 Jul 2017 23:59:50 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170719225950.wfpfzpc6llwlyxdo@suse.de> References: <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 19, 2017 at 03:19:00PM -0700, Nadav Amit wrote: > >> Yes, of course, since KSM does not batch TLB flushes. I regarded the other > >> direction - first try_to_unmap() removes the PTE (but still does not flush), > >> unlocks the page, and then KSM acquires the page lock and calls > >> write_protect_page(). It finds out the PTE is not present and does not flush > >> the TLB. > > > > When KSM acquires the page lock, it then acquires the PTL where the > > cleared PTE is observed directly and skipped. > > I don???t see why. Let???s try again - CPU0 reclaims while CPU1 deduplicates: > > CPU0 CPU1 > ---- ---- > shrink_page_list() > > => try_to_unmap() > ==> try_to_unmap_one() > [ unmaps from some page-tables ] > > [ try_to_unmap returns false; > page not reclaimed ] > > => keep_locked: unlock_page() > > [ TLB flush deferred ] > try_to_merge_one_page() > => trylock_page() > => write_protect_page() > ==> acquire ptl > [ PTE non-present ???> no PTE change > and no flush ] > ==> release ptl > ==> replace_page() > > > At this point, while replace_page() is running, CPU0 may still not have > flushed the TLBs. Another CPU (CPU2) may hold a stale PTE, which is not > write-protected. It can therefore write to that page while replace_page() is > running, resulting in memory corruption. > > No? > KSM is not my strong point so it's reaching the point where others more familiar with that code need to be involved. If try_to_unmap returns false on CPU0 then at least one unmap attempt failed and the page is not reclaimed. For those that were unmapped, they will get flushed in the near future. When KSM operates on CPU1, it'll skip the unmapped pages under the PTL so stale TLB entries are not relevant as the mapped entries are still pointing to a valid page and ksm misses a merge opportunity. If it write protects a page, ksm unconditionally flushes the PTE on clearing the PTE so again, there is no stale entry anywhere. For CPU2, it'll either reference a PTE that was unmapped in which case it'll fault once CPU0 flushes the TLB and until then it's safe to read and write as long as the TLB is flushed before the page is freed or IO is initiated which reclaim already handles. If CPU2 references a page that was still mapped then it'll be fine until KSM unmaps and flushes the page before going further so any reference after KSM starts the critical operation will trap a fault. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 7F33C6B025F for ; Wed, 19 Jul 2017 19:39:10 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id u17so14573922pfa.6 for ; Wed, 19 Jul 2017 16:39:10 -0700 (PDT) Received: from mail-pf0-x231.google.com (mail-pf0-x231.google.com. [2607:f8b0:400e:c00::231]) by mx.google.com with ESMTPS id d190si743191pfa.7.2017.07.19.16.39.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 19 Jul 2017 16:39:09 -0700 (PDT) Received: by mail-pf0-x231.google.com with SMTP id s70so5503197pfs.0 for ; Wed, 19 Jul 2017 16:39:09 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170719225950.wfpfzpc6llwlyxdo@suse.de> Date: Wed, 19 Jul 2017 16:39:07 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> References: <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Wed, Jul 19, 2017 at 03:19:00PM -0700, Nadav Amit wrote: >>>> Yes, of course, since KSM does not batch TLB flushes. I regarded = the other >>>> direction - first try_to_unmap() removes the PTE (but still does = not flush), >>>> unlocks the page, and then KSM acquires the page lock and calls >>>> write_protect_page(). It finds out the PTE is not present and does = not flush >>>> the TLB. >>>=20 >>> When KSM acquires the page lock, it then acquires the PTL where the >>> cleared PTE is observed directly and skipped. >>=20 >> I don???t see why. Let???s try again - CPU0 reclaims while CPU1 = deduplicates: >>=20 >> CPU0 CPU1 >> ---- ---- >> shrink_page_list() >>=20 >> =3D> try_to_unmap() >> =3D=3D> try_to_unmap_one() >> [ unmaps from some page-tables ] >>=20 >> [ try_to_unmap returns false; >> page not reclaimed ] >>=20 >> =3D> keep_locked: unlock_page() >>=20 >> [ TLB flush deferred ] >> try_to_merge_one_page() >> =3D> trylock_page() >> =3D> write_protect_page() >> =3D=3D> acquire ptl >> [ PTE non-present ???> no PTE change >> and no flush ] >> =3D=3D> release ptl >> =3D=3D> replace_page() >>=20 >>=20 >> At this point, while replace_page() is running, CPU0 may still not = have >> flushed the TLBs. Another CPU (CPU2) may hold a stale PTE, which is = not >> write-protected. It can therefore write to that page while = replace_page() is >> running, resulting in memory corruption. >>=20 >> No? >=20 > KSM is not my strong point so it's reaching the point where others = more > familiar with that code need to be involved. Do not assume for a second that I really know what is going on over = there. > If try_to_unmap returns false on CPU0 then at least one unmap attempt > failed and the page is not reclaimed. Actually, try_to_unmap() may even return true, and the page would still = not be reclaimed - for example if page_has_private() and freeing the buffers fails. In this case, the page would be unlocked as well. > For those that were unmapped, they > will get flushed in the near future. When KSM operates on CPU1, it'll = skip > the unmapped pages under the PTL so stale TLB entries are not relevant = as > the mapped entries are still pointing to a valid page and ksm misses a = merge > opportunity. This is the case I regarded, but I do not understand your point. The = whole problem is that CPU1 would skip the unmapped pages under the PTL. As it skips them it does not flush them from the TLB. And as a result, replace_page() may happen before the TLB is flushed by CPU0. > If it write protects a page, ksm unconditionally flushes the PTE > on clearing the PTE so again, there is no stale entry anywhere. For = CPU2, > it'll either reference a PTE that was unmapped in which case it'll = fault > once CPU0 flushes the TLB and until then it's safe to read and write = as > long as the TLB is flushed before the page is freed or IO is initiated = which > reclaim already handles. In my scenario the page is not freed and there is no I/O in the reclaim path. The TLB flush of CPU0 in my scenario is just deferred while the page-table lock is not held. As I mentioned before, this time-period can = be potentially very long in a virtual machine. CPU2 referenced a PTE that was unmapped by CPU0 (reclaim path) but not CPU1 (ksm path). ksm, IIUC, would not expect modifications of the page during = replace_page. Eventually it would flush the TLB (after changing the PTE to point to = the deduplicated page). But in the meanwhile, another CPU may use stale PTEs = for writes, and those writes would be lost after the page is deduplicated. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 828556B025F for ; Thu, 20 Jul 2017 03:43:45 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id z36so3591794wrb.13 for ; Thu, 20 Jul 2017 00:43:45 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 52si1702326wrx.410.2017.07.20.00.43.44 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 20 Jul 2017 00:43:44 -0700 (PDT) Date: Thu, 20 Jul 2017 08:43:42 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170720074342.otez35bme5gytnxl@suse.de> References: <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 19, 2017 at 04:39:07PM -0700, Nadav Amit wrote: > > If try_to_unmap returns false on CPU0 then at least one unmap attempt > > failed and the page is not reclaimed. > > Actually, try_to_unmap() may even return true, and the page would still not > be reclaimed - for example if page_has_private() and freeing the buffers > fails. In this case, the page would be unlocked as well. > I'm not seeing the relevance from the perspective of a stale TLB being used to corrupt memory or access the wrong data. > > For those that were unmapped, they > > will get flushed in the near future. When KSM operates on CPU1, it'll skip > > the unmapped pages under the PTL so stale TLB entries are not relevant as > > the mapped entries are still pointing to a valid page and ksm misses a merge > > opportunity. > > This is the case I regarded, but I do not understand your point. The whole > problem is that CPU1 would skip the unmapped pages under the PTL. As it > skips them it does not flush them from the TLB. And as a result, > replace_page() may happen before the TLB is flushed by CPU0. > At the time of the unlock_page on the reclaim side, any unmapping that will happen before the flush has taken place. If KSM starts between the unlock_page and the tlb flush then it'll skip any of the PTEs that were previously unmapped with stale entries so there is no relevant stale TLB entry to work with. > > If it write protects a page, ksm unconditionally flushes the PTE > > on clearing the PTE so again, there is no stale entry anywhere. For CPU2, > > it'll either reference a PTE that was unmapped in which case it'll fault > > once CPU0 flushes the TLB and until then it's safe to read and write as > > long as the TLB is flushed before the page is freed or IO is initiated which > > reclaim already handles. > > In my scenario the page is not freed and there is no I/O in the reclaim > path. The TLB flush of CPU0 in my scenario is just deferred while the > page-table lock is not held. As I mentioned before, this time-period can be > potentially very long in a virtual machine. CPU2 referenced a PTE that > was unmapped by CPU0 (reclaim path) but not CPU1 (ksm path). > > ksm, IIUC, would not expect modifications of the page during replace_page. Indeed not but it'll either find not PTE in which case it won't allow a stale PTE entry to exist and even when it finds a PTE, it flushes the TLB unconditionally to avoid any writes taking place. It holds the page lock while setting up the sharing so no parallel fault can reinsert the page and no parallel writes can take place that would result in false sharing. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 2C02F6B0292 for ; Fri, 21 Jul 2017 21:19:27 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id y129so12156644pgy.1 for ; Fri, 21 Jul 2017 18:19:27 -0700 (PDT) Received: from mail-pf0-x242.google.com (mail-pf0-x242.google.com. [2607:f8b0:400e:c00::242]) by mx.google.com with ESMTPS id x13si3751428pgq.222.2017.07.21.18.19.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 21 Jul 2017 18:19:25 -0700 (PDT) Received: by mail-pf0-x242.google.com with SMTP id c23so5889634pfe.5 for ; Fri, 21 Jul 2017 18:19:25 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170720074342.otez35bme5gytnxl@suse.de> Date: Fri, 21 Jul 2017 18:19:22 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Wed, Jul 19, 2017 at 04:39:07PM -0700, Nadav Amit wrote: >>> If try_to_unmap returns false on CPU0 then at least one unmap = attempt >>> failed and the page is not reclaimed. >>=20 >> Actually, try_to_unmap() may even return true, and the page would = still not >> be reclaimed - for example if page_has_private() and freeing the = buffers >> fails. In this case, the page would be unlocked as well. >=20 > I'm not seeing the relevance from the perspective of a stale TLB being > used to corrupt memory or access the wrong data. >=20 >>> For those that were unmapped, they >>> will get flushed in the near future. When KSM operates on CPU1, = it'll skip >>> the unmapped pages under the PTL so stale TLB entries are not = relevant as >>> the mapped entries are still pointing to a valid page and ksm misses = a merge >>> opportunity. >>=20 >> This is the case I regarded, but I do not understand your point. The = whole >> problem is that CPU1 would skip the unmapped pages under the PTL. As = it >> skips them it does not flush them from the TLB. And as a result, >> replace_page() may happen before the TLB is flushed by CPU0. >=20 > At the time of the unlock_page on the reclaim side, any unmapping that > will happen before the flush has taken place. If KSM starts between = the > unlock_page and the tlb flush then it'll skip any of the PTEs that = were > previously unmapped with stale entries so there is no relevant stale = TLB > entry to work with. I don=E2=80=99t see where this skipping happens, but let=E2=80=99s put = this scenario aside for a second. Here is a similar scenario that causes memory corruption. = I actually created and tested it (although I needed to hack the kernel to = add some artificial latency before the actual flushes and before the actual dedupliaction of KSM). We are going to cause KSM to deduplicate a page, and after page = comparison but before the page is actually replaced, to use a stale PTE entry to=20 overwrite the page. As a result KSM will lose a write, causing memory corruption. For this race we need 4 CPUs: CPU0: Caches a writable and dirty PTE entry, and uses the stale value = for write later. CPU1: Runs madvise_free on the range that includes the PTE. It would = clear the dirty-bit. It batches TLB flushes. CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. = We care about the fact that it clears the PTE write-bit, and of course, = batches TLB flushes. CPU3: Runs KSM. Our purpose is to pass the following test in write_protect_page(): if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) Since it will avoid TLB flush. And we want to do it while the PTE is = stale. Later, and before replacing the page, we would be able to change the = page. Note that all the operations the CPU1-3 perform canhappen in parallel = since they only acquire mmap_sem for read. We start with two identical pages. Everything below regards the same page/PTE. CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- Write the same value on page [cache PTE as dirty in TLB] MADV_FREE pte_mkclean() =09 4 > clear_refs pte_wrprotect() write_protect_page() [ success, no flush ] pages_indentical() [ ok ] Write to page different value [Ok, using stale PTE] replace_page() Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. = CPU0 already wrote on the page, but KSM ignored this write, and it got lost. Now to reiterate my point: It is really hard to get TLB batching right without some clear policy. And it should be important, since such issues = can cause memory corruption and have security implications (if somebody = manages to get the timing right). Regards, Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id B3C796B0292 for ; Mon, 24 Jul 2017 05:58:35 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id z48so22060665wrc.4 for ; Mon, 24 Jul 2017 02:58:35 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id n202si5319754wmd.152.2017.07.24.02.58.34 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 24 Jul 2017 02:58:34 -0700 (PDT) Date: Mon, 24 Jul 2017 10:58:32 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170724095832.vgvku6vlxkv75r3k@suse.de> References: <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Andy Lutomirski , Minchan Kim , "open list:MEMORY MANAGEMENT" On Fri, Jul 21, 2017 at 06:19:22PM -0700, Nadav Amit wrote: > > At the time of the unlock_page on the reclaim side, any unmapping that > > will happen before the flush has taken place. If KSM starts between the > > unlock_page and the tlb flush then it'll skip any of the PTEs that were > > previously unmapped with stale entries so there is no relevant stale TLB > > entry to work with. > > I don???t see where this skipping happens, but let???s put this scenario aside > for a second. Here is a similar scenario that causes memory corruption. I > actually created and tested it (although I needed to hack the kernel to add > some artificial latency before the actual flushes and before the actual > dedupliaction of KSM). > > We are going to cause KSM to deduplicate a page, and after page comparison > but before the page is actually replaced, to use a stale PTE entry to > overwrite the page. As a result KSM will lose a write, causing memory > corruption. > > For this race we need 4 CPUs: > > CPU0: Caches a writable and dirty PTE entry, and uses the stale value for > write later. > > CPU1: Runs madvise_free on the range that includes the PTE. It would clear > the dirty-bit. It batches TLB flushes. > > CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We > care about the fact that it clears the PTE write-bit, and of course, batches > TLB flushes. > > CPU3: Runs KSM. Our purpose is to pass the following test in > write_protect_page(): > > if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || > (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) > > Since it will avoid TLB flush. And we want to do it while the PTE is stale. > Later, and before replacing the page, we would be able to change the page. > > Note that all the operations the CPU1-3 perform canhappen in parallel since > they only acquire mmap_sem for read. > > We start with two identical pages. Everything below regards the same > page/PTE. > > CPU0 CPU1 CPU2 CPU3 > ---- ---- ---- ---- > Write the same > value on page > > [cache PTE as > dirty in TLB] > > MADV_FREE > pte_mkclean() > > 4 > clear_refs > pte_wrprotect() > > write_protect_page() > [ success, no flush ] > > pages_indentical() > [ ok ] > > Write to page > different value > > [Ok, using stale > PTE] > > replace_page() > > > Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0 > already wrote on the page, but KSM ignored this write, and it got lost. > Ok, as you say you have reproduced this with corruption, I would suggest one path for dealing with it although you'll need to pass it by the original authors. When unmapping ranges, there is a check for dirty PTEs in zap_pte_range() that forces a flush for dirty PTEs which aims to avoid writable stale PTEs from CPU0 in a scenario like you laid out above. madvise_free misses a similar class of check so I'm adding Minchan Kim to the cc as the original author of much of that code. Minchan Kim will need to confirm but it appears that two modifications would be required. The first should pass in the mmu_gather structure to madvise_free_pte_range (at minimum) and force flush the TLB under the PTL if a dirty PTE is encountered. The second is that it should consider flushing the full affected range as madvise_free holds mmap_sem for read-only to avoid problems with two parallel madv_free operations. The second is optional because there are other ways it could also be handled that may have lower overhead. Soft dirty page handling may need similar protections. > Now to reiterate my point: It is really hard to get TLB batching right > without some clear policy. And it should be important, since such issues can > cause memory corruption and have security implications (if somebody manages > to get the timing right). > Basically it comes down to when batching TLB flushes, care must be taken when dealing with dirty PTEs that writable TLB entries do not leak data. The reclaim TLB batching *should* still be ok as it allows stale entries to exist but only up until the point where IO is queued to prevent data being lost. I'm not aware of this being formally documented in the past. It's possible that you could extent the mmu_gather API to track that state and handle it properly in the general case so as long as someone uses that API properly that they'll be protected. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id E33F26B0292 for ; Mon, 24 Jul 2017 15:46:14 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id y129so87932175pgy.1 for ; Mon, 24 Jul 2017 12:46:14 -0700 (PDT) Received: from mail-pf0-x244.google.com (mail-pf0-x244.google.com. [2607:f8b0:400e:c00::244]) by mx.google.com with ESMTPS id w10si7270394pgm.394.2017.07.24.12.46.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 24 Jul 2017 12:46:13 -0700 (PDT) Received: by mail-pf0-x244.google.com with SMTP id c23so11224926pfe.5 for ; Mon, 24 Jul 2017 12:46:13 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170724095832.vgvku6vlxkv75r3k@suse.de> Date: Mon, 24 Jul 2017 12:46:10 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <16AF238B-2710-4FC3-A983-2DCFDD43AB7F@gmail.com> References: <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , Minchan Kim , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Fri, Jul 21, 2017 at 06:19:22PM -0700, Nadav Amit wrote: >>> At the time of the unlock_page on the reclaim side, any unmapping = that >>> will happen before the flush has taken place. If KSM starts between = the >>> unlock_page and the tlb flush then it'll skip any of the PTEs that = were >>> previously unmapped with stale entries so there is no relevant stale = TLB >>> entry to work with. >>=20 >> I don???t see where this skipping happens, but let???s put this = scenario aside >> for a second. Here is a similar scenario that causes memory = corruption. I >> actually created and tested it (although I needed to hack the kernel = to add >> some artificial latency before the actual flushes and before the = actual >> dedupliaction of KSM). >>=20 >> We are going to cause KSM to deduplicate a page, and after page = comparison >> but before the page is actually replaced, to use a stale PTE entry to=20= >> overwrite the page. As a result KSM will lose a write, causing memory >> corruption. >>=20 >> For this race we need 4 CPUs: >>=20 >> CPU0: Caches a writable and dirty PTE entry, and uses the stale value = for >> write later. >>=20 >> CPU1: Runs madvise_free on the range that includes the PTE. It would = clear >> the dirty-bit. It batches TLB flushes. >>=20 >> CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs = soft-dirty. We >> care about the fact that it clears the PTE write-bit, and of course, = batches >> TLB flushes. >>=20 >> CPU3: Runs KSM. Our purpose is to pass the following test in >> write_protect_page(): >>=20 >> if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || >> (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) >>=20 >> Since it will avoid TLB flush. And we want to do it while the PTE is = stale. >> Later, and before replacing the page, we would be able to change the = page. >>=20 >> Note that all the operations the CPU1-3 perform canhappen in parallel = since >> they only acquire mmap_sem for read. >>=20 >> We start with two identical pages. Everything below regards the same >> page/PTE. >>=20 >> CPU0 CPU1 CPU2 CPU3 >> ---- ---- ---- ---- >> Write the same >> value on page >>=20 >> [cache PTE as >> dirty in TLB] >>=20 >> MADV_FREE >> pte_mkclean() >> =09 >> 4 > clear_refs >> pte_wrprotect() >>=20 >> write_protect_page() >> [ success, no flush ] >>=20 >> pages_indentical() >> [ ok ] >>=20 >> Write to page >> different value >>=20 >> [Ok, using stale >> PTE] >>=20 >> replace_page() >>=20 >>=20 >> Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. = CPU0 >> already wrote on the page, but KSM ignored this write, and it got = lost. >=20 > Ok, as you say you have reproduced this with corruption, I would = suggest > one path for dealing with it although you'll need to pass it by the > original authors. >=20 > When unmapping ranges, there is a check for dirty PTEs in > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid > writable stale PTEs from CPU0 in a scenario like you laid out above. >=20 > madvise_free misses a similar class of check so I'm adding Minchan Kim > to the cc as the original author of much of that code. Minchan Kim = will > need to confirm but it appears that two modifications would be = required. > The first should pass in the mmu_gather structure to > madvise_free_pte_range (at minimum) and force flush the TLB under the > PTL if a dirty PTE is encountered. The second is that it should = consider > flushing the full affected range as madvise_free holds mmap_sem for > read-only to avoid problems with two parallel madv_free operations. = The > second is optional because there are other ways it could also be = handled > that may have lower overhead. >=20 > Soft dirty page handling may need similar protections. The problem, in my mind, is that KSM conditionally invalidates the PTEs despite potentially pending flushes. Forcing flushes under the ptl = instead of batching may have some significant performance impact. BTW: let me know if you need my PoC. >=20 >> Now to reiterate my point: It is really hard to get TLB batching = right >> without some clear policy. And it should be important, since such = issues can >> cause memory corruption and have security implications (if somebody = manages >> to get the timing right). >=20 > Basically it comes down to when batching TLB flushes, care must be = taken > when dealing with dirty PTEs that writable TLB entries do not leak = data. The > reclaim TLB batching *should* still be ok as it allows stale entries = to exist > but only up until the point where IO is queued to prevent data being > lost. I'm not aware of this being formally documented in the past. = It's > possible that you could extent the mmu_gather API to track that state > and handle it properly in the general case so as long as someone uses > that API properly that they'll be protected. I had a brief look on FreeBSD. Basically, AFAIU, the scheme is that if = there are any pending invalidations to the address space, they must be carried before related operations finish. It is similar to what I proposed = before: increase a =E2=80=9Cpending flush=E2=80=9D counter for the mm when = updating the entries, and update =E2=80=9Cdone flush=E2=80=9D counter once the invalidation is = done. When the kernel makes decisions or conditional flush based on a PTE value - it needs to wait for the flushes to be finished. Obviously, such scheme can be = further refined.=20 Thanks again, Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 16D246B0292 for ; Tue, 25 Jul 2017 03:37:52 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id q87so150223241pfk.15 for ; Tue, 25 Jul 2017 00:37:52 -0700 (PDT) Received: from lgeamrelo11.lge.com (LGEAMRELO11.lge.com. [156.147.23.51]) by mx.google.com with ESMTP id t17si1105871pfg.662.2017.07.25.00.37.50 for ; Tue, 25 Jul 2017 00:37:50 -0700 (PDT) Date: Tue, 25 Jul 2017 16:37:48 +0900 From: Minchan Kim Subject: Re: Potential race in TLB flush batching? Message-ID: <20170725073748.GB22652@bbox> References: <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170724095832.vgvku6vlxkv75r3k@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" Hi Mel, On Mon, Jul 24, 2017 at 10:58:32AM +0100, Mel Gorman wrote: > On Fri, Jul 21, 2017 at 06:19:22PM -0700, Nadav Amit wrote: > > > At the time of the unlock_page on the reclaim side, any unmapping that > > > will happen before the flush has taken place. If KSM starts between the > > > unlock_page and the tlb flush then it'll skip any of the PTEs that were > > > previously unmapped with stale entries so there is no relevant stale TLB > > > entry to work with. > > > > I don???t see where this skipping happens, but let???s put this scenario aside > > for a second. Here is a similar scenario that causes memory corruption. I > > actually created and tested it (although I needed to hack the kernel to add > > some artificial latency before the actual flushes and before the actual > > dedupliaction of KSM). > > > > We are going to cause KSM to deduplicate a page, and after page comparison > > but before the page is actually replaced, to use a stale PTE entry to > > overwrite the page. As a result KSM will lose a write, causing memory > > corruption. > > > > For this race we need 4 CPUs: > > > > CPU0: Caches a writable and dirty PTE entry, and uses the stale value for > > write later. > > > > CPU1: Runs madvise_free on the range that includes the PTE. It would clear > > the dirty-bit. It batches TLB flushes. > > > > CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We > > care about the fact that it clears the PTE write-bit, and of course, batches > > TLB flushes. > > > > CPU3: Runs KSM. Our purpose is to pass the following test in > > write_protect_page(): > > > > if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || > > (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) > > > > Since it will avoid TLB flush. And we want to do it while the PTE is stale. > > Later, and before replacing the page, we would be able to change the page. > > > > Note that all the operations the CPU1-3 perform canhappen in parallel since > > they only acquire mmap_sem for read. > > > > We start with two identical pages. Everything below regards the same > > page/PTE. > > > > CPU0 CPU1 CPU2 CPU3 > > ---- ---- ---- ---- > > Write the same > > value on page > > > > [cache PTE as > > dirty in TLB] > > > > MADV_FREE > > pte_mkclean() > > > > 4 > clear_refs > > pte_wrprotect() > > > > write_protect_page() > > [ success, no flush ] > > > > pages_indentical() > > [ ok ] > > > > Write to page > > different value > > > > [Ok, using stale > > PTE] > > > > replace_page() > > > > > > Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0 > > already wrote on the page, but KSM ignored this write, and it got lost. > > > > Ok, as you say you have reproduced this with corruption, I would suggest > one path for dealing with it although you'll need to pass it by the > original authors. > > When unmapping ranges, there is a check for dirty PTEs in > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid > writable stale PTEs from CPU0 in a scenario like you laid out above. > > madvise_free misses a similar class of check so I'm adding Minchan Kim > to the cc as the original author of much of that code. Minchan Kim will > need to confirm but it appears that two modifications would be required. > The first should pass in the mmu_gather structure to > madvise_free_pte_range (at minimum) and force flush the TLB under the > PTL if a dirty PTE is encountered. The second is that it should consider OTL: I couldn't read this lengthy discussion so I miss miss something. About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread in parallel which has stale pte does "store" to make the pte dirty, it's okay since try_to_unmap_one in shrink_page_list catches the dirty. In above example, I think KSM should flush the TLB, not MADV_FREE and soft dirty page hander. Maybe, I miss something clear, Could you explain it in detail? > flushing the full affected range as madvise_free holds mmap_sem for > read-only to avoid problems with two parallel madv_free operations. The > second is optional because there are other ways it could also be handled > that may have lower overhead. Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id A5CAF6B0292 for ; Tue, 25 Jul 2017 04:51:35 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id l3so27511177wrc.12 for ; Tue, 25 Jul 2017 01:51:35 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id k62si7035979wmb.117.2017.07.25.01.51.34 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 25 Jul 2017 01:51:34 -0700 (PDT) Date: Tue, 25 Jul 2017 09:51:32 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170725085132.iysanhtqkgopegob@suse.de> References: <20170719195820.drtfmweuhdc4eca6@suse.de> <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20170725073748.GB22652@bbox> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote: > > Ok, as you say you have reproduced this with corruption, I would suggest > > one path for dealing with it although you'll need to pass it by the > > original authors. > > > > When unmapping ranges, there is a check for dirty PTEs in > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid > > writable stale PTEs from CPU0 in a scenario like you laid out above. > > > > madvise_free misses a similar class of check so I'm adding Minchan Kim > > to the cc as the original author of much of that code. Minchan Kim will > > need to confirm but it appears that two modifications would be required. > > The first should pass in the mmu_gather structure to > > madvise_free_pte_range (at minimum) and force flush the TLB under the > > PTL if a dirty PTE is encountered. The second is that it should consider > > OTL: I couldn't read this lengthy discussion so I miss miss something. > > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread > in parallel which has stale pte does "store" to make the pte dirty, > it's okay since try_to_unmap_one in shrink_page_list catches the dirty. > In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given that the key is that data corruption is avoided, you could argue with a comment that madv_free doesn't necesssarily have to flush it as long as KSM does even if it's clean due to batching. > In above example, I think KSM should flush the TLB, not MADV_FREE and > soft dirty page hander. > That would also be acceptable. > > flushing the full affected range as madvise_free holds mmap_sem for > > read-only to avoid problems with two parallel madv_free operations. The > > second is optional because there are other ways it could also be handled > > that may have lower overhead. > > Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem? > Like madvise(), madv_free can potentially return with a stale PTE visible to the caller that observed a pte_none at the time of madv_free and uses a stale PTE that potentially allows a lost write. It's debatable whether this matters considering that madv_free to a region means that parallel writers can lose their update anyway. It's less of a concern than the KSM angle outlined in Nadav's example which he was able to artifically reproduce by slowing operations to increase the race window. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 138A86B02B4 for ; Tue, 25 Jul 2017 05:11:18 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id g14so176701822pgu.9 for ; Tue, 25 Jul 2017 02:11:18 -0700 (PDT) Received: from lgeamrelo12.lge.com (LGEAMRELO12.lge.com. [156.147.23.52]) by mx.google.com with ESMTP id w22si8376026plk.820.2017.07.25.02.11.16 for ; Tue, 25 Jul 2017 02:11:17 -0700 (PDT) Date: Tue, 25 Jul 2017 18:11:15 +0900 From: Minchan Kim Subject: Re: Potential race in TLB flush batching? Message-ID: <20170725091115.GA22920@bbox> References: <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170725085132.iysanhtqkgopegob@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 25, 2017 at 09:51:32AM +0100, Mel Gorman wrote: > On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote: > > > Ok, as you say you have reproduced this with corruption, I would suggest > > > one path for dealing with it although you'll need to pass it by the > > > original authors. > > > > > > When unmapping ranges, there is a check for dirty PTEs in > > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid > > > writable stale PTEs from CPU0 in a scenario like you laid out above. > > > > > > madvise_free misses a similar class of check so I'm adding Minchan Kim > > > to the cc as the original author of much of that code. Minchan Kim will > > > need to confirm but it appears that two modifications would be required. > > > The first should pass in the mmu_gather structure to > > > madvise_free_pte_range (at minimum) and force flush the TLB under the > > > PTL if a dirty PTE is encountered. The second is that it should consider > > > > OTL: I couldn't read this lengthy discussion so I miss miss something. > > > > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE > > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread > > in parallel which has stale pte does "store" to make the pte dirty, > > it's okay since try_to_unmap_one in shrink_page_list catches the dirty. > > > > In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given > that the key is that data corruption is avoided, you could argue with a > comment that madv_free doesn't necesssarily have to flush it as long as > KSM does even if it's clean due to batching. Yes, I think it should be done in side where have a concern. Maybe, mm_struct can carry a flag which indicates someone is doing the TLB bacthing and then KSM side can flush it by the flag. It would reduce unncessary flushing. > > > In above example, I think KSM should flush the TLB, not MADV_FREE and > > soft dirty page hander. > > > > That would also be acceptable. > > > > flushing the full affected range as madvise_free holds mmap_sem for > > > read-only to avoid problems with two parallel madv_free operations. The > > > second is optional because there are other ways it could also be handled > > > that may have lower overhead. > > > > Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem? > > > > Like madvise(), madv_free can potentially return with a stale PTE visible > to the caller that observed a pte_none at the time of madv_free and uses > a stale PTE that potentially allows a lost write. It's debatable whether That is the part I cannot understand. How does it lost "the write"? MADV_FREE doesn't discard the memory so finally, the write should be done sometime. Could you tell me more? Thanks. > this matters considering that madv_free to a region means that parallel > writers can lose their update anyway. It's less of a concern than the > KSM angle outlined in Nadav's example which he was able to artifically > reproduce by slowing operations to increase the race window. > > -- > Mel Gorman > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id AB6E86B0292 for ; Tue, 25 Jul 2017 06:10:14 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id v102so27774719wrb.2 for ; Tue, 25 Jul 2017 03:10:14 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id v6si3606461wrv.93.2017.07.25.03.10.13 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 25 Jul 2017 03:10:13 -0700 (PDT) Date: Tue, 25 Jul 2017 11:10:06 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170725100722.2dxnmgypmwnrfawp@suse.de> References: <20170719214708.wuzq3di6rt43txtn@suse.de> <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20170725091115.GA22920@bbox> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 25, 2017 at 06:11:15PM +0900, Minchan Kim wrote: > On Tue, Jul 25, 2017 at 09:51:32AM +0100, Mel Gorman wrote: > > On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote: > > > > Ok, as you say you have reproduced this with corruption, I would suggest > > > > one path for dealing with it although you'll need to pass it by the > > > > original authors. > > > > > > > > When unmapping ranges, there is a check for dirty PTEs in > > > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid > > > > writable stale PTEs from CPU0 in a scenario like you laid out above. > > > > > > > > madvise_free misses a similar class of check so I'm adding Minchan Kim > > > > to the cc as the original author of much of that code. Minchan Kim will > > > > need to confirm but it appears that two modifications would be required. > > > > The first should pass in the mmu_gather structure to > > > > madvise_free_pte_range (at minimum) and force flush the TLB under the > > > > PTL if a dirty PTE is encountered. The second is that it should consider > > > > > > OTL: I couldn't read this lengthy discussion so I miss miss something. > > > > > > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE > > > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread > > > in parallel which has stale pte does "store" to make the pte dirty, > > > it's okay since try_to_unmap_one in shrink_page_list catches the dirty. > > > > > > > In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given > > that the key is that data corruption is avoided, you could argue with a > > comment that madv_free doesn't necesssarily have to flush it as long as > > KSM does even if it's clean due to batching. > > Yes, I think it should be done in side where have a concern. > Maybe, mm_struct can carry a flag which indicates someone is > doing the TLB bacthing and then KSM side can flush it by the flag. > It would reduce unncessary flushing. > If you're confident that it's only necessary on the KSM side to avoid the problem then I'm ok with that. Update KSM in that case with a comment explaining the madv_free race and why the flush is unconditionally necessary. madv_free only came up because it was a critical part of having KSM miss a TLB flush. > > Like madvise(), madv_free can potentially return with a stale PTE visible > > to the caller that observed a pte_none at the time of madv_free and uses > > a stale PTE that potentially allows a lost write. It's debatable whether > > That is the part I cannot understand. > How does it lost "the write"? MADV_FREE doesn't discard the memory so > finally, the write should be done sometime. > Could you tell me more? > I'm relying on the fact you are the madv_free author to determine if it's really necessary. The race in question is CPU 0 running madv_free and updating some PTEs while CPU 1 is also running madv_free and looking at the same PTEs. CPU 1 may have writable TLB entries for a page but fail the pte_dirty check (because CPU 0 has updated it already) and potentially fail to flush. Hence, when madv_free on CPU 1 returns, there are still potentially writable TLB entries and the underlying PTE is still present so that a subsequent write does not necessarily propagate the dirty bit to the underlying PTE any more. Reclaim at some unknown time at the future may then see that the PTE is still clean and discard the page even though a write has happened in the meantime. I think this is possible but I could have missed some protection in madv_free that prevents it happening. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 20CA16B025F for ; Wed, 26 Jul 2017 01:43:12 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id 125so204637443pgi.2 for ; Tue, 25 Jul 2017 22:43:12 -0700 (PDT) Received: from lgeamrelo13.lge.com (LGEAMRELO13.lge.com. [156.147.23.53]) by mx.google.com with ESMTP id q26si8948517pfi.408.2017.07.25.22.43.10 for ; Tue, 25 Jul 2017 22:43:10 -0700 (PDT) Date: Wed, 26 Jul 2017 14:43:06 +0900 From: Minchan Kim Subject: Re: Potential race in TLB flush batching? Message-ID: <20170726054306.GA11100@bbox> References: <3D1386AD-7875-40B9-8C6F-DE02CF8A45A1@gmail.com> <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170725100722.2dxnmgypmwnrfawp@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Tue, Jul 25, 2017 at 11:10:06AM +0100, Mel Gorman wrote: > On Tue, Jul 25, 2017 at 06:11:15PM +0900, Minchan Kim wrote: > > On Tue, Jul 25, 2017 at 09:51:32AM +0100, Mel Gorman wrote: > > > On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote: > > > > > Ok, as you say you have reproduced this with corruption, I would suggest > > > > > one path for dealing with it although you'll need to pass it by the > > > > > original authors. > > > > > > > > > > When unmapping ranges, there is a check for dirty PTEs in > > > > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid > > > > > writable stale PTEs from CPU0 in a scenario like you laid out above. > > > > > > > > > > madvise_free misses a similar class of check so I'm adding Minchan Kim > > > > > to the cc as the original author of much of that code. Minchan Kim will > > > > > need to confirm but it appears that two modifications would be required. > > > > > The first should pass in the mmu_gather structure to > > > > > madvise_free_pte_range (at minimum) and force flush the TLB under the > > > > > PTL if a dirty PTE is encountered. The second is that it should consider > > > > > > > > OTL: I couldn't read this lengthy discussion so I miss miss something. > > > > > > > > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE > > > > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread > > > > in parallel which has stale pte does "store" to make the pte dirty, > > > > it's okay since try_to_unmap_one in shrink_page_list catches the dirty. > > > > > > > > > > In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given > > > that the key is that data corruption is avoided, you could argue with a > > > comment that madv_free doesn't necesssarily have to flush it as long as > > > KSM does even if it's clean due to batching. > > > > Yes, I think it should be done in side where have a concern. > > Maybe, mm_struct can carry a flag which indicates someone is > > doing the TLB bacthing and then KSM side can flush it by the flag. > > It would reduce unncessary flushing. > > > > If you're confident that it's only necessary on the KSM side to avoid the > problem then I'm ok with that. Update KSM in that case with a comment > explaining the madv_free race and why the flush is unconditionally > necessary. madv_free only came up because it was a critical part of having > KSM miss a TLB flush. > > > > Like madvise(), madv_free can potentially return with a stale PTE visible > > > to the caller that observed a pte_none at the time of madv_free and uses > > > a stale PTE that potentially allows a lost write. It's debatable whether > > > > That is the part I cannot understand. > > How does it lost "the write"? MADV_FREE doesn't discard the memory so > > finally, the write should be done sometime. > > Could you tell me more? > > > > I'm relying on the fact you are the madv_free author to determine if > it's really necessary. The race in question is CPU 0 running madv_free > and updating some PTEs while CPU 1 is also running madv_free and looking > at the same PTEs. CPU 1 may have writable TLB entries for a page but fail > the pte_dirty check (because CPU 0 has updated it already) and potentially > fail to flush. Hence, when madv_free on CPU 1 returns, there are still > potentially writable TLB entries and the underlying PTE is still present > so that a subsequent write does not necessarily propagate the dirty bit > to the underlying PTE any more. Reclaim at some unknown time at the future > may then see that the PTE is still clean and discard the page even though > a write has happened in the meantime. I think this is possible but I could > have missed some protection in madv_free that prevents it happening. Thanks for the detail. You didn't miss anything. It can happen and then it's really bug. IOW, if application does write something after madv_free, it must see the written value, not zero. How about adding [set|clear]_tlb_flush_pending in tlb batchin interface? With it, when tlb_finish_mmu is called, we can know we skip the flush but there is pending flush, so flush focefully to avoid madv_dontneed as well as madv_free scenario. Also, KSM can know it through mm_tlb_flush_pending? If it's acceptable, need to look into soft dirty to use [set|clear]_tlb _flush_pending or TLB gathering API. To show my intention: diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 8afa4335e5b2..fffd4d86d0c4 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -113,7 +113,7 @@ struct mmu_gather { #define HAVE_GENERIC_MMU_GATHER void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end); -void tlb_flush_mmu(struct mmu_gather *tlb); +bool tlb_flush_mmu(struct mmu_gather *tlb); void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end); extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, diff --git a/mm/ksm.c b/mm/ksm.c index 4dc92f138786..0fbbd5d234d5 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1037,8 +1037,9 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?")) goto out_unlock; - if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || - (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) { + if ((pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || + (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) || + mm_tlb_flush_pending(mm)) { pte_t entry; swapped = PageSwapCache(page); diff --git a/mm/memory.c b/mm/memory.c index ea9f28e44b81..d5c5e6497c70 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -239,12 +239,13 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long tlb->page_size = 0; __tlb_reset_range(tlb); + set_tlb_flush_pending(tlb->mm); } -static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { if (!tlb->end) - return; + return false; tlb_flush(tlb); mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end); @@ -252,6 +253,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) tlb_table_flush(tlb); #endif __tlb_reset_range(tlb); + return true; } static void tlb_flush_mmu_free(struct mmu_gather *tlb) @@ -265,10 +267,16 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb) tlb->active = &tlb->local; } -void tlb_flush_mmu(struct mmu_gather *tlb) +/* + * returns true if tlb flush really happens + */ +bool tlb_flush_mmu(struct mmu_gather *tlb) { - tlb_flush_mmu_tlbonly(tlb); + bool ret; + + ret = tlb_flush_mmu_tlbonly(tlb); tlb_flush_mmu_free(tlb); + return ret; } /* tlb_finish_mmu @@ -278,8 +286,11 @@ void tlb_flush_mmu(struct mmu_gather *tlb) void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { struct mmu_gather_batch *batch, *next; + bool flushed = tlb_flush_mmu(tlb); - tlb_flush_mmu(tlb); + clear_tlb_flush_pending(tlb->mm); + if (!flushed && mm_tlb_flush_pending(tlb->mm)) + flush_tlb_mm_range(tlb->mm, start, end, 0UL); /* keep the page table cache within bounds */ check_pgt_cache(); > > -- > Mel Gorman > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id 57CB26B025F for ; Wed, 26 Jul 2017 05:22:32 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id x64so9206631wmg.11 for ; Wed, 26 Jul 2017 02:22:32 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 98si8119930wrl.5.2017.07.26.02.22.30 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 26 Jul 2017 02:22:31 -0700 (PDT) Date: Wed, 26 Jul 2017 10:22:28 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170726092228.pyjxamxweslgaemi@suse.de> References: <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20170726054306.GA11100@bbox> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: > > I'm relying on the fact you are the madv_free author to determine if > > it's really necessary. The race in question is CPU 0 running madv_free > > and updating some PTEs while CPU 1 is also running madv_free and looking > > at the same PTEs. CPU 1 may have writable TLB entries for a page but fail > > the pte_dirty check (because CPU 0 has updated it already) and potentially > > fail to flush. Hence, when madv_free on CPU 1 returns, there are still > > potentially writable TLB entries and the underlying PTE is still present > > so that a subsequent write does not necessarily propagate the dirty bit > > to the underlying PTE any more. Reclaim at some unknown time at the future > > may then see that the PTE is still clean and discard the page even though > > a write has happened in the meantime. I think this is possible but I could > > have missed some protection in madv_free that prevents it happening. > > Thanks for the detail. You didn't miss anything. It can happen and then > it's really bug. IOW, if application does write something after madv_free, > it must see the written value, not zero. > > How about adding [set|clear]_tlb_flush_pending in tlb batchin interface? > With it, when tlb_finish_mmu is called, we can know we skip the flush > but there is pending flush, so flush focefully to avoid madv_dontneed > as well as madv_free scenario. > I *think* this is ok as it's simply more expensive on the KSM side in the event of a race but no other harmful change is made assuming that KSM is the only race-prone. The check for mm_tlb_flush_pending also happens under the PTL so there should be sufficient protection from the mm struct update being visible at teh right time. Check using the test program from "mm: Always flush VMA ranges affected by zap_page_range v2" if it handles the madvise case as well as that would give some degree of safety. Make sure it's tested against 4.13-rc2 instead of mmotm which already includes the madv_dontneed fix. If yours works for both then it supersedes the mmotm patch. It would also be interesting if Nadav would use his slowdown hack to see if he can still force the corruption. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 6C01D6B0292 for ; Wed, 26 Jul 2017 15:18:41 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id c87so18821730pfd.14 for ; Wed, 26 Jul 2017 12:18:41 -0700 (PDT) Received: from mail-pf0-x242.google.com (mail-pf0-x242.google.com. [2607:f8b0:400e:c00::242]) by mx.google.com with ESMTPS id v4si4143977pgv.275.2017.07.26.12.18.39 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 26 Jul 2017 12:18:39 -0700 (PDT) Received: by mail-pf0-x242.google.com with SMTP id m21so1818692pfj.3 for ; Wed, 26 Jul 2017 12:18:39 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170726092228.pyjxamxweslgaemi@suse.de> Date: Wed, 26 Jul 2017 12:18:37 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20170719225950.wfpfzpc6llwlyxdo@suse.de> <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Minchan Kim , Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: >>> I'm relying on the fact you are the madv_free author to determine if >>> it's really necessary. The race in question is CPU 0 running = madv_free >>> and updating some PTEs while CPU 1 is also running madv_free and = looking >>> at the same PTEs. CPU 1 may have writable TLB entries for a page but = fail >>> the pte_dirty check (because CPU 0 has updated it already) and = potentially >>> fail to flush. Hence, when madv_free on CPU 1 returns, there are = still >>> potentially writable TLB entries and the underlying PTE is still = present >>> so that a subsequent write does not necessarily propagate the dirty = bit >>> to the underlying PTE any more. Reclaim at some unknown time at the = future >>> may then see that the PTE is still clean and discard the page even = though >>> a write has happened in the meantime. I think this is possible but I = could >>> have missed some protection in madv_free that prevents it happening. >>=20 >> Thanks for the detail. You didn't miss anything. It can happen and = then >> it's really bug. IOW, if application does write something after = madv_free, >> it must see the written value, not zero. >>=20 >> How about adding [set|clear]_tlb_flush_pending in tlb batchin = interface? >> With it, when tlb_finish_mmu is called, we can know we skip the flush >> but there is pending flush, so flush focefully to avoid madv_dontneed >> as well as madv_free scenario. >=20 > I *think* this is ok as it's simply more expensive on the KSM side in > the event of a race but no other harmful change is made assuming that > KSM is the only race-prone. The check for mm_tlb_flush_pending also > happens under the PTL so there should be sufficient protection from = the > mm struct update being visible at teh right time. >=20 > Check using the test program from "mm: Always flush VMA ranges = affected > by zap_page_range v2" if it handles the madvise case as well as that > would give some degree of safety. Make sure it's tested against = 4.13-rc2 > instead of mmotm which already includes the madv_dontneed fix. If = yours > works for both then it supersedes the mmotm patch. >=20 > It would also be interesting if Nadav would use his slowdown hack to = see > if he can still force the corruption. The proposed fix for the KSM side is likely to work (I will try later), = but on the tlb_finish_mmu() side, I think there is a problem, since if any = TLB flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be executed. This means that tlb_finish_mmu() may flush one TLB entry, = leave another one stale and not flush it. Note also that the use of set/clear_tlb_flush_pending() is only = applicable following my pending fix that changes the pending indication from bool = to atomic_t. For the record here is my test, followed by the patch to add latency. = There are some magic numbers that may not apply to your system (I got tired of trying to time the system). If you run the test in a VM, the pause-loop exiting can potentially prevent the issue from appearing. -- #include #include #include #include #include #include #include #include #include #include #include #include #define PAGE_SIZE (4096) #define N_PAGES (65536ull * 16) #define CHANGED_VAL (7) #define BASE_VAL (9) #define max(a,b) \ ({ __typeof__ (a) _a =3D (a); \ __typeof__ (b) _b =3D (b); \ _a > _b ? _a : _b; }) #define STEP_HELPERS_RUN (1) #define STEP_DONTNEED_DONE (2) #define STEP_ACCESS_PAUSED (4) volatile int sync_step =3D STEP_ACCESS_PAUSED; volatile char *p; int dirty_fd, ksm_sharing_fd, ksm_run_fd; uint64_t soft_dirty_time, madvise_time, soft_dirty_delta, madvise_delta; static inline unsigned long rdtsc() { unsigned long hi, lo; __asm__ __volatile__ ("rdtsc" : "=3Da"(lo), "=3Dd"(hi)); return lo | (hi << 32); } static inline void wait_rdtsc(unsigned long cycles) { unsigned long tsc =3D rdtsc(); while (rdtsc() - tsc < cycles) __asm__ __volatile__ ("rep nop" ::: "memory"); } static void break_sharing(void) { char buf[20]; pwrite(ksm_run_fd, "2", 1, 0); printf("waiting for page sharing to be broken\n"); do { pread(ksm_sharing_fd, buf, sizeof(buf), 0); } while (strtoul(buf, NULL, sizeof(buf))); } static inline void wait_step(unsigned int step) { while (!(sync_step & step)) asm volatile ("rep nop":::"memory"); } static void *big_madvise_thread(void *ign) { while (1) { uint64_t tsc; wait_step(STEP_HELPERS_RUN); wait_rdtsc(madvise_delta); tsc =3D rdtsc(); madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_FREE); madvise_time =3D rdtsc() - tsc; sync_step =3D STEP_DONTNEED_DONE; } } static void *soft_dirty_thread(void *ign) { while (1) { int r; uint64_t tsc; wait_step(STEP_HELPERS_RUN | STEP_DONTNEED_DONE); wait_rdtsc(soft_dirty_delta); tsc =3D rdtsc(); r =3D pwrite(dirty_fd, "4", 1, 0); assert(r =3D=3D 1); soft_dirty_time =3D rdtsc() - tsc; wait_step(STEP_DONTNEED_DONE); sync_step =3D STEP_ACCESS_PAUSED; } } void main(void) { pthread_t aux_thread, aux_thread2; char pathname[256]; long i; volatile char c; sprintf(pathname, "/proc/%d/clear_refs", getpid()); dirty_fd =3D open(pathname, O_RDWR); ksm_sharing_fd =3D open("/sys/kernel/mm/ksm/pages_sharing", = O_RDONLY); assert(ksm_sharing_fd >=3D 0); ksm_run_fd =3D open("/sys/kernel/mm/ksm/run", O_RDWR); assert(ksm_run_fd >=3D 0); pwrite(ksm_run_fd, "0", 1, 0); p =3D mmap(0, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(p !=3D MAP_FAILED); madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_MERGEABLE); memset((void*)p, BASE_VAL, PAGE_SIZE * 2); for (i =3D 2; i < N_PAGES; i++) c =3D p[PAGE_SIZE * i]; pthread_create(&aux_thread, NULL, big_madvise_thread, NULL); pthread_create(&aux_thread2, NULL, soft_dirty_thread, NULL); while (1) { break_sharing(); *(p + 64) =3D BASE_VAL; // cache in TLB and = break KSM pwrite(ksm_run_fd, "1", 1, 0); wait_rdtsc(0x8000000ull); sync_step =3D STEP_HELPERS_RUN; wait_rdtsc(0x4000000ull); *(p+64) =3D CHANGED_VAL; wait_step(STEP_ACCESS_PAUSED); // wait for TLB = to be flushed if (*(p+64) !=3D CHANGED_VAL || *(p + PAGE_SIZE + 64) =3D=3D CHANGED_VAL) { printf("KSM error\n"); exit(EXIT_FAILURE); } printf("No failure yet\n"); soft_dirty_delta =3D max(0, (long)madvise_time - = (long)soft_dirty_time); madvise_delta =3D max(0, (long)soft_dirty_time - = (long)madvise_time); } } -- 8< -- Subject: [PATCH] TLB flush delay to trigger failure --- fs/proc/task_mmu.c | 2 ++ mm/ksm.c | 2 ++ mm/madvise.c | 2 ++ 3 files changed, 6 insertions(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 520802da059c..c13259251210 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -16,6 +16,7 @@ #include #include #include +#include =20 #include #include @@ -1076,6 +1077,7 @@ static ssize_t clear_refs_write(struct file *file, = const char __user *buf, walk_page_range(0, mm->highest_vm_end, = &clear_refs_walk); if (type =3D=3D CLEAR_REFS_SOFT_DIRTY) mmu_notifier_invalidate_range_end(mm, 0, -1); + msleep(5); flush_tlb_mm(mm); up_read(&mm->mmap_sem); out_mm: diff --git a/mm/ksm.c b/mm/ksm.c index 216184af0e19..317adbb48b0f 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -39,6 +39,7 @@ #include #include #include +#include =20 #include #include "internal.h" @@ -960,6 +961,7 @@ static int replace_page(struct vm_area_struct *vma, = struct page *page, mmun_end =3D addr + PAGE_SIZE; mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); =20 + msleep(5); ptep =3D pte_offset_map_lock(mm, pmd, addr, &ptl); if (!pte_same(*ptep, orig_pte)) { pte_unmap_unlock(ptep, ptl); diff --git a/mm/madvise.c b/mm/madvise.c index 25b78ee4fc2c..e4c852360f2c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -23,6 +23,7 @@ #include #include #include +#include =20 #include =20 @@ -472,6 +473,7 @@ static int madvise_free_single_vma(struct = vm_area_struct *vma, mmu_notifier_invalidate_range_start(mm, start, end); madvise_free_page_range(&tlb, vma, start, end); mmu_notifier_invalidate_range_end(mm, start, end); + msleep(5); tlb_finish_mmu(&tlb, start, end); =20 return 0;= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id D0AA16B025F for ; Wed, 26 Jul 2017 19:40:30 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id 72so119834258pfl.12 for ; Wed, 26 Jul 2017 16:40:30 -0700 (PDT) Received: from lgeamrelo12.lge.com (LGEAMRELO12.lge.com. [156.147.23.52]) by mx.google.com with ESMTP id s138si10259678pgs.270.2017.07.26.16.40.27 for ; Wed, 26 Jul 2017 16:40:28 -0700 (PDT) Date: Thu, 27 Jul 2017 08:40:25 +0900 From: Minchan Kim Subject: Re: Potential race in TLB flush batching? Message-ID: <20170726234025.GA4491@bbox> References: <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Mel Gorman , Andy Lutomirski , "open list:MEMORY MANAGEMENT" Hello Nadav, On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote: > Mel Gorman wrote: > > > On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: > >>> I'm relying on the fact you are the madv_free author to determine if > >>> it's really necessary. The race in question is CPU 0 running madv_free > >>> and updating some PTEs while CPU 1 is also running madv_free and looking > >>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail > >>> the pte_dirty check (because CPU 0 has updated it already) and potentially > >>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still > >>> potentially writable TLB entries and the underlying PTE is still present > >>> so that a subsequent write does not necessarily propagate the dirty bit > >>> to the underlying PTE any more. Reclaim at some unknown time at the future > >>> may then see that the PTE is still clean and discard the page even though > >>> a write has happened in the meantime. I think this is possible but I could > >>> have missed some protection in madv_free that prevents it happening. > >> > >> Thanks for the detail. You didn't miss anything. It can happen and then > >> it's really bug. IOW, if application does write something after madv_free, > >> it must see the written value, not zero. > >> > >> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface? > >> With it, when tlb_finish_mmu is called, we can know we skip the flush > >> but there is pending flush, so flush focefully to avoid madv_dontneed > >> as well as madv_free scenario. > > > > I *think* this is ok as it's simply more expensive on the KSM side in > > the event of a race but no other harmful change is made assuming that > > KSM is the only race-prone. The check for mm_tlb_flush_pending also > > happens under the PTL so there should be sufficient protection from the > > mm struct update being visible at teh right time. > > > > Check using the test program from "mm: Always flush VMA ranges affected > > by zap_page_range v2" if it handles the madvise case as well as that > > would give some degree of safety. Make sure it's tested against 4.13-rc2 > > instead of mmotm which already includes the madv_dontneed fix. If yours > > works for both then it supersedes the mmotm patch. > > > > It would also be interesting if Nadav would use his slowdown hack to see > > if he can still force the corruption. > > The proposed fix for the KSM side is likely to work (I will try later), but > on the tlb_finish_mmu() side, I think there is a problem, since if any TLB > flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be > executed. This means that tlb_finish_mmu() may flush one TLB entry, leave > another one stale and not flush it. Okay, I will change that part like this to avoid partial flush problem. diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 1c42d69490e4..87d0ebac6605 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) * The barriers below prevent the compiler from re-ordering the instructions * around the memory barriers that are already present in the code. */ -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) +static inline int mm_tlb_flush_pending(struct mm_struct *mm) { + int nr_pending; + barrier(); - return atomic_read(&mm->tlb_flush_pending) > 0; + nr_pending = atomic_read(&mm->tlb_flush_pending); + return nr_pending; } static inline void set_tlb_flush_pending(struct mm_struct *mm) { diff --git a/mm/memory.c b/mm/memory.c index d5c5e6497c70..b5320e96ec51 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb) void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) { struct mmu_gather_batch *batch, *next; - bool flushed = tlb_flush_mmu(tlb); + if (!tlb->fullmm && !tlb->need_flush_all && + mm_tlb_flush_pending(tlb->mm) > 1) { + tlb->start = min(start, tlb->start); + tlb->end = max(end, tlb->end); + } + + tlb_flush_mmu(tlb); clear_tlb_flush_pending(tlb->mm); - if (!flushed && mm_tlb_flush_pending(tlb->mm)) - flush_tlb_mm_range(tlb->mm, start, end, 0UL); /* keep the page table cache within bounds */ check_pgt_cache(); > > Note also that the use of set/clear_tlb_flush_pending() is only applicable > following my pending fix that changes the pending indication from bool to > atomic_t. Sure, I saw it in current mmots. Without your good job, my patch never work. :) Thanks for the head up. > > For the record here is my test, followed by the patch to add latency. There > are some magic numbers that may not apply to your system (I got tired of > trying to time the system). If you run the test in a VM, the pause-loop > exiting can potentially prevent the issue from appearing. Thanks for the sharing. I will try it, too. > > -- > > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > > #define PAGE_SIZE (4096) > #define N_PAGES (65536ull * 16) > > #define CHANGED_VAL (7) > #define BASE_VAL (9) > > #define max(a,b) \ > ({ __typeof__ (a) _a = (a); \ > __typeof__ (b) _b = (b); \ > _a > _b ? _a : _b; }) > > #define STEP_HELPERS_RUN (1) > #define STEP_DONTNEED_DONE (2) > #define STEP_ACCESS_PAUSED (4) > > volatile int sync_step = STEP_ACCESS_PAUSED; > volatile char *p; > int dirty_fd, ksm_sharing_fd, ksm_run_fd; > uint64_t soft_dirty_time, madvise_time, soft_dirty_delta, madvise_delta; > > static inline unsigned long rdtsc() > { > unsigned long hi, lo; > > __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); > return lo | (hi << 32); > } > > static inline void wait_rdtsc(unsigned long cycles) > { > unsigned long tsc = rdtsc(); > > while (rdtsc() - tsc < cycles) > __asm__ __volatile__ ("rep nop" ::: "memory"); > } > > static void break_sharing(void) > { > char buf[20]; > > pwrite(ksm_run_fd, "2", 1, 0); > > printf("waiting for page sharing to be broken\n"); > do { > pread(ksm_sharing_fd, buf, sizeof(buf), 0); > } while (strtoul(buf, NULL, sizeof(buf))); > } > > > static inline void wait_step(unsigned int step) > { > while (!(sync_step & step)) > asm volatile ("rep nop":::"memory"); > } > > static void *big_madvise_thread(void *ign) > { > while (1) { > uint64_t tsc; > > wait_step(STEP_HELPERS_RUN); > wait_rdtsc(madvise_delta); > tsc = rdtsc(); > madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_FREE); > madvise_time = rdtsc() - tsc; > sync_step = STEP_DONTNEED_DONE; > } > } > > static void *soft_dirty_thread(void *ign) > { > while (1) { > int r; > uint64_t tsc; > > wait_step(STEP_HELPERS_RUN | STEP_DONTNEED_DONE); > wait_rdtsc(soft_dirty_delta); > > tsc = rdtsc(); > r = pwrite(dirty_fd, "4", 1, 0); > assert(r == 1); > soft_dirty_time = rdtsc() - tsc; > wait_step(STEP_DONTNEED_DONE); > sync_step = STEP_ACCESS_PAUSED; > } > } > > void main(void) > { > pthread_t aux_thread, aux_thread2; > char pathname[256]; > long i; > volatile char c; > > sprintf(pathname, "/proc/%d/clear_refs", getpid()); > dirty_fd = open(pathname, O_RDWR); > > ksm_sharing_fd = open("/sys/kernel/mm/ksm/pages_sharing", O_RDONLY); > assert(ksm_sharing_fd >= 0); > > ksm_run_fd = open("/sys/kernel/mm/ksm/run", O_RDWR); > assert(ksm_run_fd >= 0); > > pwrite(ksm_run_fd, "0", 1, 0); > > p = mmap(0, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > assert(p != MAP_FAILED); > madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_MERGEABLE); > > memset((void*)p, BASE_VAL, PAGE_SIZE * 2); > for (i = 2; i < N_PAGES; i++) > c = p[PAGE_SIZE * i]; > > pthread_create(&aux_thread, NULL, big_madvise_thread, NULL); > pthread_create(&aux_thread2, NULL, soft_dirty_thread, NULL); > > while (1) { > break_sharing(); > *(p + 64) = BASE_VAL; // cache in TLB and break KSM > pwrite(ksm_run_fd, "1", 1, 0); > > wait_rdtsc(0x8000000ull); > sync_step = STEP_HELPERS_RUN; > wait_rdtsc(0x4000000ull); > > *(p+64) = CHANGED_VAL; > > wait_step(STEP_ACCESS_PAUSED); // wait for TLB to be flushed > if (*(p+64) != CHANGED_VAL || > *(p + PAGE_SIZE + 64) == CHANGED_VAL) { > printf("KSM error\n"); > exit(EXIT_FAILURE); > } > > printf("No failure yet\n"); > > soft_dirty_delta = max(0, (long)madvise_time - (long)soft_dirty_time); > madvise_delta = max(0, (long)soft_dirty_time - (long)madvise_time); > } > } > > -- 8< -- > > Subject: [PATCH] TLB flush delay to trigger failure > > --- > fs/proc/task_mmu.c | 2 ++ > mm/ksm.c | 2 ++ > mm/madvise.c | 2 ++ > 3 files changed, 6 insertions(+) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 520802da059c..c13259251210 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -1076,6 +1077,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > walk_page_range(0, mm->highest_vm_end, &clear_refs_walk); > if (type == CLEAR_REFS_SOFT_DIRTY) > mmu_notifier_invalidate_range_end(mm, 0, -1); > + msleep(5); > flush_tlb_mm(mm); > up_read(&mm->mmap_sem); > out_mm: > diff --git a/mm/ksm.c b/mm/ksm.c > index 216184af0e19..317adbb48b0f 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -39,6 +39,7 @@ > #include > #include > #include > +#include > > #include > #include "internal.h" > @@ -960,6 +961,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > mmun_end = addr + PAGE_SIZE; > mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); > > + msleep(5); > ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); > if (!pte_same(*ptep, orig_pte)) { > pte_unmap_unlock(ptep, ptl); > diff --git a/mm/madvise.c b/mm/madvise.c > index 25b78ee4fc2c..e4c852360f2c 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -23,6 +23,7 @@ > #include > #include > #include > +#include > > #include > > @@ -472,6 +473,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, > mmu_notifier_invalidate_range_start(mm, start, end); > madvise_free_page_range(&tlb, vma, start, end); > mmu_notifier_invalidate_range_end(mm, start, end); > + msleep(5); > tlb_finish_mmu(&tlb, start, end); > > return 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 039E56B025F for ; Wed, 26 Jul 2017 19:44:58 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id v190so226092021pgv.12 for ; Wed, 26 Jul 2017 16:44:57 -0700 (PDT) Received: from lgeamrelo11.lge.com (LGEAMRELO11.lge.com. [156.147.23.51]) by mx.google.com with ESMTP id 1si10848160plj.403.2017.07.26.16.44.56 for ; Wed, 26 Jul 2017 16:44:57 -0700 (PDT) Date: Thu, 27 Jul 2017 08:44:54 +0900 From: Minchan Kim Subject: Re: Potential race in TLB flush batching? Message-ID: <20170726234454.GB4491@bbox> References: <4DC97890-9FFA-4BA4-B300-B679BAB2136D@gmail.com> <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170726092228.pyjxamxweslgaemi@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" Hi Mel, On Wed, Jul 26, 2017 at 10:22:28AM +0100, Mel Gorman wrote: > On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: > > > I'm relying on the fact you are the madv_free author to determine if > > > it's really necessary. The race in question is CPU 0 running madv_free > > > and updating some PTEs while CPU 1 is also running madv_free and looking > > > at the same PTEs. CPU 1 may have writable TLB entries for a page but fail > > > the pte_dirty check (because CPU 0 has updated it already) and potentially > > > fail to flush. Hence, when madv_free on CPU 1 returns, there are still > > > potentially writable TLB entries and the underlying PTE is still present > > > so that a subsequent write does not necessarily propagate the dirty bit > > > to the underlying PTE any more. Reclaim at some unknown time at the future > > > may then see that the PTE is still clean and discard the page even though > > > a write has happened in the meantime. I think this is possible but I could > > > have missed some protection in madv_free that prevents it happening. > > > > Thanks for the detail. You didn't miss anything. It can happen and then > > it's really bug. IOW, if application does write something after madv_free, > > it must see the written value, not zero. > > > > How about adding [set|clear]_tlb_flush_pending in tlb batchin interface? > > With it, when tlb_finish_mmu is called, we can know we skip the flush > > but there is pending flush, so flush focefully to avoid madv_dontneed > > as well as madv_free scenario. > > > > I *think* this is ok as it's simply more expensive on the KSM side in > the event of a race but no other harmful change is made assuming that > KSM is the only race-prone. The check for mm_tlb_flush_pending also > happens under the PTL so there should be sufficient protection from the > mm struct update being visible at teh right time. > > Check using the test program from "mm: Always flush VMA ranges affected > by zap_page_range v2" if it handles the madvise case as well as that > would give some degree of safety. Make sure it's tested against 4.13-rc2 > instead of mmotm which already includes the madv_dontneed fix. If yours > works for both then it supersedes the mmotm patch. Okay, I will test it on 4.13-rc2 + Nadav's atomic tlb_flush_pending + my patch fixed partial flush problem pointed out by Nadav. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id D790C6B025F for ; Wed, 26 Jul 2017 20:09:12 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id a2so231320526pgn.15 for ; Wed, 26 Jul 2017 17:09:12 -0700 (PDT) Received: from mail-pf0-x241.google.com (mail-pf0-x241.google.com. [2607:f8b0:400e:c00::241]) by mx.google.com with ESMTPS id f33si7612510plf.725.2017.07.26.17.09.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 26 Jul 2017 17:09:11 -0700 (PDT) Received: by mail-pf0-x241.google.com with SMTP id g69so7629784pfe.1 for ; Wed, 26 Jul 2017 17:09:11 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170726234025.GA4491@bbox> Date: Wed, 26 Jul 2017 17:09:09 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com> References: <20170720074342.otez35bme5gytnxl@suse.de> <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> <20170726234025.GA4491@bbox> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Mel Gorman , Andy Lutomirski , "open list:MEMORY MANAGEMENT" Minchan Kim wrote: > Hello Nadav, >=20 > On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote: >> Mel Gorman wrote: >>=20 >>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: >>>>> I'm relying on the fact you are the madv_free author to determine = if >>>>> it's really necessary. The race in question is CPU 0 running = madv_free >>>>> and updating some PTEs while CPU 1 is also running madv_free and = looking >>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page = but fail >>>>> the pte_dirty check (because CPU 0 has updated it already) and = potentially >>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are = still >>>>> potentially writable TLB entries and the underlying PTE is still = present >>>>> so that a subsequent write does not necessarily propagate the = dirty bit >>>>> to the underlying PTE any more. Reclaim at some unknown time at = the future >>>>> may then see that the PTE is still clean and discard the page even = though >>>>> a write has happened in the meantime. I think this is possible but = I could >>>>> have missed some protection in madv_free that prevents it = happening. >>>>=20 >>>> Thanks for the detail. You didn't miss anything. It can happen and = then >>>> it's really bug. IOW, if application does write something after = madv_free, >>>> it must see the written value, not zero. >>>>=20 >>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin = interface? >>>> With it, when tlb_finish_mmu is called, we can know we skip the = flush >>>> but there is pending flush, so flush focefully to avoid = madv_dontneed >>>> as well as madv_free scenario. >>>=20 >>> I *think* this is ok as it's simply more expensive on the KSM side = in >>> the event of a race but no other harmful change is made assuming = that >>> KSM is the only race-prone. The check for mm_tlb_flush_pending also >>> happens under the PTL so there should be sufficient protection from = the >>> mm struct update being visible at teh right time. >>>=20 >>> Check using the test program from "mm: Always flush VMA ranges = affected >>> by zap_page_range v2" if it handles the madvise case as well as that >>> would give some degree of safety. Make sure it's tested against = 4.13-rc2 >>> instead of mmotm which already includes the madv_dontneed fix. If = yours >>> works for both then it supersedes the mmotm patch. >>>=20 >>> It would also be interesting if Nadav would use his slowdown hack to = see >>> if he can still force the corruption. >>=20 >> The proposed fix for the KSM side is likely to work (I will try = later), but >> on the tlb_finish_mmu() side, I think there is a problem, since if = any TLB >> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not = be >> executed. This means that tlb_finish_mmu() may flush one TLB entry, = leave >> another one stale and not flush it. >=20 > Okay, I will change that part like this to avoid partial flush = problem. >=20 > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 1c42d69490e4..87d0ebac6605 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct = mm_struct *mm) > * The barriers below prevent the compiler from re-ordering the = instructions > * around the memory barriers that are already present in the code. > */ > -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > +static inline int mm_tlb_flush_pending(struct mm_struct *mm) > { > + int nr_pending; > + > barrier(); > - return atomic_read(&mm->tlb_flush_pending) > 0; > + nr_pending =3D atomic_read(&mm->tlb_flush_pending); > + return nr_pending; > } > static inline void set_tlb_flush_pending(struct mm_struct *mm) > { > diff --git a/mm/memory.c b/mm/memory.c > index d5c5e6497c70..b5320e96ec51 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb) > void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, = unsigned long end) > { > struct mmu_gather_batch *batch, *next; > - bool flushed =3D tlb_flush_mmu(tlb); >=20 > + if (!tlb->fullmm && !tlb->need_flush_all && > + mm_tlb_flush_pending(tlb->mm) > 1) { I saw you noticed my comment about the access of the flag without a = lock. I must say it feels strange that a memory barrier would be needed here, = but that what I understood from the documentation. > + tlb->start =3D min(start, tlb->start); > + tlb->end =3D max(end, tlb->end); Err=E2=80=A6 You open-code mmu_gather which is arch-specific. It appears = that all of them have start and end members, but not need_flush_all. Besides, I am = not sure whether they regard start and end the same way. > + } > + > + tlb_flush_mmu(tlb); > clear_tlb_flush_pending(tlb->mm); > - if (!flushed && mm_tlb_flush_pending(tlb->mm)) > - flush_tlb_mm_range(tlb->mm, start, end, 0UL); >=20 > /* keep the page table cache within bounds */ > check_pgt_cache(); >> Note also that the use of set/clear_tlb_flush_pending() is only = applicable >> following my pending fix that changes the pending indication from = bool to >> atomic_t. >=20 > Sure, I saw it in current mmots. Without your good job, my patch never = work. :) > Thanks for the head up. Thanks, I really appreciate it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 6F0526B025F for ; Wed, 26 Jul 2017 20:34:38 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id u7so188648058pgo.6 for ; Wed, 26 Jul 2017 17:34:38 -0700 (PDT) Received: from lgeamrelo12.lge.com (LGEAMRELO12.lge.com. [156.147.23.52]) by mx.google.com with ESMTP id c3si10781350pld.94.2017.07.26.17.34.35 for ; Wed, 26 Jul 2017 17:34:36 -0700 (PDT) Date: Thu, 27 Jul 2017 09:34:34 +0900 From: Minchan Kim Subject: Re: Potential race in TLB flush batching? Message-ID: <20170727003434.GA537@bbox> References: <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> <20170726234025.GA4491@bbox> <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Mel Gorman , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote: > Minchan Kim wrote: > > > Hello Nadav, > > > > On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote: > >> Mel Gorman wrote: > >> > >>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: > >>>>> I'm relying on the fact you are the madv_free author to determine if > >>>>> it's really necessary. The race in question is CPU 0 running madv_free > >>>>> and updating some PTEs while CPU 1 is also running madv_free and looking > >>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail > >>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially > >>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still > >>>>> potentially writable TLB entries and the underlying PTE is still present > >>>>> so that a subsequent write does not necessarily propagate the dirty bit > >>>>> to the underlying PTE any more. Reclaim at some unknown time at the future > >>>>> may then see that the PTE is still clean and discard the page even though > >>>>> a write has happened in the meantime. I think this is possible but I could > >>>>> have missed some protection in madv_free that prevents it happening. > >>>> > >>>> Thanks for the detail. You didn't miss anything. It can happen and then > >>>> it's really bug. IOW, if application does write something after madv_free, > >>>> it must see the written value, not zero. > >>>> > >>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface? > >>>> With it, when tlb_finish_mmu is called, we can know we skip the flush > >>>> but there is pending flush, so flush focefully to avoid madv_dontneed > >>>> as well as madv_free scenario. > >>> > >>> I *think* this is ok as it's simply more expensive on the KSM side in > >>> the event of a race but no other harmful change is made assuming that > >>> KSM is the only race-prone. The check for mm_tlb_flush_pending also > >>> happens under the PTL so there should be sufficient protection from the > >>> mm struct update being visible at teh right time. > >>> > >>> Check using the test program from "mm: Always flush VMA ranges affected > >>> by zap_page_range v2" if it handles the madvise case as well as that > >>> would give some degree of safety. Make sure it's tested against 4.13-rc2 > >>> instead of mmotm which already includes the madv_dontneed fix. If yours > >>> works for both then it supersedes the mmotm patch. > >>> > >>> It would also be interesting if Nadav would use his slowdown hack to see > >>> if he can still force the corruption. > >> > >> The proposed fix for the KSM side is likely to work (I will try later), but > >> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB > >> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be > >> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave > >> another one stale and not flush it. > > > > Okay, I will change that part like this to avoid partial flush problem. > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 1c42d69490e4..87d0ebac6605 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) > > * The barriers below prevent the compiler from re-ordering the instructions > > * around the memory barriers that are already present in the code. > > */ > > -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > > +static inline int mm_tlb_flush_pending(struct mm_struct *mm) > > { > > + int nr_pending; > > + > > barrier(); > > - return atomic_read(&mm->tlb_flush_pending) > 0; > > + nr_pending = atomic_read(&mm->tlb_flush_pending); > > + return nr_pending; > > } > > static inline void set_tlb_flush_pending(struct mm_struct *mm) > > { > > diff --git a/mm/memory.c b/mm/memory.c > > index d5c5e6497c70..b5320e96ec51 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb) > > void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) > > { > > struct mmu_gather_batch *batch, *next; > > - bool flushed = tlb_flush_mmu(tlb); > > > > + if (!tlb->fullmm && !tlb->need_flush_all && > > + mm_tlb_flush_pending(tlb->mm) > 1) { > > I saw you noticed my comment about the access of the flag without a lock. I > must say it feels strange that a memory barrier would be needed here, but > that what I understood from the documentation. I saw your recent barriers fix patch, too. [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending As I commented out in there, I hope to use below here without being aware of complex barrier stuff. Instead, mm_tlb_flush_pending should call the right barrier inside. mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1 > > > + tlb->start = min(start, tlb->start); > > + tlb->end = max(end, tlb->end); > > Erra?| You open-code mmu_gather which is arch-specific. It appears that all of > them have start and end members, but not need_flush_all. Besides, I am not When I see tlb_gather_mmu which is not arch-specific, it intializes need_flush_all to zero so it would be no harmful although some of architecture doesn't set the flag. Please correct me if I miss something. > sure whether they regard start and end the same way. I understand your worry but my patch takes longer range by min/max so I cannot imagine how it breaks. During looking the code, I found __tlb_adjust_range so better to use it rather than open-code. diff --git a/mm/memory.c b/mm/memory.c index b5320e96ec51..b23188daa396 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e struct mmu_gather_batch *batch, *next; if (!tlb->fullmm && !tlb->need_flush_all && - mm_tlb_flush_pending(tlb->mm) > 1) { - tlb->start = min(start, tlb->start); - tlb->end = max(end, tlb->end); - } + mm_tlb_flush_pending(tlb->mm) > 1) + __tlb_adjust_range(tlb->mm, start, end - start); tlb_flush_mmu(tlb); clear_tlb_flush_pending(tlb->mm); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id DCF0C6B025F for ; Wed, 26 Jul 2017 20:49:01 -0400 (EDT) Received: by mail-pf0-f197.google.com with SMTP id g9so13565995pfk.13 for ; Wed, 26 Jul 2017 17:49:01 -0700 (PDT) Received: from mail-pg0-x244.google.com (mail-pg0-x244.google.com. [2607:f8b0:400e:c05::244]) by mx.google.com with ESMTPS id m4si10485198pgs.108.2017.07.26.17.49.00 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 26 Jul 2017 17:49:00 -0700 (PDT) Received: by mail-pg0-x244.google.com with SMTP id v190so18806417pgv.1 for ; Wed, 26 Jul 2017 17:49:00 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170727003434.GA537@bbox> Date: Wed, 26 Jul 2017 17:48:58 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <77AFE0A4-FE3D-4E05-B248-30ADE2F184EF@gmail.com> References: <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> <20170726234025.GA4491@bbox> <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com> <20170727003434.GA537@bbox> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Mel Gorman , Andy Lutomirski , "open list:MEMORY MANAGEMENT" Minchan Kim wrote: > On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote: >> Minchan Kim wrote: >>=20 >>> Hello Nadav, >>>=20 >>> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote: >>>> Mel Gorman wrote: >>>>=20 >>>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: >>>>>>> I'm relying on the fact you are the madv_free author to = determine if >>>>>>> it's really necessary. The race in question is CPU 0 running = madv_free >>>>>>> and updating some PTEs while CPU 1 is also running madv_free and = looking >>>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page = but fail >>>>>>> the pte_dirty check (because CPU 0 has updated it already) and = potentially >>>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are = still >>>>>>> potentially writable TLB entries and the underlying PTE is still = present >>>>>>> so that a subsequent write does not necessarily propagate the = dirty bit >>>>>>> to the underlying PTE any more. Reclaim at some unknown time at = the future >>>>>>> may then see that the PTE is still clean and discard the page = even though >>>>>>> a write has happened in the meantime. I think this is possible = but I could >>>>>>> have missed some protection in madv_free that prevents it = happening. >>>>>>=20 >>>>>> Thanks for the detail. You didn't miss anything. It can happen = and then >>>>>> it's really bug. IOW, if application does write something after = madv_free, >>>>>> it must see the written value, not zero. >>>>>>=20 >>>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin = interface? >>>>>> With it, when tlb_finish_mmu is called, we can know we skip the = flush >>>>>> but there is pending flush, so flush focefully to avoid = madv_dontneed >>>>>> as well as madv_free scenario. >>>>>=20 >>>>> I *think* this is ok as it's simply more expensive on the KSM side = in >>>>> the event of a race but no other harmful change is made assuming = that >>>>> KSM is the only race-prone. The check for mm_tlb_flush_pending = also >>>>> happens under the PTL so there should be sufficient protection = from the >>>>> mm struct update being visible at teh right time. >>>>>=20 >>>>> Check using the test program from "mm: Always flush VMA ranges = affected >>>>> by zap_page_range v2" if it handles the madvise case as well as = that >>>>> would give some degree of safety. Make sure it's tested against = 4.13-rc2 >>>>> instead of mmotm which already includes the madv_dontneed fix. If = yours >>>>> works for both then it supersedes the mmotm patch. >>>>>=20 >>>>> It would also be interesting if Nadav would use his slowdown hack = to see >>>>> if he can still force the corruption. >>>>=20 >>>> The proposed fix for the KSM side is likely to work (I will try = later), but >>>> on the tlb_finish_mmu() side, I think there is a problem, since if = any TLB >>>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will = not be >>>> executed. This means that tlb_finish_mmu() may flush one TLB entry, = leave >>>> another one stale and not flush it. >>>=20 >>> Okay, I will change that part like this to avoid partial flush = problem. >>>=20 >>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>> index 1c42d69490e4..87d0ebac6605 100644 >>> --- a/include/linux/mm_types.h >>> +++ b/include/linux/mm_types.h >>> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct = mm_struct *mm) >>> * The barriers below prevent the compiler from re-ordering the = instructions >>> * around the memory barriers that are already present in the code. >>> */ >>> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) >>> +static inline int mm_tlb_flush_pending(struct mm_struct *mm) >>> { >>> + int nr_pending; >>> + >>> barrier(); >>> - return atomic_read(&mm->tlb_flush_pending) > 0; >>> + nr_pending =3D atomic_read(&mm->tlb_flush_pending); >>> + return nr_pending; >>> } >>> static inline void set_tlb_flush_pending(struct mm_struct *mm) >>> { >>> diff --git a/mm/memory.c b/mm/memory.c >>> index d5c5e6497c70..b5320e96ec51 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb) >>> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, = unsigned long end) >>> { >>> struct mmu_gather_batch *batch, *next; >>> - bool flushed =3D tlb_flush_mmu(tlb); >>>=20 >>> + if (!tlb->fullmm && !tlb->need_flush_all && >>> + mm_tlb_flush_pending(tlb->mm) > 1) { >>=20 >> I saw you noticed my comment about the access of the flag without a = lock. I >> must say it feels strange that a memory barrier would be needed here, = but >> that what I understood from the documentation. >=20 > I saw your recent barriers fix patch, too. > [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending >=20 > As I commented out in there, I hope to use below here without being > aware of complex barrier stuff. Instead, mm_tlb_flush_pending should > call the right barrier inside. >=20 > mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1 I will address it in v3. >=20 >>> + tlb->start =3D min(start, tlb->start); >>> + tlb->end =3D max(end, tlb->end); >>=20 >> Err=E2=80=A6 You open-code mmu_gather which is arch-specific. It = appears that all of >> them have start and end members, but not need_flush_all. Besides, I = am not >=20 > When I see tlb_gather_mmu which is not arch-specific, it intializes > need_flush_all to zero so it would be no harmful although some of > architecture doesn't set the flag. > Please correct me if I miss something. Oh.. my bad. I missed the fact that this code is under =E2=80=9C#ifdef HAVE_GENERIC_MMU_GATHER=E2=80=9D. But that means that arch-specific = tlb_finish_mmu() implementations (s390, arm) may need to be modified as well. >> sure whether they regard start and end the same way. >=20 > I understand your worry but my patch takes longer range by min/max > so I cannot imagine how it breaks. During looking the code, I found > __tlb_adjust_range so better to use it rather than open-code. >=20 >=20 > diff --git a/mm/memory.c b/mm/memory.c > index b5320e96ec51..b23188daa396 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, = unsigned long start, unsigned long e > struct mmu_gather_batch *batch, *next; >=20 > if (!tlb->fullmm && !tlb->need_flush_all && > - mm_tlb_flush_pending(tlb->mm) > 1) { > - tlb->start =3D min(start, tlb->start); > - tlb->end =3D max(end, tlb->end); > - } > + mm_tlb_flush_pending(tlb->mm) > 1) > + __tlb_adjust_range(tlb->mm, start, end - start); >=20 > tlb_flush_mmu(tlb); > clear_tlb_flush_pending(tlb->mm); This one is better, especially as I now understand it is only for the generic MMU gather (which I missed before). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 9E2EE6B025F for ; Wed, 26 Jul 2017 21:13:19 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id y129so159657392pgy.1 for ; Wed, 26 Jul 2017 18:13:19 -0700 (PDT) Received: from mail-pg0-x244.google.com (mail-pg0-x244.google.com. [2607:f8b0:400e:c05::244]) by mx.google.com with ESMTPS id m10si7718961pgc.959.2017.07.26.18.13.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 26 Jul 2017 18:13:18 -0700 (PDT) Received: by mail-pg0-x244.google.com with SMTP id k190so3104148pgk.4 for ; Wed, 26 Jul 2017 18:13:18 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <77AFE0A4-FE3D-4E05-B248-30ADE2F184EF@gmail.com> Date: Wed, 26 Jul 2017 18:13:15 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20170724095832.vgvku6vlxkv75r3k@suse.de> <20170725073748.GB22652@bbox> <20170725085132.iysanhtqkgopegob@suse.de> <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> <20170726234025.GA4491@bbox> <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com> <20170727003434.GA537@bbox> <77AFE0A4-FE3D-4E05-B248-30ADE2F184EF@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Mel Gorman , Andy Lutomirski , "open list:MEMORY MANAGEMENT" Nadav Amit wrote: > Minchan Kim wrote: >=20 >> On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote: >>> Minchan Kim wrote: >>>=20 >>>> Hello Nadav, >>>>=20 >>>> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote: >>>>> Mel Gorman wrote: >>>>>=20 >>>>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: >>>>>>>> I'm relying on the fact you are the madv_free author to = determine if >>>>>>>> it's really necessary. The race in question is CPU 0 running = madv_free >>>>>>>> and updating some PTEs while CPU 1 is also running madv_free = and looking >>>>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a = page but fail >>>>>>>> the pte_dirty check (because CPU 0 has updated it already) and = potentially >>>>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there = are still >>>>>>>> potentially writable TLB entries and the underlying PTE is = still present >>>>>>>> so that a subsequent write does not necessarily propagate the = dirty bit >>>>>>>> to the underlying PTE any more. Reclaim at some unknown time at = the future >>>>>>>> may then see that the PTE is still clean and discard the page = even though >>>>>>>> a write has happened in the meantime. I think this is possible = but I could >>>>>>>> have missed some protection in madv_free that prevents it = happening. >>>>>>>=20 >>>>>>> Thanks for the detail. You didn't miss anything. It can happen = and then >>>>>>> it's really bug. IOW, if application does write something after = madv_free, >>>>>>> it must see the written value, not zero. >>>>>>>=20 >>>>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin = interface? >>>>>>> With it, when tlb_finish_mmu is called, we can know we skip the = flush >>>>>>> but there is pending flush, so flush focefully to avoid = madv_dontneed >>>>>>> as well as madv_free scenario. >>>>>>=20 >>>>>> I *think* this is ok as it's simply more expensive on the KSM = side in >>>>>> the event of a race but no other harmful change is made assuming = that >>>>>> KSM is the only race-prone. The check for mm_tlb_flush_pending = also >>>>>> happens under the PTL so there should be sufficient protection = from the >>>>>> mm struct update being visible at teh right time. >>>>>>=20 >>>>>> Check using the test program from "mm: Always flush VMA ranges = affected >>>>>> by zap_page_range v2" if it handles the madvise case as well as = that >>>>>> would give some degree of safety. Make sure it's tested against = 4.13-rc2 >>>>>> instead of mmotm which already includes the madv_dontneed fix. If = yours >>>>>> works for both then it supersedes the mmotm patch. >>>>>>=20 >>>>>> It would also be interesting if Nadav would use his slowdown hack = to see >>>>>> if he can still force the corruption. >>>>>=20 >>>>> The proposed fix for the KSM side is likely to work (I will try = later), but >>>>> on the tlb_finish_mmu() side, I think there is a problem, since if = any TLB >>>>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will = not be >>>>> executed. This means that tlb_finish_mmu() may flush one TLB = entry, leave >>>>> another one stale and not flush it. >>>>=20 >>>> Okay, I will change that part like this to avoid partial flush = problem. >>>>=20 >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>>> index 1c42d69490e4..87d0ebac6605 100644 >>>> --- a/include/linux/mm_types.h >>>> +++ b/include/linux/mm_types.h >>>> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct = mm_struct *mm) >>>> * The barriers below prevent the compiler from re-ordering the = instructions >>>> * around the memory barriers that are already present in the code. >>>> */ >>>> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) >>>> +static inline int mm_tlb_flush_pending(struct mm_struct *mm) >>>> { >>>> + int nr_pending; >>>> + >>>> barrier(); >>>> - return atomic_read(&mm->tlb_flush_pending) > 0; >>>> + nr_pending =3D atomic_read(&mm->tlb_flush_pending); >>>> + return nr_pending; >>>> } >>>> static inline void set_tlb_flush_pending(struct mm_struct *mm) >>>> { >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index d5c5e6497c70..b5320e96ec51 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb) >>>> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, = unsigned long end) >>>> { >>>> struct mmu_gather_batch *batch, *next; >>>> - bool flushed =3D tlb_flush_mmu(tlb); >>>>=20 >>>> + if (!tlb->fullmm && !tlb->need_flush_all && >>>> + mm_tlb_flush_pending(tlb->mm) > 1) { >>>=20 >>> I saw you noticed my comment about the access of the flag without a = lock. I >>> must say it feels strange that a memory barrier would be needed = here, but >>> that what I understood from the documentation. >>=20 >> I saw your recent barriers fix patch, too. >> [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending >>=20 >> As I commented out in there, I hope to use below here without being >> aware of complex barrier stuff. Instead, mm_tlb_flush_pending should >> call the right barrier inside. >>=20 >> mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1 >=20 > I will address it in v3. >=20 >=20 >>>> + tlb->start =3D min(start, tlb->start); >>>> + tlb->end =3D max(end, tlb->end); >>>=20 >>> Err=E2=80=A6 You open-code mmu_gather which is arch-specific. It = appears that all of >>> them have start and end members, but not need_flush_all. Besides, I = am not >>=20 >> When I see tlb_gather_mmu which is not arch-specific, it intializes >> need_flush_all to zero so it would be no harmful although some of >> architecture doesn't set the flag. >> Please correct me if I miss something. >=20 > Oh.. my bad. I missed the fact that this code is under =E2=80=9C#ifdef > HAVE_GENERIC_MMU_GATHER=E2=80=9D. But that means that arch-specific = tlb_finish_mmu() > implementations (s390, arm) may need to be modified as well. >=20 >>> sure whether they regard start and end the same way. >>=20 >> I understand your worry but my patch takes longer range by min/max >> so I cannot imagine how it breaks. During looking the code, I found >> __tlb_adjust_range so better to use it rather than open-code. >>=20 >>=20 >> diff --git a/mm/memory.c b/mm/memory.c >> index b5320e96ec51..b23188daa396 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, = unsigned long start, unsigned long e >> struct mmu_gather_batch *batch, *next; >>=20 >> if (!tlb->fullmm && !tlb->need_flush_all && >> - mm_tlb_flush_pending(tlb->mm) > 1) { >> - tlb->start =3D min(start, tlb->start); >> - tlb->end =3D max(end, tlb->end); >> - } >> + mm_tlb_flush_pending(tlb->mm) > 1) >> + __tlb_adjust_range(tlb->mm, start, end - start); >>=20 >> tlb_flush_mmu(tlb); >> clear_tlb_flush_pending(tlb->mm); >=20 > This one is better, especially as I now understand it is only for the > generic MMU gather (which I missed before). There is one issue I forgot: pte_accessible() on x86 regards mm_tlb_flush_pending() as an indication for NUMA migration. But now the = code does not make too much sense: if ((pte_flags(a) & _PAGE_PROTNONE) && mm_tlb_flush_pending(mm)) Either we remove the _PAGE_PROTNONE check or we need to use the atomic = field to count separately pending flushes due to migration and due to other reasons. The first option is safer, but Mel objected to it, because of = the performance implications. The second one requires some thought on how to build a single counter for multiple reasons and avoid a potential = overflow. Thoughts? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 3E8526B025F for ; Thu, 27 Jul 2017 03:04:24 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id e9so106390728pga.5 for ; Thu, 27 Jul 2017 00:04:24 -0700 (PDT) Received: from lgeamrelo13.lge.com (LGEAMRELO13.lge.com. [156.147.23.53]) by mx.google.com with ESMTP id z82si1824624pfd.327.2017.07.27.00.04.22 for ; Thu, 27 Jul 2017 00:04:23 -0700 (PDT) Date: Thu, 27 Jul 2017 16:04:20 +0900 From: Minchan Kim Subject: Re: Potential race in TLB flush batching? Message-ID: <20170727070420.GA1052@bbox> References: <20170725091115.GA22920@bbox> <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> <20170726234025.GA4491@bbox> <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com> <20170727003434.GA537@bbox> <77AFE0A4-FE3D-4E05-B248-30ADE2F184EF@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Mel Gorman , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Wed, Jul 26, 2017 at 06:13:15PM -0700, Nadav Amit wrote: > Nadav Amit wrote: > > > Minchan Kim wrote: > > > >> On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote: > >>> Minchan Kim wrote: > >>> > >>>> Hello Nadav, > >>>> > >>>> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote: > >>>>> Mel Gorman wrote: > >>>>> > >>>>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote: > >>>>>>>> I'm relying on the fact you are the madv_free author to determine if > >>>>>>>> it's really necessary. The race in question is CPU 0 running madv_free > >>>>>>>> and updating some PTEs while CPU 1 is also running madv_free and looking > >>>>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail > >>>>>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially > >>>>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still > >>>>>>>> potentially writable TLB entries and the underlying PTE is still present > >>>>>>>> so that a subsequent write does not necessarily propagate the dirty bit > >>>>>>>> to the underlying PTE any more. Reclaim at some unknown time at the future > >>>>>>>> may then see that the PTE is still clean and discard the page even though > >>>>>>>> a write has happened in the meantime. I think this is possible but I could > >>>>>>>> have missed some protection in madv_free that prevents it happening. > >>>>>>> > >>>>>>> Thanks for the detail. You didn't miss anything. It can happen and then > >>>>>>> it's really bug. IOW, if application does write something after madv_free, > >>>>>>> it must see the written value, not zero. > >>>>>>> > >>>>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface? > >>>>>>> With it, when tlb_finish_mmu is called, we can know we skip the flush > >>>>>>> but there is pending flush, so flush focefully to avoid madv_dontneed > >>>>>>> as well as madv_free scenario. > >>>>>> > >>>>>> I *think* this is ok as it's simply more expensive on the KSM side in > >>>>>> the event of a race but no other harmful change is made assuming that > >>>>>> KSM is the only race-prone. The check for mm_tlb_flush_pending also > >>>>>> happens under the PTL so there should be sufficient protection from the > >>>>>> mm struct update being visible at teh right time. > >>>>>> > >>>>>> Check using the test program from "mm: Always flush VMA ranges affected > >>>>>> by zap_page_range v2" if it handles the madvise case as well as that > >>>>>> would give some degree of safety. Make sure it's tested against 4.13-rc2 > >>>>>> instead of mmotm which already includes the madv_dontneed fix. If yours > >>>>>> works for both then it supersedes the mmotm patch. > >>>>>> > >>>>>> It would also be interesting if Nadav would use his slowdown hack to see > >>>>>> if he can still force the corruption. > >>>>> > >>>>> The proposed fix for the KSM side is likely to work (I will try later), but > >>>>> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB > >>>>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be > >>>>> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave > >>>>> another one stale and not flush it. > >>>> > >>>> Okay, I will change that part like this to avoid partial flush problem. > >>>> > >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >>>> index 1c42d69490e4..87d0ebac6605 100644 > >>>> --- a/include/linux/mm_types.h > >>>> +++ b/include/linux/mm_types.h > >>>> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) > >>>> * The barriers below prevent the compiler from re-ordering the instructions > >>>> * around the memory barriers that are already present in the code. > >>>> */ > >>>> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > >>>> +static inline int mm_tlb_flush_pending(struct mm_struct *mm) > >>>> { > >>>> + int nr_pending; > >>>> + > >>>> barrier(); > >>>> - return atomic_read(&mm->tlb_flush_pending) > 0; > >>>> + nr_pending = atomic_read(&mm->tlb_flush_pending); > >>>> + return nr_pending; > >>>> } > >>>> static inline void set_tlb_flush_pending(struct mm_struct *mm) > >>>> { > >>>> diff --git a/mm/memory.c b/mm/memory.c > >>>> index d5c5e6497c70..b5320e96ec51 100644 > >>>> --- a/mm/memory.c > >>>> +++ b/mm/memory.c > >>>> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb) > >>>> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) > >>>> { > >>>> struct mmu_gather_batch *batch, *next; > >>>> - bool flushed = tlb_flush_mmu(tlb); > >>>> > >>>> + if (!tlb->fullmm && !tlb->need_flush_all && > >>>> + mm_tlb_flush_pending(tlb->mm) > 1) { > >>> > >>> I saw you noticed my comment about the access of the flag without a lock. I > >>> must say it feels strange that a memory barrier would be needed here, but > >>> that what I understood from the documentation. > >> > >> I saw your recent barriers fix patch, too. > >> [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending > >> > >> As I commented out in there, I hope to use below here without being > >> aware of complex barrier stuff. Instead, mm_tlb_flush_pending should > >> call the right barrier inside. > >> > >> mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1 > > > > I will address it in v3. > > > > > >>>> + tlb->start = min(start, tlb->start); > >>>> + tlb->end = max(end, tlb->end); > >>> > >>> Erra?| You open-code mmu_gather which is arch-specific. It appears that all of > >>> them have start and end members, but not need_flush_all. Besides, I am not > >> > >> When I see tlb_gather_mmu which is not arch-specific, it intializes > >> need_flush_all to zero so it would be no harmful although some of > >> architecture doesn't set the flag. > >> Please correct me if I miss something. > > > > Oh.. my bad. I missed the fact that this code is under a??#ifdef > > HAVE_GENERIC_MMU_GATHERa??. But that means that arch-specific tlb_finish_mmu() > > implementations (s390, arm) may need to be modified as well. > > > >>> sure whether they regard start and end the same way. > >> > >> I understand your worry but my patch takes longer range by min/max > >> so I cannot imagine how it breaks. During looking the code, I found > >> __tlb_adjust_range so better to use it rather than open-code. > >> > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index b5320e96ec51..b23188daa396 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e > >> struct mmu_gather_batch *batch, *next; > >> > >> if (!tlb->fullmm && !tlb->need_flush_all && > >> - mm_tlb_flush_pending(tlb->mm) > 1) { > >> - tlb->start = min(start, tlb->start); > >> - tlb->end = max(end, tlb->end); > >> - } > >> + mm_tlb_flush_pending(tlb->mm) > 1) > >> + __tlb_adjust_range(tlb->mm, start, end - start); > >> > >> tlb_flush_mmu(tlb); > >> clear_tlb_flush_pending(tlb->mm); > > > > This one is better, especially as I now understand it is only for the > > generic MMU gather (which I missed before). > > There is one issue I forgot: pte_accessible() on x86 regards > mm_tlb_flush_pending() as an indication for NUMA migration. But now the code > does not make too much sense: > > if ((pte_flags(a) & _PAGE_PROTNONE) && > mm_tlb_flush_pending(mm)) > > Either we remove the _PAGE_PROTNONE check or we need to use the atomic field > to count separately pending flushes due to migration and due to other > reasons. The first option is safer, but Mel objected to it, because of the > performance implications. The second one requires some thought on how to > build a single counter for multiple reasons and avoid a potential overflow. > > Thoughts? > I'm really new for the autoNUMA so not sure I understand your concern If your concern is that increasing places where add up pending count, autoNUMA performance might be hurt. Right? If so, above _PAGE_PROTNONE check will filter out most of cases? Maybe, Mel could answer. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id 5EA986B02B4 for ; Thu, 27 Jul 2017 03:21:16 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id g71so12622387wmg.13 for ; Thu, 27 Jul 2017 00:21:16 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id q8si1318739wmd.8.2017.07.27.00.21.15 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 27 Jul 2017 00:21:15 -0700 (PDT) Date: Thu, 27 Jul 2017 08:21:13 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170727072113.dpv2nsqaft3inpru@suse.de> References: <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> <20170726234025.GA4491@bbox> <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com> <20170727003434.GA537@bbox> <77AFE0A4-FE3D-4E05-B248-30ADE2F184EF@gmail.com> <20170727070420.GA1052@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20170727070420.GA1052@bbox> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Nadav Amit , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Thu, Jul 27, 2017 at 04:04:20PM +0900, Minchan Kim wrote: > > There is one issue I forgot: pte_accessible() on x86 regards > > mm_tlb_flush_pending() as an indication for NUMA migration. But now the code > > does not make too much sense: > > > > if ((pte_flags(a) & _PAGE_PROTNONE) && > > mm_tlb_flush_pending(mm)) > > > > Either we remove the _PAGE_PROTNONE check or we need to use the atomic field > > to count separately pending flushes due to migration and due to other > > reasons. The first option is safer, but Mel objected to it, because of the > > performance implications. The second one requires some thought on how to > > build a single counter for multiple reasons and avoid a potential overflow. > > > > Thoughts? > > > > I'm really new for the autoNUMA so not sure I understand your concern > If your concern is that increasing places where add up pending count, > autoNUMA performance might be hurt. Right? > If so, above _PAGE_PROTNONE check will filter out most of cases? > Maybe, Mel could answer. I'm not sure what I'm being asked. In the case above, the TLB flush pending is only relevant against autonuma-related races so only those PTEs are checked to limit overhead. It could be checked on every PTE but it's adding more compiler barriers or more atomic reads which do not appear necessary. If the check is removed, a comment should be added explaining why every PTE has to be checked. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id E80B76B04B3 for ; Thu, 27 Jul 2017 12:04:15 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id w187so16630107pgb.10 for ; Thu, 27 Jul 2017 09:04:15 -0700 (PDT) Received: from mail-pg0-x242.google.com (mail-pg0-x242.google.com. [2607:f8b0:400e:c05::242]) by mx.google.com with ESMTPS id m9si11784569plk.240.2017.07.27.09.04.14 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 27 Jul 2017 09:04:14 -0700 (PDT) Received: by mail-pg0-x242.google.com with SMTP id 125so7495413pgi.5 for ; Thu, 27 Jul 2017 09:04:14 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170727072113.dpv2nsqaft3inpru@suse.de> Date: Thu, 27 Jul 2017 09:04:11 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <68D28CCA-10CC-48F8-A38F-B682A98A4BA5@gmail.com> References: <20170725100722.2dxnmgypmwnrfawp@suse.de> <20170726054306.GA11100@bbox> <20170726092228.pyjxamxweslgaemi@suse.de> <20170726234025.GA4491@bbox> <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com> <20170727003434.GA537@bbox> <77AFE0A4-FE3D-4E05-B248-30ADE2F184EF@gmail.com> <20170727070420.GA1052@bbox> <20170727072113.dpv2nsqaft3inpru@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Minchan Kim , Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Thu, Jul 27, 2017 at 04:04:20PM +0900, Minchan Kim wrote: >>> There is one issue I forgot: pte_accessible() on x86 regards >>> mm_tlb_flush_pending() as an indication for NUMA migration. But now = the code >>> does not make too much sense: >>>=20 >>> if ((pte_flags(a) & _PAGE_PROTNONE) && >>> mm_tlb_flush_pending(mm)) >>>=20 >>> Either we remove the _PAGE_PROTNONE check or we need to use the = atomic field >>> to count separately pending flushes due to migration and due to = other >>> reasons. The first option is safer, but Mel objected to it, because = of the >>> performance implications. The second one requires some thought on = how to >>> build a single counter for multiple reasons and avoid a potential = overflow. >>>=20 >>> Thoughts? >>=20 >> I'm really new for the autoNUMA so not sure I understand your concern >> If your concern is that increasing places where add up pending count, >> autoNUMA performance might be hurt. Right? >> If so, above _PAGE_PROTNONE check will filter out most of cases? >> Maybe, Mel could answer. >=20 > I'm not sure what I'm being asked. In the case above, the TLB flush = pending > is only relevant against autonuma-related races so only those PTEs are > checked to limit overhead. It could be checked on every PTE but it's > adding more compiler barriers or more atomic reads which do not appear > necessary. If the check is removed, a comment should be added = explaining > why every PTE has to be checked. I considered breaking tlb_flush_pending to two: tlb_flush_pending_numa = and tlb_flush_pending_other (they can share one atomic64_t field). This way, pte_accessible() would only consider =E2=80=9Ctlb_flush_pending_numa", = and the changes that Minchan proposed would not increase the number unnecessary = TLB flushes. However, considering the complexity of the TLB flushes scheme, and the = fact I am not fully convinced all of these TLB flushes are indeed = unnecessary, I will put it aside. Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id AD76B6B0496 for ; Thu, 27 Jul 2017 13:36:19 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id u89so36015323wrc.1 for ; Thu, 27 Jul 2017 10:36:19 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id d43si20307132wrd.85.2017.07.27.10.36.18 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 27 Jul 2017 10:36:18 -0700 (PDT) Date: Thu, 27 Jul 2017 18:36:13 +0100 From: Mel Gorman Subject: Re: Potential race in TLB flush batching? Message-ID: <20170727173613.g3vz2dv3fcxrsnf7@suse.de> References: <20170726092228.pyjxamxweslgaemi@suse.de> <20170726234025.GA4491@bbox> <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com> <20170727003434.GA537@bbox> <77AFE0A4-FE3D-4E05-B248-30ADE2F184EF@gmail.com> <20170727070420.GA1052@bbox> <20170727072113.dpv2nsqaft3inpru@suse.de> <68D28CCA-10CC-48F8-A38F-B682A98A4BA5@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <68D28CCA-10CC-48F8-A38F-B682A98A4BA5@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nadav Amit Cc: Minchan Kim , Andy Lutomirski , "open list:MEMORY MANAGEMENT" On Thu, Jul 27, 2017 at 09:04:11AM -0700, Nadav Amit wrote: > Mel Gorman wrote: > > > On Thu, Jul 27, 2017 at 04:04:20PM +0900, Minchan Kim wrote: > >>> There is one issue I forgot: pte_accessible() on x86 regards > >>> mm_tlb_flush_pending() as an indication for NUMA migration. But now the code > >>> does not make too much sense: > >>> > >>> if ((pte_flags(a) & _PAGE_PROTNONE) && > >>> mm_tlb_flush_pending(mm)) > >>> > >>> Either we remove the _PAGE_PROTNONE check or we need to use the atomic field > >>> to count separately pending flushes due to migration and due to other > >>> reasons. The first option is safer, but Mel objected to it, because of the > >>> performance implications. The second one requires some thought on how to > >>> build a single counter for multiple reasons and avoid a potential overflow. > >>> > >>> Thoughts? > >> > >> I'm really new for the autoNUMA so not sure I understand your concern > >> If your concern is that increasing places where add up pending count, > >> autoNUMA performance might be hurt. Right? > >> If so, above _PAGE_PROTNONE check will filter out most of cases? > >> Maybe, Mel could answer. > > > > I'm not sure what I'm being asked. In the case above, the TLB flush pending > > is only relevant against autonuma-related races so only those PTEs are > > checked to limit overhead. It could be checked on every PTE but it's > > adding more compiler barriers or more atomic reads which do not appear > > necessary. If the check is removed, a comment should be added explaining > > why every PTE has to be checked. > > I considered breaking tlb_flush_pending to two: tlb_flush_pending_numa and > tlb_flush_pending_other (they can share one atomic64_t field). This way, > pte_accessible() would only consider ???tlb_flush_pending_numa", and the > changes that Minchan proposed would not increase the number unnecessary TLB > flushes. > > However, considering the complexity of the TLB flushes scheme, and the fact > I am not fully convinced all of these TLB flushes are indeed unnecessary, I > will put it aside. > Ok, I understand now. With a second set/clear of mm_tlb_flush_pending, it is necessary to remove the PROT_NUMA check from pte_accessible because it's no longer change_prot_range that is the only user of concern. At this time, I do not see a value if adding two pending field because it's a maintenance headache and an API that would be harder to get right. It's also not clear it would add any performance advantage and even if it did, it's the type of complexity that would need hard data supporting it. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org