From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id 4D1F56B025F for ; Wed, 19 Jul 2017 16:20:06 -0400 (EDT) Received: by mail-pf0-f197.google.com with SMTP id y62so10756938pfa.3 for ; Wed, 19 Jul 2017 13:20:06 -0700 (PDT) Received: from mail-pg0-x230.google.com (mail-pg0-x230.google.com. [2607:f8b0:400e:c05::230]) by mx.google.com with ESMTPS id o12si504296pfa.545.2017.07.19.13.20.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 19 Jul 2017 13:20:05 -0700 (PDT) Received: by mail-pg0-x230.google.com with SMTP id 123so4808181pgj.1 for ; Wed, 19 Jul 2017 13:20:05 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170719195820.drtfmweuhdc4eca6@suse.de> Date: Wed, 19 Jul 2017 13:20:01 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> References: <20170711215240.tdpmwmgwcuerjj3o@suse.de> <9ECCACFE-6006-4C19-8FC0-C387EB5F3BEE@gmail.com> <20170712082733.ouf7yx2bnvwwcfms@suse.de> <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> <20170713060706.o2cuko5y6irxwnww@suse.de> <20170715155518.ok2q62efc2vurqk5@suse.de> <20170719074131.75wexoal3fiyoxw5@suse.de> <20170719195820.drtfmweuhdc4eca6@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Wed, Jul 19, 2017 at 12:41:01PM -0700, Nadav Amit wrote: >> Mel Gorman wrote: >>=20 >>> On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote: >>>>> If there are separate address spaces using a shared mapping then = the >>>>> same race does not occur. >>>>=20 >>>> I missed the fact you reverted the two operations since the = previous version >>>> of the patch. This specific scenario should be solved with this = patch. >>>>=20 >>>> But in general, I think there is a need for a simple locking = scheme. >>>=20 >>> Such as? >>=20 >> Something like: >>=20 >> bool is_potentially_stale_pte(pte_t pte, pgprot_t prot, int = lock_state); >>=20 >> which would get the current PTE, the protection bits that the user is >> interested in, and whether mmap_sem is taken read/write/none. >=20 > =46rom a PTE you cannot know the state of mmap_sem because you can = rmap > back to multiple mm's for shared mappings. It's also fairly heavy = handed. > Technically, you could lock on the basis of the VMA but that has other > consequences for scalability. The staleness is also a factor because > it's a case of "does the staleness matter". Sometimes it does, = sometimes > it doesn't. mmap_sem even if it could be used does not always tell us > the right information either because it can matter whether we are = racing > against a userspace reference or a kernel operation. >=20 > It's possible your idea could be made work, but right now I'm not = seeing a > solution that handles every corner case. I asked to hear what your = ideas > were because anything I thought of that could batch TLB flushing in = the > general case had flaws that did not improve over what is already = there. I don=E2=80=99t disagree with what you say - perhaps my scheme is too = simplistic. But the bottom line, if you cannot form simple rules for when TLB needs = to be flushed, what are the chances others would get it right? >> [snip] >>=20 >>>> As a result, concurrent operation such as KSM???s = write_protect_page() or >>>=20 >>> write_protect_page operates under the page lock and cannot race with = reclaim. >>=20 >> I still do not understand this claim. IIUC, reclaim can unmap the = page in >> some page table, decide not to reclaim the page and release the = page-lock >> before flush. >=20 > shrink_page_list is the caller of try_to_unmap in reclaim context. It > has this check >=20 > if (!trylock_page(page)) > goto keep; >=20 > For pages it cannot lock, they get put back on the LRU and recycled = instead > of reclaimed. Hence, if KSM or anything else holds the page lock, = reclaim > can't unmap it. Yes, of course, since KSM does not batch TLB flushes. I regarded the = other direction - first try_to_unmap() removes the PTE (but still does not = flush), unlocks the page, and then KSM acquires the page lock and calls write_protect_page(). It finds out the PTE is not present and does not = flush the TLB. >>>> page_mkclean_one() can consider the page write-protected while in = fact it is >>>> still accessible - since the TLB flush was deferred. >>>=20 >>> As long as it's flushed before any IO occurs that would lose a data = update, >>> it's not a data integrity issue. >>>=20 >>>> As a result, they may >>>> mishandle the PTE without flushing the page. In the case of >>>> page_mkclean_one(), I suspect it may even lead to memory = corruption. I admit >>>> that in x86 there are some mitigating factors that would make such = ???attack??? >>>> complicated, but it still seems wrong to me, no? >>>=20 >>> I worry that you're beginning to see races everywhere. I admit that = the >>> rules and protections here are varied and complex but it's worth = keeping >>> in mind that data integrity is the key concern (no false reads to = wrong >>> data, no lost writes) and the first race you identified found some = problems >>> here. However, with or without batching, there is always a delay = between >>> when a PTE is cleared and when the TLB entries are removed. >>=20 >> Sure, but usually the delay occurs while the page-table lock is taken = so >> there is no race. >>=20 >> Now, it is not fair to call me a paranoid, considering that these = races are >> real - I confirmed that at least two can happen in practice. >=20 > It's less an accusation of paranoia and more a caution that the fact = that > pte_clear_flush is not atomic means that it can be difficult to find = what > races matter and what ones don't. >=20 >> As for ???data integrity is the key concern??? - violating the memory = management >> API can cause data integrity issues for programs. >=20 > The madvise one should be fixed too. It could also be "fixed" by > removing all batching but the performance cost will be sufficiently = high > that there will be pressure to find an alternative. >=20 >> It may not cause the OS to >> crash, but it should not be acceptable either, and may potentially = raise >> security concerns. If you think that the current behavior is ok, let = the >> documentation and man pages clarify that mprotect may not protect, = madvise >> may not advise and so on. >=20 > The madvise one should be fixed, not because because it allows a case > whereby userspace thinks it has initialised a structure that is = actually > in a page that is freed after a TLB is flushed resulting in a lost > write. It wouldn't cause any issues with shared or file-backed = mappings > but it is a problem for anonymous. >=20 >> And although you would use it against me, I would say: Nobody knew = that TLB >> flushing could be so complicated. >=20 > There is no question that the area is complicated. My comment was actually an unfunny joke... Never mind. Thanks, Nadav p.s.: Thanks for your patience. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org