From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F9B4C43462 for ; Thu, 15 Apr 2021 10:27:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2243D610E6 for ; Thu, 15 Apr 2021 10:27:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2243D610E6 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A76BC6B006C; Thu, 15 Apr 2021 06:27:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9FFCB6B0070; Thu, 15 Apr 2021 06:27:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 82B766B0071; Thu, 15 Apr 2021 06:27:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0199.hostedemail.com [216.40.44.199]) by kanga.kvack.org (Postfix) with ESMTP id 5F0966B006C for ; Thu, 15 Apr 2021 06:27:01 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 0D7151E18 for ; Thu, 15 Apr 2021 10:27:01 +0000 (UTC) X-FDA: 78034223442.07.2985608 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf18.hostedemail.com (Postfix) with ESMTP id 7BE962000266 for ; Thu, 15 Apr 2021 10:27:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1618482420; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8cpKb0ASp+K7zcTMzICSE9fwIk8Icc72x4Rwh+Gm6ZU=; b=B34Uw0nyaEBVNiCQkVKnRYtdn0+1THHtgjLZ1BOVY544Qq+r7rvTjXPr55JfQlsbBK8fzb gt89YrV2KsSFD59Er9odEoxmkcNrZAs850a3i50ZCSC94u95cMzUPC+6aRA94hcEH3V39Y 66zNci0w5OR8g3bINn2nQubuFLy7gh0= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-402-bmOiR7RMMV2MfKhHB7TyLg-1; Thu, 15 Apr 2021 06:26:56 -0400 X-MC-Unique: bmOiR7RMMV2MfKhHB7TyLg-1 Received: by mail-wr1-f69.google.com with SMTP id v20-20020a5d59140000b02901028c7a1f7dso2532828wrd.18 for ; Thu, 15 Apr 2021 03:26:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:references:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=8cpKb0ASp+K7zcTMzICSE9fwIk8Icc72x4Rwh+Gm6ZU=; b=IZgPktPFRDqJfiiWHu06gKw+02mWNaMK/9d4r4tCRGVsElcOHx2kkhXgJOmVpdL68g pJ+zVfRegIxh536ENoz0Aylj9C+R9LPga1zTRHuK3q31O0IVfYD2WieChBsFawW26U1+ 5aHEE/0KMpHq/h94QCMkJ3Rlf/s9glV579rdnCy2+dPYBpM/+WfSvE25yWnYeTIVbzzf ghhXz0Vtmr3A7BYufiiAxtvT1skZX/UI2mz80F8j7Gc/V+EtIawL6zi2W6Onmd0JCbKR omgMxTN32ejyxRfgtaTt6ztbVchBtBnw3K3lb/vBIVbHyPc26EcP8yAPKwQT8L310pDG mgCA== X-Gm-Message-State: AOAM531kNmi0dSbAZwaksIyoMSm41w8wA7mPwzEIwq280yn1XPqH4fKr 4akc5TmDO1DDbwvEJxSW7BG3Espfiq9CFW4gGAkGMqfU7dmfUSo6EUezmgflhXYxnOO7z03uwvi TRYVXPGVkMA0= X-Received: by 2002:adf:f186:: with SMTP id h6mr2569779wro.89.1618482415120; Thu, 15 Apr 2021 03:26:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwv7Q012I0Vz8MHqIPLa4Xkx+hS8m+nH/TPxEyn3JS7V7NiE9yOuvehr7qVZbxLbS4BFGfVVQ== X-Received: by 2002:adf:f186:: with SMTP id h6mr2569712wro.89.1618482414741; Thu, 15 Apr 2021 03:26:54 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6392.dip0.t-ipconnect.de. [91.12.99.146]) by smtp.gmail.com with ESMTPSA id z17sm2386902wro.1.2021.04.15.03.26.53 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 15 Apr 2021 03:26:54 -0700 (PDT) From: David Hildenbrand To: Jann Horn Cc: kernel list , Linux-MM , Andrew Morton , Arnd Bergmann , Michal Hocko , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Mike Kravetz , Peter Xu , Rolf Eike Beer , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch , Linux API References: <20210317110644.25343-1-david@redhat.com> <20210317110644.25343-3-david@redhat.com> <2bab28c7-08c0-7ff0-c70e-9bf94da05ce1@redhat.com> <26227fc6-3e7b-4e69-f69d-4dc2a67ecfe8@redhat.com> <54165ffe-dbf7-377a-a710-d15be4701f20@redhat.com> <5f49b60c-957d-8cb4-de7a-7c855dc72942@redhat.com> Organization: Red Hat Subject: Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory Message-ID: <273f9c4b-2a1a-23e7-53b9-0da5441895b3@redhat.com> Date: Thu, 15 Apr 2021 12:26:52 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <5f49b60c-957d-8cb4-de7a-7c855dc72942@redhat.com> Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=david@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Queue-Id: 7BE962000266 X-Stat-Signature: 8tn8h9zjysxjbn5ea8yma85yuzxycoyg X-Rspamd-Server: rspam02 Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf18; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1618482421-921708 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 07.04.21 12:31, David Hildenbrand wrote: > On 30.03.21 18:31, David Hildenbrand wrote: >> On 30.03.21 18:30, David Hildenbrand wrote: >>> On 30.03.21 18:21, Jann Horn wrote: >>>> On Tue, Mar 30, 2021 at 5:01 PM David Hildenbrand = wrote: >>>>>>> +long faultin_vma_page_range(struct vm_area_struct *vma, unsigned= long start, >>>>>>> + unsigned long end, bool write, int *l= ocked) >>>>>>> +{ >>>>>>> + struct mm_struct *mm =3D vma->vm_mm; >>>>>>> + unsigned long nr_pages =3D (end - start) / PAGE_SIZE; >>>>>>> + int gup_flags; >>>>>>> + >>>>>>> + VM_BUG_ON(!PAGE_ALIGNED(start)); >>>>>>> + VM_BUG_ON(!PAGE_ALIGNED(end)); >>>>>>> + VM_BUG_ON_VMA(start < vma->vm_start, vma); >>>>>>> + VM_BUG_ON_VMA(end > vma->vm_end, vma); >>>>>>> + mmap_assert_locked(mm); >>>>>>> + >>>>>>> + /* >>>>>>> + * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT wh= en we hit >>>>>>> + * a poisoned page. >>>>>>> + * FOLL_POPULATE: Always populate memory with VM_LOCKONFA= ULT. >>>>>>> + * !FOLL_FORCE: Require proper access permissions. >>>>>>> + */ >>>>>>> + gup_flags =3D FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | F= OLL_HWPOISON; >>>>>>> + if (write) >>>>>>> + gup_flags |=3D FOLL_WRITE; >>>>>>> + >>>>>>> + /* >>>>>>> + * See check_vma_flags(): Will return -EFAULT on incompat= ible mappings >>>>>>> + * or with insufficient permissions. >>>>>>> + */ >>>>>>> + return __get_user_pages(mm, start, nr_pages, gup_flags, >>>>>>> + NULL, NULL, locked); >>>>>> >>>>>> You mentioned in the commit message that you don't want to actuall= y >>>>>> dirty all the file pages and force writeback; but doesn't >>>>>> POPULATE_WRITE still do exactly that? In follow_page_pte(), if >>>>>> FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty: >>>>> >>>>> Well, I mention that POPULATE_READ explicitly doesn't do that. I >>>>> primarily set it because populate_vma_page_range() also sets it. >>>>> >>>>> Is it safe to *not* set it? IOW, fault something writable into a pa= ge >>>>> table (where the CPU could dirty it without additional page faults) >>>>> without marking it accessed? For me, this made logically sense. Thu= s I >>>>> also understood why populate_vma_page_range() set it. >>>> >>>> FOLL_TOUCH doesn't have anything to do with installing the PTE - it >>>> essentially means "the caller of get_user_pages wants to read/write >>>> the contents of the returned page, so please do the same things you >>>> would do if userspace was accessing the page". So in particular, if >>>> you look up a page via get_user_pages() with FOLL_WRITE|FOLL_TOUCH, >>>> that tells the MM subsystem "I will be writing into this page direct= ly >>>> from the kernel, bypassing the userspace page tables, so please mark >>>> it as dirty now so that it will be properly written back later". Par= t >>>> of that is that it marks the page as recently used, which has an >>>> effect on LRU pageout behavior, I think - as far as I understand, th= at >>>> is why populate_vma_page_range() uses FOLL_TOUCH. >>>> >>>> If you look at __get_user_pages(), you can see that it is split up >>>> into two major parts: faultin_page() for creating PTEs, and >>>> follow_page_mask() for grabbing pages from PTEs. faultin_page() >>>> ignores FOLL_TOUCH completely; only follow_page_mask() uses it. >>>> >>>> In a way I guess maybe you do want the "mark as recently accessed" >>>> part that FOLL_TOUCH would give you without FOLL_WRITE? But I think >>>> you very much don't want the dirtying that FOLL_TOUCH|FOLL_WRITE lea= ds >>>> to. Maybe the ideal approach would be to add a new FOLL flag to say = "I >>>> only want to mark as recently used, I don't want to dirty". Or maybe >>>> it's enough to just leave out the FOLL_TOUCH entirely, I don't know. >>> >>> Any thoughts why populate_vma_page_range() does it? >> >> Sorry, I missed the explanation above - thanks! >=20 > Looking into the details, adjusting the FOLL_TOUCH logic won't make too > much of a difference for MADV_POPULATE_WRITE I guess. AFAIKs, the > biggest impact of FOLL_TOUCH is actually with FOLL_FORCE - which we are > not using, but populate_vma_page_range() is. >=20 >=20 > If a page was not faulted in yet, > faultin_page(FOLL_WRITE)->handle_mm_fault(FAULT_FLAG_WRITE) will alread= y > mark the PTE/PMD/... dirty and accessed. One example is > handle_pte_fault(). We will mark the page accessed again via FOLL_TOUCH= , > which doesn't seem to be strictly required. >=20 >=20 > If the page was already faulted in, we have three cases: >=20 > 1. Page faulted in writable. The page should already be dirty (otherwis= e > we would be in trouble I guess). We will mark it accessed. >=20 > 2. Page faulted in readable. handle_mm_fault() will fault it in writabl= e > and set the page dirty. >=20 > 3. Page faulted in readable and we have FOLL_FORCE. We mark the page > dirty and accessed. >=20 >=20 > So doing a MADV_POPULATE_WRITE, whereby we prefault page tables > writable, doesn't seem to fly without marking the pages dirty. That's > one reason why I included MADV_POPULATE_READ. >=20 > We could >=20 > a) Drop FOLL_TOUCH. We are not marking the page accessed, which would > mean it gets evicted rather earlier than later. >=20 > b) Introduce FOLL_ACCESSED which won't do the dirtying. But then, the > pages are already dirty as explained above, so there isn't a real > observable change. >=20 > c) Keep it as is: Mark the page accessed and dirty. As it's already > dirty, that does not seem to be a real issue. >=20 > Am I missing something obvious? Thanks! >=20 I did some more digging. I think there are cases for shared mappings=20 where we can have pte_write() but not pte_dirty(). One example seems to be mm/memory.c:copy_present_pte() , used during fork= . IIUC, this means that the child process can write to these pages, but=20 won't mark the PTEs dirty -- as there won't be a write fault. I'd assume=20 we'd need pte_mkclean(pte_wrprotect(pte)), but I'm fairly new to that cod= e. (Similarly, we do an pte_mkold() without revoking any protection,=20 meaning we won't catch read accesses.) Maybe the logic here is that if the PTE was writable in some parent, it=20 was also dirty in some parent (at least in the one originally mapping it=20 writable). When evicting/writeback'ing file pages, we'll have go over=20 the rmap and zap all entries of any page tables either way; it's=20 sufficient if one PTE entry is dirty. I can spot that even=20 zap_pte_range() will sync the dirty flag back to the page. Am I right or is there some other magic going on? :) --=20 Thanks, David / dhildenb