From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24F84C433DB for ; Wed, 10 Feb 2021 15:18:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A4F5164E38 for ; Wed, 10 Feb 2021 15:18:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A4F5164E38 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id ECC456B0006; Wed, 10 Feb 2021 10:18:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E7BC16B006C; Wed, 10 Feb 2021 10:18:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D1D996B006E; Wed, 10 Feb 2021 10:18:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0072.hostedemail.com [216.40.44.72]) by kanga.kvack.org (Postfix) with ESMTP id B7A9C6B0006 for ; Wed, 10 Feb 2021 10:18:52 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 757E38245578 for ; Wed, 10 Feb 2021 15:18:52 +0000 (UTC) X-FDA: 77802715704.09.rain32_120378f27611 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin09.hostedemail.com (Postfix) with ESMTP id 5CEB6182DD9E4 for ; Wed, 10 Feb 2021 15:18:52 +0000 (UTC) X-HE-Tag: rain32_120378f27611 X-Filterd-Recvd-Size: 5033 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Wed, 10 Feb 2021 15:18:51 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 85C93AB98; Wed, 10 Feb 2021 15:18:50 +0000 (UTC) To: Milan Broz , Michal Hocko Cc: linux-mm@kvack.org, Linux Kernel Mailing List , Mikulas Patocka References: <70885d37-62b7-748b-29df-9e94f3291736@gmail.com> <20210108134140.GA9883@dhcp22.suse.cz> <9474cd07-676a-56ed-1942-5090e0b9a82f@suse.cz> <6eebb858-d517-b70d-9202-f4e84221ed89@suse.cz> From: Vlastimil Babka Subject: Re: Very slow unlockall() Message-ID: <273db3a6-28b1-6605-1743-ef86e7eb2b72@suse.cz> Date: Wed, 10 Feb 2021 16:18:50 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2/1/21 8:19 PM, Milan Broz wrote: > On 01/02/2021 19:55, Vlastimil Babka wrote: >> On 2/1/21 7:00 PM, Milan Broz wrote: >>> On 01/02/2021 14:08, Vlastimil Babka wrote: >>>> On 1/8/21 3:39 PM, Milan Broz wrote: >>>>> On 08/01/2021 14:41, Michal Hocko wrote: >>>>>> On Wed 06-01-21 16:20:15, Milan Broz wrote: >>>>>>> Hi, >>>>>>> >>>>>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in crypt= setup code >>>>>>> and someone tried to use it with hardened memory allocator librar= y. >>>>>>> >>>>>>> Execution time was increased to extreme (minutes) and as we found= , the problem >>>>>>> is in munlockall(). >>>>>>> >>>>>>> Here is a plain reproducer for the core without any external code= - it takes >>>>>>> unlocking on Fedora rawhide kernel more than 30 seconds! >>>>>>> I can reproduce it on 5.10 kernels and Linus' git. >>>>>>> >>>>>>> The reproducer below tries to mmap large amount memory with PROT_= NONE (later never used). >>>>>>> The real code of course does something more useful but the proble= m is the same. >>>>>>> >>>>>>> #include >>>>>>> #include >>>>>>> #include >>>>>>> #include >>>>>>> >>>>>>> int main (int argc, char *argv[]) >>>>>>> { >>>>>>> void *p =3D mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE= | MAP_ANONYMOUS, -1, 0); >>=20 >> So, this is 2TB memory area, but PROT_NONE means it's never actually p= opulated, >> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ = | >> PROT_WRITE there, the mlockall() starts taking ages. >>=20 >> So does that reflect your use case? munlockall() with large PROT_NONE = areas? If >> so, munlock_vma_pages_range() is indeed not optimized for that, but I = would >> expect such scenario to be uncommon, so better clarify first. >=20 > It is just a simple reproducer of the underlying problem, as suggested = here=20 > https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301 >=20 > We use mlockall() in cryptsetup and with hardened malloc it slows down = unlock significantly. > (For the real case problem please read the whole issue report above.) OK, finally read through the bug report, and learned two things: 1) the PROT_NONE is indeed intentional part of the reproducer 2) Linux mailing lists still have a bad reputation and people avoid them.= That's sad :( Well, thanks for overcoming that :) Daniel there says "I think the Linux kernel implementation of mlockall is= quite broken and tries to lock all the reserved PROT_NONE regions in advance wh= ich doesn't make any sense." >From my testing this doesn't seem to be the case, as the mlockall() part = is very fast, so I don't think it faults in and mlocks PROT_NONE areas. It only s= tarts to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() par= t is slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We p= robably can't just skip them, as they might actually contain mlocked pages if tho= se were faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NON= E. And the munlock (munlock_vma_pages_range()) is slow, because it uses follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that= 's always traversing all levels of page tables from scratch. Funnily enough, speeding this up was my first linux-mm series years ago. But the speedup = only works if pte's are present, which is not the case for unpopulated PROT_NO= NE areas. That use case was unexpected back then. We should probably convert= this code to a proper page table walk. If there are large areas with unpopulat= ed pmd entries (or even higher levels) we would traverse them very quickly.