From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EBE6BC433F5 for ; Fri, 15 Oct 2021 18:34:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CE05860F36 for ; Fri, 15 Oct 2021 18:34:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237356AbhJOSgN (ORCPT ); Fri, 15 Oct 2021 14:36:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56572 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241250AbhJOSfm (ORCPT ); Fri, 15 Oct 2021 14:35:42 -0400 Received: from mail-yb1-xb2f.google.com (mail-yb1-xb2f.google.com [IPv6:2607:f8b0:4864:20::b2f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4AA66C061762 for ; Fri, 15 Oct 2021 11:33:35 -0700 (PDT) Received: by mail-yb1-xb2f.google.com with SMTP id d131so24887868ybd.5 for ; Fri, 15 Oct 2021 11:33:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Te9f2SCg6g7ZLqzS5jQBTUtq73YsWWcz5wpKa5oJkKU=; b=L88l/ewTW7/hyA58QlhQiGpG0PpswGaSNEs9o63QcWE94uA741P2X4ymDTKOeuNfIb iCDTwyIR3RBwLo2XTniMIe7bsVraWqopcn7H10rNgTxKMLSd6Olgc0gD5qnxTJ/f6O1p Twfo0WeeL6+tdVn1PJjDDV+pQSyMFctUIDe7lj0bhaqzrR9GE+OXQs3eS3CoZx4hLY89 BfjA9j4Kh6tKvBmpiWziY3Q41Tx5rk0BbbmrwES8KWPnRMashsyU/CChh/EWVbI/IvD8 TA5ckE+apAM6C/B5biqIIrVnILGJ+kS4Qt0E/qBYslhp5Br6IdGtRh092e9cVY0/gfbY 4YYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Te9f2SCg6g7ZLqzS5jQBTUtq73YsWWcz5wpKa5oJkKU=; b=xBRFpOn/lg3G/8Gg2Ml0g0O/vqfe5yZqeHf7d5ORNQ+m5UJNLAYYxmMQOpWE3QyYVm HdU4RuAgQtzvrqj9Y2yuy+79SJQASkdhphUqDRFLEfsNsmYjkeht6j0tb4x+p8lp1yw6 I3dvz/Cuv9osnXcyxTaP2ANpzU2mu3fbHjqNCj/n00us0U4J55stKru8XeHJwehp1mSU G8zyYIfTcjRRB4IiPclZtP6XG9ers9f57TqPHb6vCyHv2/62lOIAfXlPlXb3cSmHpE1Y tz+BuL24lriN7g2rnplfc4VWV6X8NaPtiptdtm3jPLow6dlH47ZPFjALWjCpf/F7vxlg Tb2A== X-Gm-Message-State: AOAM5324eJiPO8V/AY2ElSPP1jJbVoIxNRhfU8Ij5LTsMO0UMVH43yX1 X/PxAnYfMTwlrPTlE7hnOWGYPyvpG5bU1Kavu1hY4A== X-Google-Smtp-Source: ABdhPJyxBFWbSkfrU5ks5rSh2Jm/DM5lei4c6HB9bi0EhzY6DB8hXPXtRMmfNb/TaOZCIf5FbSO8rKZXnEP/tJQFETg= X-Received: by 2002:a05:6902:120e:: with SMTP id s14mr17200240ybu.161.1634322814158; Fri, 15 Oct 2021 11:33:34 -0700 (PDT) MIME-Version: 1.0 References: <92cbfe3b-f3d1-a8e1-7eb9-bab735e782f6@rasmusvillemoes.dk> <20211007101527.GA26288@duo.ucw.cz> <202110071111.DF87B4EE3@keescook> <202110081344.FE6A7A82@keescook> <26f9db1e-69e9-1a54-6d49-45c0c180067c@redhat.com> <3563a3e8-b971-b604-7388-766ecfce4634@redhat.com> In-Reply-To: <3563a3e8-b971-b604-7388-766ecfce4634@redhat.com> From: Suren Baghdasaryan Date: Fri, 15 Oct 2021 11:33:22 -0700 Message-ID: Subject: Re: [PATCH v10 3/3] mm: add anonymous vma name refcounting To: David Hildenbrand Cc: Michal Hocko , Kees Cook , Pavel Machek , Rasmus Villemoes , John Hubbard , Andrew Morton , Colin Cross , Sumit Semwal , Dave Hansen , Matthew Wilcox , "Kirill A . Shutemov" , Vlastimil Babka , Johannes Weiner , Jonathan Corbet , Al Viro , Randy Dunlap , Kalesh Singh , Peter Xu , rppt@kernel.org, Peter Zijlstra , Catalin Marinas , vincenzo.frascino@arm.com, =?UTF-8?B?Q2hpbndlbiBDaGFuZyAo5by16Yym5paHKQ==?= , Axel Rasmussen , Andrea Arcangeli , Jann Horn , apopple@nvidia.com, Yu Zhao , Will Deacon , fenghua.yu@intel.com, thunder.leizhen@huawei.com, Hugh Dickins , feng.tang@intel.com, Jason Gunthorpe , Roman Gushchin , Thomas Gleixner , krisman@collabora.com, Chris Hyser , Peter Collingbourne , "Eric W. Biederman" , Jens Axboe , legion@kernel.org, Rolf Eike Beer , Cyrill Gorcunov , Muchun Song , Viresh Kumar , Thomas Cedeno , sashal@kernel.org, cxfcosmos@gmail.com, LKML , linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm , kernel-team Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 15, 2021 at 9:39 AM David Hildenbrand wrote: > > > >>> > >>> 1. Forking a process with anonymous vmas named using memfd is 5-15% > >>> slower than with prctl (depends on the number of VMAs in the process > >>> being forked). Profiling shows that i_mmap_lock_write() dominates > >>> dup_mmap(). Exit path is also slower by roughly 9% with > >>> free_pgtables() and fput() dominating exit_mmap(). Fork performance is > >>> important for Android because almost all processes are forked from > >>> zygote, therefore this limitation already makes this approach > >>> prohibitive. > >> > >> Interesting, naturally I wonder if that can be optimized. > > > > Maybe but it looks like we simply do additional things for file-backed > > memory, which seems natural. The call to i_mmap_lock_write() is from > > here: https://elixir.bootlin.com/linux/latest/source/kernel/fork.c#L565 > > > >> > >>> > >>> 2. mremap() usage to grow the mapping has an issue when used with memfds: > >>> > >>> fd = memfd_create(name, MFD_ALLOW_SEALING); > >>> ftruncate(fd, size_bytes); > >>> ptr = mmap(NULL, size_bytes, prot, MAP_PRIVATE, fd, 0); > >>> close(fd); > >>> ptr = mremap(ptr, size_bytes, size_bytes * 2, MREMAP_MAYMOVE); > >>> touch_mem(ptr, size_bytes * 2); > >>> > >>> This would generate a SIGBUS in touch_mem(). I believe it's because > >>> ftruncate() specified the size to be size_bytes and we are accessing > >>> more than that after remapping. prctl() does not have this limitation > >>> and we do have a usecase for growing a named VMA. > >> > >> Can't you simply size the memfd much larger? I mean, it doesn't really > >> cost much, does it? > > > > If we know beforehand what the max size it can reach then that would > > be possible. I would really hate to miscalculate here and cause a > > simple memory access to generate signals. Tracking such corner cases > > in the field is not an easy task and I would rather avoid the > > possibility of it. > > The question would be if you cannot simply add some extremely large > number, because the file size itself doesn't really matter for memfd IIRC. > > Having that said, without trying it out, I wouldn't know from the top of > my head if memremap would work that way on an already closed fd that ahs > a sufficient size :/ If you have the example still somewhere, I would be > interested if that would work in general. Yes, I tried a simple test like this and it works: fd = memfd_create(name, MFD_ALLOW_SEALING); ftruncate(fd, size_bytes * 2); ptr = mmap(NULL, size_bytes, prot, MAP_PRIVATE, fd, 0); close(fd); ptr = mremap(ptr, size_bytes, size_bytes * 2, MREMAP_MAYMOVE); touch_mem(ptr, size_bytes * 2); I understand your suggestion but it's just another hoop we have to jump to make this work and feels unnatural from userspace POV. Also virtual address space exhaustion might be an issue for 32bit userspace with this approach. > > [...] > > >> > >>> > >>> 4. There is a usecase in the Android userspace where vma naming > >>> happens after memory was allocated. Bionic linker does in-memory > >>> relocations and then names some relocated sections. > >> > >> Would renaming a memfd be an option or is that "too late" ? > > > > My understanding is that linker allocates space to load and relocate > > the code, performs the relocations in that space and then names some > > of the regions after that. Whether it can be redesigned to allocate > > multiple named regions and perform the relocation between them I did > > not really try since it would be a project by itself. > > > > TBH, at some point I just look at the amount of required changes (both > > kernel and userspace) and new limitations that userspace has to adhere > > to for fitting memfds to my usecase, and I feel that it's just not > > worth it. In the end we end up using the same refcounted strings with > > vma->vm_file->f_count as the refcount and name stored in > > vma->vm_file->f_path->dentry but with more overhead. > > Yes, but it's glued to files which naturally have names :) Yeah, I understand your motivations and that's why I'm exploring these possibilities but it proves to be just too costly for a feature as simple as naming a vma :) > > Again, I appreciate that you looked into alternatives! I can see the > late renaming could be the biggest blocker if user space cannot be > adjusted easily to be compatible with that using memfds. Yeah, it would definitely be hard for Android to adopt this. If there are no objections to the current approach I would like to respin another version with the CONFIG option added sometime early next week. If anyone has objections, please let me know. Thanks, Suren. > > -- > Thanks, > > David / dhildenb >