From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9F8DC4332F for ; Thu, 3 Nov 2022 16:28:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231801AbiKCQ2R (ORCPT ); Thu, 3 Nov 2022 12:28:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231822AbiKCQ1u (ORCPT ); Thu, 3 Nov 2022 12:27:50 -0400 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AB3481CB32 for ; Thu, 3 Nov 2022 09:27:23 -0700 (PDT) Received: by mail-pf1-x42c.google.com with SMTP id k15so2105635pfg.2 for ; Thu, 03 Nov 2022 09:27:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Qbc9/7qEOtavwLw8LQp9XE09/9bhelNZbWmBfVDGw/4=; b=nRKfJ4QQ3+p6c+xIQjyLkoMwJvMnwYkAuaPcWumPlxJnVKcvToJsNiOAGYenmja0Sy 0RoJkXdIm/TX+G8gnDZ+obkXWdspQdCFRI8jyQ0Jbs1WkCbaBSCKMFWUKWu6G3aYiNf+ UQS/IMoncWgX6OOaPRMaCTiHd4UobAg5+BLo68MES2g/g2g52drNelqM/o5S6rL276vQ 59B9YcfdRSBbO8RH0qvr4sgiqXAKyJqC46UCoo6fS3cz6SnqinDkZiU40iLNXr90+JsU JTENrIEpqr75AzmQXB9iCQJUqbZTKhHGlXw36gDu1kRfRLSYiBKX6y1lz/w+ys/ieg8s 2YFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Qbc9/7qEOtavwLw8LQp9XE09/9bhelNZbWmBfVDGw/4=; b=jQ5ycZfEkb7M7ZKrvcZ5P3IDE7bO8piUXgVLjQxtLqAfnPTAe0uaEaabslr3QJAzT9 eTyUoHEPppAPs5YOmwKSBGcrM0nFisyrsOhCzequi5KzjIT/jXhuMOOo2Wkol5wcvdDI Rit40egEKtZ1hCRsRQA54T66U2U/emj8crOBqDAyfGjxFmCLOGah6aCKz80OAFjLNchp Fq0XVOVne4kI76ulSxOcunTS8boc6XADhJ6Hi68JK3Vz2oFyGDQclbONSIfxnSiMcFB8 yaMbtZaLtDIERWFZ8faWMIRqOusHywdy8lKF3Hv4umC9DL/m7Ap2UP0FCkgrn3ne0mLL PgKw== X-Gm-Message-State: ACrzQf35j8inDWndttvdOc84+UQwuTdLrZB1VX6HwcFlSJw7PMwcJ4R0 7zr+AGdoa3i9J8+0fKHAOtMsHkPQrWP5g8kPwXzc7g== X-Google-Smtp-Source: AMsMyM7LsxnGFqh+M0E5h18/S54Hm40LIPJDB0peNXRc06g8NnvtmJDyMzFqlwvdks3WF/JqPmHtsXk2TpzQtF/OoJ8= X-Received: by 2002:a63:c4c:0:b0:46f:e243:503a with SMTP id 12-20020a630c4c000000b0046fe243503amr14314192pgm.483.1667492842447; Thu, 03 Nov 2022 09:27:22 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20221021134711.GA3607894@chaop.bj.intel.com> <20221024145928.66uehsokp7bpa2st@box.shutemov.name> In-Reply-To: <20221024145928.66uehsokp7bpa2st@box.shutemov.name> From: Vishal Annapurve Date: Thu, 3 Nov 2022 21:57:11 +0530 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: "Kirill A . Shutemov" Cc: Sean Christopherson , Chao Peng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Yu Zhang , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-api@vger.kernel.org On Mon, Oct 24, 2022 at 8:30 PM Kirill A . Shutemov wrote: > > On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote: > > On Fri, Oct 21, 2022, Chao Peng wrote: > > > > > > > > In the context of userspace inaccessible memfd, what would be a > > > > suggested way to enforce NUMA memory policy for physical memory > > > > allocation? mbind[1] won't work here in absence of virtual address > > > > range. > > > > > > How about set_mempolicy(): > > > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html > > > > Andy Lutomirski brought this up in an off-list discussion way back when the whole > > private-fd thing was first being proposed. > > > > : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses. If > > : we want to support them for TDX private memory, we either need TDX private > > : memory to have an HVA or we need file-based equivalents. Arguably we should add > > : fmove_pages and fbind syscalls anyway, since the current API is quite awkward > > : even for tools like numactl. > > Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be > addressed in the initial submission. > > BTW, it is not regression comparing to old KVM slots, if the memory is > backed by memfd or other file: > > MBIND(2) > The specified policy will be ignored for any MAP_SHARED mappings in the > specified memory range. Rather the pages will be allocated according to > the memory policy of the thread that caused the page to be allocated. > Again, this may not be the thread that called mbind(). > > It is not clear how to define fbind(2) semantics, considering that multiple > processes may compete for the same region of page cache. > > Should it be per-inode or per-fd? Or maybe per-range in inode/fd? > David's analysis on mempolicy with shmem seems to be right. set_policy on virtual address range does seem to change the shared policy for the inode irrespective of the mapping type. Maybe having a way to set numa policy per-range in the inode would be at par with what we can do today via mbind on virtual address ranges. > fmove_pages(2) should be relatively straight forward, since it is > best-effort and does not guarantee that the page will note be moved > somewhare else just after return from the syscall. > > -- > Kiryl Shutsemau / Kirill A. Shutemov