From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E79D0C433DF for ; Thu, 20 Aug 2020 01:54:28 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id AD75B207DA for ; Thu, 20 Aug 2020 01:54:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="lZ/K4yVK" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AD75B207DA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvdimm-bounces@lists.01.org Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 9DE1713520724; Wed, 19 Aug 2020 18:54:28 -0700 (PDT) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::644; helo=mail-ej1-x644.google.com; envelope-from=dan.j.williams@intel.com; receiver= Received: from mail-ej1-x644.google.com (mail-ej1-x644.google.com [IPv6:2a00:1450:4864:20::644]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 783B613515B61 for ; Wed, 19 Aug 2020 18:54:11 -0700 (PDT) Received: by mail-ej1-x644.google.com with SMTP id bo3so698073ejb.11 for ; Wed, 19 Aug 2020 18:54:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=M2856byAPxR8wC0hiTcV4Eu6Vwp5zHAzX+5RXykLdZ8=; b=lZ/K4yVK00w07TAWoz4NrJ7bh2SQj0QlEXGy+16Uvv+KVxUgjz3B3Iv7EDFWqB1Pdr DuqS0vl9GBWbYe7Tv0iQX9ujEvHAFPuKsnBXPYGFMmNAlAMmAGkWwywfwrjoKzj7KHaq 9x9L9GAliRR0hzTYRqnJnP0AEzhhN86hMJ/tfa5N1nZQYhNMWXesPPOB/n1sQBmB9iFO 1gldSVsbODRKjEwdNvNll8hWpTbiqQ6XVTWltCogzN6iGXyVjZR5g3DVI13dx5mRUp7V PQNTIep7dDdU8pcLAQzgJrYbrZcCxhZ6TJ0/WHGSPo2V8rXXnajEPWLIdfd2u3CPbZ4m C4iQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=M2856byAPxR8wC0hiTcV4Eu6Vwp5zHAzX+5RXykLdZ8=; b=CowVhVnhUwfp3AR5WpOFp8ZAJRf4akddc6Fjb+VDmVV5y2ggJjN1yN4Xr5TY4yobte L//jx7/42MmfqL43spjP8bqCKWUxbbZe8Wcrk5/alBrW9vozvN4V2fA763QRPbPWQ6Jt tZrpxWDiVAT4O+L3eGoGd6LHRZFjtE5+1x/RGQhOIK7jyltjfGLRdq+qbolVlNPZNEZ/ uFK5oH9XwOsgKCcVxSW79DLxm3+LCrJ4HmrRaKk/2Tr2+7StaFpx56xb4SciSVdEndb3 50TlAuhgHEgEUt2toM6qhyBJICR2oFwmTdb5e2XC/C0nEPnOVdpRfHBp/wRQSOL5LiQZ KOfA== X-Gm-Message-State: AOAM530ZvSoQ3NKd+2l0dyJSI0dt89u8I8ZKWSR2Jr9T5EwtXQP1ZuNV DupFmpYz7eUbg0xIEcK3VQKDYThGbHeorNnjjSTc3A== X-Google-Smtp-Source: ABdhPJzvu8YTM6DfS52cWqhkmKxDgLU8iC/mKftCXZJ19EZWyjZICrR61eKB8x5hnqIA1r8PziL6kB6a/+obAqDoPt8= X-Received: by 2002:a17:906:413:: with SMTP id d19mr1123427eja.523.1597888448722; Wed, 19 Aug 2020 18:54:08 -0700 (PDT) MIME-Version: 1.0 References: <159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: From: Dan Williams Date: Wed, 19 Aug 2020 18:53:57 -0700 Message-ID: Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges To: David Hildenbrand Message-ID-Hash: 5OHNEA3HYEB44LJMSPOABSBU2AEU7HKF X-Message-ID-Hash: 5OHNEA3HYEB44LJMSPOABSBU2AEU7HKF X-MailFrom: dan.j.williams@intel.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header CC: Andrew Morton , Ard Biesheuvel , Mike Rapoport , Borislav Petkov , David Airlie , Will Deacon , Catalin Marinas , Ard Biesheuvel , Joao Martins , Tom Lendacky , "Rafael J. Wysocki" , Jonathan Cameron , X86 ML , "H. Peter Anvin" , Thomas Gleixner , Greg Kroah-Hartman , Pavel Tatashin , Peter Zijlstra , Ben Skeggs , Benjamin Herrenschmidt , Jason Gunthorpe , Jia He , Ingo Molnar , Dave Hansen , Paul Mackerras , Brice Goglin , Michael Ellerman , "Rafael J. Wysocki" , Daniel Vetter , Andy Lutomirski , "Rafael J. Wysocki" , Linux MM , linux-nvdimm , Linux Kernel Mailing List , Linux ACPI , Maling list - DRI developers X-Mailman-Version: 3.1.1 Precedence: list List-Id: "Linux-nvdimm developer list." Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Mon, Aug 3, 2020 at 12:48 AM David Hildenbrand wrote: > > [...] > > > Well, no v5.8-rc8 to line this up for v5.9, so next best is early > > integration into -mm before other collisions develop. > > > > Chatted with Justin offline and it currently appears that the missing > > numa information is the fault of the platform firmware to populate all > > the necessary NUMA data in the NFIT. > > I'm planning on looking at some bits of this series this week, but some > questions upfront ... > > > > > --- > > Cover: > > > > The device-dax facility allows an address range to be directly mapped > > through a chardev, or optionally hotplugged to the core kernel page > > allocator as System-RAM. It is the mechanism for converting persistent > > memory (pmem) to be used as another volatile memory pool i.e. the > > current Memory Tiering hot topic on linux-mm. > > > > In the case of pmem the nvdimm-namespace-label mechanism can sub-divide > > it, but that labeling mechanism is not available / applicable to > > soft-reserved ("EFI specific purpose") memory [3]. This series provides > > a sysfs-mechanism for the daxctl utility to enable provisioning of > > volatile-soft-reserved memory ranges. > > > > The motivations for this facility are: > > > > 1/ Allow performance differentiated memory ranges to be split between > > kernel-managed and directly-accessed use cases. > > > > 2/ Allow physical memory to be provisioned along performance relevant > > address boundaries. For example, divide a memory-side cache [4] along > > cache-color boundaries. > > > > 3/ Parcel out soft-reserved memory to VMs using device-dax as a security > > / permissions boundary [5]. Specifically I have seen people (ab)using > > memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the > > device-dax interface on custom address ranges. A follow-on for the VM > > use case is to teach device-dax to dynamically allocate 'struct page' at > > runtime to reduce the duplication of 'struct page' space in both the > > guest and the host kernel for the same physical pages. > > > I think I am missing some important pieces. Bear with me. No worries, also bear with me, I'm going to be offline intermittently until at least mid-September. Hopefully Joao and/or Vishal can jump in on this discussion. > > 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not > automatically used in the buddy during boot, but remains untouched > (similar to pmem). But as it involves ACPI as well, it could also be > used on arm64 (-e820), correct? Correct, arm64 also gets the EFI support for enumerating memory this way. However, I would clarify that whether soft-reserved is given to the buddy allocator by default or not is the kernel's policy choice, "buddy-by-default" is ok and is what will happen anyways with older kernels on platforms that enumerate a memory range this way. > 2. Soft-reserved memory is volatile RAM with differing performance > characteristics ("performance differentiated memory"). What would be > examples of such memory? Likely the most prominent one that drove the creation of the "EFI Specific Purpose" attribute bit is high-bandwidth memory. One concrete example of that was a platform called Knights Landing [1] that ended up shipping firmware that lied to the OS about the latency characteristics of the memory to try to reverse engineer OS behavior to not allocate from that memory range by default. With the EFI attribute firmware performance tables can tell the truth about the performance characteristics of the memory range *and* indicate that the OS not use it for general purpose allocations by default. [1]: https://software.intel.com/content/www/us/en/develop/blogs/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing.html > Like, memory that is faster than RAM (scratch > pad), or slower (pmem)? Or both? :) Both, but note that PMEM is already hard-reserved by default. Soft-reserved is about a memory range that, for example, an administrator may want to reserve 100% for a weather simulation where if even a small amount of memory was stolen for the page cache the application may not meet its performance targets. It could also be a memory range that is so slow that only applications with higher latency tolerances would be prepared to consume it. In other words the soft-reserved memory can be used to indicate memory that is either too precious, or too slow for general purpose OS allocations. > Is it a valid use case to use pmem > in a hypervisor to back this memory? Depends on the pmem. That performance capability is indicated by the ACPI HMAT, not the EFI soft-reserved designation. > 3. There seem to be use cases where "soft-reserved" memory is used via > DAX. What is an example use case? I assume it's *not* to treat it like > PMEM but instead e.g., use it as a fast buffer inside applications or > similar. Right, in that weather-simulation example that application could just mmap /dev/daxX.Y and never worry about contending for the "fast memory" resource on the platform. Alternatively if that resource needs to be shared and/or over-commited then kernel memory-management services are needed and that dax-device can be assigned to kmem. > 4. There seem to be use cases where some part of "soft-reserved" memory > is used via DAX, some other is given to the buddy. What is an example > use case? Is this really necessary or only some theoretical use case? It's as necessary as pmem namespace partitioning, or the inclusion of dax-kmem upstream in the first place. In that kmem case the motivation was that some users want a portion of pmem provisioned for storage and some for volatile usage. The motivation is similar here, platform firmware can only identify memory attributes on coarse boundaries, finer grained provisioning decisions are up to the administrator / platform-owner and the kernel is a just a facilitator of that policy. > > 5. The "provisioned along performance relevant address boundaries." part > is unclear to me. Can you give an example of how this would look like > from user space? Like, split that memory in blocks of size X with > alignment Y and give them to separate applications? One example of platform address boundaries are the memory address ranges that alias in a direct-mapped memory-side-cache. In the direct-map-cache aliasing may repeat every N GBs where N is the ratio of far-to-near memory. ("Near memory" == cache "Far memory" == backing memory). Also refer back to the background in the page allocator shuffling patches [2]. With this partitioning mechanism you could, for one example use case, assign different VMs to exclusive colors in the memory side cache. [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e900a918b098 > 6. If you add such memory to the buddy, is there any way the system can > differentiate it from other memory? E.g., via fake/other NUMA nodes? Numa node numbers / are how performance differentiated memory ranges are enumerated. The expectation is that all distinct performance memory targets have unique ACPI proximity domains and Linux numa node numbers as a result. > Also, can you give examples of how kmem-added memory is represented in > /proc/iomem for a) pmem and b) soft-resered memory after this series > (skimming over the patches, I think there is a change for pmem, right?)? I don't expect a change. The only difference is the parent resource will be marked "Soft Reserved" instead of "Persistent Memory". > I am really wondering if it's the right approach to squeeze this into > our pmem/nvdimm infrastructure just because it's easy to do. E.g., man > "ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile > Memory)" speaks explicitly about non-volatile memory. In fact it's not squeezed into PMEM infrastructure. dax-kmem and device-dax are independent of PMEM. PMEM is one source of potential device-dax instances, soft-reserved memory is another orthogonal source. This is why device-dax needs its own userspace policy directed partitioning mechanism because there is no PMEM to store the configuration for partitioned higph-bandwidth memory. The userspace tooling for this mechanism is targeted for a tool called daxctl that has no PMEM dependencies. Look to Joao's use case that is using this infrastructure independent of PMEM with manual soft-reservations specified on the kernel command-line. _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80581C433E1 for ; Thu, 20 Aug 2020 01:54:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 55DBE207DA for ; Thu, 20 Aug 2020 01:54:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="lZ/K4yVK" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726735AbgHTByL (ORCPT ); Wed, 19 Aug 2020 21:54:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43454 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726664AbgHTByK (ORCPT ); Wed, 19 Aug 2020 21:54:10 -0400 Received: from mail-ej1-x642.google.com (mail-ej1-x642.google.com [IPv6:2a00:1450:4864:20::642]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 52F9BC061383 for ; Wed, 19 Aug 2020 18:54:10 -0700 (PDT) Received: by mail-ej1-x642.google.com with SMTP id o18so724294eje.7 for ; Wed, 19 Aug 2020 18:54:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=M2856byAPxR8wC0hiTcV4Eu6Vwp5zHAzX+5RXykLdZ8=; b=lZ/K4yVK00w07TAWoz4NrJ7bh2SQj0QlEXGy+16Uvv+KVxUgjz3B3Iv7EDFWqB1Pdr DuqS0vl9GBWbYe7Tv0iQX9ujEvHAFPuKsnBXPYGFMmNAlAMmAGkWwywfwrjoKzj7KHaq 9x9L9GAliRR0hzTYRqnJnP0AEzhhN86hMJ/tfa5N1nZQYhNMWXesPPOB/n1sQBmB9iFO 1gldSVsbODRKjEwdNvNll8hWpTbiqQ6XVTWltCogzN6iGXyVjZR5g3DVI13dx5mRUp7V PQNTIep7dDdU8pcLAQzgJrYbrZcCxhZ6TJ0/WHGSPo2V8rXXnajEPWLIdfd2u3CPbZ4m C4iQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=M2856byAPxR8wC0hiTcV4Eu6Vwp5zHAzX+5RXykLdZ8=; b=K6s8xydRF7KhKf+K1emvW5LYM8NJh598nepHq7PZzyd659iQStkROU4Cm6EAzQwXHs BKImWxacT1Gtx/Xf28NxNj664knTjIMknJH0OLTdxwqiyDAK6ABik3M2pltb21HW3wHE rIwfGL+MIuItjhgnAtBP5WVC8XfzWZFysDwT5U5sUV8DVlQgGDphQJ1o5ynscT/o7r5l N79pgvNcpKPhDKdicTWcSa3AhX0t4yF9I6wgamO6a7u9oGlU3tZmEuLvl1vdBhpF3kQ5 gMKBGRFd80XhgBG4wgkz4lue8DeM/9dZXOFznsmo8N6Z5WuddjPHXhVIQoP75Ie4eIU+ g9JQ== X-Gm-Message-State: AOAM530OVZb/gpPzEmCAwW1Crlorq8B2OsJQSu1JQy538uSMVjLkGBg7 4M89ufPOlvQIy6XOAmmDU+vZ9NrdSvj10HkdADCeJA== X-Google-Smtp-Source: ABdhPJzvu8YTM6DfS52cWqhkmKxDgLU8iC/mKftCXZJ19EZWyjZICrR61eKB8x5hnqIA1r8PziL6kB6a/+obAqDoPt8= X-Received: by 2002:a17:906:413:: with SMTP id d19mr1123427eja.523.1597888448722; Wed, 19 Aug 2020 18:54:08 -0700 (PDT) MIME-Version: 1.0 References: <159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: From: Dan Williams Date: Wed, 19 Aug 2020 18:53:57 -0700 Message-ID: Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges To: David Hildenbrand Cc: Andrew Morton , Ira Weiny , Ard Biesheuvel , Mike Rapoport , Borislav Petkov , Vishal Verma , David Airlie , Will Deacon , Catalin Marinas , Ard Biesheuvel , Joao Martins , Tom Lendacky , Dave Jiang , "Rafael J. Wysocki" , Jonathan Cameron , Wei Yang , X86 ML , "H. Peter Anvin" , Thomas Gleixner , Greg Kroah-Hartman , Pavel Tatashin , Peter Zijlstra , Ben Skeggs , Benjamin Herrenschmidt , Jason Gunthorpe , Jia He , Ingo Molnar , Dave Hansen , Paul Mackerras , Brice Goglin , Jeff Moyer , Michael Ellerman , "Rafael J. Wysocki" , Daniel Vetter , Andy Lutomirski , "Rafael J. Wysocki" , Linux MM , linux-nvdimm , Linux Kernel Mailing List , Linux ACPI , Maling list - DRI developers Content-Type: text/plain; charset="UTF-8" Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org On Mon, Aug 3, 2020 at 12:48 AM David Hildenbrand wrote: > > [...] > > > Well, no v5.8-rc8 to line this up for v5.9, so next best is early > > integration into -mm before other collisions develop. > > > > Chatted with Justin offline and it currently appears that the missing > > numa information is the fault of the platform firmware to populate all > > the necessary NUMA data in the NFIT. > > I'm planning on looking at some bits of this series this week, but some > questions upfront ... > > > > > --- > > Cover: > > > > The device-dax facility allows an address range to be directly mapped > > through a chardev, or optionally hotplugged to the core kernel page > > allocator as System-RAM. It is the mechanism for converting persistent > > memory (pmem) to be used as another volatile memory pool i.e. the > > current Memory Tiering hot topic on linux-mm. > > > > In the case of pmem the nvdimm-namespace-label mechanism can sub-divide > > it, but that labeling mechanism is not available / applicable to > > soft-reserved ("EFI specific purpose") memory [3]. This series provides > > a sysfs-mechanism for the daxctl utility to enable provisioning of > > volatile-soft-reserved memory ranges. > > > > The motivations for this facility are: > > > > 1/ Allow performance differentiated memory ranges to be split between > > kernel-managed and directly-accessed use cases. > > > > 2/ Allow physical memory to be provisioned along performance relevant > > address boundaries. For example, divide a memory-side cache [4] along > > cache-color boundaries. > > > > 3/ Parcel out soft-reserved memory to VMs using device-dax as a security > > / permissions boundary [5]. Specifically I have seen people (ab)using > > memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the > > device-dax interface on custom address ranges. A follow-on for the VM > > use case is to teach device-dax to dynamically allocate 'struct page' at > > runtime to reduce the duplication of 'struct page' space in both the > > guest and the host kernel for the same physical pages. > > > I think I am missing some important pieces. Bear with me. No worries, also bear with me, I'm going to be offline intermittently until at least mid-September. Hopefully Joao and/or Vishal can jump in on this discussion. > > 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not > automatically used in the buddy during boot, but remains untouched > (similar to pmem). But as it involves ACPI as well, it could also be > used on arm64 (-e820), correct? Correct, arm64 also gets the EFI support for enumerating memory this way. However, I would clarify that whether soft-reserved is given to the buddy allocator by default or not is the kernel's policy choice, "buddy-by-default" is ok and is what will happen anyways with older kernels on platforms that enumerate a memory range this way. > 2. Soft-reserved memory is volatile RAM with differing performance > characteristics ("performance differentiated memory"). What would be > examples of such memory? Likely the most prominent one that drove the creation of the "EFI Specific Purpose" attribute bit is high-bandwidth memory. One concrete example of that was a platform called Knights Landing [1] that ended up shipping firmware that lied to the OS about the latency characteristics of the memory to try to reverse engineer OS behavior to not allocate from that memory range by default. With the EFI attribute firmware performance tables can tell the truth about the performance characteristics of the memory range *and* indicate that the OS not use it for general purpose allocations by default. [1]: https://software.intel.com/content/www/us/en/develop/blogs/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing.html > Like, memory that is faster than RAM (scratch > pad), or slower (pmem)? Or both? :) Both, but note that PMEM is already hard-reserved by default. Soft-reserved is about a memory range that, for example, an administrator may want to reserve 100% for a weather simulation where if even a small amount of memory was stolen for the page cache the application may not meet its performance targets. It could also be a memory range that is so slow that only applications with higher latency tolerances would be prepared to consume it. In other words the soft-reserved memory can be used to indicate memory that is either too precious, or too slow for general purpose OS allocations. > Is it a valid use case to use pmem > in a hypervisor to back this memory? Depends on the pmem. That performance capability is indicated by the ACPI HMAT, not the EFI soft-reserved designation. > 3. There seem to be use cases where "soft-reserved" memory is used via > DAX. What is an example use case? I assume it's *not* to treat it like > PMEM but instead e.g., use it as a fast buffer inside applications or > similar. Right, in that weather-simulation example that application could just mmap /dev/daxX.Y and never worry about contending for the "fast memory" resource on the platform. Alternatively if that resource needs to be shared and/or over-commited then kernel memory-management services are needed and that dax-device can be assigned to kmem. > 4. There seem to be use cases where some part of "soft-reserved" memory > is used via DAX, some other is given to the buddy. What is an example > use case? Is this really necessary or only some theoretical use case? It's as necessary as pmem namespace partitioning, or the inclusion of dax-kmem upstream in the first place. In that kmem case the motivation was that some users want a portion of pmem provisioned for storage and some for volatile usage. The motivation is similar here, platform firmware can only identify memory attributes on coarse boundaries, finer grained provisioning decisions are up to the administrator / platform-owner and the kernel is a just a facilitator of that policy. > > 5. The "provisioned along performance relevant address boundaries." part > is unclear to me. Can you give an example of how this would look like > from user space? Like, split that memory in blocks of size X with > alignment Y and give them to separate applications? One example of platform address boundaries are the memory address ranges that alias in a direct-mapped memory-side-cache. In the direct-map-cache aliasing may repeat every N GBs where N is the ratio of far-to-near memory. ("Near memory" == cache "Far memory" == backing memory). Also refer back to the background in the page allocator shuffling patches [2]. With this partitioning mechanism you could, for one example use case, assign different VMs to exclusive colors in the memory side cache. [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e900a918b098 > 6. If you add such memory to the buddy, is there any way the system can > differentiate it from other memory? E.g., via fake/other NUMA nodes? Numa node numbers / are how performance differentiated memory ranges are enumerated. The expectation is that all distinct performance memory targets have unique ACPI proximity domains and Linux numa node numbers as a result. > Also, can you give examples of how kmem-added memory is represented in > /proc/iomem for a) pmem and b) soft-resered memory after this series > (skimming over the patches, I think there is a change for pmem, right?)? I don't expect a change. The only difference is the parent resource will be marked "Soft Reserved" instead of "Persistent Memory". > I am really wondering if it's the right approach to squeeze this into > our pmem/nvdimm infrastructure just because it's easy to do. E.g., man > "ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile > Memory)" speaks explicitly about non-volatile memory. In fact it's not squeezed into PMEM infrastructure. dax-kmem and device-dax are independent of PMEM. PMEM is one source of potential device-dax instances, soft-reserved memory is another orthogonal source. This is why device-dax needs its own userspace policy directed partitioning mechanism because there is no PMEM to store the configuration for partitioned higph-bandwidth memory. The userspace tooling for this mechanism is targeted for a tool called daxctl that has no PMEM dependencies. Look to Joao's use case that is using this infrastructure independent of PMEM with manual soft-reservations specified on the kernel command-line. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1CAFC433E3 for ; Thu, 20 Aug 2020 01:54:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 95C12207DA for ; Thu, 20 Aug 2020 01:54:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="lZ/K4yVK" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 95C12207DA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0457E6B0036; Wed, 19 Aug 2020 21:54:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F35E16B0037; Wed, 19 Aug 2020 21:54:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E001F6B0055; Wed, 19 Aug 2020 21:54:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0005.hostedemail.com [216.40.44.5]) by kanga.kvack.org (Postfix) with ESMTP id C2CE66B0036 for ; Wed, 19 Aug 2020 21:54:11 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 81A201DFB for ; Thu, 20 Aug 2020 01:54:11 +0000 (UTC) X-FDA: 77169276702.25.train56_570965b2702c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin25.hostedemail.com (Postfix) with ESMTP id 4BDA21804E3A0 for ; Thu, 20 Aug 2020 01:54:11 +0000 (UTC) X-HE-Tag: train56_570965b2702c X-Filterd-Recvd-Size: 12941 Received: from mail-ej1-f68.google.com (mail-ej1-f68.google.com [209.85.218.68]) by imf28.hostedemail.com (Postfix) with ESMTP for ; Thu, 20 Aug 2020 01:54:10 +0000 (UTC) Received: by mail-ej1-f68.google.com with SMTP id f24so731706ejx.6 for ; Wed, 19 Aug 2020 18:54:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=M2856byAPxR8wC0hiTcV4Eu6Vwp5zHAzX+5RXykLdZ8=; b=lZ/K4yVK00w07TAWoz4NrJ7bh2SQj0QlEXGy+16Uvv+KVxUgjz3B3Iv7EDFWqB1Pdr DuqS0vl9GBWbYe7Tv0iQX9ujEvHAFPuKsnBXPYGFMmNAlAMmAGkWwywfwrjoKzj7KHaq 9x9L9GAliRR0hzTYRqnJnP0AEzhhN86hMJ/tfa5N1nZQYhNMWXesPPOB/n1sQBmB9iFO 1gldSVsbODRKjEwdNvNll8hWpTbiqQ6XVTWltCogzN6iGXyVjZR5g3DVI13dx5mRUp7V PQNTIep7dDdU8pcLAQzgJrYbrZcCxhZ6TJ0/WHGSPo2V8rXXnajEPWLIdfd2u3CPbZ4m C4iQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=M2856byAPxR8wC0hiTcV4Eu6Vwp5zHAzX+5RXykLdZ8=; b=ncbToiZ3arOf6k6H4qvlJeMIns+W8M0Id6+ul8lDyzhfGrSbC/rF8DSQ0LR8L/E2jW m6qWpEGDuDsluaa3SAZoxyNq9ydKpjh0jR2tSO7HerwNh+INqXbQ08O75sAAP2dbpCJ9 wRww7Zovv/i0IJsw3OHRJtcOxCUsGokRlFkmHl5IngDYKXWx5uyk0QxiQBejCKW5a3EE gLDkaH3juoqg5x72qZLj8BwJl2hGSP9OUCLaSjmd7O+9v2kwZ+ZLEidIZQkYQxMYJ6hA ffeq0rsEBE69X5xb0SFlO+hvenBe6fDQuNdVNZSgOKLFdfdbsugqchTLS+qGm1Cl4xEt Uw8g== X-Gm-Message-State: AOAM531Iwouhh0j8d/lKckS7cvts1i6XpLFijNv05oMSbmY4tfp1CiA6 m28tzY/0sEgyv8pMUmVrWg78V9aGJ7zfuEPbxWwdaw== X-Google-Smtp-Source: ABdhPJzvu8YTM6DfS52cWqhkmKxDgLU8iC/mKftCXZJ19EZWyjZICrR61eKB8x5hnqIA1r8PziL6kB6a/+obAqDoPt8= X-Received: by 2002:a17:906:413:: with SMTP id d19mr1123427eja.523.1597888448722; Wed, 19 Aug 2020 18:54:08 -0700 (PDT) MIME-Version: 1.0 References: <159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: From: Dan Williams Date: Wed, 19 Aug 2020 18:53:57 -0700 Message-ID: Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges To: David Hildenbrand Cc: Andrew Morton , Ira Weiny , Ard Biesheuvel , Mike Rapoport , Borislav Petkov , Vishal Verma , David Airlie , Will Deacon , Catalin Marinas , Ard Biesheuvel , Joao Martins , Tom Lendacky , Dave Jiang , "Rafael J. Wysocki" , Jonathan Cameron , Wei Yang , X86 ML , "H. Peter Anvin" , Thomas Gleixner , Greg Kroah-Hartman , Pavel Tatashin , Peter Zijlstra , Ben Skeggs , Benjamin Herrenschmidt , Jason Gunthorpe , Jia He , Ingo Molnar , Dave Hansen , Paul Mackerras , Brice Goglin , Jeff Moyer , Michael Ellerman , "Rafael J. Wysocki" , Daniel Vetter , Andy Lutomirski , "Rafael J. Wysocki" , Linux MM , linux-nvdimm , Linux Kernel Mailing List , Linux ACPI , Maling list - DRI developers Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4BDA21804E3A0 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Aug 3, 2020 at 12:48 AM David Hildenbrand wrote: > > [...] > > > Well, no v5.8-rc8 to line this up for v5.9, so next best is early > > integration into -mm before other collisions develop. > > > > Chatted with Justin offline and it currently appears that the missing > > numa information is the fault of the platform firmware to populate all > > the necessary NUMA data in the NFIT. > > I'm planning on looking at some bits of this series this week, but some > questions upfront ... > > > > > --- > > Cover: > > > > The device-dax facility allows an address range to be directly mapped > > through a chardev, or optionally hotplugged to the core kernel page > > allocator as System-RAM. It is the mechanism for converting persistent > > memory (pmem) to be used as another volatile memory pool i.e. the > > current Memory Tiering hot topic on linux-mm. > > > > In the case of pmem the nvdimm-namespace-label mechanism can sub-divide > > it, but that labeling mechanism is not available / applicable to > > soft-reserved ("EFI specific purpose") memory [3]. This series provides > > a sysfs-mechanism for the daxctl utility to enable provisioning of > > volatile-soft-reserved memory ranges. > > > > The motivations for this facility are: > > > > 1/ Allow performance differentiated memory ranges to be split between > > kernel-managed and directly-accessed use cases. > > > > 2/ Allow physical memory to be provisioned along performance relevant > > address boundaries. For example, divide a memory-side cache [4] along > > cache-color boundaries. > > > > 3/ Parcel out soft-reserved memory to VMs using device-dax as a security > > / permissions boundary [5]. Specifically I have seen people (ab)using > > memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the > > device-dax interface on custom address ranges. A follow-on for the VM > > use case is to teach device-dax to dynamically allocate 'struct page' at > > runtime to reduce the duplication of 'struct page' space in both the > > guest and the host kernel for the same physical pages. > > > I think I am missing some important pieces. Bear with me. No worries, also bear with me, I'm going to be offline intermittently until at least mid-September. Hopefully Joao and/or Vishal can jump in on this discussion. > > 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not > automatically used in the buddy during boot, but remains untouched > (similar to pmem). But as it involves ACPI as well, it could also be > used on arm64 (-e820), correct? Correct, arm64 also gets the EFI support for enumerating memory this way. However, I would clarify that whether soft-reserved is given to the buddy allocator by default or not is the kernel's policy choice, "buddy-by-default" is ok and is what will happen anyways with older kernels on platforms that enumerate a memory range this way. > 2. Soft-reserved memory is volatile RAM with differing performance > characteristics ("performance differentiated memory"). What would be > examples of such memory? Likely the most prominent one that drove the creation of the "EFI Specific Purpose" attribute bit is high-bandwidth memory. One concrete example of that was a platform called Knights Landing [1] that ended up shipping firmware that lied to the OS about the latency characteristics of the memory to try to reverse engineer OS behavior to not allocate from that memory range by default. With the EFI attribute firmware performance tables can tell the truth about the performance characteristics of the memory range *and* indicate that the OS not use it for general purpose allocations by default. [1]: https://software.intel.com/content/www/us/en/develop/blogs/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing.html > Like, memory that is faster than RAM (scratch > pad), or slower (pmem)? Or both? :) Both, but note that PMEM is already hard-reserved by default. Soft-reserved is about a memory range that, for example, an administrator may want to reserve 100% for a weather simulation where if even a small amount of memory was stolen for the page cache the application may not meet its performance targets. It could also be a memory range that is so slow that only applications with higher latency tolerances would be prepared to consume it. In other words the soft-reserved memory can be used to indicate memory that is either too precious, or too slow for general purpose OS allocations. > Is it a valid use case to use pmem > in a hypervisor to back this memory? Depends on the pmem. That performance capability is indicated by the ACPI HMAT, not the EFI soft-reserved designation. > 3. There seem to be use cases where "soft-reserved" memory is used via > DAX. What is an example use case? I assume it's *not* to treat it like > PMEM but instead e.g., use it as a fast buffer inside applications or > similar. Right, in that weather-simulation example that application could just mmap /dev/daxX.Y and never worry about contending for the "fast memory" resource on the platform. Alternatively if that resource needs to be shared and/or over-commited then kernel memory-management services are needed and that dax-device can be assigned to kmem. > 4. There seem to be use cases where some part of "soft-reserved" memory > is used via DAX, some other is given to the buddy. What is an example > use case? Is this really necessary or only some theoretical use case? It's as necessary as pmem namespace partitioning, or the inclusion of dax-kmem upstream in the first place. In that kmem case the motivation was that some users want a portion of pmem provisioned for storage and some for volatile usage. The motivation is similar here, platform firmware can only identify memory attributes on coarse boundaries, finer grained provisioning decisions are up to the administrator / platform-owner and the kernel is a just a facilitator of that policy. > > 5. The "provisioned along performance relevant address boundaries." part > is unclear to me. Can you give an example of how this would look like > from user space? Like, split that memory in blocks of size X with > alignment Y and give them to separate applications? One example of platform address boundaries are the memory address ranges that alias in a direct-mapped memory-side-cache. In the direct-map-cache aliasing may repeat every N GBs where N is the ratio of far-to-near memory. ("Near memory" == cache "Far memory" == backing memory). Also refer back to the background in the page allocator shuffling patches [2]. With this partitioning mechanism you could, for one example use case, assign different VMs to exclusive colors in the memory side cache. [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e900a918b098 > 6. If you add such memory to the buddy, is there any way the system can > differentiate it from other memory? E.g., via fake/other NUMA nodes? Numa node numbers / are how performance differentiated memory ranges are enumerated. The expectation is that all distinct performance memory targets have unique ACPI proximity domains and Linux numa node numbers as a result. > Also, can you give examples of how kmem-added memory is represented in > /proc/iomem for a) pmem and b) soft-resered memory after this series > (skimming over the patches, I think there is a change for pmem, right?)? I don't expect a change. The only difference is the parent resource will be marked "Soft Reserved" instead of "Persistent Memory". > I am really wondering if it's the right approach to squeeze this into > our pmem/nvdimm infrastructure just because it's easy to do. E.g., man > "ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile > Memory)" speaks explicitly about non-volatile memory. In fact it's not squeezed into PMEM infrastructure. dax-kmem and device-dax are independent of PMEM. PMEM is one source of potential device-dax instances, soft-reserved memory is another orthogonal source. This is why device-dax needs its own userspace policy directed partitioning mechanism because there is no PMEM to store the configuration for partitioned higph-bandwidth memory. The userspace tooling for this mechanism is targeted for a tool called daxctl that has no PMEM dependencies. Look to Joao's use case that is using this infrastructure independent of PMEM with manual soft-reservations specified on the kernel command-line. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A67FEC433DF for ; Thu, 20 Aug 2020 01:54:11 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 6764D207FB for ; Thu, 20 Aug 2020 01:54:11 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="lZ/K4yVK" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6764D207FB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id A60986E88B; Thu, 20 Aug 2020 01:54:10 +0000 (UTC) Received: from mail-ej1-x641.google.com (mail-ej1-x641.google.com [IPv6:2a00:1450:4864:20::641]) by gabe.freedesktop.org (Postfix) with ESMTPS id 38FBA6E88B for ; Thu, 20 Aug 2020 01:54:10 +0000 (UTC) Received: by mail-ej1-x641.google.com with SMTP id kq25so749176ejb.3 for ; Wed, 19 Aug 2020 18:54:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=M2856byAPxR8wC0hiTcV4Eu6Vwp5zHAzX+5RXykLdZ8=; b=lZ/K4yVK00w07TAWoz4NrJ7bh2SQj0QlEXGy+16Uvv+KVxUgjz3B3Iv7EDFWqB1Pdr DuqS0vl9GBWbYe7Tv0iQX9ujEvHAFPuKsnBXPYGFMmNAlAMmAGkWwywfwrjoKzj7KHaq 9x9L9GAliRR0hzTYRqnJnP0AEzhhN86hMJ/tfa5N1nZQYhNMWXesPPOB/n1sQBmB9iFO 1gldSVsbODRKjEwdNvNll8hWpTbiqQ6XVTWltCogzN6iGXyVjZR5g3DVI13dx5mRUp7V PQNTIep7dDdU8pcLAQzgJrYbrZcCxhZ6TJ0/WHGSPo2V8rXXnajEPWLIdfd2u3CPbZ4m C4iQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=M2856byAPxR8wC0hiTcV4Eu6Vwp5zHAzX+5RXykLdZ8=; b=hqa+rIwpm5opGfwpFUNo1Nk0eOAUj87dZVeVamYHroZQ2DtTDO8VFDduJblFTZYnme 0kQenM8GcTNhVfQAbgXD0bjw7Xjhw0Oo7+Pz22ewhJg55yt3dhc7FZ2aBomDDKqu9g+d 2MNIPa5xKzVYQVGvuv6fPvwreJe+gKAQ1b6bk5FsqgKNLWlME92F/0Nlfo/aDBJg3gzD eFQ+XviVQUO+DUlbjE5DfAetav1xfuO1GXNa6moWrBm6BAn9OKqzPXOb5Lu2Ic6rxpys S0p8Qa6GNPnclVX48hlfmNT4rP47SjqmvWXHem5Q3OYtWDW38rEG55Y9JlHgL8omxsIv apUw== X-Gm-Message-State: AOAM531+AB5+/Dtb2UXQsUUHXiWgszINcCGXy3bKoSfW7UI+3+VTK4XH 8Pj6ncL0/PpGUvBsVcy5hfDMGrExsiVNzHYcZ007jQ== X-Google-Smtp-Source: ABdhPJzvu8YTM6DfS52cWqhkmKxDgLU8iC/mKftCXZJ19EZWyjZICrR61eKB8x5hnqIA1r8PziL6kB6a/+obAqDoPt8= X-Received: by 2002:a17:906:413:: with SMTP id d19mr1123427eja.523.1597888448722; Wed, 19 Aug 2020 18:54:08 -0700 (PDT) MIME-Version: 1.0 References: <159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: From: Dan Williams Date: Wed, 19 Aug 2020 18:53:57 -0700 Message-ID: Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges To: David Hildenbrand X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Rafael J. Wysocki" , "Rafael J. Wysocki" , David Airlie , Catalin Marinas , Dave Hansen , Maling list - DRI developers , Paul Mackerras , Linux MM , Michael Ellerman , "H. Peter Anvin" , Joao Martins , Will Deacon , Ard Biesheuvel , Dave Jiang , Linux ACPI , linux-nvdimm , Vishal Verma , X86 ML , Mike Rapoport , Peter Zijlstra , Jeff Moyer , Jason Gunthorpe , Ben Skeggs , Tom Lendacky , Pavel Tatashin , Ira Weiny , Borislav Petkov , Andy Lutomirski , Jonathan Cameron , Jia He , Thomas Gleixner , Ingo Molnar , Ard Biesheuvel , Greg Kroah-Hartman , "Rafael J. Wysocki" , Linux Kernel Mailing List , Wei Yang , Brice Goglin , Andrew Morton Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Mon, Aug 3, 2020 at 12:48 AM David Hildenbrand wrote: > > [...] > > > Well, no v5.8-rc8 to line this up for v5.9, so next best is early > > integration into -mm before other collisions develop. > > > > Chatted with Justin offline and it currently appears that the missing > > numa information is the fault of the platform firmware to populate all > > the necessary NUMA data in the NFIT. > > I'm planning on looking at some bits of this series this week, but some > questions upfront ... > > > > > --- > > Cover: > > > > The device-dax facility allows an address range to be directly mapped > > through a chardev, or optionally hotplugged to the core kernel page > > allocator as System-RAM. It is the mechanism for converting persistent > > memory (pmem) to be used as another volatile memory pool i.e. the > > current Memory Tiering hot topic on linux-mm. > > > > In the case of pmem the nvdimm-namespace-label mechanism can sub-divide > > it, but that labeling mechanism is not available / applicable to > > soft-reserved ("EFI specific purpose") memory [3]. This series provides > > a sysfs-mechanism for the daxctl utility to enable provisioning of > > volatile-soft-reserved memory ranges. > > > > The motivations for this facility are: > > > > 1/ Allow performance differentiated memory ranges to be split between > > kernel-managed and directly-accessed use cases. > > > > 2/ Allow physical memory to be provisioned along performance relevant > > address boundaries. For example, divide a memory-side cache [4] along > > cache-color boundaries. > > > > 3/ Parcel out soft-reserved memory to VMs using device-dax as a security > > / permissions boundary [5]. Specifically I have seen people (ab)using > > memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the > > device-dax interface on custom address ranges. A follow-on for the VM > > use case is to teach device-dax to dynamically allocate 'struct page' at > > runtime to reduce the duplication of 'struct page' space in both the > > guest and the host kernel for the same physical pages. > > > I think I am missing some important pieces. Bear with me. No worries, also bear with me, I'm going to be offline intermittently until at least mid-September. Hopefully Joao and/or Vishal can jump in on this discussion. > > 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not > automatically used in the buddy during boot, but remains untouched > (similar to pmem). But as it involves ACPI as well, it could also be > used on arm64 (-e820), correct? Correct, arm64 also gets the EFI support for enumerating memory this way. However, I would clarify that whether soft-reserved is given to the buddy allocator by default or not is the kernel's policy choice, "buddy-by-default" is ok and is what will happen anyways with older kernels on platforms that enumerate a memory range this way. > 2. Soft-reserved memory is volatile RAM with differing performance > characteristics ("performance differentiated memory"). What would be > examples of such memory? Likely the most prominent one that drove the creation of the "EFI Specific Purpose" attribute bit is high-bandwidth memory. One concrete example of that was a platform called Knights Landing [1] that ended up shipping firmware that lied to the OS about the latency characteristics of the memory to try to reverse engineer OS behavior to not allocate from that memory range by default. With the EFI attribute firmware performance tables can tell the truth about the performance characteristics of the memory range *and* indicate that the OS not use it for general purpose allocations by default. [1]: https://software.intel.com/content/www/us/en/develop/blogs/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing.html > Like, memory that is faster than RAM (scratch > pad), or slower (pmem)? Or both? :) Both, but note that PMEM is already hard-reserved by default. Soft-reserved is about a memory range that, for example, an administrator may want to reserve 100% for a weather simulation where if even a small amount of memory was stolen for the page cache the application may not meet its performance targets. It could also be a memory range that is so slow that only applications with higher latency tolerances would be prepared to consume it. In other words the soft-reserved memory can be used to indicate memory that is either too precious, or too slow for general purpose OS allocations. > Is it a valid use case to use pmem > in a hypervisor to back this memory? Depends on the pmem. That performance capability is indicated by the ACPI HMAT, not the EFI soft-reserved designation. > 3. There seem to be use cases where "soft-reserved" memory is used via > DAX. What is an example use case? I assume it's *not* to treat it like > PMEM but instead e.g., use it as a fast buffer inside applications or > similar. Right, in that weather-simulation example that application could just mmap /dev/daxX.Y and never worry about contending for the "fast memory" resource on the platform. Alternatively if that resource needs to be shared and/or over-commited then kernel memory-management services are needed and that dax-device can be assigned to kmem. > 4. There seem to be use cases where some part of "soft-reserved" memory > is used via DAX, some other is given to the buddy. What is an example > use case? Is this really necessary or only some theoretical use case? It's as necessary as pmem namespace partitioning, or the inclusion of dax-kmem upstream in the first place. In that kmem case the motivation was that some users want a portion of pmem provisioned for storage and some for volatile usage. The motivation is similar here, platform firmware can only identify memory attributes on coarse boundaries, finer grained provisioning decisions are up to the administrator / platform-owner and the kernel is a just a facilitator of that policy. > > 5. The "provisioned along performance relevant address boundaries." part > is unclear to me. Can you give an example of how this would look like > from user space? Like, split that memory in blocks of size X with > alignment Y and give them to separate applications? One example of platform address boundaries are the memory address ranges that alias in a direct-mapped memory-side-cache. In the direct-map-cache aliasing may repeat every N GBs where N is the ratio of far-to-near memory. ("Near memory" == cache "Far memory" == backing memory). Also refer back to the background in the page allocator shuffling patches [2]. With this partitioning mechanism you could, for one example use case, assign different VMs to exclusive colors in the memory side cache. [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e900a918b098 > 6. If you add such memory to the buddy, is there any way the system can > differentiate it from other memory? E.g., via fake/other NUMA nodes? Numa node numbers / are how performance differentiated memory ranges are enumerated. The expectation is that all distinct performance memory targets have unique ACPI proximity domains and Linux numa node numbers as a result. > Also, can you give examples of how kmem-added memory is represented in > /proc/iomem for a) pmem and b) soft-resered memory after this series > (skimming over the patches, I think there is a change for pmem, right?)? I don't expect a change. The only difference is the parent resource will be marked "Soft Reserved" instead of "Persistent Memory". > I am really wondering if it's the right approach to squeeze this into > our pmem/nvdimm infrastructure just because it's easy to do. E.g., man > "ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile > Memory)" speaks explicitly about non-volatile memory. In fact it's not squeezed into PMEM infrastructure. dax-kmem and device-dax are independent of PMEM. PMEM is one source of potential device-dax instances, soft-reserved memory is another orthogonal source. This is why device-dax needs its own userspace policy directed partitioning mechanism because there is no PMEM to store the configuration for partitioned higph-bandwidth memory. The userspace tooling for this mechanism is targeted for a tool called daxctl that has no PMEM dependencies. Look to Joao's use case that is using this infrastructure independent of PMEM with manual soft-reservations specified on the kernel command-line. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel