From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B21D1C4743E for ; Tue, 8 Jun 2021 12:05:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 530AC61358 for ; Tue, 8 Jun 2021 12:05:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 530AC61358 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D0FEE6B006E; Tue, 8 Jun 2021 08:05:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CC0516B0070; Tue, 8 Jun 2021 08:05:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC5426B0071; Tue, 8 Jun 2021 08:05:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0068.hostedemail.com [216.40.44.68]) by kanga.kvack.org (Postfix) with ESMTP id 6885D6B006E for ; Tue, 8 Jun 2021 08:05:56 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 0B702181AEF1F for ; Tue, 8 Jun 2021 12:05:56 +0000 (UTC) X-FDA: 78230427912.29.10A0C57 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf29.hostedemail.com (Postfix) with ESMTP id A1A6E55C for ; Tue, 8 Jun 2021 12:05:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623153954; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9+RQlAmP9Mglt7tSiLtL0tPiVcWxKY+yT42LYj+NpJ4=; b=DkwEYyzJWLoNTJwWB5RxpEpNQ0PsQ33bSlf0JxN1l/+CQawzIRUDW6Vhp/T2CK+8j7jmb4 mpaP6h/x+frgqE9fO+HLuf/sWVpAH551GkXBoKk8c6DHBh3o9Rlbs2LPF9bAdhzRDHByUx gsyn2ezi6KC2SAgcackup9epUayT2U8= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-161-Fcwu7Ik2NHWzcNKr7bn8jw-1; Tue, 08 Jun 2021 08:05:53 -0400 X-MC-Unique: Fcwu7Ik2NHWzcNKr7bn8jw-1 Received: by mail-wm1-f70.google.com with SMTP id w3-20020a1cf6030000b0290195fd5fd0f2so671618wmc.4 for ; Tue, 08 Jun 2021 05:05:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=9+RQlAmP9Mglt7tSiLtL0tPiVcWxKY+yT42LYj+NpJ4=; b=oyDjcMYWeA1JoC+WmPqz+VgvGJ2UP4JWm3yoFUt1HBMH+aU4XR/RPIgNuJk9zFa7/k MMHDN8xZ1zipw6SVNekZGNHbr0JWrCCPmd/pRrgctQ6s9ZO9B7Die0uSgFnko/YYZALX 5HSlAWrHM/F3lAfuT4LryzfDYkzVMORXbBEBBZs/s/1S/g/gJaGBfT5DAuBlDcQOgtVr owK3DIu8Idb4JyKjdxQJSE5NeYODdWs3f+hwxgSXkkXSGfTz12OJEHkvxk+024B9T5Gz QZFWWffZTJJdlZ/zJJUzLTvFYSwOeDu+BOhhXwJGd3mKQjrPyHopY3ZG1DauwxvOtXOT V49Q== X-Gm-Message-State: AOAM532MmwKBm8V5TBkBK3FAV6JhtDpN54OUQ8KXKq05+zPrZWbiZZe8 uYQbFOb+0GJN7irAJYnpwJC77x4jNf7WROlg3lemao3TnbEzReBH5DnSfBFjXQxDrDdsvbx6P1T OW69SdoJ7c8s= X-Received: by 2002:a1c:4c17:: with SMTP id z23mr2305909wmf.164.1623153952281; Tue, 08 Jun 2021 05:05:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzMPnm1yT1NdRqySW6KZPnKQJHTyFOttKRjVRJAbTL3gzSPg3xdiCwGGR1i5QFW8azK9DHU7A== X-Received: by 2002:a1c:4c17:: with SMTP id z23mr2305835wmf.164.1623153951689; Tue, 08 Jun 2021 05:05:51 -0700 (PDT) Received: from [192.168.3.132] (p5b0c61cf.dip0.t-ipconnect.de. [91.12.97.207]) by smtp.gmail.com with ESMTPSA id f12sm4755563wru.81.2021.06.08.05.05.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 08 Jun 2021 05:05:51 -0700 (PDT) To: Mike Rapoport Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Oscar Salvador , Michal Hocko , Mike Kravetz , Dave Hansen , Matthew Wilcox , Anshuman Khandual , Muchun Song , Pavel Tatashin , Jonathan Corbet , Stephen Rothwell , linux-doc@vger.kernel.org References: <20210525102604.8770-1-david@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v1] memory-hotplug.rst: complete admin-guide overhaul Message-ID: <385d2bd0-8857-9d40-c8f9-c302f0b56e12@redhat.com> Date: Tue, 8 Jun 2021 14:05:50 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DkwEYyzJ; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf29.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam02 X-Stat-Signature: e797fr1n3kxn6pikpay8z7t1d6eu8n6s X-Rspamd-Queue-Id: A1A6E55C X-HE-Tag: 1623153950-433308 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Mike, thansk for the review! >> =20 >> :Created: Jul 28 2007 >> :Updated: Add some details about locking internals: Aug 20 2018 >> +:Updated: Complete overhaul: May 18 2021 >=20 > I'd drop all three, we have git log... Agreed. >=20 >> =20 >> -This document is about memory hotplug including how-to-use and curren= t status. >> -Because Memory Hotplug is still under development, contents of this t= ext will >> -be changed often. >> +This document describes generic Linux support for memory hot(un)plug = with >> +a focus on System RAM, including ZONE_MOVABLE support. >> =20 >> .. contents:: :local: >> =20 >> -.. note:: >> - >> - (1) x86_64's has special implementation for memory hotplug. >> - This text does not describe it. >> - (2) This text assumes that sysfs is mounted at ``/sys``. >> - >> - >> Introduction >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> =20 >> -Purpose of memory hotplug >> -------------------------- >> +Memory hot(un)plug allows for increasing and decreasing the physical = memory >=20 > Maybe: the size of physical memory >=20 Agreed. >> +available to a machine at runtime. In the simplest case, it consists = of >> +physically plugging or unplugging a DIMM at runtime, coordinated with= the >> +operating system. >> =20 >> -Memory Hotplug allows users to increase/decrease the amount of memory= . >> -Generally, there are two purposes. >> +Memory hot(un)plug is used for various purposes: >> =20 >> -(A) For changing the amount of memory. >> - This is to allow a feature like capacity on demand. >> -(B) For installing/removing DIMMs or NUMA-nodes physically. >> - This is to exchange DIMMs/NUMA-nodes, reduce power consumption, e= tc. >> +(A) The physical memory available to a machine can be adjusted at run= time, >> + up- or downgrading the memory capacity. This dynamic memory >> + resizing, sometimes referred to as "capacity on demand", is frequ= ently >> + used with virtual machines and logical partitions. >=20 > I like more the bulleted lists, so you just put * or - in the beginning= of > the line and then you don't need to know neither the alphabet nor how t= o > count :) >=20 > More seriously, if the letters (or numbers) have no particular meaning = it's > easier to maintain a list with neutral bullets. Makes sense! It doesn't really help in this case. > =20 >> -(A) is required by highly virtualized environments and (B) is require= d by >> -hardware which supports memory power management. >> +(B) Replacing hardware, such as DIMMs or whole NUMA nodes, without do= wntime. >> + One example is replacing failing memory modules. >> =20 >> -Linux memory hotplug is designed for both purpose. >> +(C) Reducing memory consumption either by physically unplugging >> + memory modules or by logically unplugging (parts of) memory modul= es >> + from Linux. >=20 > It feels like some part of explanation is missing. My understanding of = the > above paragraph is "we remove a DIMM and thus the memory consumption > drops". > My guess is that you refer here to VM environments, in this case some m= ore > details would help. It was actually supposed to be "Reducing energy consumption" -- which=20 will make more sense :) Thanks for catching that! > =20 >> -Phases of memory hotplug >> +Further, the basic memory hot(un)plug infrastructure in Linux is nowa= days >> +also used to expose PMEM, other performance-differentiated >=20 > ^ persistent memory (PMEM) >=20 >> +memory and reserved memory regions as ordinary system RAM to Linux. >> + >> +Phases of Memory Hotplug >> ------------------------ >> =20 >> -There are 2 phases in Memory Hotplug: >> +Memory hotplug consists of two phases: >> + >> +(1) Adding the memory to Linux >> +(2) Onlining memory blocks >> =20 >> - 1) Physical Memory Hotplug phase >> - 2) Logical Memory Hotplug phase. >> +In the first phase, metadata (such as the memmap) is allocated, page = tables >> +for the direct mapping are allocated and initialized, and memory bloc= ks >=20 > User/administrator should not care about memmap or direct map and these > details are better suited for Documentation/vm but since we don't have = it > how about: >=20 > ... metadata, such as the memory map and page tables for the direct map= , > are allocated and initialized, ... Admins will have to know/care about the "memmap" terminology, because we=20 now have features that use that name (for example, "memmap_on_memory") So I'll tweak it to ".. metadata, such as the memory map ("memmap") and page tables for the=20 direct map, are allocated and initialized, ..." >=20 >> +are created; the latter also creates sysfs files for managing >=20 > The reader doesn't know what are memory blocks in this context yet. I'd > suggest to move "Unit of Memory Hot(Un)Plug" before the phases. Makes sense. [...] >> +Unit of Memory Hot(Un)Plug >=20 > Units? Or rather "Memory Hot(Un)Plug Granularity" [...] >> -Kernel Configuration >> -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> +There are various ways how Linux is notified about memory hotplug eve= nts >> +such that it can start adding hotplugged memory. This description is >> +mostly limited to mechanisms present on physical machines; mechanisms= specific >> +to virtual machines or logical partitions are not described. >=20 > ... This description is limited to systems that support ACPI; mechanism= s > specific to other firmware interfaces or virtual machines are not > described. Ack [...] >> -Under each memory block, you can see 5 files: >> +The Linux kernel can be configured to automatically online added memo= ry >> +blocks and drivers automatically trigger offlining of memory blocks >> +when trying hotunplug of memory. Memory blocks can only be removed on= ce offlining >=20 > ... and drivers may trigger offlining of memory blocks when they attemp= t of > hotunplug the memory. >=20 I'll rephrase to "... when attempting hotunplug of memory". [...] >> >> -.. note:: >> +One can explicitly request to associate it with ZONE_MOVABLE by:: >=20 > s/it/added memory block/ I'll use "an offline memory block". [...] >> =20 >> - /sys/devices/system/memory/memory9/node0 -> ../../node/node0 >> +The kernel can be configured to try auto-onlining of newly added memo= ry blocks. >> +If disabled, the memory blocks will stay offline until explicitly onl= ined >=20 > ^ If this feature is disabled >=20 Ack [...] >> - /sys/devices/system/memory/probe >> +In the current implementation, Linux's memory offlining will try migr= ating >> +all movable pages off the affected memory block. As most kernel alloc= ations, >> +such as page tables, are unmovable, page migration can fail and, ther= efore, >> +inhibit memory offlining from succeeding. >> =20 >> -You can tell the physical address of new memory to the kernel by:: >> +Having the memory provided by memory block managed by ZONE_MOVABLE se= verely >=20 > significan= tly ^ Indeed >=20 >> +increases memory offlining reliability; still, memory offlining can f= ail in >> +some corner cases. >> =20 >> - % echo start_address_of_new_memory > /sys/devices/system/memory/prob= e >> +Further, memory offlining might retry for a long time (or even foreve= r), >> +until aborted by the user. >> =20 >> -Then, [start_address_of_new_memory, start_address_of_new_memory + >> -memory_block_size] memory range is hot-added. In this case, hotplug s= cript is >> -not called (in current implementation). You'll have to online memory = by >> -yourself. Please see :ref:`memory_hotplug_how_to_online_memory`. >> +Offlining of a memory block can be triggered via:: >> =20 >> -Logical Memory hot-add phase >> -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D >> + % echo offline > /sys/devices/system/memory/memoryXXX/state >> =20 >> -State of memory >> ---------------- >> +Or alternatively:: >> =20 >> -To see (online/offline) state of a memory block, read 'state' file:: >> + % echo 0 > /sys/devices/system/memory/memoryXXX/online >> =20 >> - % cat /sys/device/system/memory/memoryXXX/state >> +If offline succeeds, the state of the memory block is changed to be "= offline". >> +If it fails, an error will be returned by the kernel. >=20 > I think elaborating here how the error is returned would be nice. I *think* it's returned via the system call that tries modifying the file= . "If it fails, an error will be returned by the kernel via the systemcall=20 that triggered modifying of the respective file." > =20 >> +Observing the State of Memory Blocks >> +------------------------------------ >> =20 >> -- If the memory block is online, you'll read "online". >> -- If the memory block is offline, you'll read "offline". >> +The state (online/offline/going-offline) of a memory block can be obs= erved >> +either via:: >> =20 >> + % cat /sys/device/system/memory/memoryXXX/state >> =20 >> -.. _memory_hotplug_how_to_online_memory: >> +Or alternatively (1/0) via:: >> =20 >> -How to online memory >> --------------------- >> + % cat /sys/device/system/memory/memoryXXX/online >> =20 >> -When the memory is hot-added, the kernel decides whether or not to "o= nline" >> -it according to the policy which can be read from "auto_online_blocks= " file:: >> +For an online memory block, the managing zone van be observed via:: >=20 > typo: ^ can Thanks >> =20 >> - % cat /sys/devices/system/memory/auto_online_blocks >> + % cat /sys/device/system/memory/memoryXXX/valid_zones >> =20 >> -The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kerne= l config >> -option. If it is disabled the default is "offline" which means the ne= wly added >> -memory is not in a ready-to-use state and you have to "online" the ne= wly added >> -memory blocks manually. Automatic onlining can be requested by writin= g "online" >> -to "auto_online_blocks" file:: >> +Configuring Memory Hot(Un)Plug >> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >> =20 >> - % echo online > /sys/devices/system/memory/auto_online_blocks >> +There are various ways how admins can configure memory hot(un)plug an= d interact >=20 > ^ system administrators Ack [...] >> -Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by:= : >> +.. note:: >> =20 >> - % echo online_kernel > /sys/devices/system/memory/memoryXXX/state >> + With CONFIG_MEMORY_FAILURE, two additional files ``hard_offline_pag= e`` and >=20 > When the kernel is built with CONFIG_MEMORY_FAILURE option enabled >=20 > Maybe add a subsection about the configuration options that define sysf= s > behaviour and group all the notes there as simple paragraphs? >=20 I'll move the notes for "auto_online_block" and "probe" right to the=20 description. I'll leave the note for hard_offline_page and=20 soft_offline_page in the (last remaining) note as they are not really=20 releated to memory hot(un)plug. [...] >> +``uevent`` read-write: generic uevent file for devices. >> +``valid_zones`` read-only: shows by which zone memory provided by= an >> + online memory block is managed, and by which zone memory >> + provided by an offline memory block could be managed when >> + onlining. >=20 > Sounds a bit awkward to me. Maybe >=20 > when a block is online shows the zone it belongs to; when a block is of= fline > shows what zone will manage it when the block will be onlined. >=20 Ack >> =20 >> -Now, a boot option for making a memory block which consists of migrat= able pages >> -is supported. By specifying "kernelcore=3D" or "movablecore=3D" boot = option, you can >> -create ZONE_MOVABLE...a zone which is just used for movable pages. >> -(See also Documentation/admin-guide/kernel-parameters.rst) >> + For online memory blocks, ``DMA``, ``DMA32``, ``Normal``, >> + ``Movable`` and ``none`` may be returned. ``none`` indicates >=20 > Highmem? Or we don't support hotplug on 32 bits? We only support 64 bit: config MEMORY_HOTPLUG ... depends on 64BIT || BROKEN Worth a comment in the document "Introduction": "Linux only supports memory hot(un)plug on selected 64 bit=20 architectures, such as x86_64, aarch64, ppc64, s390x and ia64." I can spot that sh also enables it -- but I never even tested it or saw=20 any BUG reports related to it, so I'll not mention it for now explicitly=20 in the document. >=20 >> + that memory provided by a memory block is managed by >> + multiple zones or spans multiple nodes; such memory blocks >> + cannot be offlined. ``Movable`` indicates ZONE_MOVABLE. >> + Other values indicate a kernel zone. >> =20 >> -Assume the system has "TOTAL" amount of memory at boot time, this boo= t option >> -creates ZONE_MOVABLE as following. >> + For offline memory blocks, the first column shows the >> + zone the kernel would select when onlining the memory block >> + right now without further specifying a zone. >> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D >> =20 >> -1) When kernelcore=3DYYYY boot option is used, >> - Size of memory not for movable pages (not for offline) is YYYY. >> - Size of memory for movable pages (for offline) is TOTAL-YYYY. >> +.. note:: >> =20 >> -2) When movablecore=3DZZZZ boot option is used, >> - Size of memory not for movable pages (not for offline) is TOTAL - = ZZZZ. >> - Size of memory for movable pages (for offline) is ZZZZ. >> + ``valid_zones`` is only available with CONFIG_MEMORY_HOTREMOVE. >> =20 >> .. note:: >> =20 >> - Unfortunately, there is no information to show which memory block = belongs >> - to ZONE_MOVABLE. This is TBD. >> + If CONFIG_NUMA is enabled the memoryXXX/ directories can also be ac= cessed >> + via symbolic links located in the ``/sys/devices/system/node/node*`= ` >> + directories. >> + >> + For example:: >> + >> + /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 >> + >> + A backlink will also be created:: >> + >> + /sys/devices/system/memory/memory9/node0 -> ../../node/node0 >> =20 >> - Memory offlining can fail when dissolving a free huge page on ZONE= _MOVABLE >> - and the feature of freeing unused vmemmap pages associated with ea= ch hugetlb >> - page is enabled. >> +Cmdline Parameters >=20 > Command line Ack, will adjust all "cmdline" instances. Hope I didn't miss any feedback, will do another pass to make sure I=20 considered everything. Thanks! --=20 Thanks, David / dhildenb