From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A4D1C47082 for ; Tue, 8 Jun 2021 10:12:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DD1AE61246 for ; Tue, 8 Jun 2021 10:12:18 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DD1AE61246 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3F2696B006C; Tue, 8 Jun 2021 06:12:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3A2896B006E; Tue, 8 Jun 2021 06:12:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1CE4F6B0070; Tue, 8 Jun 2021 06:12:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0121.hostedemail.com [216.40.44.121]) by kanga.kvack.org (Postfix) with ESMTP id E05276B006C for ; Tue, 8 Jun 2021 06:12:17 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 782698249980 for ; Tue, 8 Jun 2021 10:12:17 +0000 (UTC) X-FDA: 78230141514.07.C4ABC3E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf28.hostedemail.com (Postfix) with ESMTP id 18D122001097 for ; Tue, 8 Jun 2021 10:12:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623147134; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GuWwz0l9Jq2D+rEX+ek9RVNrga2qQsCVfqJEFTY2fF0=; b=ZVKaAhtQ3quIyBjFZIetjp7nXE7XNZ6iS+ZKq9IfNxuDOQJ5UKpoKiScCNXTQ3X/AKK5Jb tD8jnvVgtiWWuQMqyyeDP5qawohgaolOH3OYV3LbXNfIJjhjF6igioJsNb6tVxMzcpMeDM OKmiP6mNlp4RLtDtQJYOPctWfz1BN3Q= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-353-SmvjwcWvNayHOBvyTAqbPQ-1; Tue, 08 Jun 2021 06:12:12 -0400 X-MC-Unique: SmvjwcWvNayHOBvyTAqbPQ-1 Received: by mail-wm1-f69.google.com with SMTP id w3-20020a1cf6030000b0290195fd5fd0f2so547089wmc.4 for ; Tue, 08 Jun 2021 03:12:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=GuWwz0l9Jq2D+rEX+ek9RVNrga2qQsCVfqJEFTY2fF0=; b=hW6Ac3TcbNnQ56oxb5Oj8zycK73bR0WR6hIck//fMVWz5OAFLKXcZMhL1Iebkj2dvf l28VIy7CcZjjDUddLbUuSz7NtOQHlYp1ZrilQV8GWAaKIPxFkMmJEjRTVwJVjsvoIDYv zjrWDhAlGnpHmBGrECwjQ0es1e5T6E0EvYOe0hUlAV2XD+qDwai3U+c7IA5u/lUeo+/U VnvkOpugIr4T2DNU6B5DwWHSfn59aXslk7StKZV6k+WpH+UTv+6ASnggQc+Y0YUGw/G1 eOOF77Xd/wqc34nkih8FyKbGwwK7wpdA9O5LTMJ7LJkDEmMLnSkIZBEzc0cu6wNQOBB2 6UvA== X-Gm-Message-State: AOAM531lCziZbA0Rl4cmYQSd9LQI9LJ1i4PzDMmhT4AWcwIABaZMB1B+ 0zZ2bKwl1Z7bE4nE8Ilz+zGltr69z2wNod4+pW70xW5aCBOKa8V+0qTM0EotwzY2+OTIdgh5AMY g+9ZZ7rsH7nk= X-Received: by 2002:a1c:2202:: with SMTP id i2mr3308168wmi.72.1623147131510; Tue, 08 Jun 2021 03:12:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxblAl0xmXsTTtUWutRHTwHvYL1IPUjulz2jonRa8xW8vrpYP+2CmxiJibHeA1MgeeD+tc5LQ== X-Received: by 2002:a1c:2202:: with SMTP id i2mr3308117wmi.72.1623147131043; Tue, 08 Jun 2021 03:12:11 -0700 (PDT) Received: from [192.168.3.132] (p5b0c61cf.dip0.t-ipconnect.de. [91.12.97.207]) by smtp.gmail.com with ESMTPSA id o22sm2207186wmc.17.2021.06.08.03.12.10 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 08 Jun 2021 03:12:10 -0700 (PDT) To: Oscar Salvador Cc: linux-kernel@vger.kernel.org, Andrew Morton , Vitaly Kuznetsov , "Michael S. Tsirkin" , Jason Wang , Marek Kedzierski , Hui Zhu , Pankaj Gupta , Wei Yang , Michal Hocko , Dan Williams , Anshuman Khandual , Dave Hansen , Vlastimil Babka , Mike Rapoport , "Rafael J. Wysocki" , Len Brown , Pavel Tatashin , virtualization@lists.linux-foundation.org, linux-mm@kvack.org, linux-acpi@vger.kernel.org References: <20210607195430.48228-1-david@redhat.com> <20210608094244.GA22894@linux> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups Message-ID: <9ab50bc0-1714-67c4-ea9a-79e7d315315b@redhat.com> Date: Tue, 8 Jun 2021 12:12:09 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: <20210608094244.GA22894@linux> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 18D122001097 X-Stat-Signature: zmsb5szjzzyjcrb68fyu4dpkpmn1sbn3 Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ZVKaAhtQ; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf28.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1623147131-89056 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 08.06.21 11:42, Oscar Salvador wrote: > On Mon, Jun 07, 2021 at 09:54:18PM +0200, David Hildenbrand wrote: >> Hi, >> >> this series aims at improving in-kernel auto-online support. It tackle= s the >> fundamental problems that: >=20 > Hi David, >=20 > the idea sounds good to me, and I like that this series takes away part= of the > responsability from the user to know where the memory should go. > I think the kernel is a much better fit for that as it has all the requ= ired > information to balance things. >=20 > I also glanced over the series and besides some things here and there t= he > whole approach looks sane. > I plan to have a look into it in a few days, just have some high level = questions > for the time being: Hi Oscar, >=20 >> 1) We can create zone imbalances when onlining all memory blindly to >> ZONE_MOVABLE, in the worst case crashing the system. We have to k= now >> upfront how much memory we are going to hotplug such that we can >> safely enable auto-onlining of all hotplugged memory to ZONE_MOVA= BLE >> via "online_movable". This is far from practical and only applica= ble in >> limited setups -- like inside VMs under the RHV/oVirt hypervisor = which >> will never hotplug more than 3 times the boot memory (and the >> limitation is only in place due to the Linux limitation). >=20 > Could you give more insight about the problems created by zone imbalanc= es (e.g: > a lot of movable memory and little kernel memory). I just updated memory-hotplug.rst exactly for that purpose :) https://lkml.kernel.org/r/20210525102604.8770-1-david@redhat.com There, also safe zone ratios and "usually well known values" are given.=20 I can link it in the next cover letter. >=20 >> 2) We see more setups that implement dynamic VM resizing, hot(un)plu= gging >> memory to resize VM memory. In these setups, we might hotplug a l= ot of >> memory, but it might happen in various small steps in both direct= ions >> (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the >> primary driver of this upstream right now, performing such dynami= c >> resizing NUMA-aware via multiple virtio-mem devices. >> >> Onlining all hotplugged memory to ZONE_NORMAL means we basically = have >> no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we ca= n >> easily run into zone imbalances when growing a VM. We want a mixt= ure, >> and we want as much memory as reasonable/configured in ZONE_MOVAB= LE. >> >> 3) Memory devices consist of 1..X memory block devices, however, the >> kernel doesn't really track the relationship. Consequently, also = user >> space has no idea. We want to make per-device decisions. As one >> example, for memory hotunplug it doesn't make sense to use a mixt= ure of >> zones within a single DIMM: we want all MOVABLE if possible, othe= rwise >> all !MOVABLE, because any !MOVABLE part will easily block the DIM= M from >> getting hotunplugged. As another example, virtio-mem operates on >> individual units that span 1..X memory blocks. Similar to a DIMM,= we >> want a unit to either be all MOVABLE or !MOVABLE. Further, we wan= t >> as much memory of a virtio-mem device to be MOVABLE as possible. >=20 > So, a virtio-mem unit could be seen as DIMM right? It's a bit more complicated. Each individual unit (e.g., a 128 MiB=20 memory block) is the smallest granularity we can add/remove of that=20 device. So such a unit is somewhat like a DIMM. However, all "units" of=20 the device can interact -- it's a single memory device. >=20 >> 4) We want memory onlining to be done right from the kernel while ad= ding >> memory; for example, this is reqired for fast memory hotplug for >> drivers that add individual memory blocks, like virito-mem. We wa= nt a >> way to configure a policy in the kernel and avoid implementing ad= vanced >> policies in user space. >=20 > "we want memory onlining to be done right from the kernel while adding = memory" >=20 > is not that always the case when a driver adds memory? User has no inte= raction > with that right? Well, with auto-onlining in the kernel disabled, user space has to do=20 the onlining -- for example via udev rules right now in major distributio= ns. But there are also users that always want to online manually in user=20 space to select a zone. Most prominently standby memory on s390x, but=20 also in some cases dax/kmem memory. But these two are really corner=20 cases. In general, we want hotplugged memory to be onlined immediately. >=20 >> The auto-onlining support we have in the kernel is not sufficient. All= we >> have is a) online everything movable (online_movable) b) online everyt= hing >> !movable (online_kernel) c) keep zones contiguous (online). This serie= s >> allows configuring c) to mean instead "online movable if possible acco= rding >> to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a ne= w >> onlining policy. >> >> This series does 3 things: >> >> 1) Introduces the "auto-movable" online policy that initially opera= tes on >> individual memory blocks only. It uses a maximum MOVABLE:KERNEL = ratio >> to make a decision whether a memory block will be onlined to >> ZONE_MOVABLE or not. However, in the basic form, hotplugged KERN= EL >> memory does not allow for more MOVABLE memory (details in the >> patches). CMA memory is treated like MOVABLE memory. >=20 > How a user would know which ratio is sane? Could we add some info in th= e > Docu part that kinda sets some "basic" rules? Again, currently resides in the memory-hotplug.rst overhaul. >=20 >> 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) me= mory >> groups and uses group information to make decisions in the >> "auto-movable" online policy accross memory blocks of a single m= emory >> device (modeled as memory group). >=20 > So, the distinction being that a DIMM cannot grow larger but we can add= more > memory to a virtio-mem unit? I feel I am missing some insight here. Right, the relevant patch contains more info. You either plug or unplug a DIMM (or a NUMA node which spans multiple=20 DIMMS) -- both are ACPI memory devices that span multiple physical=20 regions. You cannot unplug parts of a DIMM or grow it. "static" as also=20 expressed by ACPI code ("adds" and "removes" all memory device memory in=20 one go). virtio-mem behaves differently, as it's a single physical memory region=20 in which we dynamically add or remove memory. The granularity in which=20 we add/remove memory from Linux is a "unit". In the simplest case, it's=20 just a single memory block (e.g., 128 MiB). So it's a memory device that=20 can grow/shrink in the given unit -- "dynamic". >=20 >> 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by >> allowing ZONE_NORMAL memory within a dynamic memory group to all= ow for >> more ZONE_MOVABLE memory within the same memory group. The targe= t use >> case is dynamic VM resizing using virtio-mem. >=20 > Sorry, I got lost in this one. Care to explain a bit more? The virtio-mem example below should make this a bit more clearer (in=20 addition to the relevant patch), especially in contrast to static memory=20 devices like DIMMs. Key is that a single virtio-mem device is a "dynamic=20 memory group" in which memory can get added/removed dynamically in a=20 given unit granularity. And we want to special case that type of device=20 to have as much memory of a virtio-mem device being MOVABLE as possible=20 (and configured). >=20 >> The target usage will be: >> >> 1) Linux boots with "mhp_default_online_type=3Doffline" >> >> 2) User space (e.g., systemd unit) configures memory onlining (acco= rding >> to a config file and system properties), for example: >> * Setting memory_hotplug.online_policy=3Dauto-movable >> * Setting memory_hotplug.auto_movable_ratio=3D301 >> * Setting memory_hotplug.auto_movable_numa_aware=3Dtrue >=20 > I think we would need to document those in order to let the user know w= hat > it is best for them. e.g: when do we want to enable auto_movable_numa_a= ware etc. Yes, as mentioned below, an memory-hotplug.rst update will follow once=20 the overhaul is done. The respective patch contains more information. >=20 >> For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured rat= io of >> 301% results in the following layout: >> Memory block 1-15: DMA32 (early) >> Memory block 32-47: Normal (early) >> Memory block 48-79: Movable (DIMM 0) >> Memory block 80-111: Movable (DIMM 1) >> Memory block 112-143: Movable (DIMM 2) >> Memory block 144-275: Normal (DIMM 3) >> Memory block 176-207: Normal (DIMM 4) >> ... all Normal >> (-> hotplugged Normal memory does not allow for more Movable memory) >=20 > Uhm, I am sorry for being dense here: >=20 > On x86_64, 4GB =3D 32 sections (of 128MB each). Why the memblock span f= rom #1 to #47? Sorry, it's actually "Memory block 0-15", which gives us 0-15 and 32-47=20 =3D=3D 32 memory blocks corresponding to boot memory. Note that the absen= t=20 memory blocks 16-31 should correspond to the PCI hole. Thanks Oscar! --=20 Thanks, David / dhildenb