From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3428AC4741F for ; Thu, 24 Sep 2020 13:55:07 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 521FF23977 for ; Thu, 24 Sep 2020 13:55:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="fz2dCKCM" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 521FF23977 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvdimm-bounces@lists.01.org Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id B1A0B15192629; Thu, 24 Sep 2020 06:55:05 -0700 (PDT) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::641; helo=mail-ej1-x641.google.com; envelope-from=dan.j.williams@intel.com; receiver= Received: from mail-ej1-x641.google.com (mail-ej1-x641.google.com [IPv6:2a00:1450:4864:20::641]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 51CF515192625 for ; Thu, 24 Sep 2020 06:55:02 -0700 (PDT) Received: by mail-ej1-x641.google.com with SMTP id o8so4573883ejb.10 for ; Thu, 24 Sep 2020 06:55:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=u0Wwg44LAJ97oDid02lb/yWA2qTwaA94VroM35bUtxs=; b=fz2dCKCM6gYh+e67Fsuthj+EMQFSxTJF23dMNMiOXOgepOQmYJQnfdSExcrz7soelL wuGYNZKdLFhitgSzfhEAbih8kf0GQohxFDDfvtCoarw/kIX6KKuAnhbIRVPrjZz+n6q9 d7WXqr4o/GJuJBLuNSYV3w85Bvj/ZTcO5YksdZItI0/vlU79FujjL8YoTvCEIOFoJTd/ 6vh4w4rlArULZPlhTQxbI54J/EVnN66/GrClQeJzzzAKB5VTq/3tEC1NP2kze3HRI0t1 MDhjq+kPKXdETW6F6AFWe4S+sh4xowCGkxzQBf/hmAJjVKqUnIt8HIYWV/Xq13aVAWO5 zV8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=u0Wwg44LAJ97oDid02lb/yWA2qTwaA94VroM35bUtxs=; b=q6V4qP/NlcUaknWV82K2jRqpnqnlYnx9im0gQ/13hNU6nJ/gstGOH4+shXKZNcaZ9B PufoWTgKKD8pAQphwqhK+CjFL6XzXQBPfiCZLtpC13Xmd2brbCbHp0LyyxbWsNKWHDgE ecoLiASTE6qgUtntvYkwsLWCbkKN+VH67A5dALyozJdSjYtzknz1Udh5zkuaAMKh6JYC CVP5IBpzfxSPbszE1bf69hdjG7gs2cmOjryynh5YPRtV96rkO3fYxBl6jMh45pnZzA1L EXYduyfnvcAAxDAeegflc3iGZ6bJRFWwDZYLkD16qd0jnGpX8lJFdQX/24+1jqTGzQPu vupg== X-Gm-Message-State: AOAM532seNXIWYrEDhu+CEgUS3m7p8AGCpGmlB+6dMB5Iqmx/R/41xcS dpJWzDHnRBIYD+SoT3L//gAs/PMgl0o3IjhtiMArMA== X-Google-Smtp-Source: ABdhPJy+gShDa4M5BCxb6FmTsKMUNxD9KbnGrAI9AWRrw57kRj5wOs7HEgWWpD+B+1/E1a7KpQVWWVvv5CfNF8IG2sw= X-Received: by 2002:a17:906:8143:: with SMTP id z3mr1107180ejw.323.1600955701310; Thu, 24 Sep 2020 06:55:01 -0700 (PDT) MIME-Version: 1.0 References: <159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com> <159643100485.4062302.976628339798536960.stgit@dwillia2-desk3.amr.corp.intel.com> <17686fcc-202e-0982-d0de-54d5349cfb5d@oracle.com> <9acc6148-72eb-7016-dba9-46fa87ded5a5@redhat.com>

In-Reply-To: From: Dan Williams Date: Thu, 24 Sep 2020 06:54:49 -0700 Message-ID: Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res To: David Hildenbrand Message-ID-Hash: TB5UASC4QBUH4TGIJ7EMS564BE364W2F X-Message-ID-Hash: TB5UASC4QBUH4TGIJ7EMS564BE364W2F X-MailFrom: dan.j.williams@intel.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header CC: Joao Martins , Andrew Morton , Dave Hansen , Pavel Tatashin , Peter Zijlstra , Ard Biesheuvel , Linux MM , linux-nvdimm , Linux Kernel Mailing List , Linux ACPI , Maling list - DRI developers X-Mailman-Version: 3.1.1 Precedence: list List-Id: "Linux-nvdimm developer list." Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Thu, Sep 24, 2020 at 12:26 AM David Hildenbrand wrote: > > On 23.09.20 23:41, Dan Williams wrote: > > On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand wrote: > >> > >> On 08.09.20 17:33, Joao Martins wrote: > >>> [Sorry for the late response] > >>> > >>> On 8/21/20 11:06 AM, David Hildenbrand wrote: > >>>> On 03.08.20 07:03, Dan Williams wrote: > >>>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev) > >>>>> * could be mixed in a node with faster memory, causing > >>>>> * unavoidable performance issues. > >>>>> */ > >>>>> - numa_node = dev_dax->target_node; > >>>>> if (numa_node < 0) { > >>>>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n", > >>>>> numa_node); > >>>>> return -EINVAL; > >>>>> } > >>>>> > >>>>> - /* Hotplug starting at the beginning of the next block: */ > >>>>> - kmem_start = ALIGN(range->start, memory_block_size_bytes()); > >>>>> - > >>>>> - kmem_size = range_len(range); > >>>>> - /* Adjust the size down to compensate for moving up kmem_start: */ > >>>>> - kmem_size -= kmem_start - range->start; > >>>>> - /* Align the size down to cover only complete blocks: */ > >>>>> - kmem_size &= ~(memory_block_size_bytes() - 1); > >>>>> - kmem_end = kmem_start + kmem_size; > >>>>> - > >>>>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); > >>>>> - if (!new_res_name) > >>>>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL); > >>>>> + if (!res_name) > >>>>> return -ENOMEM; > >>>>> > >>>>> - /* Region is permanently reserved if hotremove fails. */ > >>>>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name); > >>>>> - if (!new_res) { > >>>>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n", > >>>>> - &kmem_start, &kmem_end); > >>>>> - kfree(new_res_name); > >>>>> + res = request_mem_region(range.start, range_len(&range), res_name); > >>>> > >>>> I think our range could be empty after aligning. I assume > >>>> request_mem_region() would check that, but maybe we could report a > >>>> better error/warning in that case. > >>>> > >>> dax_kmem_range() already returns a memory-block-aligned @range but > >>> IIUC request_mem_region() isn't checking for that. Having said that > >>> the returned @res wouldn't be different from the passed range.start. > >>> > >>>>> /* > >>>>> * Ensure that future kexec'd kernels will not treat this as RAM > >>>>> * automatically. > >>>>> */ > >>>>> - rc = add_memory_driver_managed(numa_node, new_res->start, > >>>>> - resource_size(new_res), kmem_name); > >>>>> + rc = add_memory_driver_managed(numa_node, res->start, > >>>>> + resource_size(res), kmem_name); > >>>>> + > >>>>> + res->flags |= IORESOURCE_BUSY; > >>>> > >>>> Hm, I don't think that's correct. Any specific reason why to mark the > >>>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could > >>>> suddenly stumble over it - and e.g., similarly kexec code when trying to > >>>> find memory for placing kexec images. I think we should leave this > >>>> !BUSY, just as it is right now. > >>>> > >>> Agreed. > >>> > >>>>> if (rc) { > >>>>> - release_resource(new_res); > >>>>> - kfree(new_res); > >>>>> - kfree(new_res_name); > >>>>> + release_mem_region(range.start, range_len(&range)); > >>>>> + kfree(res_name); > >>>>> return rc; > >>>>> } > >>>>> - dev_dax->dax_kmem_res = new_res; > >>>>> + > >>>>> + dev_set_drvdata(dev, res_name); > >>>>> > >>>>> return 0; > >>>>> } > >>>>> > >>>>> #ifdef CONFIG_MEMORY_HOTREMOVE > >>>>> -static int dev_dax_kmem_remove(struct device *dev) > >>>>> +static void dax_kmem_release(struct dev_dax *dev_dax) > >>>>> { > >>>>> - struct dev_dax *dev_dax = to_dev_dax(dev); > >>>>> - struct resource *res = dev_dax->dax_kmem_res; > >>>>> - resource_size_t kmem_start = res->start; > >>>>> - resource_size_t kmem_size = resource_size(res); > >>>>> - const char *res_name = res->name; > >>>>> int rc; > >>>>> + struct device *dev = &dev_dax->dev; > >>>>> + const char *res_name = dev_get_drvdata(dev); > >>>>> + struct range range = dax_kmem_range(dev_dax); > >>>>> > >>>>> /* > >>>>> * We have one shot for removing memory, if some memory blocks were not > >>>>> * offline prior to calling this function remove_memory() will fail, and > >>>>> * there is no way to hotremove this memory until reboot because device > >>>>> - * unbind will succeed even if we return failure. > >>>>> + * unbind will proceed regardless of the remove_memory result. > >>>>> */ > >>>>> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size); > >>>>> - if (rc) { > >>>>> - any_hotremove_failed = true; > >>>>> - dev_err(dev, > >>>>> - "DAX region %pR cannot be hotremoved until the next reboot\n", > >>>>> - res); > >>>>> - return rc; > >>>>> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range)); > >>>>> + if (rc == 0) { > >>>> > >>>> if (!rc) ? > >>>> > >>> Better off would be to keep the old order: > >>> > >>> if (rc) { > >>> any_hotremove_failed = true; > >>> dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n", > >>> range.start, range.end); > >>> return; > >>> } > >>> > >>> release_mem_region(range.start, range_len(&range)); > >>> dev_set_drvdata(dev, NULL); > >>> kfree(res_name); > >>> return; > >>> > >>> > >>>>> + release_mem_region(range.start, range_len(&range)); > >>>> > >>>> remove_memory() does a release_mem_region_adjustable(). Don't you > >>>> actually want to release the *unaligned* region you requested? > >>>> > >>> Isn't it what we're doing here? > >>> (The release_mem_region_adjustable() is using the same > >>> dax_kmem-aligned range and there's no split/adjust) > >>> > >>> Meaning right now (+ parent marked as !BUSY), and if I am understanding > >>> this correctly: > >>> > >>> request_mem_region(range.start, range_len) > >>> __request_region(iomem_res, range.start, range_len) -> alloc @parent > >>> add_memory_driver_managed(parent.start, resource_size(parent)) > >>> __request_region(parent.start, resource_size(parent)) -> alloc @child > >>> > >>> [...] > >>> > >>> remove_memory(range.start, range_len) > >>> request_mem_region_adjustable(range.start, range_len) > >>> __release_region(range.start, range_len) -> remove @child > >>> > >>> release_mem_region(range.start, range_len) > >>> __release_region(range.start, range_len) -> doesn't remove @parent because !BUSY? > >>> > >>> The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining > >>> unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y. > >>> > >>> Joao > >>> > >> > >> Thinking about it, if we don't set the parent resource BUSY (which is > >> what I think is the right way of doing things), and don't want to store > >> the parent resource pointer, we could add something like > >> lookup_resource() - e.g., lookup_mem_resource() - , however, searching > >> properly in the whole hierarchy (instead of only the first level), and > >> traversing down to the last hierarchy. Then it would be as simple as > >> > >> remove_memory(range.start, range_len) > >> res = lookup_mem_resource(range.start); > >> release_resource(res); > > > > Another thought... I notice that you've taught > > register_memory_resource() a IORESOURCE_MEM_DRIVER_MANAGED special > > case. Lets just make the assumption of add_memory_driver_managed() > > that it is the driver's responsibility to mark the range busy before > > calling, and the driver's responsibility to release the region. I.e. > > validate (rather than request) that the range is busy in > > register_memory_resource(), and teach release_memory_resource() to > > skip releasing the region when the memory is marked driver managed. > > That would let dax_kmem drop its manipulation of the 'busy' flag which > > is a layering violation no matter how many comments we put around it. > > IIUC, that won't work for virtio-mem, whereby the parent resource spans > multiple possible (future) add_memory_driver_managed() calls and is > (just like for kmem) a pure indication to which device memory ranges belong. > > For example, when exposing 2GB via a virtio-mem device with max 4GB: > > (/proc/iomem) > 240000000-33fffffff : virtio0 > 240000000-2bfffffff : System RAM (virtio_mem) > > And after hotplugging additional 2GB: > > 240000000-33fffffff : virtio0 > 240000000-33fffffff : System RAM (virtio_mem) > > So marking "virtio0" always BUSY (especially right from the start) would > be wrong. I'm not suggesting to busy the whole "virtio" range, just the portion that's about to be passed to add_memory_driver_managed(). > The assumption is that anything that's IORESOURCE_SYSTEM_RAM > and IORESOUCE_BUSY is currently added to the system as system RAM (e.g., > after add_memory() and friends, or during boot). > > I do agree that manually clearing the busy flag is ugly. What we most > probably want is request_mem_region() that performs similar checks (no > overlaps with existing BUSY resources), but doesn't set the region busy. > I can't see that working without some way to export and hold the resource lock until some agent can atomically claim the range. _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org