From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B728FC4332F for ; Thu, 6 Oct 2022 15:52:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231709AbiJFPwR (ORCPT ); Thu, 6 Oct 2022 11:52:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57566 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231918AbiJFPwP (ORCPT ); Thu, 6 Oct 2022 11:52:15 -0400 Received: from mail-qv1-xf43.google.com (mail-qv1-xf43.google.com [IPv6:2607:f8b0:4864:20::f43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4B164A99C7 for ; Thu, 6 Oct 2022 08:52:10 -0700 (PDT) Received: by mail-qv1-xf43.google.com with SMTP id mx8so1420137qvb.8 for ; Thu, 06 Oct 2022 08:52:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=B2o9ZROGbxRs9xRX9MabbA1puhx8ft3d7HPuUza0guk=; b=RYSBceZ0muIaRCBnYmDykL9LYKu+e3R4EZPUnVaDyxdOBxLXh62ZZX+qtbXiUvB2KA C96dPC7tpStP1WiFf1liG1RPn/KTWcuHSG2+doC55oQXqJ8KiSYqnw5URojRZAAftvaw MSkn5LgBqm6qyA1eNu4xyDpRXXtUofDfiP3k2SIpNvOr2zOoVhHQq/vSlgXplZHHCqrs fqJ+WPPLHGuS9gwhKeGG3wBRPBc/Pg2fgJr+/mzJn+93QkHgWfFCjNcXss6oJGyv85OH OeMwFLZkqXbLHZ8oQFtMrGjoSy32L8bhYSMJW0HgEeUS3mXmuoOXBOxll/FgW7vdIcP1 HdSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=B2o9ZROGbxRs9xRX9MabbA1puhx8ft3d7HPuUza0guk=; b=gcwiGPIuwhv+5XA3PkcEdGSN87wyykcJslmWfky5IBstz2wNqs77He0hrDYrU32Mhm itlBNB1nPqgaOOOnSCkPehccXQS3wk93yjPIJ1yHeceXTAbN4va78/1VStLTqclG5w8J 13DlKI6W3hQlVCP3h46qdfS605j7TfEHmf1EnNVNbcTmooAVJKFgxOzFIPbt+4eG8zeN Au9ILpsS6omnp/smn5wxGCJUH5Te+qNiV7+5J2DFn0f+So8j9RvHYqJYRKTnSPo8xIeY 2RdbI4slh5NokUpQRvJq6ZUDlgL6AiAViaeMPlZZxdOTaC1cGyzNLw5i8oKeDigxM+9g wheA== X-Gm-Message-State: ACrzQf3Bwbw/FJT9oyJm1pHLjiPlorPzYCCeA/QwFnW0rgW8hxzklfMD HtdMkmv1WRmBkurFqqnnXQ== X-Google-Smtp-Source: AMsMyM5LE7C4YkdlkzG3xoy2DYZgeYK7gW+oqsQnxBOQpfG+wwSP7qDYn1VdRTw0dmJHUpWvBSrfsw== X-Received: by 2002:ad4:5dee:0:b0:4ac:b757:ccd with SMTP id jn14-20020ad45dee000000b004acb7570ccdmr702228qvb.76.1665071529265; Thu, 06 Oct 2022 08:52:09 -0700 (PDT) Received: from fedora (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id o15-20020a05620a22cf00b006bb78d095c5sm18201578qki.79.2022.10.06.08.52.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 06 Oct 2022 08:52:08 -0700 (PDT) Date: Thu, 6 Oct 2022 11:52:06 -0400 From: Gregory Price To: Jonathan Cameron Cc: qemu-devel@nongnu.org, linux-cxl@vger.kernel.org, Alison Schofield , Davidlohr Bueso , "a.manzanares@samsung.com" , Ben Widawsky Subject: Re: [PATCH RFC] hw/cxl: type 3 devices can now present volatile or persistent memory Message-ID: References: <20221006000103.49542-1-gregory.price@memverge.com> <20221006094557.000035ab@huawei.com> <20221006095007.00001271@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20221006095007.00001271@huawei.com> Precedence: bulk List-ID: X-Mailing-List: linux-cxl@vger.kernel.org On Thu, Oct 06, 2022 at 09:50:07AM +0100, Jonathan Cameron wrote: > On Thu, 6 Oct 2022 09:45:57 +0100 > Jonathan Cameron wrote: > > > Great to see this. > > > > Missing Signed-off by so we can't apply this (no developer certificate of > > origin) Probably want your from address to match that and I've no idea > > if your fully name is Gourry (apologies if it is!) > > > > +CC linux-cxl as most folks who will be interested in this hang out there > > and are more likely to noticed it than burried on qemu-devel. > > Also +CC some others who will be interested in this discussion. > > > > Please also cc the listed CXL QEMU maintainers (myself and Ben W) - see > > MAINTAINERS > > Sorry about that, this is my literal first patch submission to the kernel devlists ever, I'm still getting my environment set up (just learning mutt as I type this). Gourry is a nickname I go by, but I see pseudonyms are not particularly welcome on the lists, so no worries. In addition, I'm relatively new to the QEMU, CXL, and PCIe ecosystem, so please excuse some of my ignorance of the overall state of things - doing my best to pickup as quickly as possible. > I forgot to ask. How are you testing this? > > One of the blockers for volatile support was that we had no means to poke > it properly as the kernel doesn't yet support volatile capacity and > no one has done the relevant work in EDK2 or similar to do it before the kernel boots. > There has been some work on EDK2 support for ARM N2 FVPs from > Saanta Pattanayak, but not upstream eyt. > https://lpc.events/event/16/contributions/1254/ > > Thanks, > > Jonathan > Presently this is untested for exactly the reasons you described. I had originally intended to shoot off a cover letter with this patch discussing the issue, but I managed to much that up... so lets just plop that in here: [PATCH RFC] hw/cxl: Volatile Type-3 CXL Devices The idea behind this patch is simple: Not all type-3 devices are persistent memory devices, so add a simple boolean (is_pmem) to the attributes of type-3 devices and have the mailbox-utility report memory accordingly (persistent_capacity if is_pmem, else volatile_capacity). In an emulated system, there isn't really a difference between the devices except for the backing storage (file vs ram), but having an emulation mode for each is useful for the sake of testing fabric manager/orchestrator corner cases. At the moment this can be tested with `cxl list` and it will show the ram capacity as requested, instead of pmem capacity. There are a few issues with this patch set that i'm not able to discern the right answer to just yet, and was hoping for some input: 1) The PCI device type is set prior to realize/attributes, and is currently still set to PCI_CLASS_STORAGE_EXPRESS. Should this instead be PCI_CLASS_MEMORY_CXL when presenting as a simple memory expander? 1a) If so, that means we'll probably need to make type-3 into an abstract class of devices and explicitly create devices that represents vmem and pmem devices separately. That could probably be done within cxl_type3.c since most of the code will be shared anyway. But this is confusing, because a pmem device can be made to operate in volatile mode... so should it ALSO be assigned as PCI_CLASS_MEMORY_CXL, and a volatile device simply be a pmem device with persistence turned off? Many design questions here. 1b) If not, does this have an appreciable affect on system-setup from the perspective of the EFI/Bios? It should be possible, (see #2) to have the EFI/bios setup memory expanders and just present them to the kernel as additional, available memory as if more ram is simply present. 2) EDK2 sets the memory area as a reserved, and the memory is not configured by the system as ram. I'm fairly sure edk2 just doesn't support this yet, but there's a chicken/egg problem. If the device isn't there, there's nothing to test against... if there's nothing to test against, no one will write the support. So I figure we should kick start the process (probably by getting it wrong on the first go around!) 3) Upstream linux drivers haven't touched ram configurations yet. I just configured this with Dan Williams yesterday on IRC. My understanding is that it's been worked on but nothing has been upstreamed, in part because there are only a very small set of devices available to developers at the moment. 4) Should (is_pmem) be defaulted to 'true' (or invert and make the flag (is_vmem)), and should pmem devices also present their memory as volatile capacity? I think they should, because all pmem devices should be capable of running in a volatile mode. > > > > Would it be possible to introduce this support in a fashion that allowed > > us to transition to devices presenting both without needing a change of > > interface? With hindsight the memdev naming of the existing parameter is > > less than helpful, but one path to doing this would be to allow for > > an alternative memdev-volatile= parameter. For now we can keep the patch > > small by only supporting either memdev or memdev-volatile My understanding of the current design is the the memdev backing itself is irrelevant. You can totally have file-memdevs backing a volatile reagion and a ram-memdev backing a persistent region. By adding more/alternative flags that can conflict with each other, we will also create a boondoggle on intialization. I had first considered making Type3Dev into an abstract device and instantiating separate explicit pmem and vmem devices, but ultimatly that just looked like 2 empty structures wrapping Type3Dev and setting the internal is_pmem boolean... lol. > > > > That way when we want to come along and add mixed devices later, we simply > > enable supporting the specification of two different memory backends. > > > > The tricky bit then would be to support variable capacity (devices can > > support a region of memory which may be persistent or non persistent > > depending on settings - see Set Partition Info). > > > > So my gut feeling is we may eventually allow all of: > > > > 1) single memory backend for persistent or volatile. > > 2) two memory backends, covering a persistent region and a volatile region > > (likely you may want actual non volatile backend for the persistent > > region but not the volatile one). > > 3) single memory backend in which we can allow for the Set Partition Info stuff > > to work. > > Seems a little odd to use two memory backends. Of what use is it to the software developers, it should be completely transparent to them, right? The only thing I can think of is maybe reset mechanics for volatile regions being set differently than persistent regions, but even then it seems simple enough to just emulate the behavior and use a single backing device. > > As we move beyond just one backend that is presented as persistent we need > > to define the interface... > > > > I can think of far too many options! Let me lay out a few for comment. > > > > A) Array of memory backends, separate control of configurable size > > + starting persistent size, starting volatile size. > > Default of one memory backend gives you current non-volatile only. > > To get volatile only you provide one memory backend and set > > persistent size to 0 and volatile size to size of memory backend. > > > > B) Two memory backends, (later add some trickery around presented size) > > memdev as defined is persistent memory (maybe we deprecate that interface > > and make it memdev-persistent) memdev-volatile as volatile. I had considered this, but - at least in my head - you're describing a multi-logical device. The device we're looking at right now is a basic Single-Logical Device which can be either persistent or volatile. This goes back to my question: Should we be designing this SLD as a component which can slot into an MLD, or are there special features of a MLD sub-device that aren't present on an SLD? Basically I just don't see the need for multiple backends, as opposed to multiple devices which are separately manageable. If I'm misunderstanding something, please let me know. > > > > Steps to add functionality. > > 1. Add your code to enable volatile but use 'which memdev is supplied' to > > act as the flag. For now supply both is an error. > > 2. Enable static volatile + non-volatile devices (both memdevs supplied). > > 3. (hackish) Add a control for amount of partionable capacity. Cheat > > and remove that capacity from both memdevs. Changing partition just > > messes without how much of each memdev is used. We'd need to be a little > > careful to ensure we wipe data etc. > > > > You code mostly looks good to me so I think discussion here is going to be > > all about the interface and how we build one more suitable for the future! > > > > Jonathan > > > > > > > > Feels like I'm > > > Example command lines > > > --------------------- > > > -A very simple setup with just one directly attached CXL Type 3 device:: > > > +A very simple setup with just one directly attached CXL Type 3 Persistent Memory device:: > > > > > > qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \ > > > ... > > > @@ -308,7 +308,18 @@ A very simple setup with just one directly attached CXL Type 3 device:: > > > -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \ > > > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > > > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > > > - -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \ > > > + -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \ > > > + -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G > > > > Can't do this as it breaks any existing setup. If a flag makes sense (I don't think > > it is flexible enough) then would need to be volatile so it not being present > > will correspond to false and current situation. > > this would be another reason to simply default the pmem flag to true. It would allow existing setups to continue as-is. I can make this swap, but i think the above discussions were more important to get started (and why this is an RFC). > > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c > > > index bc1bb18844..3ed4dfeb69 100644 > > > --- a/hw/cxl/cxl-mailbox-utils.c > > > +++ b/hw/cxl/cxl-mailbox-utils.c > > > @@ -138,7 +138,7 @@ static ret_code cmd_firmware_update_get_info(struct cxl_cmd *cmd, > > > } QEMU_PACKED *fw_info; > > > QEMU_BUILD_BUG_ON(sizeof(*fw_info) != 0x50); > > > > > > - if (cxl_dstate->pmem_size < (256 << 20)) { > > > + if (cxl_dstate->mem_size < (256 << 20)) { > > > > Huh. I wonder why this check is here in the first place? Looks very odd given why > > should we be checking the memory size in a firmware update based command. > > > > Probably cut and paste error long ago. > > I had actually wondered about all these 256MB alignment checks, and whether they should actually be sunken into the device creation rather than strewn around. I would have to go digging through the spec to see whether there is explicit error handling at each level if a device presents an out-of-spec size. Something for another patch. > > > @@ -338,10 +340,10 @@ static void ct3_class_init(ObjectClass *oc, void *data) > > > pc->class_id = PCI_CLASS_STORAGE_EXPRESS; > > > pc->vendor_id = PCI_VENDOR_ID_INTEL; > > > pc->device_id = 0xd93; /* LVF for now */ > > > - pc->revision = 1; > > > + pc->revision = 2; > > Given the driver isn't checking this and your code change isn't breaking > > older versions I would not update the revision for this. > > Will drop that from the patch