From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-cxl-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E0A09C433F5
	for <linux-cxl@archiver.kernel.org>; Wed, 16 Mar 2022 16:50:42 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S239504AbiCPQvz (ORCPT <rfc822;linux-cxl@archiver.kernel.org>);
        Wed, 16 Mar 2022 12:51:55 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45754 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S238414AbiCPQvz (ORCPT
        <rfc822;linux-cxl@vger.kernel.org>); Wed, 16 Mar 2022 12:51:55 -0400
Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D556937BCF
        for <linux-cxl@vger.kernel.org>; Wed, 16 Mar 2022 09:50:39 -0700 (PDT)
Received: from fraeml711-chm.china.huawei.com (unknown [172.18.147.207])
        by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4KJbnm4bHQz6H6hp;
        Thu, 17 Mar 2022 00:48:44 +0800 (CST)
Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by
 fraeml711-chm.china.huawei.com (10.206.15.60) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.24; Wed, 16 Mar 2022 17:50:37 +0100
Received: from localhost (10.202.226.41) by lhreml710-chm.china.huawei.com
 (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Wed, 16 Mar
 2022 16:50:36 +0000
Date:   Wed, 16 Mar 2022 16:50:34 +0000
From:   Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To:     Peter Xu <peterx@redhat.com>
CC:     "Michael S. Tsirkin" <mst@redhat.com>,
        Peter Maydell <peter.maydell@linaro.org>,
        Ben Widawsky <ben.widawsky@intel.com>, <qemu-devel@nongnu.org>,
        Samarth Saxena <samarths@cadence.com>,
        Chris Browy <cbrowy@avery-design.com>, <linuxarm@huawei.com>,
        <linux-cxl@vger.kernel.org>, Markus Armbruster <armbru@redhat.com>,
        "Shreyas Shah" <shreyas.shah@elastics.cloud>,
        Saransh Gupta1 <saransh@ibm.com>,
        Shameerali Kolothum Thodi 
        <shameerali.kolothum.thodi@huawei.com>,
        "Marcel Apfelbaum" <marcel@redhat.com>,
        Igor Mammedov <imammedo@redhat.com>,
        "Dan Williams" <dan.j.williams@intel.com>,
        Alex =?ISO-8859-1?Q?Benn=E9e?= <alex.bennee@linaro.org>,
        Philippe =?ISO-8859-1?Q?Mathieu-Daud=E9?= <f4bug@amsat.org>,
        Paolo Bonzini <pbonzini@redhat.com>,
        David Hildenbrand <david@redhat.com>
Subject: Re: [PATCH v7 00/46] CXl 2.0 emulation Support
Message-ID: <20220316165034.000037e7@Huawei.com>
In-Reply-To: <Yimwjtd8SsVLOU5e@xz-m1.local>
References: <20220306174137.5707-1-Jonathan.Cameron@huawei.com>
        <20220306163119-mutt-send-email-mst@kernel.org>
        <20220307093918.00002f20@Huawei.com>
        <YihiHNxVjDFI0Z8r@xz-m1.local>
        <20220309112827.00002c73@Huawei.com>
        <Yimwjtd8SsVLOU5e@xz-m1.local>
Organization: Huawei Technologies Research and Development (UK) Ltd.
X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32)
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.202.226.41]
X-ClientProxiedBy: lhreml744-chm.china.huawei.com (10.201.108.194) To
 lhreml710-chm.china.huawei.com (10.201.108.61)
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <linux-cxl.vger.kernel.org>
X-Mailing-List: linux-cxl@vger.kernel.org

On Thu, 10 Mar 2022 16:02:22 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:
> > Hi Peter,  
> 
> Hi, Jonathan,
> 
> >   
> > > 
> > > https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> > > 
> > > Having mr->ops set but with memory_access_is_direct() returning true sounds
> > > weird to me.
> > > 
> > > Sorry to have no understanding of the whole picture, but.. could you share
> > > more on what's the interleaving requirement on the proxying, and why it
> > > can't be done with adding some IO memory regions as sub-regions upon the
> > > file one?  
> > 
> > The proxying requirement is simply a means to read/write to a computed address
> > within a memory region. There may well be a better way to do that.
> > 
> > If I understand your suggestion correctly you would need a very high
> > number of IO memory regions to be created dynamically when particular sets of
> > registers across multiple devices in the topology are all programmed.
> > 
> > The interleave can be 256 bytes across up to 16x, many terabyte, devices.
> > So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
> > IO regions.  Even for a minimal useful test case of largest interleave
> > set of 16x 256MB devices (256MB is minimum size the specification allows per
> > decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
> > Any idea if that approach would scale sensibly to this number of regions?
> > 
> > There are also complexities to getting all the information in one place to
> > work out which IO memory regions maps where in PA space. Current solution is
> > to do that mapping in the same way the hardware does which is hierarchical,
> > so we walk the path to the device, picking directions based on each interleave
> > decoder that we meet.
> > Obviously this is a bit slow but I only really care about correctness at the
> > moment.  I can think of various approaches to speeding it up but I'm not sure
> > if we will ever care about performance.
> > 
> > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
> > has the logic for that and as you can see it's fairly simple because we are always
> > going down the topology following the decoders.
> > 
> > Below I have mapped out an algorithm I think would work for doing it with
> > IO memory regions as subregions.
> > 
> > We could fake the whole thing by limiting ourselves to small host
> > memory windows which are always directly backed, but then I wouldn't
> > achieve the main aim of this which is to provide a test base for the OS code.
> > To do that I need real interleave so I can seed the files with test patterns
> > and verify the accesses hit the correct locations. Emulating what the hardware
> > is actually doing on a device by device basis is the easiest way I have
> > come up with to do that.
> > 
> > Let me try to provide some more background so you hopefully don't have
> > to have read the specs to follow what is going on!
> > There are an example for directly connected (no switches) topology in the
> > docs
> > 
> > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
> > 
> > The overall picture is we have a large number of CXL Type 3 memory devices,
> > which at runtime (by OS at boot/on hotplug) are configured into various
> > interleaving sets with hierarchical decoding at the host + host bridge
> > + switch levels. For test setups I probably need to go to around 32 devices
> > so I can hit various configurations simultaneously.
> > No individual device has visibility of the full interleave setup - hence
> > the walk in the existing code through the various decoders to find the
> > final Device Physical address.
> > 
> > At the host level the host provides a set of Physical Address windows with
> > a fixed interleave decoding across the different host bridges in the system
> > (CXL Fixed Memory windows, CFMWs)
> > On a real system these have to be large enough to allow for any memory
> > devices that might be hotplugged and all possible configurations (so
> > with 2 host bridges you need at least 3 windows in the many TB range,
> > much worse as the number of host bridges goes up). It'll be worse than
> > this when we have QoS groups, but the current Qemu code just puts all
> > the windows in group 0.  Hence my first thought of just putting memory
> > behind those doesn't scale (a similar approach to this was in the
> > earliest versions of this patch set - though the full access path
> > wasn't wired up).
> > 
> > The granularity can be in powers of 2 from 256 bytes to 16 kbytes
> > 
> > Next each host bridge has programmable address decoders which take the
> > incoming (often already interleaved) memory access and direct them to
> > appropriate root ports.  The root ports can be connected to a switch
> > which has additional address decoders in the upstream port to decide
> > which downstream port to route to.  Note we currently only support 1 level
> > of switches but it's easy to make this algorithm recursive to support
> > multiple switch levels (currently the kernel proposals only support 1 level)
> > 
> > Finally the End Point with the actual memory receives the interleaved request and
> > takes the full address and (for power of 2 decoding - we don't yet support
> > 3,6 and 12 way which is more complex and there is no kernel support yet)
> > it drops a few address bits and adds an offset for the decoder used to
> > calculate it's own device physical address.  Note device will support
> > multiple interleave sets for different parts of it's file once we add
> > multiple decoder support (on the todo list).
> > 
> > So the current solution is straight forward (with the exception of that
> > proxying) because it follows the same decoding as used in real hardware
> > to route the memory accesses. As a result we get a read/write to a
> > device physical address and hence proxy that.  If any of the decoders
> > along the path are not configured then we error out at that stage.
> > 
> > To create the equivalent as IO subregions I think we'd have to do the
> > following from (this might be mediated by some central entity that
> > doesn't currently exist, or done on demand from which ever CXL device
> > happens to have it's decoder set up last)
> > 
> > 1) Wait for a decoder commit (enable) on any component. Goto 2.
> > 2) Walk the topology (up to host decoder, down to memory device)
> > If a complete interleaving path has been configured -
> >    i.e. we have committed decoders all the way to the memory
> >    device goto step 3, otherwise return to step 1 to wait for
> >    more decoders to be committed.
> > 3) For the memory region being supplied by the memory device,
> >    add subregions to map the device physical address (address
> >    in the file) for each interleave stride to the appropriate
> >    host Physical Address.
> > 4) Return to step 1 to wait for more decoders to commit.
> > 
> > So summary is we can do it with IO regions, but there are a lot of them
> > and the setup is somewhat complex as we don't have one single point in
> > time where we know all the necessary information is available to compute
> > the right addresses.
> > 
> > Looking forward to your suggestions if I haven't caused more confusion!  

Hi Peter,

> 
> Thanks for the write up - I must confess they're a lot! :)
> 
> I merely only learned what is CXL today, and I'm not very experienced on
> device modeling either, so please bare with me with stupid questions..
> 
> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
> That's a normal IO region, which looks very reasonable.
> 
> However I'm confused why patch "RFC: softmmu/memory: Add ops to
> memory_region_ram_init_from_file" helped.
> 
> Per my knowledge, all the memory accesses upon this CFMW window trapped
> using this IO region already.  There can be multiple memory file objects
> underneath, and when read/write happens the object will be decoded from
> cxl_cfmws_find_device() as you referenced.

Yes.

> 
> However I see nowhere that these memory objects got mapped as sub-regions
> into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
> be trapped.

AS you note they aren't mapped into the parent mr, hence we are trapping.
The parent mem_ops are responsible for decoding the 'which device' +
'what address in device memory space'. Once we've gotten that info
the question is how do I actually do the access?

Mapping as subregions seems unwise due to the huge number required.

> 
> To ask in another way: what will happen if you simply revert this RFC
> patch?  What will go wrong?

The call to memory_region_dispatch_read()
https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556

would call memory_region_access_valid() that calls 
mr->ops->valid.accepts() which is set to
unassigned_mem_accepts() and hence...
you get back a MEMTX_DECODE_ERROR back and an exception in the
guest.

That wouldn't happen with a non proxied access to the ram as
those paths never uses the ops as memory_access_is_direct() is called
and simply memcpy used without any involvement of the ops.

Is a better way to proxy those writes to the backing files?

I was fishing a bit in the dark here and saw the existing ops defined
for a different purpose for VFIO

4a2e242bbb ("memory Don't use memcpy for ram_device regions")

and those allowed the use of memory_region_dispatch_write() to work.

Hence the RFC marking on that patch :)

Thanks,

Jonathan


> 
> Thanks,
> 


From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id E50BCC433F5
	for <qemu-devel@archiver.kernel.org>; Wed, 16 Mar 2022 16:53:38 +0000 (UTC)
Received: from localhost ([::1]:45586 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>)
	id 1nUWu2-0000hV-3h
	for qemu-devel@archiver.kernel.org; Wed, 16 Mar 2022 12:53:38 -0400
Received: from eggs.gnu.org ([209.51.188.92]:40680)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <jonathan.cameron@huawei.com>)
 id 1nUWrN-0006RJ-Cx
 for qemu-devel@nongnu.org; Wed, 16 Mar 2022 12:50:54 -0400
Received: from frasgout.his.huawei.com ([185.176.79.56]:2404)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <jonathan.cameron@huawei.com>)
 id 1nUWrH-0006ww-HK
 for qemu-devel@nongnu.org; Wed, 16 Mar 2022 12:50:52 -0400
Received: from fraeml711-chm.china.huawei.com (unknown [172.18.147.207])
 by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4KJbnm4bHQz6H6hp;
 Thu, 17 Mar 2022 00:48:44 +0800 (CST)
Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by
 fraeml711-chm.china.huawei.com (10.206.15.60) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.24; Wed, 16 Mar 2022 17:50:37 +0100
Received: from localhost (10.202.226.41) by lhreml710-chm.china.huawei.com
 (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2308.21; Wed, 16 Mar
 2022 16:50:36 +0000
Date: Wed, 16 Mar 2022 16:50:34 +0000
To: Peter Xu <peterx@redhat.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>, Peter Maydell
 <peter.maydell@linaro.org>, Ben Widawsky <ben.widawsky@intel.com>,
 <qemu-devel@nongnu.org>, Samarth Saxena <samarths@cadence.com>, Chris Browy
 <cbrowy@avery-design.com>, <linuxarm@huawei.com>,
 <linux-cxl@vger.kernel.org>, Markus Armbruster <armbru@redhat.com>, "Shreyas
 Shah" <shreyas.shah@elastics.cloud>, Saransh Gupta1 <saransh@ibm.com>,
 Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>, "Marcel
 Apfelbaum" <marcel@redhat.com>, Igor Mammedov <imammedo@redhat.com>, "Dan
 Williams" <dan.j.williams@intel.com>, Alex =?ISO-8859-1?Q?Benn=E9e?=
 <alex.bennee@linaro.org>, Philippe =?ISO-8859-1?Q?Mathieu-Daud=E9?=
 <f4bug@amsat.org>, Paolo Bonzini <pbonzini@redhat.com>, David Hildenbrand
 <david@redhat.com>
Subject: Re: [PATCH v7 00/46] CXl 2.0 emulation Support
Message-ID: <20220316165034.000037e7@Huawei.com>
In-Reply-To: <Yimwjtd8SsVLOU5e@xz-m1.local>
References: <20220306174137.5707-1-Jonathan.Cameron@huawei.com>
 <20220306163119-mutt-send-email-mst@kernel.org>
 <20220307093918.00002f20@Huawei.com> <YihiHNxVjDFI0Z8r@xz-m1.local>
 <20220309112827.00002c73@Huawei.com> <Yimwjtd8SsVLOU5e@xz-m1.local>
Organization: Huawei Technologies Research and Development (UK) Ltd.
X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32)
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.202.226.41]
X-ClientProxiedBy: lhreml744-chm.china.huawei.com (10.201.108.194) To
 lhreml710-chm.china.huawei.com (10.201.108.61)
X-CFilter-Loop: Reflected
Received-SPF: pass client-ip=185.176.79.56;
 envelope-from=jonathan.cameron@huawei.com; helo=frasgout.his.huawei.com
X-Spam_score_int: -41
X-Spam_score: -4.2
X-Spam_bar: ----
X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3,
 RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: "Qemu-devel"
 <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Reply-to:  Jonathan Cameron <Jonathan.Cameron@Huawei.com>
From:  Jonathan Cameron via <qemu-devel@nongnu.org>

On Thu, 10 Mar 2022 16:02:22 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:
> > Hi Peter,  
> 
> Hi, Jonathan,
> 
> >   
> > > 
> > > https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> > > 
> > > Having mr->ops set but with memory_access_is_direct() returning true sounds
> > > weird to me.
> > > 
> > > Sorry to have no understanding of the whole picture, but.. could you share
> > > more on what's the interleaving requirement on the proxying, and why it
> > > can't be done with adding some IO memory regions as sub-regions upon the
> > > file one?  
> > 
> > The proxying requirement is simply a means to read/write to a computed address
> > within a memory region. There may well be a better way to do that.
> > 
> > If I understand your suggestion correctly you would need a very high
> > number of IO memory regions to be created dynamically when particular sets of
> > registers across multiple devices in the topology are all programmed.
> > 
> > The interleave can be 256 bytes across up to 16x, many terabyte, devices.
> > So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
> > IO regions.  Even for a minimal useful test case of largest interleave
> > set of 16x 256MB devices (256MB is minimum size the specification allows per
> > decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
> > Any idea if that approach would scale sensibly to this number of regions?
> > 
> > There are also complexities to getting all the information in one place to
> > work out which IO memory regions maps where in PA space. Current solution is
> > to do that mapping in the same way the hardware does which is hierarchical,
> > so we walk the path to the device, picking directions based on each interleave
> > decoder that we meet.
> > Obviously this is a bit slow but I only really care about correctness at the
> > moment.  I can think of various approaches to speeding it up but I'm not sure
> > if we will ever care about performance.
> > 
> > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
> > has the logic for that and as you can see it's fairly simple because we are always
> > going down the topology following the decoders.
> > 
> > Below I have mapped out an algorithm I think would work for doing it with
> > IO memory regions as subregions.
> > 
> > We could fake the whole thing by limiting ourselves to small host
> > memory windows which are always directly backed, but then I wouldn't
> > achieve the main aim of this which is to provide a test base for the OS code.
> > To do that I need real interleave so I can seed the files with test patterns
> > and verify the accesses hit the correct locations. Emulating what the hardware
> > is actually doing on a device by device basis is the easiest way I have
> > come up with to do that.
> > 
> > Let me try to provide some more background so you hopefully don't have
> > to have read the specs to follow what is going on!
> > There are an example for directly connected (no switches) topology in the
> > docs
> > 
> > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
> > 
> > The overall picture is we have a large number of CXL Type 3 memory devices,
> > which at runtime (by OS at boot/on hotplug) are configured into various
> > interleaving sets with hierarchical decoding at the host + host bridge
> > + switch levels. For test setups I probably need to go to around 32 devices
> > so I can hit various configurations simultaneously.
> > No individual device has visibility of the full interleave setup - hence
> > the walk in the existing code through the various decoders to find the
> > final Device Physical address.
> > 
> > At the host level the host provides a set of Physical Address windows with
> > a fixed interleave decoding across the different host bridges in the system
> > (CXL Fixed Memory windows, CFMWs)
> > On a real system these have to be large enough to allow for any memory
> > devices that might be hotplugged and all possible configurations (so
> > with 2 host bridges you need at least 3 windows in the many TB range,
> > much worse as the number of host bridges goes up). It'll be worse than
> > this when we have QoS groups, but the current Qemu code just puts all
> > the windows in group 0.  Hence my first thought of just putting memory
> > behind those doesn't scale (a similar approach to this was in the
> > earliest versions of this patch set - though the full access path
> > wasn't wired up).
> > 
> > The granularity can be in powers of 2 from 256 bytes to 16 kbytes
> > 
> > Next each host bridge has programmable address decoders which take the
> > incoming (often already interleaved) memory access and direct them to
> > appropriate root ports.  The root ports can be connected to a switch
> > which has additional address decoders in the upstream port to decide
> > which downstream port to route to.  Note we currently only support 1 level
> > of switches but it's easy to make this algorithm recursive to support
> > multiple switch levels (currently the kernel proposals only support 1 level)
> > 
> > Finally the End Point with the actual memory receives the interleaved request and
> > takes the full address and (for power of 2 decoding - we don't yet support
> > 3,6 and 12 way which is more complex and there is no kernel support yet)
> > it drops a few address bits and adds an offset for the decoder used to
> > calculate it's own device physical address.  Note device will support
> > multiple interleave sets for different parts of it's file once we add
> > multiple decoder support (on the todo list).
> > 
> > So the current solution is straight forward (with the exception of that
> > proxying) because it follows the same decoding as used in real hardware
> > to route the memory accesses. As a result we get a read/write to a
> > device physical address and hence proxy that.  If any of the decoders
> > along the path are not configured then we error out at that stage.
> > 
> > To create the equivalent as IO subregions I think we'd have to do the
> > following from (this might be mediated by some central entity that
> > doesn't currently exist, or done on demand from which ever CXL device
> > happens to have it's decoder set up last)
> > 
> > 1) Wait for a decoder commit (enable) on any component. Goto 2.
> > 2) Walk the topology (up to host decoder, down to memory device)
> > If a complete interleaving path has been configured -
> >    i.e. we have committed decoders all the way to the memory
> >    device goto step 3, otherwise return to step 1 to wait for
> >    more decoders to be committed.
> > 3) For the memory region being supplied by the memory device,
> >    add subregions to map the device physical address (address
> >    in the file) for each interleave stride to the appropriate
> >    host Physical Address.
> > 4) Return to step 1 to wait for more decoders to commit.
> > 
> > So summary is we can do it with IO regions, but there are a lot of them
> > and the setup is somewhat complex as we don't have one single point in
> > time where we know all the necessary information is available to compute
> > the right addresses.
> > 
> > Looking forward to your suggestions if I haven't caused more confusion!  

Hi Peter,

> 
> Thanks for the write up - I must confess they're a lot! :)
> 
> I merely only learned what is CXL today, and I'm not very experienced on
> device modeling either, so please bare with me with stupid questions..
> 
> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
> That's a normal IO region, which looks very reasonable.
> 
> However I'm confused why patch "RFC: softmmu/memory: Add ops to
> memory_region_ram_init_from_file" helped.
> 
> Per my knowledge, all the memory accesses upon this CFMW window trapped
> using this IO region already.  There can be multiple memory file objects
> underneath, and when read/write happens the object will be decoded from
> cxl_cfmws_find_device() as you referenced.

Yes.

> 
> However I see nowhere that these memory objects got mapped as sub-regions
> into parent (CXLFixedWindow.mr).  Then I don't understand why they cannot
> be trapped.

AS you note they aren't mapped into the parent mr, hence we are trapping.
The parent mem_ops are responsible for decoding the 'which device' +
'what address in device memory space'. Once we've gotten that info
the question is how do I actually do the access?

Mapping as subregions seems unwise due to the huge number required.

> 
> To ask in another way: what will happen if you simply revert this RFC
> patch?  What will go wrong?

The call to memory_region_dispatch_read()
https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556

would call memory_region_access_valid() that calls 
mr->ops->valid.accepts() which is set to
unassigned_mem_accepts() and hence...
you get back a MEMTX_DECODE_ERROR back and an exception in the
guest.

That wouldn't happen with a non proxied access to the ram as
those paths never uses the ops as memory_access_is_direct() is called
and simply memcpy used without any involvement of the ops.

Is a better way to proxy those writes to the backing files?

I was fishing a bit in the dark here and saw the existing ops defined
for a different purpose for VFIO

4a2e242bbb ("memory Don't use memcpy for ram_device regions")

and those allowed the use of memory_region_dispatch_write() to work.

Hence the RFC marking on that patch :)

Thanks,

Jonathan


> 
> Thanks,
>