All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mina Almasry <almasrymina@google.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>,
	Michael Sammler <sammler@google.com>,
	 Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	 Ira Weiny <ira.weiny@intel.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	nvdimm@lists.linux.dev,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1] virtio_pmem: populate numa information
Date: Fri, 11 Nov 2022 13:54:58 -0800	[thread overview]
Message-ID: <CAHS8izOYYV+dz3vPdbkipt1i1XAU-mvJOn6c_z-NJJwzUtWzDg@mail.gmail.com> (raw)
In-Reply-To: <6359ab83d6e4d_4da3294d0@dwillia2-xfh.jf.intel.com.notmuch>

On Wed, Oct 26, 2022 at 2:50 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Pankaj Gupta wrote:
> > > > > Compute the numa information for a virtio_pmem device from the memory
> > > > > range of the device. Previously, the target_node was always 0 since
> > > > > the ndr_desc.target_node field was never explicitly set. The code for
> > > > > computing the numa node is taken from cxl_pmem_region_probe in
> > > > > drivers/cxl/pmem.c.
> > > > >
> > > > > Signed-off-by: Michael Sammler <sammler@google.com>

Tested-by: Mina Almasry <almasrymina@google.com>

I don't have much expertise on this driver, but with the help of this
patch I was able to get memory tiering [1] emulation going on qemu. As
far as I know there is no alternative to this emulation, and so I
would love to see this or equivalent merged, if possible.

This is what I have going to get memory tiering emulation:

In qemu, added these configs:
      -object memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G
\
      -smp 2,sockets=2,maxcpus=2  \
      -numa node,nodeid=0,memdev=m0 \
      -numa node,nodeid=1,memdev=m1 \
      -numa node,nodeid=2,memdev=m2,initiator=0 \
      -numa node,nodeid=3,initiator=0 \
      -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \

On boot, ran these commands:
    ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null
    echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
    for i in `ls /sys/devices/system/memory/`; do
      state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null)
      if [ "$state" == "offline" ]; then
        echo online_movable > "/sys/devices/system/memory/$i/state"
      fi
    done

Without this CL, I see the memory onlined in node 0 always, and is not
a separate memory tier. With this CL and qemu configs, the memory is
onlined in node 3 and is set as a separate memory tier, which enables
qemu-based development:

==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <==
3
==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <==
0-2

AFAIK there is no alternative to enabling memory tiering emulation in
qemu, and would love to see this or equivalent merged, if possible.


[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers

> > > > > ---
> > > > >  drivers/nvdimm/virtio_pmem.c | 11 +++++++++--
> > > > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > > > > index 20da455d2ef6..a92eb172f0e7 100644
> > > > > --- a/drivers/nvdimm/virtio_pmem.c
> > > > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem)
> > > > >  static int virtio_pmem_probe(struct virtio_device *vdev)
> > > > >  {
> > > > >         struct nd_region_desc ndr_desc = {};
> > > > > -       int nid = dev_to_node(&vdev->dev);
> > > > >         struct nd_region *nd_region;
> > > > >         struct virtio_pmem *vpmem;
> > > > >         struct resource res;
> > > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> > > > >         dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus);
> > > > >
> > > > >         ndr_desc.res = &res;
> > > > > -       ndr_desc.numa_node = nid;
> > > > > +
> > > > > +       ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start);
> > > > > +       ndr_desc.target_node = phys_to_target_node(res.start);
> > > > > +       if (ndr_desc.target_node == NUMA_NO_NODE) {
> > > > > +               ndr_desc.target_node = ndr_desc.numa_node;
> > > > > +               dev_dbg(&vdev->dev, "changing target node from %d to %d",
> > > > > +                       NUMA_NO_NODE, ndr_desc.target_node);
> > > > > +       }
> > > >
> > > > As this memory later gets hotplugged using "devm_memremap_pages". I don't
> > > > see if 'target_node' is used for fsdax case?
> > > >
> > > > It seems to me "target_node" is used mainly for volatile range above
> > > > persistent memory ( e.g kmem driver?).
> > > >
> > > I am not sure if 'target_node' is used in the fsdax case, but it is
> > > indeed used by the devdax/kmem driver when hotplugging the memory (see
> > > 'dev_dax_kmem_probe' and '__dax_pmem_probe').
> >
> > Yes, but not currently for FS_DAX iiuc.
>
> The target_node is only used by the dax_kmem driver. In the FSDAX case
> the memory (persistent or otherwise) is mapped behind a block-device.
> That block-device has affinity to a CPU initiator, but that memory does
> not itself have any NUMA affinity or identity as a target.
>
> So:
>
> block-device NUMA node == closest CPU initiator node to the device
>
> dax-device target node == memory only NUMA node target, after onlining

  reply	other threads:[~2022-11-11 21:55 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 17:11 [PATCH v1] virtio_pmem: populate numa information Michael Sammler
2022-10-21 11:42 ` Pankaj Gupta
2022-10-21 17:08   ` Michael Sammler
2022-10-26 12:12     ` Pankaj Gupta
2022-10-26 21:49       ` Dan Williams
2022-11-11 21:54         ` Mina Almasry [this message]
2022-11-13 17:44           ` Pankaj Gupta
2022-11-14 19:14             ` Mina Almasry
2022-11-16 14:28               ` Pankaj Gupta
2022-10-26 21:28 ` Pasha Tatashin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHS8izOYYV+dz3vPdbkipt1i1XAU-mvJOn6c_z-NJJwzUtWzDg@mail.gmail.com \
    --to=almasrymina@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=pankaj.gupta.linux@gmail.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=sammler@google.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.