On Mon, Sep 27, 2021 at 08:39:30PM +0300, Max Gurtovoy wrote:
> 
> On 9/27/2021 11:09 AM, Stefan Hajnoczi wrote:
> > On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote:
> > > To optimize performance, set the affinity of the block device tagset
> > > according to the virtio device affinity.
> > > 
> > > Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> > > ---
> > >   drivers/block/virtio_blk.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > index 9b3bd083b411..1c68c3e0ebf9 100644
> > > --- a/drivers/block/virtio_blk.c
> > > +++ b/drivers/block/virtio_blk.c
> > > @@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device *vdev)
> > >   	memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
> > >   	vblk->tag_set.ops = &virtio_mq_ops;
> > >   	vblk->tag_set.queue_depth = queue_depth;
> > > -	vblk->tag_set.numa_node = NUMA_NO_NODE;
> > > +	vblk->tag_set.numa_node = virtio_dev_to_node(vdev);
> > >   	vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> > >   	vblk->tag_set.cmd_size =
> > >   		sizeof(struct virtblk_req) +
> > I implemented NUMA affinity in the past and could not demonstrate a
> > performance improvement:
> > https://lists.linuxfoundation.org/pipermail/virtualization/2020-June/048248.html
> > 
> > The pathological case is when a guest with vNUMA has the virtio-blk-pci
> > device on the "wrong" host NUMA node. Then memory accesses should cross
> > NUMA nodes. Still, it didn't seem to matter.
> 
> I think the reason you didn't see any improvement is since you didn't use
> the right device for the node query. See my patch 1/2.

That doesn't seem to be the case. Please see
drivers/base/core.c:device_add():

  /* use parent numa_node */
  if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
          set_dev_node(dev, dev_to_node(parent));

IMO it's cleaner to use dev_to_node(&vdev->dev) than to directly access
the parent.

Have I missed something?

> 
> I can try integrating these patches in my series and fix it.
> 
> BTW, we might not see a big improvement because of other bottlenecks but
> this is known perf optimization we use often in block storage drivers.

Let's see benchmark results. Otherwise this is just dead code that adds
complexity.

Stefan