On Mon, Sep 27, 2021 at 08:39:30PM +0300, Max Gurtovoy wrote: > > On 9/27/2021 11:09 AM, Stefan Hajnoczi wrote: > > On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote: > > > To optimize performance, set the affinity of the block device tagset > > > according to the virtio device affinity. > > > > > > Signed-off-by: Max Gurtovoy > > > --- > > > drivers/block/virtio_blk.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > > > index 9b3bd083b411..1c68c3e0ebf9 100644 > > > --- a/drivers/block/virtio_blk.c > > > +++ b/drivers/block/virtio_blk.c > > > @@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device *vdev) > > > memset(&vblk->tag_set, 0, sizeof(vblk->tag_set)); > > > vblk->tag_set.ops = &virtio_mq_ops; > > > vblk->tag_set.queue_depth = queue_depth; > > > - vblk->tag_set.numa_node = NUMA_NO_NODE; > > > + vblk->tag_set.numa_node = virtio_dev_to_node(vdev); > > > vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; > > > vblk->tag_set.cmd_size = > > > sizeof(struct virtblk_req) + > > I implemented NUMA affinity in the past and could not demonstrate a > > performance improvement: > > https://lists.linuxfoundation.org/pipermail/virtualization/2020-June/048248.html > > > > The pathological case is when a guest with vNUMA has the virtio-blk-pci > > device on the "wrong" host NUMA node. Then memory accesses should cross > > NUMA nodes. Still, it didn't seem to matter. > > I think the reason you didn't see any improvement is since you didn't use > the right device for the node query. See my patch 1/2. That doesn't seem to be the case. Please see drivers/base/core.c:device_add(): /* use parent numa_node */ if (parent && (dev_to_node(dev) == NUMA_NO_NODE)) set_dev_node(dev, dev_to_node(parent)); IMO it's cleaner to use dev_to_node(&vdev->dev) than to directly access the parent. Have I missed something? > > I can try integrating these patches in my series and fix it. > > BTW, we might not see a big improvement because of other bottlenecks but > this is known perf optimization we use often in block storage drivers. Let's see benchmark results. Otherwise this is just dead code that adds complexity. Stefan