Re: [PATCH 2/2] virtio-blk: set NUMA affinity for a tagset

From: Leon Romanovsky <leon@kernel.org>
To: Max Gurtovoy <mgurtovoy@nvidia.com>
Cc: mst@redhat.com, virtualization@lists.linux-foundation.org,
	kvm@vger.kernel.org, stefanha@redhat.com, oren@nvidia.com,
	nitzanc@nvidia.com, israelr@nvidia.com, hch@infradead.org,
	linux-block@vger.kernel.org, axboe@kernel.dk,
	Yaron Gepstein <yarong@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: [PATCH 2/2] virtio-blk: set NUMA affinity for a tagset
Date: Tue, 28 Sep 2021 19:27:42 +0300	[thread overview]
Message-ID: <YVNCflMxWh4m7ewU@unreal> (raw)
In-Reply-To: <f8de7c19-9f04-a458-6c1d-8133a83aa93f@nvidia.com>

On Tue, Sep 28, 2021 at 06:59:15PM +0300, Max Gurtovoy wrote:
> 
> On 9/27/2021 9:23 PM, Leon Romanovsky wrote:
> > On Mon, Sep 27, 2021 at 08:25:09PM +0300, Max Gurtovoy wrote:
> > > On 9/27/2021 2:34 PM, Leon Romanovsky wrote:
> > > > On Sun, Sep 26, 2021 at 05:55:18PM +0300, Max Gurtovoy wrote:
> > > > > To optimize performance, set the affinity of the block device tagset
> > > > > according to the virtio device affinity.
> > > > > 
> > > > > Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> > > > > ---
> > > > >    drivers/block/virtio_blk.c | 2 +-
> > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > > > index 9b3bd083b411..1c68c3e0ebf9 100644
> > > > > --- a/drivers/block/virtio_blk.c
> > > > > +++ b/drivers/block/virtio_blk.c
> > > > > @@ -774,7 +774,7 @@ static int virtblk_probe(struct virtio_device *vdev)
> > > > >    	memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
> > > > >    	vblk->tag_set.ops = &virtio_mq_ops;
> > > > >    	vblk->tag_set.queue_depth = queue_depth;
> > > > > -	vblk->tag_set.numa_node = NUMA_NO_NODE;
> > > > > +	vblk->tag_set.numa_node = virtio_dev_to_node(vdev);
> > > > I afraid that by doing it, you will increase chances to see OOM, because
> > > > in NUMA_NO_NODE, MM will try allocate memory in whole system, while in
> > > > the latter mode only on specific NUMA which can be depleted.
> > > This is a common methodology we use in the block layer and in NVMe subsystem
> > > and we don't afraid of the OOM issue you raised.
> > There are many reasons for that, but we are talking about virtio here
> > and not about NVMe.
> 
> Ok. what reasons ?

For example, NVMe are physical devices that rely on DMA operations,
PCI connectivity e.t.c to operate. Such systems indeed can benefit from
NUMA locality hints. At the end, these devices are physically connected
to that NUMA node.

In our case, virtio-blk is a software interface that doesn't have all
these limitations. On the contrary, the virtio-blk can be created on one
CPU and moved later to be close to the QEMU which can run on another NUMA
node.

Also this patch increases chances to get OOM by factor of NUMA nodes.
Before your patch, the virtio_blk can allocate from X memory, after your
patch it will be X/NUMB_NUMA_NODES.

In addition, it has all chances to even hurt performance.

So yes, post v2, but as Stefan and I asked, please provide supportive
performance results, because what was done for another subsystem doesn't
mean that it will be applicable here.

Thanks