Re: [PATCH] virtio: fix IO request length in virtio SCSI/block #PSBM-78839

From: Denis Plotnikov <dplotnikov@virtuozzo.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "kwolf@redhat.com" <kwolf@redhat.com>,
	Denis Lunev <den@virtuozzo.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"mreitz@redhat.com" <mreitz@redhat.com>,
	Roman Kagan <rkagan@virtuozzo.com>,
	"stefanha@redhat.com" <stefanha@redhat.com>,
	"kraxel@redhat.com" <kraxel@redhat.com>
Subject: Re: [PATCH] virtio: fix IO request length in virtio SCSI/block #PSBM-78839
Date: Thu, 5 Dec 2019 07:59:54 +0000	[thread overview]
Message-ID: <f0a0906b-0a76-fe28-331c-f08e00014cef@virtuozzo.com> (raw)
In-Reply-To: <c39d90bc-92c0-6b53-b7a6-992ec69bb0b0@virtuozzo.com>

Ping!

On 25.11.2019 12:16, Denis Plotnikov wrote:
>
>
> On 06.11.2019 15:03, Michael S. Tsirkin wrote:
>> On Thu, Oct 24, 2019 at 11:34:34AM +0000, Denis Lunev wrote:
>>> On 10/24/19 12:28 AM, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 18, 2019 at 02:55:47PM +0300, Denis Plotnikov wrote:
>>>>> From: "Denis V. Lunev" <den@openvz.org>
>>>>>
>>>>> Linux guests submit IO requests no longer than PAGE_SIZE * max_seg
>>>>> field reported by SCSI controler. Thus typical sequential read with
>>>>> 1 MB size results in the following pattern of the IO from the guest:
>>>>>    8,16   1    15754     2.766095122  2071  D   R 2095104 + 1008 [dd]
>>>>>    8,16   1    15755     2.766108785  2071  D   R 2096112 + 1008 [dd]
>>>>>    8,16   1    15756     2.766113486  2071  D   R 2097120 + 32 [dd]
>>>>>    8,16   1    15757     2.767668961     0  C   R 2095104 + 1008 [0]
>>>>>    8,16   1    15758     2.768534315     0  C   R 2096112 + 1008 [0]
>>>>>    8,16   1    15759     2.768539782     0  C   R 2097120 + 32 [0]
>>>>> The IO was generated by
>>>>>    dd if=/dev/sda of=/dev/null bs=1024 iflag=direct
>>>>>
>>>>> This effectively means that on rotational disks we will observe 3 
>>>>> IOPS
>>>>> for each 2 MBs processed. This definitely negatively affects both
>>>>> guest and host IO performance.
>>>>>
>>>>> The cure is relatively simple - we should report lengthy 
>>>>> scatter-gather
>>>>> ability of the SCSI controller. Fortunately the situation here is 
>>>>> very
>>>>> good. VirtIO transport layer can accomodate 1024 items in one request
>>>>> while we are using only 128. This situation is present since almost
>>>>> very beginning. 2 items are dedicated for request metadata thus we
>>>>> should publish VIRTQUEUE_MAX_SIZE - 2 as max_seg.
>>>>>
>>>>> The following pattern is observed after the patch:
>>>>>    8,16   1     9921     2.662721340  2063  D   R 2095104 + 1024 [dd]
>>>>>    8,16   1     9922     2.662737585  2063  D   R 2096128 + 1024 [dd]
>>>>>    8,16   1     9923     2.665188167     0  C   R 2095104 + 1024 [0]
>>>>>    8,16   1     9924     2.665198777     0  C   R 2096128 + 1024 [0]
>>>>> which is much better.
>>>>>
>>>>> The dark side of this patch is that we are tweaking guest visible
>>>>> parameter, though this should be relatively safe as above transport
>>>>> layer support is present in QEMU/host Linux for a very long time.
>>>>> The patch adds configurable property for VirtIO SCSI with a new 
>>>>> default
>>>>> and hardcode option for VirtBlock which does not provide good
>>>>> configurable framework.
>>>>>
>>>>> Unfortunately the commit can not be applied as is. For the real 
>>>>> cure we
>>>>> need guest to be fixed to accomodate that queue length, which is done
>>>>> only in the latest 4.14 kernel. Thus we are going to expose the 
>>>>> property
>>>>> and tweak it on machine type level.
>>>>>
>>>>> The problem with the old kernels is that they have
>>>>> max_segments <= virtqueue_size restriction which cause the guest
>>>>> crashing in the case of violation.
>>>> This isn't just in the guests: virtio spec also seems to imply this,
>>>> or at least be vague on this point.
>>>>
>>>> So I think it'll need a feature bit.
>>>> Doing that in a safe way will also allow being compatible with old 
>>>> guests.
>>>>
>>>> The only downside is it's a bit more work as we need to
>>>> spec this out and add guest support.
>>>>
>>>>> To fix the case described above in the old kernels we can increase
>>>>> virtqueue_size to 256 and max_segments to 254. The pitfall here is
>>>>> that seabios allows the virtqueue_size-s < 128, however, the seabios
>>>>> patch extending that value to 256 is pending.
>>>> And the fix here is just to limit large vq size to virtio 1.0.
>>>> In that mode it's fine I think:
>>>>
>>>>
>>>>     /* check if the queue is available */
>>>>     if (vp->use_modern) {
>>>>         num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size);
>>>>         if (num > MAX_QUEUE_NUM) {
>>>>             vp_write(&vp->common, virtio_pci_common_cfg, queue_size,
>>>>                      MAX_QUEUE_NUM);
>>>>             num = vp_read(&vp->common, virtio_pci_common_cfg, 
>>>> queue_size);
>>>>         }
>>>>     } else {
>>>>         num = vp_read(&vp->legacy, virtio_pci_legacy, queue_num);
>>>>     }
> The same seabios snippet,  but more detailed:
>
> vp_find_vq()
> {
>    ...
>    /* check if the queue is available */
>    if (vp->use_modern) {
>        num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size);
>        if (num > MAX_QUEUE_NUM) {
>            vp_write(&vp->common, virtio_pci_common_cfg, queue_size,
>                     MAX_QUEUE_NUM);
>            num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size);
>        }
>    } else {
>        num = vp_read(&vp->legacy, virtio_pci_legacy, queue_num);
>    }
>    if (!num) {
>        dprintf(1, "ERROR: queue size is 0\n");
>        goto fail;
>    }
>    if (num > MAX_QUEUE_NUM) {
>        dprintf(1, "ERROR: queue size %d > %d\n", num, MAX_QUEUE_NUM);
>        goto fail;
>    }
> ...
> }
>
> It turned out that the problem is here, but not because of the seabios 
> code.
> The virtqueue size is written and then incorrect value is re-read.
> Thanks to Roman Kagan (rkagan@virtuozzo.com) for investigating the 
> root cause of the problem.
>
> As the code states, for the modern devices, seabios reads the queue 
> size and if it's
> greater than seabios can support, reduce the queue size to the max 
> seabios supported value.
>
> This doesn't work.
>
> The reason is that the size is read from the virtio device,
>
> virtio_pci_common_read()
> {
>     ...
>     case VIRTIO_PCI_COMMON_Q_SIZE:
>         val = virtio_queue_get_num(vdev, vdev->queue_sel);
>         break;
>     ...
> }
>
> but is written to the proxy
>
> virtio_pci_common_write()
> {
>     ...
>     case VIRTIO_PCI_COMMON_Q_SIZE:
>         proxy->vqs[vdev->queue_sel].num = val;
>         break;
>    ...
> }.
>
> The final stage of the size setting is propagated it from the proxy to 
> the device on virtqueue enabling:
>
> virtio_cpi_common_write()
> {
>     ...
>     case VIRTIO_PCI_COMMON_Q_ENABLE:
>         virtio_queue_set_num(vdev, vdev->queue_sel,
> proxy->vqs[vdev->queue_sel].num);
>         virtio_queue_set_rings(vdev, vdev->queue_sel,
> ((uint64_t)proxy->vqs[vdev->queue_sel].desc[1]) << 32 |
>                        proxy->vqs[vdev->queue_sel].desc[0],
> ((uint64_t)proxy->vqs[vdev->queue_sel].avail[1]) << 32 |
>                        proxy->vqs[vdev->queue_sel].avail[0],
> ((uint64_t)proxy->vqs[vdev->queue_sel].used[1]) << 32 |
>                        proxy->vqs[vdev->queue_sel].used[0]);
>         proxy->vqs[vdev->queue_sel].enabled = 1;
>         break;
>     ...
> }.
>
> So we have the following workflow:
> suppose the device has virtqueue size = 256 and seabios MAX_QUEUE_NUM 
> = 128.
> In that case the seabios works like:
>
> 1. if vp_modern read the size (256)
> 2. 256 > 128
> 3. write virtqueue size = 128
> 4. re-read virtqueue size = 256 !!!
> 5. fail because of the check
>     if (num > MAX_QUEUE_NUM) {
>         dprintf(1, "ERROR: queue size %d > %d\n", num, MAX_QUEUE_NUM);
>         goto fail;
>     }
>
> To fix the issue, we need to read and write the virtqueue size from 
> the same place.
> Should we do with the proxy?
> Is there any reason to read from the device and write to the proxy?
>
> Furthermore, the size setting has a few flaws:
>
> 1. The size being set should be a power of 2
> 2. The size being set should be less or equal to the virtqueue size 
> (and be greater that 2?)
>
> Denis
>>> you mean to put the code like this into virtio_pci_realize() inside 
>>> QEMU?
>>>
>>> If no, can you pls clarify which component should be touched.
>>>
>>> Den
>> I mean:
>>   - add an API to change the default queue size
>>   - add a validate features callback, in there check and for modern
>>     flag set in features increase the queue size
>>
>> maybe all this is too much work, we could block this
>> for transitional devices, but your patch does not do it,
>> you need to check that legacy is enabled not that modern
>> is not disabled.
>>
>>
>>
>