Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle

From: "manish.mishra" <manish.mishra@nutanix.com>
To: Peter Xu <peterx@redhat.com>
Cc: "Jason Wang" <jasowang@redhat.com>,
	"Hyman Huang" <huangy81@chinatelecom.cn>,
	qemu-devel <qemu-devel@nongnu.org>,
	"Eduardo Habkost" <eduardo@habkost.net>,
	"David Hildenbrand" <david@redhat.com>,
	"Juan Quintela" <quintela@redhat.com>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	"Markus ArmBruster" <armbru@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>
Subject: Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
Date: Mon, 13 Jun 2022 21:03:24 +0530	[thread overview]
Message-ID: <2081c641-80ea-00a9-b42d-3ef4cbf6b387@nutanix.com> (raw)
In-Reply-To: <YqdKsNEkWsS3XDvf@xz-m1.local>

On 13/06/22 8:03 pm, Peter Xu wrote:
> On Mon, Jun 13, 2022 at 03:28:34PM +0530, manish.mishra wrote:
>> On 26/05/22 8:21 am, Jason Wang wrote:
>>> On Wed, May 25, 2022 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>>>> On Wed, May 25, 2022 at 11:38:26PM +0800, Hyman Huang wrote:
>>>>>> 2. Also this algorithm only control or limits dirty rate by guest
>>>>>> writes. There can be some memory dirtying done by virtio based devices
>>>>>> which is accounted only at qemu level so may not be accounted through
>>>>>> dirty rings so do we have plan for that in future? Those are not issue
>>>>>> for auto-converge as it slows full VM but dirty rate limit only slows
>>>>>> guest writes.
>>>>>>
>>>>>   From the migration point of view, time spent on migrating memory is far
>>>>> greater than migrating devices emulated by qemu. I think we can do that when
>>>>> migrating device costs the same magnitude time as migrating memory.
>>>>>
>>>>> As to auto-converge, it throttle vcpu by kicking it and force it to sleep
>>>>> periodically. The two seems has no much difference from the perspective of
>>>>> internal method but the auto-converge is kind of "offensive" when doing
>>>>> restraint. I'll read the auto-converge implementation code and figure out
>>>>> the problem you point out.
>>>> This seems to be not virtio-specific, but can be applied to any device DMA
>>>> writting to guest mem (if not including vfio).  But indeed virtio can be
>>>> normally faster.
>>>>
>>>> I'm also curious how fast a device DMA could dirty memories.  This could be
>>>> a question to answer to all vcpu-based throttling approaches (including the
>>>> quota based approach that was proposed on KVM list).  Maybe for kernel
>>>> virtio drivers we can have some easier estimation?
>>> As you said below, it really depends on the speed of the backend.
>>>
>>>>    My guess is it'll be
>>>> much harder for DPDK-in-guest (aka userspace drivers) because IIUC that
>>>> could use a large chunk of guest mem.
>>> Probably, for vhost-user backend, it could be ~20Mpps or even higher.
>> Sorry for late response on this. We did experiment with IO on virtio-scsi based disk.
> Thanks for trying this and sharing it out.
>
>> We could see dirty rate of ~500MBps on my system and most of that was not tracked
>>
>> as kvm_dirty_log. Also for reference i am attaching test we used to avoid tacking
>>
>> in KVM. (as attached file).
> The number looks sane as it seems to be the sequential bandwidth for a
> disk, though I'm not 100% sure it'll work as expected since you mmap()ed
> the region with private pages rather than shared, so after you did I'm
> wondering whether below will happen (also based on the fact that you mapped
> twice the size of guest mem as you mentioned in the comment):
>
>    (1) Swap out will start to trigger after you read a lot of data into the
>        mem already, then old-read pages will be swapped out to disk (and
>        hopefully the swap device does not reside on the same virtio-scsi
>        disk or it'll be even more complicated scenario of mixture IOs..),
>        meanwhile when you finish reading a round and start to read from
>        offset 0 swap-in will start to happen too.  Swapping can slow down
>        things already, and I'm wondering whether the 500MB/s was really
>        caused by the swapout rather than backend disk reads.  More below.
>
>    (2) Another attribute of private pages AFAICT is after you read it once
>        it does not need to be read again from the virtio-scsi disks.  In
>        other words, I'm thinking whether starting from the 2nd iteration
>        your program won't trigger any DMA at all but purely torturing the
>        swap device.
>
> Maybe changing MAP_PRIVATE to MAP_SHARED can emulate better on what we want
> to measure, but I'm also not 100% sure on whether it could be accurate..
>
> Thanks,
>
Thanks Peter, Yes agree MAP_SHARED should be used here, sorry i missed that 😁.

Yes, my purpose of taking file size larger than RAM_SIZE was to cause

frequent page cache flush and re-populating page-cache pages, not to

trigger swaps. I checked on my VM i had swapping disabled, may be

MAP_PRIVATE did not make difference because it was read-only.

I tested again with MAP_SHARED it comes around ~500MBps.

Thanks

Manish Mishra

>>> Thanks
>>>
>>>> [copy Jason too]
>>>>
>>>> --
>>>> Peter Xu
>>>>
>> #include <fcntl.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <sys/stat.h>
>> #include <sys/time.h>
>> #include <time.h>
>> #include <unistd.h>
>>
>> #define PAGE_SIZE 4096
>> #define GB (1024 * 1024 * 1024)
>>
>> int main()
>> {
>>      char *buff;
>>      size_t size;
>>      struct stat stat;
>>      // Take file of size atleast double of RAM size to
>>      // achieve max dirty rate possible.
>>      const char * file_name = "file_10_gb";
>>      int fd;
>>      size_t i = 0, count = 0;
>>      struct timespec ts1, ts0;
>>      double time_diff;
>>
>>      fd = open(file_name, O_RDONLY);
>>      if (fd == -1) {
>>         perror("Error opening file");
>>         exit(1);
>>      }
>>
>>      fstat (fd, &stat);
>>      size = stat.st_size;
>>      printf("File size %ld\n", (long)size);
>>
>>      buff = (char *)mmap(0, size, PROT_READ, MAP_PRIVATE, fd, 0);
>>      if (buff == MAP_FAILED) {
>>         perror("Mmap Error");
>>         exit(1);
>>      }
>>
>>      (void)clock_gettime(CLOCK_MONOTONIC, &ts0);
>>
>>      while(1) {
>>         char c;
>>
>>         i = (i + PAGE_SIZE) % size;
>>         c = buff[i];
>>         count++;
>>         // Check on every 10K pages for rate.
>>         if (count % 10000 == 0) {
>>            (void)clock_gettime(CLOCK_MONOTONIC, &ts1);
>>            time_diff = ((double)ts1.tv_sec  + ts1.tv_nsec * 1.0e-9) -((double)ts0.tv_sec + ts0.tv_nsec * 1.0e-9);
>>            printf("Expected Dirty rate %f\n", (10000.0 * PAGE_SIZE) / GB / time_diff);
>>            ts0 = ts1;
>>         }
>>      }
>>
>>      close(fd);
>>      return 0;
>> }
>