[Qemu-devel] is there a limit on the number of in-flight I/O operations?

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] is there a limit on the number of in-flight I/O operations?
@ 2014-07-18 14:58 Chris Friesen
  2014-07-18 15:24 ` Paolo Bonzini
  2014-07-18 15:54 ` Andrey Korolyov
  0 siblings, 2 replies; 40+ messages in thread
From: Chris Friesen @ 2014-07-18 14:58 UTC (permalink / raw)
  To: qemu-devel

Hi,

I've recently run up against an interesting issue where I had a number 
of guests running and when I started doing heavy disk I/O on a virtio 
disk (backed via ceph rbd) the memory consumption spiked and triggered 
the OOM-killer.

I want to reserve some memory for I/O, but I don't know how much it can 
use in the worst-case.

Is there a limit on the number of in-flight I/O operations?  (Preferably 
as a configurable option, but even hard-coded would be good to know as 
well.)

Thanks,
Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-18 14:58 [Qemu-devel] is there a limit on the number of in-flight I/O operations? Chris Friesen
@ 2014-07-18 15:24 ` Paolo Bonzini
  2014-07-18 16:22   ` Chris Friesen
  2014-07-18 15:54 ` Andrey Korolyov
  1 sibling, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2014-07-18 15:24 UTC (permalink / raw)
  To: Chris Friesen, qemu-devel

Il 18/07/2014 16:58, Chris Friesen ha scritto:
> 
> I've recently run up against an interesting issue where I had a number
> of guests running and when I started doing heavy disk I/O on a virtio
> disk (backed via ceph rbd) the memory consumption spiked and triggered
> the OOM-killer.
> 
> I want to reserve some memory for I/O, but I don't know how much it can
> use in the worst-case.
> 
> Is there a limit on the number of in-flight I/O operations?  (Preferably
> as a configurable option, but even hard-coded would be good to know as
> well.)

For rbd, there is no such limit in QEMU except the size of the virtio
ring buffer, but librbd may add limits of its own.

For files, there's no limit if you use aio=threads, but the more I/O
operations you trigger the more threads QEMU will create.  After 10
seconds of idle the threads will be destroyed.

Also for files, the limit is 128 per disk if you use aio=native.  You
can only change it by recompilation only.

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-18 14:58 [Qemu-devel] is there a limit on the number of in-flight I/O operations? Chris Friesen
  2014-07-18 15:24 ` Paolo Bonzini
@ 2014-07-18 15:54 ` Andrey Korolyov
  2014-07-18 16:26   ` Chris Friesen
  1 sibling, 1 reply; 40+ messages in thread
From: Andrey Korolyov @ 2014-07-18 15:54 UTC (permalink / raw)
  To: Chris Friesen; +Cc: qemu-devel

On Fri, Jul 18, 2014 at 6:58 PM, Chris Friesen
<chris.friesen@windriver.com> wrote:
> Hi,
>
> I've recently run up against an interesting issue where I had a number of
> guests running and when I started doing heavy disk I/O on a virtio disk
> (backed via ceph rbd) the memory consumption spiked and triggered the
> OOM-killer.
>
> I want to reserve some memory for I/O, but I don't know how much it can use
> in the worst-case.
>
> Is there a limit on the number of in-flight I/O operations?  (Preferably as
> a configurable option, but even hard-coded would be good to know as well.)
>
> Thanks,
> Chris
>

Hi, are you using per-vm cgroups or it was happened on bare system?
Ceph backend have writeback cache setting, may be you hitting it but
it must be set enormously large then.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-18 15:24 ` Paolo Bonzini
@ 2014-07-18 16:22   ` Chris Friesen
  2014-07-18 20:13     ` Paolo Bonzini
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Friesen @ 2014-07-18 16:22 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel

On 07/18/2014 09:24 AM, Paolo Bonzini wrote:
> Il 18/07/2014 16:58, Chris Friesen ha scritto:
>>
>> I've recently run up against an interesting issue where I had a number
>> of guests running and when I started doing heavy disk I/O on a virtio
>> disk (backed via ceph rbd) the memory consumption spiked and triggered
>> the OOM-killer.
>>
>> I want to reserve some memory for I/O, but I don't know how much it can
>> use in the worst-case.
>>
>> Is there a limit on the number of in-flight I/O operations?  (Preferably
>> as a configurable option, but even hard-coded would be good to know as
>> well.)
>
> For rbd, there is no such limit in QEMU except the size of the virtio
> ring buffer, but librbd may add limits of its own.
>
> For files, there's no limit if you use aio=threads, but the more I/O
> operations you trigger the more threads QEMU will create.  After 10
> seconds of idle the threads will be destroyed.
>
> Also for files, the limit is 128 per disk if you use aio=native.  You
> can only change it by recompilation only.

Has anyone looked at enforcing some limits?  I'm okay with throttling 
performance if necessary, but I really don't want to trigger the OOM-killer.

I could do it locally, but it'd be nice to have some pointers as to 
where to start.

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-18 15:54 ` Andrey Korolyov
@ 2014-07-18 16:26   ` Chris Friesen
  2014-07-18 16:30     ` Andrey Korolyov
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Friesen @ 2014-07-18 16:26 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: qemu-devel

On 07/18/2014 09:54 AM, Andrey Korolyov wrote:
> On Fri, Jul 18, 2014 at 6:58 PM, Chris Friesen
> <chris.friesen@windriver.com> wrote:
>> Hi,
>>
>> I've recently run up against an interesting issue where I had a number of
>> guests running and when I started doing heavy disk I/O on a virtio disk
>> (backed via ceph rbd) the memory consumption spiked and triggered the
>> OOM-killer.
>>
>> I want to reserve some memory for I/O, but I don't know how much it can use
>> in the worst-case.
>>
>> Is there a limit on the number of in-flight I/O operations?  (Preferably as
>> a configurable option, but even hard-coded would be good to know as well.)
>>
>> Thanks,
>> Chris
>>
>
> Hi, are you using per-vm cgroups or it was happened on bare system?
> Ceph backend have writeback cache setting, may be you hitting it but
> it must be set enormously large then.
>

This is without cgroups.  (I think we had tried cgroups and ran into 
some issues.)  Would cgroups even help with iSCSI/rbd/etc?

The "-drive" parameter in qemu was using "cache=none" for the VMs in 
question.  But I'm assuming it keeps the buffer around until acked by 
the far end in order to be able to handle retries.

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-18 16:26   ` Chris Friesen
@ 2014-07-18 16:30     ` Andrey Korolyov
  2014-07-18 16:46       ` Chris Friesen
  0 siblings, 1 reply; 40+ messages in thread
From: Andrey Korolyov @ 2014-07-18 16:30 UTC (permalink / raw)
  To: Chris Friesen; +Cc: qemu-devel

On Fri, Jul 18, 2014 at 8:26 PM, Chris Friesen
<chris.friesen@windriver.com> wrote:
> On 07/18/2014 09:54 AM, Andrey Korolyov wrote:
>>
>> On Fri, Jul 18, 2014 at 6:58 PM, Chris Friesen
>> <chris.friesen@windriver.com> wrote:
>>>
>>> Hi,
>>>
>>> I've recently run up against an interesting issue where I had a number of
>>> guests running and when I started doing heavy disk I/O on a virtio disk
>>> (backed via ceph rbd) the memory consumption spiked and triggered the
>>> OOM-killer.
>>>
>>> I want to reserve some memory for I/O, but I don't know how much it can
>>> use
>>> in the worst-case.
>>>
>>> Is there a limit on the number of in-flight I/O operations?  (Preferably
>>> as
>>> a configurable option, but even hard-coded would be good to know as
>>> well.)
>>>
>>> Thanks,
>>> Chris
>>>
>>
>> Hi, are you using per-vm cgroups or it was happened on bare system?
>> Ceph backend have writeback cache setting, may be you hitting it but
>> it must be set enormously large then.
>>
>
> This is without cgroups.  (I think we had tried cgroups and ran into some
> issues.)  Would cgroups even help with iSCSI/rbd/etc?
>
> The "-drive" parameter in qemu was using "cache=none" for the VMs in
> question.  But I'm assuming it keeps the buffer around until acked by the
> far end in order to be able to handle retries.
>
> Chris
>
>

This is probably a bug even if the legitimate mechanisms causing it -
peak memory footprint for an emulator should be predictable. Never hit
something like this on any kind of workload, will try to reproduce by
myself.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-18 16:30     ` Andrey Korolyov
@ 2014-07-18 16:46       ` Chris Friesen
  0 siblings, 0 replies; 40+ messages in thread
From: Chris Friesen @ 2014-07-18 16:46 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: qemu-devel

On 07/18/2014 10:30 AM, Andrey Korolyov wrote:
> On Fri, Jul 18, 2014 at 8:26 PM, Chris Friesen
> <chris.friesen@windriver.com> wrote:
>> On 07/18/2014 09:54 AM, Andrey Korolyov wrote:
>>>
>>> On Fri, Jul 18, 2014 at 6:58 PM, Chris Friesen
>>> <chris.friesen@windriver.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I've recently run up against an interesting issue where I had a number of
>>>> guests running and when I started doing heavy disk I/O on a virtio disk
>>>> (backed via ceph rbd) the memory consumption spiked and triggered the
>>>> OOM-killer.
>>>>
>>>> I want to reserve some memory for I/O, but I don't know how much it can
>>>> use
>>>> in the worst-case.
>>>>
>>>> Is there a limit on the number of in-flight I/O operations?  (Preferably
>>>> as
>>>> a configurable option, but even hard-coded would be good to know as
>>>> well.)
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>
>>> Hi, are you using per-vm cgroups or it was happened on bare system?
>>> Ceph backend have writeback cache setting, may be you hitting it but
>>> it must be set enormously large then.
>>>
>>
>> This is without cgroups.  (I think we had tried cgroups and ran into some
>> issues.)  Would cgroups even help with iSCSI/rbd/etc?
>>
>> The "-drive" parameter in qemu was using "cache=none" for the VMs in
>> question.  But I'm assuming it keeps the buffer around until acked by the
>> far end in order to be able to handle retries.
>>
>> Chris
>>
>>
>
> This is probably a bug even if the legitimate mechanisms causing it -
> peak memory footprint for an emulator should be predictable. Never hit
> something like this on any kind of workload, will try to reproduce by
> myself.

The drive parameter would have looked something like this:

-drive 
file=rbd:volumes/volume-7c1427d4-0758-4384-9431-653aab24a690:auth_supported=none:mon_host=192.168.205.3\:6789\;192.168.205.4\:6789\;192.168.205.5\:6789,if=none,id=drive-virtio-disk0,format=raw,serial=7c1427d4-0758-4384-9431-653aab24a690,cache=none 
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1

When we started running dbench in the guest the qemu RSS jumped 
significantly.  Also, it stayed at the higher value even after the test 
was stopped--which is not ideal behaviour.

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-18 16:22   ` Chris Friesen
@ 2014-07-18 20:13     ` Paolo Bonzini
  2014-07-18 22:48       ` Chris Friesen
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2014-07-18 20:13 UTC (permalink / raw)
  To: Chris Friesen, qemu-devel

Il 18/07/2014 18:22, Chris Friesen ha scritto:
> On 07/18/2014 09:24 AM, Paolo Bonzini wrote:
>> Il 18/07/2014 16:58, Chris Friesen ha scritto:
>>>
>>> I've recently run up against an interesting issue where I had a number
>>> of guests running and when I started doing heavy disk I/O on a virtio
>>> disk (backed via ceph rbd) the memory consumption spiked and triggered
>>> the OOM-killer.
>>>
>>> I want to reserve some memory for I/O, but I don't know how much it can
>>> use in the worst-case.
>>>
>>> Is there a limit on the number of in-flight I/O operations?  (Preferably
>>> as a configurable option, but even hard-coded would be good to know as
>>> well.)
>>
>> For rbd, there is no such limit in QEMU except the size of the virtio
>> ring buffer, but librbd may add limits of its own.
>>
>> For files, there's no limit if you use aio=threads, but the more I/O
>> operations you trigger the more threads QEMU will create.  After 10
>> seconds of idle the threads will be destroyed.
>>
>> Also for files, the limit is 128 per disk if you use aio=native.  You
>> can only change it by recompilation only.
> 
> Has anyone looked at enforcing some limits?  I'm okay with throttling
> performance if necessary, but I really don't want to trigger the
> OOM-killer.

I forgot about "-drive ...,iops_max=NNN". :)

Paolo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-18 20:13     ` Paolo Bonzini
@ 2014-07-18 22:48       ` Chris Friesen
  2014-07-19  5:49         ` Paolo Bonzini
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Friesen @ 2014-07-18 22:48 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel

On 07/18/2014 02:13 PM, Paolo Bonzini wrote:
> Il 18/07/2014 18:22, Chris Friesen ha scritto:
>> On 07/18/2014 09:24 AM, Paolo Bonzini wrote:
>>> Il 18/07/2014 16:58, Chris Friesen ha scritto:
>>>>
>>>> I've recently run up against an interesting issue where I had a number
>>>> of guests running and when I started doing heavy disk I/O on a virtio
>>>> disk (backed via ceph rbd) the memory consumption spiked and triggered
>>>> the OOM-killer.
>>>>
>>>> I want to reserve some memory for I/O, but I don't know how much it can
>>>> use in the worst-case.
>>>>
>>>> Is there a limit on the number of in-flight I/O operations?  (Preferably
>>>> as a configurable option, but even hard-coded would be good to know as
>>>> well.)
>>>
>>> For rbd, there is no such limit in QEMU except the size of the virtio
>>> ring buffer, but librbd may add limits of its own.
>>>
>>> For files, there's no limit if you use aio=threads, but the more I/O
>>> operations you trigger the more threads QEMU will create.  After 10
>>> seconds of idle the threads will be destroyed.
>>>
>>> Also for files, the limit is 128 per disk if you use aio=native.  You
>>> can only change it by recompilation only.
>>
>> Has anyone looked at enforcing some limits?  I'm okay with throttling
>> performance if necessary, but I really don't want to trigger the
>> OOM-killer.
>
> I forgot about "-drive ...,iops_max=NNN". :)

I'm not sure it's actually useful though, since it specifies the max IO 
operations per second, not the maximum number of in-flight operations.

If the far end can't keep up with the requested rate of operations, then 
the number of in-flight operations is going to grow even if you throttle 
the number of operations per second.

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-18 22:48       ` Chris Friesen
@ 2014-07-19  5:49         ` Paolo Bonzini
  2014-07-19  6:27           ` Chris Friesen
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2014-07-19  5:49 UTC (permalink / raw)
  To: Chris Friesen, qemu-devel

Il 19/07/2014 00:48, Chris Friesen ha scritto:
>>>
>>
>> I forgot about "-drive ...,iops_max=NNN". :)
> 
> I'm not sure it's actually useful though, since it specifies the max IO
> operations per second, not the maximum number of in-flight operations.

No, that's "-drive iops=NNN".  QEMU implements a leaky bucket algorithm,
where "iops_max" gives the size of the bucket and "iops" gives the
refill rate.

Paolo

> If the far end can't keep up with the requested rate of operations, then
> the number of in-flight operations is going to grow even if you throttle
> the number of operations per second.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-19  5:49         ` Paolo Bonzini
@ 2014-07-19  6:27           ` Chris Friesen
  2014-07-19  7:23             ` Paolo Bonzini
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Friesen @ 2014-07-19  6:27 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel

On 07/18/2014 11:49 PM, Paolo Bonzini wrote:
> Il 19/07/2014 00:48, Chris Friesen ha scritto:
>>>>
>>>
>>> I forgot about "-drive ...,iops_max=NNN". :)
>>
>> I'm not sure it's actually useful though, since it specifies the max IO
>> operations per second, not the maximum number of in-flight operations.
>
> No, that's "-drive iops=NNN".  QEMU implements a leaky bucket algorithm,
> where "iops_max" gives the size of the bucket and "iops" gives the
> refill rate.

Does it track in-flight operations though?  Or just how many operations 
can be requested in a given amount of time?

If it tracks how many operations can be requested, then if the "iops" 
parameter is larger than what the server can maintain then the number of 
in-flight operations could still grow indefinitely.

I suppose I'll have to check the code. :)

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-19  6:27           ` Chris Friesen
@ 2014-07-19  7:23             ` Paolo Bonzini
  2014-07-19  8:45               ` Benoît Canet
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2014-07-19  7:23 UTC (permalink / raw)
  To: Chris Friesen, qemu-devel, Benoît Canet

Il 19/07/2014 08:27, Chris Friesen ha scritto:
> Does it track in-flight operations though?  Or just how many operations
> can be requested in a given amount of time?

It should track in flight operations.  However, I'm not sure it supports
the iops=0 case properly, since I do not see anything in
tracked_request_end that ceases the accounting of the current operation.
 Benoit, can you answer?

Paolo

> If it tracks how many operations can be requested, then if the "iops"
> parameter is larger than what the server can maintain then the number of
> in-flight operations could still grow indefinitely.
> 
> I suppose I'll have to check the code. :)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-19  7:23             ` Paolo Bonzini
@ 2014-07-19  8:45               ` Benoît Canet
  2014-07-21 14:59                 ` Chris Friesen
  0 siblings, 1 reply; 40+ messages in thread
From: Benoît Canet @ 2014-07-19  8:45 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Chris Friesen, qemu-devel

The Saturday 19 Jul 2014 à 09:23:50 (+0200), Paolo Bonzini wrote :
> Il 19/07/2014 08:27, Chris Friesen ha scritto:
> > Does it track in-flight operations though?  Or just how many operations
> > can be requested in a given amount of time?
> 
> It should track in flight operations.  However, I'm not sure it supports
> the iops=0 case properly, since I do not see anything in
> tracked_request_end that ceases the accounting of the current operation.
>  Benoit, can you answer?
> 
> Paolo

I think in the throttling case the number of in flight operation is limited by
the emulated hardware queue. Else request would pile up and throttling would be
inefective.

So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.

> 
> > If it tracks how many operations can be requested, then if the "iops"
> > parameter is larger than what the server can maintain then the number of
> > in-flight operations could still grow indefinitely.
> > 
> > I suppose I'll have to check the code. :)
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-19  8:45               ` Benoît Canet
@ 2014-07-21 14:59                 ` Chris Friesen
  2014-07-21 15:15                   ` Benoît Canet
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Friesen @ 2014-07-21 14:59 UTC (permalink / raw)
  To: Benoît Canet, Paolo Bonzini; +Cc: qemu-devel

On 07/19/2014 02:45 AM, Benoît Canet wrote:

> I think in the throttling case the number of in flight operation is limited by
> the emulated hardware queue. Else request would pile up and throttling would be
> inefective.
>
> So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.

Okay, that makes sense.  Do you know how much data can be written as 
part of a single operation?  We're using 2MB hugepages for the guest 
memory, and we saw the qemu RSS numbers jump from 25-30MB during normal 
operation up to 120-180MB when running dbench.  I'd like to know what 
the worst-case would be.

Thanks,
Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-21 14:59                 ` Chris Friesen
@ 2014-07-21 15:15                   ` Benoît Canet
  2014-07-21 15:35                     ` Chris Friesen
  0 siblings, 1 reply; 40+ messages in thread
From: Benoît Canet @ 2014-07-21 15:15 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
> On 07/19/2014 02:45 AM, Benoît Canet wrote:
> 
> >I think in the throttling case the number of in flight operation is limited by
> >the emulated hardware queue. Else request would pile up and throttling would be
> >inefective.
> >
> >So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
> 
> Okay, that makes sense.  Do you know how much data can be written as part of
> a single operation?  We're using 2MB hugepages for the guest memory, and we
> saw the qemu RSS numbers jump from 25-30MB during normal operation up to
> 120-180MB when running dbench.  I'd like to know what the worst-case would
> be.

I think Linux as a limit of 512Kb for io size or something like that.
So the guest would have VIRTIO_PCI_QUEUE_MAX * 512Kb of in flight buffers at max.

> 
> Thanks,
> Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-21 15:15                   ` Benoît Canet
@ 2014-07-21 15:35                     ` Chris Friesen
  2014-07-21 15:54                       ` Benoît Canet
                                         ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Chris Friesen @ 2014-07-21 15:35 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Paolo Bonzini, qemu-devel

On 07/21/2014 09:15 AM, Benoît Canet wrote:
> The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
>> On 07/19/2014 02:45 AM, Benoît Canet wrote:
>>
>>> I think in the throttling case the number of in flight operation is limited by
>>> the emulated hardware queue. Else request would pile up and throttling would be
>>> inefective.
>>>
>>> So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
>>
>> Okay, that makes sense.  Do you know how much data can be written as part of
>> a single operation?  We're using 2MB hugepages for the guest memory, and we
>> saw the qemu RSS numbers jump from 25-30MB during normal operation up to
>> 120-180MB when running dbench.  I'd like to know what the worst-case would
>> be.
>
> I think Linux as a limit of 512Kb for io size or something like that.
> So the guest would have VIRTIO_PCI_QUEUE_MAX * 512Kb of in flight buffers at max.

Those numbers don't line up with what we were seeing...we saw a 120MB+ 
jump when running dbench, which would map to more like 2MB per operation 
assuming a limit of VIRTIO_PCI_QUEUE_MAX (i.e. 64).

Unless there are other buffers involved that we haven't factored in...

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-21 15:35                     ` Chris Friesen
@ 2014-07-21 15:54                       ` Benoît Canet
  2014-07-21 16:10                       ` Benoît Canet
  2014-07-21 19:47                       ` Benoît Canet
  2 siblings, 0 replies; 40+ messages in thread
From: Benoît Canet @ 2014-07-21 15:54 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
> On 07/21/2014 09:15 AM, Benoît Canet wrote:
> >The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
> >>On 07/19/2014 02:45 AM, Benoît Canet wrote:
> >>
> >>>I think in the throttling case the number of in flight operation is limited by
> >>>the emulated hardware queue. Else request would pile up and throttling would be
> >>>inefective.
> >>>
> >>>So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
> >>
> >>Okay, that makes sense.  Do you know how much data can be written as part of
> >>a single operation?  We're using 2MB hugepages for the guest memory, and we
> >>saw the qemu RSS numbers jump from 25-30MB during normal operation up to
> >>120-180MB when running dbench.  I'd like to know what the worst-case would
> >>be.
> >
> >I think Linux as a limit of 512Kb for io size or something like that.
> >So the guest would have VIRTIO_PCI_QUEUE_MAX * 512Kb of in flight buffers at max.
> 
> Those numbers don't line up with what we were seeing...we saw a 120MB+ jump
> when running dbench, which would map to more like 2MB per operation assuming
> a limit of VIRTIO_PCI_QUEUE_MAX (i.e. 64).
> 
> Unless there are other buffers involved that we haven't factored in...

What VIRTIO_PCI_QUEUE_MAX would take into account would be the in flight requests between the guest and qemu.
What's left to study is if the Linux guest could have by himself some additional in flight request.
I guess so.

Best regards

Benoît

> 
> Chris
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-21 15:35                     ` Chris Friesen
  2014-07-21 15:54                       ` Benoît Canet
@ 2014-07-21 16:10                       ` Benoît Canet
  2014-08-23  0:59                         ` Chris Friesen
  2014-07-21 19:47                       ` Benoît Canet
  2 siblings, 1 reply; 40+ messages in thread
From: Benoît Canet @ 2014-07-21 16:10 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
> On 07/21/2014 09:15 AM, Benoît Canet wrote:
> >The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
> >>On 07/19/2014 02:45 AM, Benoît Canet wrote:
> >>
> >>>I think in the throttling case the number of in flight operation is limited by
> >>>the emulated hardware queue. Else request would pile up and throttling would be
> >>>inefective.
> >>>
> >>>So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
> >>
> >>Okay, that makes sense.  Do you know how much data can be written as part of
> >>a single operation?  We're using 2MB hugepages for the guest memory, and we
> >>saw the qemu RSS numbers jump from 25-30MB during normal operation up to
> >>120-180MB when running dbench.  I'd like to know what the worst-case would

Sorry I didn't understood this part at first read.

In the linux guest can you monitor:
benoit@Laure:~$ cat /sys/class/block/xyz/inflight ?

This would give us a faily precise number of the requests actually in flight between the guest and qemu.

Best regards

Benoît

> >>be.
> >
> >I think Linux as a limit of 512Kb for io size or something like that.
> >So the guest would have VIRTIO_PCI_QUEUE_MAX * 512Kb of in flight buffers at max.
> 
> Those numbers don't line up with what we were seeing...we saw a 120MB+ jump
> when running dbench, which would map to more like 2MB per operation assuming
> a limit of VIRTIO_PCI_QUEUE_MAX (i.e. 64).
> 
> Unless there are other buffers involved that we haven't factored in...
> 
> Chris
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-21 15:35                     ` Chris Friesen
  2014-07-21 15:54                       ` Benoît Canet
  2014-07-21 16:10                       ` Benoît Canet
@ 2014-07-21 19:47                       ` Benoît Canet
  2014-07-21 21:12                         ` Chris Friesen
  2 siblings, 1 reply; 40+ messages in thread
From: Benoît Canet @ 2014-07-21 19:47 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
> On 07/21/2014 09:15 AM, Benoît Canet wrote:
> >The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
> >>On 07/19/2014 02:45 AM, Benoît Canet wrote:
> >>
> >>>I think in the throttling case the number of in flight operation is limited by
> >>>the emulated hardware queue. Else request would pile up and throttling would be
> >>>inefective.
> >>>
> >>>So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
> >>
> >>Okay, that makes sense.  Do you know how much data can be written as part of
> >>a single operation?  We're using 2MB hugepages for the guest memory, and we
> >>saw the qemu RSS numbers jump from 25-30MB during normal operation up to
> >>120-180MB when running dbench.  I'd like to know what the worst-case would
> >>be.

At first start QEMU start only a bunch of IO threads.
When IOs activity kick some other IO threads will be started and can show up in
memory measurement because the same physical memory will be referenced by multiple threads.

Are you sure this is not the case you are seeing ?

The H option of ps on the host could help.

Best regards

Benoît

> >
> >I think Linux as a limit of 512Kb for io size or something like that.
> >So the guest would have VIRTIO_PCI_QUEUE_MAX * 512Kb of in flight buffers at max.
> 
> Those numbers don't line up with what we were seeing...we saw a 120MB+ jump
> when running dbench, which would map to more like 2MB per operation assuming
> a limit of VIRTIO_PCI_QUEUE_MAX (i.e. 64).
> 
> Unless there are other buffers involved that we haven't factored in...
> 
> Chris
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-21 19:47                       ` Benoît Canet
@ 2014-07-21 21:12                         ` Chris Friesen
  2014-07-21 22:04                           ` Benoît Canet
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Friesen @ 2014-07-21 21:12 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Paolo Bonzini, qemu-devel

On 07/21/2014 01:47 PM, Benoît Canet wrote:
> The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
>> On 07/21/2014 09:15 AM, Benoît Canet wrote:
>>> The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
>>>> On 07/19/2014 02:45 AM, Benoît Canet wrote:
>>>>
>>>>> I think in the throttling case the number of in flight operation is limited by
>>>>> the emulated hardware queue. Else request would pile up and throttling would be
>>>>> inefective.
>>>>>
>>>>> So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
>>>>
>>>> Okay, that makes sense.  Do you know how much data can be written as part of
>>>> a single operation?  We're using 2MB hugepages for the guest memory, and we
>>>> saw the qemu RSS numbers jump from 25-30MB during normal operation up to
>>>> 120-180MB when running dbench.  I'd like to know what the worst-case would
>>>> be.
>
> At first start QEMU start only a bunch of IO threads.
> When IOs activity kick some other IO threads will be started and can show up in
> memory measurement because the same physical memory will be referenced by multiple threads.
>
> Are you sure this is not the case you are seeing ?
>
> The H option of ps on the host could help.

I'm pretty sure that this wouldn't change the RES (aka RSS) value, since 
all those threads are sharing the same address space.

If we create IO threads on activity though, we could end up causing some 
overhead due to the per-thread stack (8MB by default).  Do we have a 
limit on how many IO threads could get created?

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-21 21:12                         ` Chris Friesen
@ 2014-07-21 22:04                           ` Benoît Canet
  0 siblings, 0 replies; 40+ messages in thread
From: Benoît Canet @ 2014-07-21 22:04 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

The Monday 21 Jul 2014 à 15:12:31 (-0600), Chris Friesen wrote :
> On 07/21/2014 01:47 PM, Benoît Canet wrote:
> >The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
> >>On 07/21/2014 09:15 AM, Benoît Canet wrote:
> >>>The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
> >>>>On 07/19/2014 02:45 AM, Benoît Canet wrote:
> >>>>
> >>>>>I think in the throttling case the number of in flight operation is limited by
> >>>>>the emulated hardware queue. Else request would pile up and throttling would be
> >>>>>inefective.
> >>>>>
> >>>>>So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
> >>>>
> >>>>Okay, that makes sense.  Do you know how much data can be written as part of
> >>>>a single operation?  We're using 2MB hugepages for the guest memory, and we
> >>>>saw the qemu RSS numbers jump from 25-30MB during normal operation up to
> >>>>120-180MB when running dbench.  I'd like to know what the worst-case would
> >>>>be.
> >
> >At first start QEMU start only a bunch of IO threads.
> >When IOs activity kick some other IO threads will be started and can show up in
> >memory measurement because the same physical memory will be referenced by multiple threads.
> >
> >Are you sure this is not the case you are seeing ?
> >
> >The H option of ps on the host could help.
> 
> I'm pretty sure that this wouldn't change the RES (aka RSS) value, since all
> those threads are sharing the same address space.
> 
> If we create IO threads on activity though, we could end up causing some
> overhead due to the per-thread stack (8MB by default).  Do we have a limit
> on how many IO threads could get created?
> 
> Chris
> 

thread-pool.c:
pool->max_threads = 64;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-07-21 16:10                       ` Benoît Canet
@ 2014-08-23  0:59                         ` Chris Friesen
  2014-08-23  7:56                           ` Benoît Canet
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Friesen @ 2014-08-23  0:59 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Paolo Bonzini, qemu-devel

On 07/21/2014 10:10 AM, Benoît Canet wrote:
> The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
>> On 07/21/2014 09:15 AM, Benoît Canet wrote:
>>> The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
>>>> On 07/19/2014 02:45 AM, Benoît Canet wrote:
>>>>
>>>>> I think in the throttling case the number of in flight operation is limited by
>>>>> the emulated hardware queue. Else request would pile up and throttling would be
>>>>> inefective.
>>>>>
>>>>> So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
>>>>
>>>> Okay, that makes sense.  Do you know how much data can be written as part of
>>>> a single operation?  We're using 2MB hugepages for the guest memory, and we
>>>> saw the qemu RSS numbers jump from 25-30MB during normal operation up to
>>>> 120-180MB when running dbench.  I'd like to know what the worst-case would
>
> Sorry I didn't understood this part at first read.
>
> In the linux guest can you monitor:
> benoit@Laure:~$ cat /sys/class/block/xyz/inflight ?
>
> This would give us a faily precise number of the requests actually in flight between the guest and qemu.


After a bit of a break I'm looking at this again.

While doing "dd if=/dev/zero of=testfile bs=1M count=700" in the guest, 
I got a max "inflight" value of 181.  This seems quite a bit higher than 
VIRTIO_PCI_QUEUE_MAX.

I've seen throughput as high as ~210 MB/sec, which also kicked the RSS 
numbers up above 200MB.

I tried dropping VIRTIO_PCI_QUEUE_MAX down to 32 (it didn't seem to work 
at all for values much less than that, though I didn't bother getting an 
exact value) and it didn't really make any difference, I saw inflight 
values as high as 177.

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-23  0:59                         ` Chris Friesen
@ 2014-08-23  7:56                           ` Benoît Canet
  2014-08-25 15:12                             ` Chris Friesen
  2014-08-25 21:50                             ` Chris Friesen
  0 siblings, 2 replies; 40+ messages in thread
From: Benoît Canet @ 2014-08-23  7:56 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

The Friday 22 Aug 2014 à 18:59:38 (-0600), Chris Friesen wrote :
> On 07/21/2014 10:10 AM, Benoît Canet wrote:
> >The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
> >>On 07/21/2014 09:15 AM, Benoît Canet wrote:
> >>>The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
> >>>>On 07/19/2014 02:45 AM, Benoît Canet wrote:
> >>>>
> >>>>>I think in the throttling case the number of in flight operation is limited by
> >>>>>the emulated hardware queue. Else request would pile up and throttling would be
> >>>>>inefective.
> >>>>>
> >>>>>So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
> >>>>
> >>>>Okay, that makes sense.  Do you know how much data can be written as part of
> >>>>a single operation?  We're using 2MB hugepages for the guest memory, and we
> >>>>saw the qemu RSS numbers jump from 25-30MB during normal operation up to
> >>>>120-180MB when running dbench.  I'd like to know what the worst-case would
> >
> >Sorry I didn't understood this part at first read.
> >
> >In the linux guest can you monitor:
> >benoit@Laure:~$ cat /sys/class/block/xyz/inflight ?
> >
> >This would give us a faily precise number of the requests actually in flight between the guest and qemu.
> 
> 
> After a bit of a break I'm looking at this again.
> 

Strange.

I would use dd with the flag oflag=nocache to make sure the write request
does not do in the guest cache though.

Best regards

Benoît

> While doing "dd if=/dev/zero of=testfile bs=1M count=700" in the guest, I
> got a max "inflight" value of 181.  This seems quite a bit higher than
> VIRTIO_PCI_QUEUE_MAX.
> 
> I've seen throughput as high as ~210 MB/sec, which also kicked the RSS
> numbers up above 200MB.
> 
> I tried dropping VIRTIO_PCI_QUEUE_MAX down to 32 (it didn't seem to work at
> all for values much less than that, though I didn't bother getting an exact
> value) and it didn't really make any difference, I saw inflight values as
> high as 177.
> 
> Chris
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-23  7:56                           ` Benoît Canet
@ 2014-08-25 15:12                             ` Chris Friesen
  2014-08-25 17:43                               ` Chris Friesen
  2015-08-27 16:33                               ` Stefan Hajnoczi
  2014-08-25 21:50                             ` Chris Friesen
  1 sibling, 2 replies; 40+ messages in thread
From: Chris Friesen @ 2014-08-25 15:12 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Paolo Bonzini, qemu-devel

On 08/23/2014 01:56 AM, Benoît Canet wrote:
> The Friday 22 Aug 2014 à 18:59:38 (-0600), Chris Friesen wrote :
>> On 07/21/2014 10:10 AM, Benoît Canet wrote:
>>> The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
>>>> On 07/21/2014 09:15 AM, Benoît Canet wrote:
>>>>> The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
>>>>>> On 07/19/2014 02:45 AM, Benoît Canet wrote:
>>>>>>
>>>>>>> I think in the throttling case the number of in flight operation is limited by
>>>>>>> the emulated hardware queue. Else request would pile up and throttling would be
>>>>>>> inefective.
>>>>>>>
>>>>>>> So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
>>>>>>
>>>>>> Okay, that makes sense.  Do you know how much data can be written as part of
>>>>>> a single operation?  We're using 2MB hugepages for the guest memory, and we
>>>>>> saw the qemu RSS numbers jump from 25-30MB during normal operation up to
>>>>>> 120-180MB when running dbench.  I'd like to know what the worst-case would
>>>
>>> Sorry I didn't understood this part at first read.
>>>
>>> In the linux guest can you monitor:
>>> benoit@Laure:~$ cat /sys/class/block/xyz/inflight ?
>>>
>>> This would give us a faily precise number of the requests actually in flight between the guest and qemu.
>>
>>
>> After a bit of a break I'm looking at this again.
>>
>
> Strange.
>
> I would use dd with the flag oflag=nocache to make sure the write request
> does not do in the guest cache though.

I set up another test, checking the inflight value every second.

Running just "dd if=/dev/zero of=testfile2 bs=1M count=700 
oflag=nocache&" gave a bit over 100 inflight requests.

If I simultaneously run "dd if=testfile of=/dev/null bs=1M count=700 
oflag=nocache&" then then number of inflight write requests peaks at 176.

I should point out that the above numbers are with qemu 1.7.0, with a 
ceph storage backend.  qemu is started with

-drive file=rbd:cinder-volumes/.........

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-25 15:12                             ` Chris Friesen
@ 2014-08-25 17:43                               ` Chris Friesen
  2015-08-27 16:37                                 ` Stefan Hajnoczi
  2015-08-27 16:33                               ` Stefan Hajnoczi
  1 sibling, 1 reply; 40+ messages in thread
From: Chris Friesen @ 2014-08-25 17:43 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Paolo Bonzini, qemu-devel

On 08/25/2014 09:12 AM, Chris Friesen wrote:

> I set up another test, checking the inflight value every second.
>
> Running just "dd if=/dev/zero of=testfile2 bs=1M count=700
> oflag=nocache&" gave a bit over 100 inflight requests.
>
> If I simultaneously run "dd if=testfile of=/dev/null bs=1M count=700
> oflag=nocache&" then then number of inflight write requests peaks at 176.
>
> I should point out that the above numbers are with qemu 1.7.0, with a
> ceph storage backend.  qemu is started with
>
> -drive file=rbd:cinder-volumes/.........

 From a stacktrace that I added it looks like the writes are coming in 
via virtio_blk_handle_output().

Looking at virtio_blk_device_init() I see it calling 
virtio_add_queue(vdev, 128, virtio_blk_handle_output);

I wondered if that 128 had anything to do with the number of inflight 
requests, so I tried recompiling with 16 instead. I still saw the number 
of inflight requests go up to 178 and the guest took a kernel panic in 
virtqueue_add_buf() so that wasn't very successful. :)

Following the code path in virtio_blk_handle_write() it looks like it 
will bundle up to 32 writes into a single large iovec-based "multiwrite" 
operation.  But from there on down I don't see a limit on how many 
writes can be outstanding at any one time.  Still checking the code 
further up the virtio call chain.

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-23  7:56                           ` Benoît Canet
  2014-08-25 15:12                             ` Chris Friesen
@ 2014-08-25 21:50                             ` Chris Friesen
  2014-08-27  5:43                               ` Chris Friesen
                                                 ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Chris Friesen @ 2014-08-25 21:50 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Paolo Bonzini, qemu-devel

On 08/23/2014 01:56 AM, Benoît Canet wrote:
> The Friday 22 Aug 2014 à 18:59:38 (-0600), Chris Friesen wrote :
>> On 07/21/2014 10:10 AM, Benoît Canet wrote:
>>> The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
>>>> On 07/21/2014 09:15 AM, Benoît Canet wrote:
>>>>> The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
>>>>>> On 07/19/2014 02:45 AM, Benoît Canet wrote:
>>>>>>
>>>>>>> I think in the throttling case the number of in flight operation is limited by
>>>>>>> the emulated hardware queue. Else request would pile up and throttling would be
>>>>>>> inefective.
>>>>>>>
>>>>>>> So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
>>>>>>
>>>>>> Okay, that makes sense.  Do you know how much data can be written as part of
>>>>>> a single operation?  We're using 2MB hugepages for the guest memory, and we
>>>>>> saw the qemu RSS numbers jump from 25-30MB during normal operation up to
>>>>>> 120-180MB when running dbench.  I'd like to know what the worst-case would
>>>
>>> Sorry I didn't understood this part at first read.
>>>
>>> In the linux guest can you monitor:
>>> benoit@Laure:~$ cat /sys/class/block/xyz/inflight ?
>>>
>>> This would give us a faily precise number of the requests actually in flight between the guest and qemu.
>>
>>
>> After a bit of a break I'm looking at this again.
>>
>
> Strange.
>
> I would use dd with the flag oflag=nocache to make sure the write request
> does not do in the guest cache though.
>
> Best regards
>
> Benoît
>
>> While doing "dd if=/dev/zero of=testfile bs=1M count=700" in the guest, I
>> got a max "inflight" value of 181.  This seems quite a bit higher than
>> VIRTIO_PCI_QUEUE_MAX.
>>
>> I've seen throughput as high as ~210 MB/sec, which also kicked the RSS
>> numbers up above 200MB.
>>
>> I tried dropping VIRTIO_PCI_QUEUE_MAX down to 32 (it didn't seem to work at
>> all for values much less than that, though I didn't bother getting an exact
>> value) and it didn't really make any difference, I saw inflight values as
>> high as 177.

I think I might have a glimmering of what's going on.  Someone please 
correct me if I get something wrong.

I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with 
respect to max inflight operations, and neither does virtio-blk calling 
virtio_add_queue() with a queue size of 128.

I think what's happening is that virtio_blk_handle_output() spins, 
pulling data off the 128-entry queue and calling 
virtio_blk_handle_request().  At this point that queue entry can be 
reused, so the queue size isn't really relevant.

In virtio_blk_handle_write() we add the request to a MultiReqBuffer and 
every 32 writes we'll call virtio_submit_multiwrite() which calls down 
into bdrv_aio_multiwrite().  That tries to merge requests and then for 
each resulting request calls bdrv_aio_writev() which ends up calling 
qemu_rbd_aio_writev(), which calls rbd_start_aio().

rbd_start_aio() allocates a buffer and converts from iovec to a single 
buffer.  This buffer stays allocated until the request is acked, which 
is where the bulk of the memory overhead with rbd is coming from (has 
anyone considered adding iovec support to rbd to avoid this extra copy?).

The only limit I see in the whole call chain from 
virtio_blk_handle_request() on down is the call to 
bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that 
doesn't provide any limit on the absolute number of inflight operations, 
only on operations/sec.  If the ceph server cluster can't keep up with 
the aggregate load, then the number of inflight operations can still 
grow indefinitely.

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-25 21:50                             ` Chris Friesen
@ 2014-08-27  5:43                               ` Chris Friesen
  2015-05-14 13:42                                 ` Andrey Korolyov
  2015-08-27 16:48                               ` Stefan Hajnoczi
  2015-08-27 16:49                               ` Stefan Hajnoczi
  2 siblings, 1 reply; 40+ messages in thread
From: Chris Friesen @ 2014-08-27  5:43 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Paolo Bonzini, qemu-devel

On 08/25/2014 03:50 PM, Chris Friesen wrote:

> I think I might have a glimmering of what's going on.  Someone please
> correct me if I get something wrong.
>
> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
> respect to max inflight operations, and neither does virtio-blk calling
> virtio_add_queue() with a queue size of 128.
>
> I think what's happening is that virtio_blk_handle_output() spins,
> pulling data off the 128-entry queue and calling
> virtio_blk_handle_request().  At this point that queue entry can be
> reused, so the queue size isn't really relevant.
>
> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
> every 32 writes we'll call virtio_submit_multiwrite() which calls down
> into bdrv_aio_multiwrite().  That tries to merge requests and then for
> each resulting request calls bdrv_aio_writev() which ends up calling
> qemu_rbd_aio_writev(), which calls rbd_start_aio().
>
> rbd_start_aio() allocates a buffer and converts from iovec to a single
> buffer.  This buffer stays allocated until the request is acked, which
> is where the bulk of the memory overhead with rbd is coming from (has
> anyone considered adding iovec support to rbd to avoid this extra copy?).
>
> The only limit I see in the whole call chain from
> virtio_blk_handle_request() on down is the call to
> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
> doesn't provide any limit on the absolute number of inflight operations,
> only on operations/sec.  If the ceph server cluster can't keep up with
> the aggregate load, then the number of inflight operations can still
> grow indefinitely.
>
> Chris

I was a bit concerned that I'd need to extend the IO throttling code to 
support a limit on total inflight bytes, but it doesn't look like that 
will be necessary.

It seems that using mallopt() to set the trim/mmap thresholds to 128K is 
enough to minimize the increase in RSS and also drop it back down after 
an I/O burst.  For now this looks like it should be sufficient for our 
purposes.

I'm actually a bit surprised I didn't have to go lower, but it seems to 
work for both "dd" and dbench testcases so we'll give it a try.

Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-27  5:43                               ` Chris Friesen
@ 2015-05-14 13:42                                 ` Andrey Korolyov
  2015-08-26 17:10                                   ` Andrey Korolyov
  0 siblings, 1 reply; 40+ messages in thread
From: Andrey Korolyov @ 2015-05-14 13:42 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen
<chris.friesen@windriver.com> wrote:
> On 08/25/2014 03:50 PM, Chris Friesen wrote:
>
>> I think I might have a glimmering of what's going on.  Someone please
>> correct me if I get something wrong.
>>
>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
>> respect to max inflight operations, and neither does virtio-blk calling
>> virtio_add_queue() with a queue size of 128.
>>
>> I think what's happening is that virtio_blk_handle_output() spins,
>> pulling data off the 128-entry queue and calling
>> virtio_blk_handle_request().  At this point that queue entry can be
>> reused, so the queue size isn't really relevant.
>>
>> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
>> every 32 writes we'll call virtio_submit_multiwrite() which calls down
>> into bdrv_aio_multiwrite().  That tries to merge requests and then for
>> each resulting request calls bdrv_aio_writev() which ends up calling
>> qemu_rbd_aio_writev(), which calls rbd_start_aio().
>>
>> rbd_start_aio() allocates a buffer and converts from iovec to a single
>> buffer.  This buffer stays allocated until the request is acked, which
>> is where the bulk of the memory overhead with rbd is coming from (has
>> anyone considered adding iovec support to rbd to avoid this extra copy?).
>>
>> The only limit I see in the whole call chain from
>> virtio_blk_handle_request() on down is the call to
>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
>> doesn't provide any limit on the absolute number of inflight operations,
>> only on operations/sec.  If the ceph server cluster can't keep up with
>> the aggregate load, then the number of inflight operations can still
>> grow indefinitely.
>>
>> Chris
>
>
> I was a bit concerned that I'd need to extend the IO throttling code to
> support a limit on total inflight bytes, but it doesn't look like that will
> be necessary.
>
> It seems that using mallopt() to set the trim/mmap thresholds to 128K is
> enough to minimize the increase in RSS and also drop it back down after an
> I/O burst.  For now this looks like it should be sufficient for our
> purposes.
>
> I'm actually a bit surprised I didn't have to go lower, but it seems to work
> for both "dd" and dbench testcases so we'll give it a try.
>
> Chris
>

Bumping this...

For now, we are rarely suffering with an unlimited cache growth issue
which can be observed on all post-1.4 versions of qemu with rbd
backend in a writeback mode and certain pattern of a guest operations.
The issue is confirmed for virtio and can be re-triggered by issuing
excessive amount of write requests without completing returned acks
from a emulator` cache timely. Since most applications behave in a
right way, the oom issue is very rare (and we developed an ugly
workaround for such situations long ago). If anybody is interested in
fixing this, I can send a prepared image for a reproduction or
instructions to make one, whichever is preferable.

Thanks!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2015-05-14 13:42                                 ` Andrey Korolyov
@ 2015-08-26 17:10                                   ` Andrey Korolyov
  2015-08-26 23:31                                     ` Josh Durgin
  0 siblings, 1 reply; 40+ messages in thread
From: Andrey Korolyov @ 2015-08-26 17:10 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, Josh Durgin, qemu-devel

On Thu, May 14, 2015 at 4:42 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
> On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen
> <chris.friesen@windriver.com> wrote:
>> On 08/25/2014 03:50 PM, Chris Friesen wrote:
>>
>>> I think I might have a glimmering of what's going on.  Someone please
>>> correct me if I get something wrong.
>>>
>>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
>>> respect to max inflight operations, and neither does virtio-blk calling
>>> virtio_add_queue() with a queue size of 128.
>>>
>>> I think what's happening is that virtio_blk_handle_output() spins,
>>> pulling data off the 128-entry queue and calling
>>> virtio_blk_handle_request().  At this point that queue entry can be
>>> reused, so the queue size isn't really relevant.
>>>
>>> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
>>> every 32 writes we'll call virtio_submit_multiwrite() which calls down
>>> into bdrv_aio_multiwrite().  That tries to merge requests and then for
>>> each resulting request calls bdrv_aio_writev() which ends up calling
>>> qemu_rbd_aio_writev(), which calls rbd_start_aio().
>>>
>>> rbd_start_aio() allocates a buffer and converts from iovec to a single
>>> buffer.  This buffer stays allocated until the request is acked, which
>>> is where the bulk of the memory overhead with rbd is coming from (has
>>> anyone considered adding iovec support to rbd to avoid this extra copy?).
>>>
>>> The only limit I see in the whole call chain from
>>> virtio_blk_handle_request() on down is the call to
>>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
>>> doesn't provide any limit on the absolute number of inflight operations,
>>> only on operations/sec.  If the ceph server cluster can't keep up with
>>> the aggregate load, then the number of inflight operations can still
>>> grow indefinitely.
>>>
>>> Chris
>>
>>
>> I was a bit concerned that I'd need to extend the IO throttling code to
>> support a limit on total inflight bytes, but it doesn't look like that will
>> be necessary.
>>
>> It seems that using mallopt() to set the trim/mmap thresholds to 128K is
>> enough to minimize the increase in RSS and also drop it back down after an
>> I/O burst.  For now this looks like it should be sufficient for our
>> purposes.
>>
>> I'm actually a bit surprised I didn't have to go lower, but it seems to work
>> for both "dd" and dbench testcases so we'll give it a try.
>>
>> Chris
>>
>
> Bumping this...
>
> For now, we are rarely suffering with an unlimited cache growth issue
> which can be observed on all post-1.4 versions of qemu with rbd
> backend in a writeback mode and certain pattern of a guest operations.
> The issue is confirmed for virtio and can be re-triggered by issuing
> excessive amount of write requests without completing returned acks
> from a emulator` cache timely. Since most applications behave in a
> right way, the oom issue is very rare (and we developed an ugly
> workaround for such situations long ago). If anybody is interested in
> fixing this, I can send a prepared image for a reproduction or
> instructions to make one, whichever is preferable.
>
> Thanks!

A gentle bump: for at least rbd backend with writethrough/writeback
cache it is possible to achieve unlimited growth with lot of large
unfinished ops, what can be considered as a DoS. Usually it is
triggered by poorly written applications in the wild, like proprietary
KV databases or MSSQL under Windows, but regular applications,
primarily OSS databases, can trigger the RSS growth for hundreds of
megabytes just easily. There is probably no straight way to limit
in-flight request size by re-chunking it, as supposedly malicious
guest can inflate it up to very high numbers, but it`s fine to crash
such a guest, saving real-world stuff with simple in-flight op count
limiter looks like more achievable option.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2015-08-26 17:10                                   ` Andrey Korolyov
@ 2015-08-26 23:31                                     ` Josh Durgin
  2015-08-26 23:47                                       ` Andrey Korolyov
  0 siblings, 1 reply; 40+ messages in thread
From: Josh Durgin @ 2015-08-26 23:31 UTC (permalink / raw)
  To: Andrey Korolyov, Chris Friesen
  Cc: Benoît Canet, Paolo Bonzini, qemu-devel

On 08/26/2015 10:10 AM, Andrey Korolyov wrote:
> On Thu, May 14, 2015 at 4:42 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>> On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen
>> <chris.friesen@windriver.com> wrote:
>>> On 08/25/2014 03:50 PM, Chris Friesen wrote:
>>>
>>>> I think I might have a glimmering of what's going on.  Someone please
>>>> correct me if I get something wrong.
>>>>
>>>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
>>>> respect to max inflight operations, and neither does virtio-blk calling
>>>> virtio_add_queue() with a queue size of 128.
>>>>
>>>> I think what's happening is that virtio_blk_handle_output() spins,
>>>> pulling data off the 128-entry queue and calling
>>>> virtio_blk_handle_request().  At this point that queue entry can be
>>>> reused, so the queue size isn't really relevant.
>>>>
>>>> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
>>>> every 32 writes we'll call virtio_submit_multiwrite() which calls down
>>>> into bdrv_aio_multiwrite().  That tries to merge requests and then for
>>>> each resulting request calls bdrv_aio_writev() which ends up calling
>>>> qemu_rbd_aio_writev(), which calls rbd_start_aio().
>>>>
>>>> rbd_start_aio() allocates a buffer and converts from iovec to a single
>>>> buffer.  This buffer stays allocated until the request is acked, which
>>>> is where the bulk of the memory overhead with rbd is coming from (has
>>>> anyone considered adding iovec support to rbd to avoid this extra copy?).
>>>>
>>>> The only limit I see in the whole call chain from
>>>> virtio_blk_handle_request() on down is the call to
>>>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
>>>> doesn't provide any limit on the absolute number of inflight operations,
>>>> only on operations/sec.  If the ceph server cluster can't keep up with
>>>> the aggregate load, then the number of inflight operations can still
>>>> grow indefinitely.
>>>>
>>>> Chris
>>>
>>>
>>> I was a bit concerned that I'd need to extend the IO throttling code to
>>> support a limit on total inflight bytes, but it doesn't look like that will
>>> be necessary.
>>>
>>> It seems that using mallopt() to set the trim/mmap thresholds to 128K is
>>> enough to minimize the increase in RSS and also drop it back down after an
>>> I/O burst.  For now this looks like it should be sufficient for our
>>> purposes.
>>>
>>> I'm actually a bit surprised I didn't have to go lower, but it seems to work
>>> for both "dd" and dbench testcases so we'll give it a try.
>>>
>>> Chris
>>>
>>
>> Bumping this...
>>
>> For now, we are rarely suffering with an unlimited cache growth issue
>> which can be observed on all post-1.4 versions of qemu with rbd
>> backend in a writeback mode and certain pattern of a guest operations.
>> The issue is confirmed for virtio and can be re-triggered by issuing
>> excessive amount of write requests without completing returned acks
>> from a emulator` cache timely. Since most applications behave in a
>> right way, the oom issue is very rare (and we developed an ugly
>> workaround for such situations long ago). If anybody is interested in
>> fixing this, I can send a prepared image for a reproduction or
>> instructions to make one, whichever is preferable.
>>
>> Thanks!
>
> A gentle bump: for at least rbd backend with writethrough/writeback
> cache it is possible to achieve unlimited growth with lot of large
> unfinished ops, what can be considered as a DoS. Usually it is
> triggered by poorly written applications in the wild, like proprietary
> KV databases or MSSQL under Windows, but regular applications,
> primarily OSS databases, can trigger the RSS growth for hundreds of
> megabytes just easily. There is probably no straight way to limit
> in-flight request size by re-chunking it, as supposedly malicious
> guest can inflate it up to very high numbers, but it`s fine to crash
> such a guest, saving real-world stuff with simple in-flight op count
> limiter looks like more achievable option.

Hey, sorry I missed this thread before.

What version of ceph are you running? There was an issue with ceph
0.80.8 and earlier that could cause lots of extra memory usage by rbd's
cache (even in writethrough mode) due to copy-on-write triggering 
whole-object (default 4MB) reads, and sticking those in the cache without
proper throttling [1]. I'm wondering if this could be causing the large
RSS growth you're seeing.

In-flight requests do have buffers and structures allocated for them in
librbd, but these should have lower overhead than cow. If these are the
problem, it seems to me a generic limit on in flight ops in qemu would
be a reasonable fix. Other backends have resources tied up by in-flight
ops as well.

Josh

[1] https://github.com/ceph/ceph/pull/3410

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2015-08-26 23:31                                     ` Josh Durgin
@ 2015-08-26 23:47                                       ` Andrey Korolyov
  2015-08-27  0:56                                         ` Josh Durgin
  0 siblings, 1 reply; 40+ messages in thread
From: Andrey Korolyov @ 2015-08-26 23:47 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Benoît Canet, Chris Friesen, qemu-devel, Paolo Bonzini

On Thu, Aug 27, 2015 at 2:31 AM, Josh Durgin <jdurgin@redhat.com> wrote:
> On 08/26/2015 10:10 AM, Andrey Korolyov wrote:
>>
>> On Thu, May 14, 2015 at 4:42 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>
>>> On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen
>>> <chris.friesen@windriver.com> wrote:
>>>>
>>>> On 08/25/2014 03:50 PM, Chris Friesen wrote:
>>>>
>>>>> I think I might have a glimmering of what's going on.  Someone please
>>>>> correct me if I get something wrong.
>>>>>
>>>>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
>>>>> respect to max inflight operations, and neither does virtio-blk calling
>>>>> virtio_add_queue() with a queue size of 128.
>>>>>
>>>>> I think what's happening is that virtio_blk_handle_output() spins,
>>>>> pulling data off the 128-entry queue and calling
>>>>> virtio_blk_handle_request().  At this point that queue entry can be
>>>>> reused, so the queue size isn't really relevant.
>>>>>
>>>>> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
>>>>> every 32 writes we'll call virtio_submit_multiwrite() which calls down
>>>>> into bdrv_aio_multiwrite().  That tries to merge requests and then for
>>>>> each resulting request calls bdrv_aio_writev() which ends up calling
>>>>> qemu_rbd_aio_writev(), which calls rbd_start_aio().
>>>>>
>>>>> rbd_start_aio() allocates a buffer and converts from iovec to a single
>>>>> buffer.  This buffer stays allocated until the request is acked, which
>>>>> is where the bulk of the memory overhead with rbd is coming from (has
>>>>> anyone considered adding iovec support to rbd to avoid this extra
>>>>> copy?).
>>>>>
>>>>> The only limit I see in the whole call chain from
>>>>> virtio_blk_handle_request() on down is the call to
>>>>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
>>>>> doesn't provide any limit on the absolute number of inflight
>>>>> operations,
>>>>> only on operations/sec.  If the ceph server cluster can't keep up with
>>>>> the aggregate load, then the number of inflight operations can still
>>>>> grow indefinitely.
>>>>>
>>>>> Chris
>>>>
>>>>
>>>>
>>>> I was a bit concerned that I'd need to extend the IO throttling code to
>>>> support a limit on total inflight bytes, but it doesn't look like that
>>>> will
>>>> be necessary.
>>>>
>>>> It seems that using mallopt() to set the trim/mmap thresholds to 128K is
>>>> enough to minimize the increase in RSS and also drop it back down after
>>>> an
>>>> I/O burst.  For now this looks like it should be sufficient for our
>>>> purposes.
>>>>
>>>> I'm actually a bit surprised I didn't have to go lower, but it seems to
>>>> work
>>>> for both "dd" and dbench testcases so we'll give it a try.
>>>>
>>>> Chris
>>>>
>>>
>>> Bumping this...
>>>
>>> For now, we are rarely suffering with an unlimited cache growth issue
>>> which can be observed on all post-1.4 versions of qemu with rbd
>>> backend in a writeback mode and certain pattern of a guest operations.
>>> The issue is confirmed for virtio and can be re-triggered by issuing
>>> excessive amount of write requests without completing returned acks
>>> from a emulator` cache timely. Since most applications behave in a
>>> right way, the oom issue is very rare (and we developed an ugly
>>> workaround for such situations long ago). If anybody is interested in
>>> fixing this, I can send a prepared image for a reproduction or
>>> instructions to make one, whichever is preferable.
>>>
>>> Thanks!
>>
>>
>> A gentle bump: for at least rbd backend with writethrough/writeback
>> cache it is possible to achieve unlimited growth with lot of large
>> unfinished ops, what can be considered as a DoS. Usually it is
>> triggered by poorly written applications in the wild, like proprietary
>> KV databases or MSSQL under Windows, but regular applications,
>> primarily OSS databases, can trigger the RSS growth for hundreds of
>> megabytes just easily. There is probably no straight way to limit
>> in-flight request size by re-chunking it, as supposedly malicious
>> guest can inflate it up to very high numbers, but it`s fine to crash
>> such a guest, saving real-world stuff with simple in-flight op count
>> limiter looks like more achievable option.
>
>
> Hey, sorry I missed this thread before.
>
> What version of ceph are you running? There was an issue with ceph
> 0.80.8 and earlier that could cause lots of extra memory usage by rbd's
> cache (even in writethrough mode) due to copy-on-write triggering
> whole-object (default 4MB) reads, and sticking those in the cache without
> proper throttling [1]. I'm wondering if this could be causing the large
> RSS growth you're seeing.
>
> In-flight requests do have buffers and structures allocated for them in
> librbd, but these should have lower overhead than cow. If these are the
> problem, it seems to me a generic limit on in flight ops in qemu would
> be a reasonable fix. Other backends have resources tied up by in-flight
> ops as well.
>
> Josh
>
> [1] https://github.com/ceph/ceph/pull/3410
>
>
>

I honestly believe that this is the second case. I have your pull in
mine dumpling branch since mid-February, but amount of 'near-oom to
handle' events was still the same over last few months compared to
earlier times, with range from hundred megabytes to gigabyte compared
to the theoretical top of the VM` consumption. Since the nature of the
issue is a very reactive, e.g. RSS image can grow fast and shrink fast
and eventually hit the cgroup limit, I have only a bare reproducer and
a couple of indirect symptoms which are driving my thoughts in a
direction as above - there is still no direct confirmation that
unfinished disk requests are always causing infinite additional memory
allocation.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2015-08-26 23:47                                       ` Andrey Korolyov
@ 2015-08-27  0:56                                         ` Josh Durgin
  0 siblings, 0 replies; 40+ messages in thread
From: Josh Durgin @ 2015-08-27  0:56 UTC (permalink / raw)
  To: Andrey Korolyov
  Cc: Benoît Canet, Chris Friesen, qemu-devel, Paolo Bonzini

On 08/26/2015 04:47 PM, Andrey Korolyov wrote:
> On Thu, Aug 27, 2015 at 2:31 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>> On 08/26/2015 10:10 AM, Andrey Korolyov wrote:
>>>
>>> On Thu, May 14, 2015 at 4:42 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
>>>>
>>>> On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen
>>>> <chris.friesen@windriver.com> wrote:
>>>>>
>>>>> On 08/25/2014 03:50 PM, Chris Friesen wrote:
>>>>>
>>>>>> I think I might have a glimmering of what's going on.  Someone please
>>>>>> correct me if I get something wrong.
>>>>>>
>>>>>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
>>>>>> respect to max inflight operations, and neither does virtio-blk calling
>>>>>> virtio_add_queue() with a queue size of 128.
>>>>>>
>>>>>> I think what's happening is that virtio_blk_handle_output() spins,
>>>>>> pulling data off the 128-entry queue and calling
>>>>>> virtio_blk_handle_request().  At this point that queue entry can be
>>>>>> reused, so the queue size isn't really relevant.
>>>>>>
>>>>>> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
>>>>>> every 32 writes we'll call virtio_submit_multiwrite() which calls down
>>>>>> into bdrv_aio_multiwrite().  That tries to merge requests and then for
>>>>>> each resulting request calls bdrv_aio_writev() which ends up calling
>>>>>> qemu_rbd_aio_writev(), which calls rbd_start_aio().
>>>>>>
>>>>>> rbd_start_aio() allocates a buffer and converts from iovec to a single
>>>>>> buffer.  This buffer stays allocated until the request is acked, which
>>>>>> is where the bulk of the memory overhead with rbd is coming from (has
>>>>>> anyone considered adding iovec support to rbd to avoid this extra
>>>>>> copy?).
>>>>>>
>>>>>> The only limit I see in the whole call chain from
>>>>>> virtio_blk_handle_request() on down is the call to
>>>>>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
>>>>>> doesn't provide any limit on the absolute number of inflight
>>>>>> operations,
>>>>>> only on operations/sec.  If the ceph server cluster can't keep up with
>>>>>> the aggregate load, then the number of inflight operations can still
>>>>>> grow indefinitely.
>>>>>>
>>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>> I was a bit concerned that I'd need to extend the IO throttling code to
>>>>> support a limit on total inflight bytes, but it doesn't look like that
>>>>> will
>>>>> be necessary.
>>>>>
>>>>> It seems that using mallopt() to set the trim/mmap thresholds to 128K is
>>>>> enough to minimize the increase in RSS and also drop it back down after
>>>>> an
>>>>> I/O burst.  For now this looks like it should be sufficient for our
>>>>> purposes.
>>>>>
>>>>> I'm actually a bit surprised I didn't have to go lower, but it seems to
>>>>> work
>>>>> for both "dd" and dbench testcases so we'll give it a try.
>>>>>
>>>>> Chris
>>>>>
>>>>
>>>> Bumping this...
>>>>
>>>> For now, we are rarely suffering with an unlimited cache growth issue
>>>> which can be observed on all post-1.4 versions of qemu with rbd
>>>> backend in a writeback mode and certain pattern of a guest operations.
>>>> The issue is confirmed for virtio and can be re-triggered by issuing
>>>> excessive amount of write requests without completing returned acks
>>>> from a emulator` cache timely. Since most applications behave in a
>>>> right way, the oom issue is very rare (and we developed an ugly
>>>> workaround for such situations long ago). If anybody is interested in
>>>> fixing this, I can send a prepared image for a reproduction or
>>>> instructions to make one, whichever is preferable.
>>>>
>>>> Thanks!
>>>
>>>
>>> A gentle bump: for at least rbd backend with writethrough/writeback
>>> cache it is possible to achieve unlimited growth with lot of large
>>> unfinished ops, what can be considered as a DoS. Usually it is
>>> triggered by poorly written applications in the wild, like proprietary
>>> KV databases or MSSQL under Windows, but regular applications,
>>> primarily OSS databases, can trigger the RSS growth for hundreds of
>>> megabytes just easily. There is probably no straight way to limit
>>> in-flight request size by re-chunking it, as supposedly malicious
>>> guest can inflate it up to very high numbers, but it`s fine to crash
>>> such a guest, saving real-world stuff with simple in-flight op count
>>> limiter looks like more achievable option.
>>
>>
>> Hey, sorry I missed this thread before.
>>
>> What version of ceph are you running? There was an issue with ceph
>> 0.80.8 and earlier that could cause lots of extra memory usage by rbd's
>> cache (even in writethrough mode) due to copy-on-write triggering
>> whole-object (default 4MB) reads, and sticking those in the cache without
>> proper throttling [1]. I'm wondering if this could be causing the large
>> RSS growth you're seeing.
>>
>> In-flight requests do have buffers and structures allocated for them in
>> librbd, but these should have lower overhead than cow. If these are the
>> problem, it seems to me a generic limit on in flight ops in qemu would
>> be a reasonable fix. Other backends have resources tied up by in-flight
>> ops as well.
>>
>> Josh
>>
>> [1] https://github.com/ceph/ceph/pull/3410
>>
>>
>>
>
> I honestly believe that this is the second case. I have your pull in
> mine dumpling branch since mid-February, but amount of 'near-oom to
> handle' events was still the same over last few months compared to
> earlier times, with range from hundred megabytes to gigabyte compared
> to the theoretical top of the VM` consumption. Since the nature of the
> issue is a very reactive, e.g. RSS image can grow fast and shrink fast
> and eventually hit the cgroup limit, I have only a bare reproducer and
> a couple of indirect symptoms which are driving my thoughts in a
> direction as above - there is still no direct confirmation that
> unfinished disk requests are always causing infinite additional memory
> allocation.

Could you run massif on one of these guests with a problematic workload
to see where most of the memory is being used?

Like in this bug report, where it pointed to reads for cow as the
culprit:

http://tracker.ceph.com/issues/6494#note-1

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-25 15:12                             ` Chris Friesen
  2014-08-25 17:43                               ` Chris Friesen
@ 2015-08-27 16:33                               ` Stefan Hajnoczi
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2015-08-27 16:33 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

On Mon, Aug 25, 2014 at 09:12:54AM -0600, Chris Friesen wrote:
> On 08/23/2014 01:56 AM, Benoît Canet wrote:
> >The Friday 22 Aug 2014 à 18:59:38 (-0600), Chris Friesen wrote :
> >>On 07/21/2014 10:10 AM, Benoît Canet wrote:
> >>>The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
> >>>>On 07/21/2014 09:15 AM, Benoît Canet wrote:
> >>>>>The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
> >>>>>>On 07/19/2014 02:45 AM, Benoît Canet wrote:
> >>>>>>
> >>>>>>>I think in the throttling case the number of in flight operation is limited by
> >>>>>>>the emulated hardware queue. Else request would pile up and throttling would be
> >>>>>>>inefective.
> >>>>>>>
> >>>>>>>So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
> >>>>>>
> >>>>>>Okay, that makes sense.  Do you know how much data can be written as part of
> >>>>>>a single operation?  We're using 2MB hugepages for the guest memory, and we
> >>>>>>saw the qemu RSS numbers jump from 25-30MB during normal operation up to
> >>>>>>120-180MB when running dbench.  I'd like to know what the worst-case would
> >>>
> >>>Sorry I didn't understood this part at first read.
> >>>
> >>>In the linux guest can you monitor:
> >>>benoit@Laure:~$ cat /sys/class/block/xyz/inflight ?
> >>>
> >>>This would give us a faily precise number of the requests actually in flight between the guest and qemu.
> >>
> >>
> >>After a bit of a break I'm looking at this again.
> >>
> >
> >Strange.
> >
> >I would use dd with the flag oflag=nocache to make sure the write request
> >does not do in the guest cache though.
> 
> I set up another test, checking the inflight value every second.
> 
> Running just "dd if=/dev/zero of=testfile2 bs=1M count=700 oflag=nocache&"
> gave a bit over 100 inflight requests.
> 
> If I simultaneously run "dd if=testfile of=/dev/null bs=1M count=700
> oflag=nocache&" then then number of inflight write requests peaks at 176.

Please use oflag=direct bs=4k because oflag=nocache is not an accurate
measure of the number of I/O requests.  It's also a good idea to reduce
bs since the block queue limits in the Linux guest driver are around 128
KB and cannot submit 1 MB in a single request.

oflag=nocache uses fadvise(POSIX_ADV_DONTNEED) as shown by strace(1):

read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
fadvise64(0, 2866688, 4096, POSIX_FADV_DONTNEED) = 0

That's not the same as bypassing the page cache!

All I/O is going through the page cache, but dd is telling the kernel it
doesn't care about the cached memory anymore *afterwards*.  This means
the kernel can still do readahead/writebehind to some extent and does
not represent the true number of I/O operations to the disk!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-25 17:43                               ` Chris Friesen
@ 2015-08-27 16:37                                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2015-08-27 16:37 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

On Mon, Aug 25, 2014 at 11:43:38AM -0600, Chris Friesen wrote:
> On 08/25/2014 09:12 AM, Chris Friesen wrote:
> 
> >I set up another test, checking the inflight value every second.
> >
> >Running just "dd if=/dev/zero of=testfile2 bs=1M count=700
> >oflag=nocache&" gave a bit over 100 inflight requests.
> >
> >If I simultaneously run "dd if=testfile of=/dev/null bs=1M count=700
> >oflag=nocache&" then then number of inflight write requests peaks at 176.
> >
> >I should point out that the above numbers are with qemu 1.7.0, with a
> >ceph storage backend.  qemu is started with
> >
> >-drive file=rbd:cinder-volumes/.........
> 
> From a stacktrace that I added it looks like the writes are coming in via
> virtio_blk_handle_output().
> 
> Looking at virtio_blk_device_init() I see it calling virtio_add_queue(vdev,
> 128, virtio_blk_handle_output);
> 
> I wondered if that 128 had anything to do with the number of inflight
> requests, so I tried recompiling with 16 instead. I still saw the number of
> inflight requests go up to 178 and the guest took a kernel panic in
> virtqueue_add_buf() so that wasn't very successful. :)
> 
> Following the code path in virtio_blk_handle_write() it looks like it will
> bundle up to 32 writes into a single large iovec-based "multiwrite"
> operation.  But from there on down I don't see a limit on how many writes
> can be outstanding at any one time.  Still checking the code further up the
> virtio call chain.

Yes, virtio-blk does write merging.  Since QEMU 2.4.0 it also does read
request merging.

I suggest using the fio benchmark tool with the following job file to
try submitting 256 I/O requests at the same time:

[randread]
blocksize=4k
filename=/dev/vda
rw=randread
direct=1
ioengine=libaio
iodepth=256
runtime=120

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-25 21:50                             ` Chris Friesen
  2014-08-27  5:43                               ` Chris Friesen
@ 2015-08-27 16:48                               ` Stefan Hajnoczi
  2015-08-27 17:05                                 ` Stefan Hajnoczi
  2015-08-27 16:49                               ` Stefan Hajnoczi
  2 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2015-08-27 16:48 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

On Mon, Aug 25, 2014 at 03:50:02PM -0600, Chris Friesen wrote:
> On 08/23/2014 01:56 AM, Benoît Canet wrote:
> >The Friday 22 Aug 2014 à 18:59:38 (-0600), Chris Friesen wrote :
> >>On 07/21/2014 10:10 AM, Benoît Canet wrote:
> >>>The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
> >>>>On 07/21/2014 09:15 AM, Benoît Canet wrote:
> >>>>>The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
> >>>>>>On 07/19/2014 02:45 AM, Benoît Canet wrote:
> >>>>>>
> >>>>>>>I think in the throttling case the number of in flight operation is limited by
> >>>>>>>the emulated hardware queue. Else request would pile up and throttling would be
> >>>>>>>inefective.
> >>>>>>>
> >>>>>>>So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
> >>>>>>
> >>>>>>Okay, that makes sense.  Do you know how much data can be written as part of
> >>>>>>a single operation?  We're using 2MB hugepages for the guest memory, and we
> >>>>>>saw the qemu RSS numbers jump from 25-30MB during normal operation up to
> >>>>>>120-180MB when running dbench.  I'd like to know what the worst-case would
> >>>
> >>>Sorry I didn't understood this part at first read.
> >>>
> >>>In the linux guest can you monitor:
> >>>benoit@Laure:~$ cat /sys/class/block/xyz/inflight ?
> >>>
> >>>This would give us a faily precise number of the requests actually in flight between the guest and qemu.
> >>
> >>
> >>After a bit of a break I'm looking at this again.
> >>
> >
> >Strange.
> >
> >I would use dd with the flag oflag=nocache to make sure the write request
> >does not do in the guest cache though.
> >
> >Best regards
> >
> >Benoît
> >
> >>While doing "dd if=/dev/zero of=testfile bs=1M count=700" in the guest, I
> >>got a max "inflight" value of 181.  This seems quite a bit higher than
> >>VIRTIO_PCI_QUEUE_MAX.
> >>
> >>I've seen throughput as high as ~210 MB/sec, which also kicked the RSS
> >>numbers up above 200MB.
> >>
> >>I tried dropping VIRTIO_PCI_QUEUE_MAX down to 32 (it didn't seem to work at
> >>all for values much less than that, though I didn't bother getting an exact
> >>value) and it didn't really make any difference, I saw inflight values as
> >>high as 177.
> 
> I think I might have a glimmering of what's going on.  Someone please
> correct me if I get something wrong.
> 
> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with respect
> to max inflight operations, and neither does virtio-blk calling
> virtio_add_queue() with a queue size of 128.
> 
> I think what's happening is that virtio_blk_handle_output() spins, pulling
> data off the 128-entry queue and calling virtio_blk_handle_request().  At
> this point that queue entry can be reused, so the queue size isn't really
> relevant.

The number of pending virtio-blk requests is finite.  You missed the
vring descriptor table where buffer descriptors live - that's what
prevents the guest from issuing an infinite number of pending requests.

You are correct that the host moves along the "avail" queue, the actual
buffer descriptors in the vring (struct vring_desc) stay put until
request completion is processed by the guest driver from the "used"
ring.

Each virtio-blk request takes at least 2 vring descriptors (data buffer
+ request status byte).  I think 3 is common in practice because drivers
like to submit struct virtio_blk_outhdr in its own descriptor.

So we have a limit of 128 / 2 = 64 I/O requests or 128 / 3 = 42 I/O
requests.

If you rerun the tests with the fio job file I posted, the results
should show that only 64 or 42 requests are pending at any given time.

Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2014-08-25 21:50                             ` Chris Friesen
  2014-08-27  5:43                               ` Chris Friesen
  2015-08-27 16:48                               ` Stefan Hajnoczi
@ 2015-08-27 16:49                               ` Stefan Hajnoczi
  2015-08-28  0:31                                 ` Josh Durgin
  2 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2015-08-27 16:49 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

On Mon, Aug 25, 2014 at 03:50:02PM -0600, Chris Friesen wrote:
> The only limit I see in the whole call chain from
> virtio_blk_handle_request() on down is the call to
> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that doesn't
> provide any limit on the absolute number of inflight operations, only on
> operations/sec.  If the ceph server cluster can't keep up with the aggregate
> load, then the number of inflight operations can still grow indefinitely.

We probably shouldn't rely on QEMU I/O throttling to keep memory usage
reasonable.

Instead rbd should be adjusted to support iovecs as you suggested.  That
way no bounce buffers are needed.

Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2015-08-27 16:48                               ` Stefan Hajnoczi
@ 2015-08-27 17:05                                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2015-08-27 17:05 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Benoît Canet, Paolo Bonzini, qemu-devel

On Thu, Aug 27, 2015 at 5:48 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Aug 25, 2014 at 03:50:02PM -0600, Chris Friesen wrote:
>> On 08/23/2014 01:56 AM, Benoît Canet wrote:
>> >The Friday 22 Aug 2014 à 18:59:38 (-0600), Chris Friesen wrote :
>> >>On 07/21/2014 10:10 AM, Benoît Canet wrote:
>> >>>The Monday 21 Jul 2014 à 09:35:29 (-0600), Chris Friesen wrote :
>> >>>>On 07/21/2014 09:15 AM, Benoît Canet wrote:
>> >>>>>The Monday 21 Jul 2014 à 08:59:45 (-0600), Chris Friesen wrote :
>> >>>>>>On 07/19/2014 02:45 AM, Benoît Canet wrote:
>> >>>>>>
>> >>>>>>>I think in the throttling case the number of in flight operation is limited by
>> >>>>>>>the emulated hardware queue. Else request would pile up and throttling would be
>> >>>>>>>inefective.
>> >>>>>>>
>> >>>>>>>So this number should be around: #define VIRTIO_PCI_QUEUE_MAX 64 or something like than that.
>> >>>>>>
>> >>>>>>Okay, that makes sense.  Do you know how much data can be written as part of
>> >>>>>>a single operation?  We're using 2MB hugepages for the guest memory, and we
>> >>>>>>saw the qemu RSS numbers jump from 25-30MB during normal operation up to
>> >>>>>>120-180MB when running dbench.  I'd like to know what the worst-case would
>> >>>
>> >>>Sorry I didn't understood this part at first read.
>> >>>
>> >>>In the linux guest can you monitor:
>> >>>benoit@Laure:~$ cat /sys/class/block/xyz/inflight ?
>> >>>
>> >>>This would give us a faily precise number of the requests actually in flight between the guest and qemu.
>> >>
>> >>
>> >>After a bit of a break I'm looking at this again.
>> >>
>> >
>> >Strange.
>> >
>> >I would use dd with the flag oflag=nocache to make sure the write request
>> >does not do in the guest cache though.
>> >
>> >Best regards
>> >
>> >Benoît
>> >
>> >>While doing "dd if=/dev/zero of=testfile bs=1M count=700" in the guest, I
>> >>got a max "inflight" value of 181.  This seems quite a bit higher than
>> >>VIRTIO_PCI_QUEUE_MAX.
>> >>
>> >>I've seen throughput as high as ~210 MB/sec, which also kicked the RSS
>> >>numbers up above 200MB.
>> >>
>> >>I tried dropping VIRTIO_PCI_QUEUE_MAX down to 32 (it didn't seem to work at
>> >>all for values much less than that, though I didn't bother getting an exact
>> >>value) and it didn't really make any difference, I saw inflight values as
>> >>high as 177.
>>
>> I think I might have a glimmering of what's going on.  Someone please
>> correct me if I get something wrong.
>>
>> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with respect
>> to max inflight operations, and neither does virtio-blk calling
>> virtio_add_queue() with a queue size of 128.
>>
>> I think what's happening is that virtio_blk_handle_output() spins, pulling
>> data off the 128-entry queue and calling virtio_blk_handle_request().  At
>> this point that queue entry can be reused, so the queue size isn't really
>> relevant.
>
> The number of pending virtio-blk requests is finite.  You missed the
> vring descriptor table where buffer descriptors live - that's what
> prevents the guest from issuing an infinite number of pending requests.
>
> You are correct that the host moves along the "avail" queue, the actual
> buffer descriptors in the vring (struct vring_desc) stay put until
> request completion is processed by the guest driver from the "used"
> ring.
>
> Each virtio-blk request takes at least 2 vring descriptors (data buffer
> + request status byte).  I think 3 is common in practice because drivers
> like to submit struct virtio_blk_outhdr in its own descriptor.
>
> So we have a limit of 128 / 2 = 64 I/O requests or 128 / 3 = 42 I/O
> requests.
>
> If you rerun the tests with the fio job file I posted, the results
> should show that only 64 or 42 requests are pending at any given time.

By the way, there's one more case: indirect vring descriptors.  If
indirect vring descriptors are used them each request takes just 1
descriptor in the vring descriptor table.  In that case up to 128
in-flight I/O requests are supported.

Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2015-08-27 16:49                               ` Stefan Hajnoczi
@ 2015-08-28  0:31                                 ` Josh Durgin
  2015-08-28  8:31                                   ` Andrey Korolyov
  0 siblings, 1 reply; 40+ messages in thread
From: Josh Durgin @ 2015-08-28  0:31 UTC (permalink / raw)
  To: Stefan Hajnoczi, Chris Friesen
  Cc: Benoît Canet, Paolo Bonzini, Andrey Korolyov, qemu-devel

On 08/27/2015 09:49 AM, Stefan Hajnoczi wrote:
> On Mon, Aug 25, 2014 at 03:50:02PM -0600, Chris Friesen wrote:
>> The only limit I see in the whole call chain from
>> virtio_blk_handle_request() on down is the call to
>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that doesn't
>> provide any limit on the absolute number of inflight operations, only on
>> operations/sec.  If the ceph server cluster can't keep up with the aggregate
>> load, then the number of inflight operations can still grow indefinitely.
>
> We probably shouldn't rely on QEMU I/O throttling to keep memory usage
> reasonable.

Agreed.

> Instead rbd should be adjusted to support iovecs as you suggested.  That
> way no bounce buffers are needed.

Yeah, this is pretty simple to do. Internally librbd has 
iovec-equivalents. I'm not sure this is the main source of extra memory 
usage
though.

I suspect the main culprit here is rbd cache letting itself burst too
large, rather than the bounce buffers.

Andrey, does this still occur with caching off?

Josh

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
  2015-08-28  0:31                                 ` Josh Durgin
@ 2015-08-28  8:31                                   ` Andrey Korolyov
  0 siblings, 0 replies; 40+ messages in thread
From: Andrey Korolyov @ 2015-08-28  8:31 UTC (permalink / raw)
  To: Josh Durgin
  Cc: Benoît Canet, Stefan Hajnoczi, Paolo Bonzini, qemu-devel,
	Chris Friesen

On Fri, Aug 28, 2015 at 3:31 AM, Josh Durgin <jdurgin@redhat.com> wrote:
> On 08/27/2015 09:49 AM, Stefan Hajnoczi wrote:
>>
>> On Mon, Aug 25, 2014 at 03:50:02PM -0600, Chris Friesen wrote:
>>>
>>> The only limit I see in the whole call chain from
>>> virtio_blk_handle_request() on down is the call to
>>> bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that doesn't
>>> provide any limit on the absolute number of inflight operations, only on
>>> operations/sec.  If the ceph server cluster can't keep up with the
>>> aggregate
>>> load, then the number of inflight operations can still grow indefinitely.
>>
>>
>> We probably shouldn't rely on QEMU I/O throttling to keep memory usage
>> reasonable.
>
>
> Agreed.
>
>> Instead rbd should be adjusted to support iovecs as you suggested.  That
>> way no bounce buffers are needed.
>
>
> Yeah, this is pretty simple to do. Internally librbd has iovec-equivalents.
> I'm not sure this is the main source of extra memory usage
> though.
>
> I suspect the main culprit here is rbd cache letting itself burst too
> large, rather than the bounce buffers.
>
> Andrey, does this still occur with caching off?

No, complete cache disablement helped so far perfectly, what cannot be
said for wt/wb modes. As I mentioned, the consumption curve is a
reactive by its nature, so I`ll try to catch the burst under valgrind
but cannot promise anything.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
       [not found] <1000957815.25879188.1441820902018.JavaMail.zimbra@redhat.com>
@ 2015-09-09 18:51 ` Jason Dillaman
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Dillaman @ 2015-09-09 18:51 UTC (permalink / raw)
  To: andrey; +Cc: qemu-devel

>> Bumping this...
>>
>> For now, we are rarely suffering with an unlimited cache growth issue
>> which can be observed on all post-1.4 versions of qemu with rbd
>> backend in a writeback mode and certain pattern of a guest operations.
>> The issue is confirmed for virtio and can be re-triggered by issuing
>> excessive amount of write requests without completing returned acks
>> from a emulator` cache timely. Since most applications behave in a
>> right way, the oom issue is very rare (and we developed an ugly
>> workaround for such situations long ago). If anybody is interested in
>> fixing this, I can send a prepared image for a reproduction or
>> instructions to make one, whichever is preferable.
>>
>> Thanks!
>
>A gentle bump: for at least rbd backend with writethrough/writeback
>cache it is possible to achieve unlimited growth with lot of large
>unfinished ops, what can be considered as a DoS. Usually it is
>triggered by poorly written applications in the wild, like proprietary
>KV databases or MSSQL under Windows, but regular applications,
>primarily OSS databases, can trigger the RSS growth for hundreds of
>megabytes just easily. There is probably no straight way to limit
>in-flight request size by re-chunking it, as supposedly malicious
>guest can inflate it up to very high numbers, but it`s fine to crash
>such a guest, saving real-world stuff with simple in-flight op count
>limiter looks like more achievable option.

Any chance you can provide the reproducer VM image via ceph-post-file [1]?  Using the latest Firefly release with QEMU 2.3.1, I was unable to reproduce unlimited growth while hammering the VM with a randwrite fio job with iodepth=256, blocksize=4k.   

[1] http://ceph.com/docs/master/man/8/ceph-post-file/

-- Jason

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2015-09-09 18:52 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-18 14:58 [Qemu-devel] is there a limit on the number of in-flight I/O operations? Chris Friesen
2014-07-18 15:24 ` Paolo Bonzini
2014-07-18 16:22   ` Chris Friesen
2014-07-18 20:13     ` Paolo Bonzini
2014-07-18 22:48       ` Chris Friesen
2014-07-19  5:49         ` Paolo Bonzini
2014-07-19  6:27           ` Chris Friesen
2014-07-19  7:23             ` Paolo Bonzini
2014-07-19  8:45               ` Benoît Canet
2014-07-21 14:59                 ` Chris Friesen
2014-07-21 15:15                   ` Benoît Canet
2014-07-21 15:35                     ` Chris Friesen
2014-07-21 15:54                       ` Benoît Canet
2014-07-21 16:10                       ` Benoît Canet
2014-08-23  0:59                         ` Chris Friesen
2014-08-23  7:56                           ` Benoît Canet
2014-08-25 15:12                             ` Chris Friesen
2014-08-25 17:43                               ` Chris Friesen
2015-08-27 16:37                                 ` Stefan Hajnoczi
2015-08-27 16:33                               ` Stefan Hajnoczi
2014-08-25 21:50                             ` Chris Friesen
2014-08-27  5:43                               ` Chris Friesen
2015-05-14 13:42                                 ` Andrey Korolyov
2015-08-26 17:10                                   ` Andrey Korolyov
2015-08-26 23:31                                     ` Josh Durgin
2015-08-26 23:47                                       ` Andrey Korolyov
2015-08-27  0:56                                         ` Josh Durgin
2015-08-27 16:48                               ` Stefan Hajnoczi
2015-08-27 17:05                                 ` Stefan Hajnoczi
2015-08-27 16:49                               ` Stefan Hajnoczi
2015-08-28  0:31                                 ` Josh Durgin
2015-08-28  8:31                                   ` Andrey Korolyov
2014-07-21 19:47                       ` Benoît Canet
2014-07-21 21:12                         ` Chris Friesen
2014-07-21 22:04                           ` Benoît Canet
2014-07-18 15:54 ` Andrey Korolyov
2014-07-18 16:26   ` Chris Friesen
2014-07-18 16:30     ` Andrey Korolyov
2014-07-18 16:46       ` Chris Friesen
     [not found] <1000957815.25879188.1441820902018.JavaMail.zimbra@redhat.com>
2015-09-09 18:51 ` Jason Dillaman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.