From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:50337)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1UQeJy-0004Gn-8g
	for qemu-devel@nongnu.org; Fri, 12 Apr 2013 09:47:24 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1UQeJv-0000v8-2r
	for qemu-devel@nongnu.org; Fri, 12 Apr 2013 09:47:18 -0400
Received: from e9.ny.us.ibm.com ([32.97.182.139]:40590)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1UQeJu-0000tx-Tw
	for qemu-devel@nongnu.org; Fri, 12 Apr 2013 09:47:15 -0400
Received: from /spool/local
	by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only!
	Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <mrhines@linux.vnet.ibm.com>;
	Fri, 12 Apr 2013 09:47:12 -0400
Received: from d01relay01.pok.ibm.com (d01relay01.pok.ibm.com [9.56.227.233])
	by d01dlp02.pok.ibm.com (Postfix) with ESMTP id CEA596E803C
	for <qemu-devel@nongnu.org>; Fri, 12 Apr 2013 09:47:06 -0400 (EDT)
Received: from d01av05.pok.ibm.com (d01av05.pok.ibm.com [9.56.224.195])
	by d01relay01.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	r3CDl9Wx215536
	for <qemu-devel@nongnu.org>; Fri, 12 Apr 2013 09:47:09 -0400
Received: from d01av05.pok.ibm.com (loopback [127.0.0.1])
	by d01av05.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id
	r3CDl9pZ026000
	for <qemu-devel@nongnu.org>; Fri, 12 Apr 2013 09:47:09 -0400
Message-ID: <5168105C.5040605@linux.vnet.ibm.com>
Date: Fri, 12 Apr 2013 09:47:08 -0400
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <20130411071927.GA17063@redhat.com>
	<5166B6B1.2030003@linux.vnet.ibm.com>
	<20130411134820.GA24942@redhat.com>
	<5166C19A.1040402@linux.vnet.ibm.com>
	<20130411143718.GC24942@redhat.com> <5166CDAD.8060807@redhat.com>
	<20130411145632.GA2280@redhat.com>
	<5166F7AE.8070209@linux.vnet.ibm.com>
	<20130411191533.GA25515@redhat.com>
	<51671DFF.80904@linux.vnet.ibm.com>
	<20130412104802.GA23467@redhat.com>
In-Reply-To: <20130412104802.GA23467@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive
 protocol documentation
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: aliguori@us.ibm.com, qemu-devel@nongnu.org, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, Paolo Bonzini <pbonzini@redhat.com>

On 04/12/2013 06:48 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 04:33:03PM -0400, Michael R. Hines wrote:
>> On 04/11/2013 03:15 PM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 11, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
>>>> On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
>>>>>> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
>>>>>>> pg1 ->  pin -> req -> res -> rdma -> done
>>>>>>>          pg2 ->  pin -> req -> res -> rdma -> done
>>>>>>>                  pg3 -> pin -> req -> res -> rdma -> done
>>>>>>>                         pg4 -> pin -> req -> res -> rdma -> done
>>>>>>>                                pg4 -> pin -> req -> res -> rdma -> done
>>>>>>>
>>>>>>> It's like a assembly line see?  So while software does the registration
>>>>>>> roundtrip dance, hardware is processing rdma requests for previous
>>>>>>> chunks.
>>>>>> Does this only affects the implementation, or also the wire protocol?
>>>>> It affects the wire protocol.
>>>> I *do* believe chunked registration was a *very* useful request by
>>>> the community, and I want to thank you for convincing me to implement it.
>>>>
>>>> But, with all due respect, pipelining is a "solution looking for a problem".
>>> The problem is bad performance, isn't it?
>>> If it wasn't we'd use chunk based all the time.
>>>
>>>> Improving the protocol does not help the behavior of any well-known
>>>> workloads,
>>>> because it is based on the idea the the memory footprint of a VM would
>>>> *rapidly* shrink and contract up and down during the steady-state iteration
>>>> rounds while the migration is taking place.
>>> What gave you that idea? Not at all.  It is based on the idea
>>> of doing control actions in parallel with data transfers,
>>> so that control latency does not degrade performance.
>> Again, this parallelization is trying to solve a problem that
>> doesn't exist.
>>
>> As I've described before, I re-executed the worst-case memory stress hog
>> tests with RDMA *after* the bulk-phase round completes and determined
>> that RDMA throughput remains unaffected because most of the memory
>> was already registered in advance.
>>
>>>> This simply does not happen - workloads don't behave that way - they either
>>>> grow really big or they grow really small and they settle that way
>>>> for a reasonable
>>>> amount of time before the load on the application changes at a
>>>> future point in time.
>>>>
>>>> - Michael
>>> What is the bottleneck for chunk-based? Can you tell me that?  Find out,
>>> and you will maybe see pipelining will help.
>>>
>>> Basically to me, when you describe the protocol in detail the problems
>>> become apparent.
>>>
>>> I think you worry too much about what the guest does, what APIs are
>>> exposed from the migration core and the specifics of the workload. Build
>>> a sane protocol for data transfers and layer the workload on top.
>>>
>> What is the point in enhancing a protocol to solve a problem will
>> never be manifested?
>>
>> We're trying to overlap two *completely different use cases* that
>> are completely unrelated:
>>
>> 1. Static overcommit
>> 2. Dynamic, fine-grained overcommit (at small time scales... seconds
>> or minutes)
>>
>> #1 Happens all the time. Cram a bunch of virtual machines with fixed
>> workloads
>> and fixed writable working sets into the same place, and you're good to go.
>>
>> #2 never happens. Ever. It just doesn't happen, and the enhancements you've
>> described are trying to protect against #2, when we should really be
>> focused on #1.
>>
>> It is not standard practice for a workload to expect high overcommit
>> performance
>> in the *middle* of a relocation and nobody in the industry that I
>> have met over the
>> years has expressed any such desire to do so.
>>
> Depends on who you talk to I guess.  Almost everyone
> overcommits to some level. They might not know it.
> It depends on the amount of overcommit.  You pin all (at least non zero)
> memory eventually, breaking memory overcommit completely. If I
> overcommit by 4kilobytes do you expect performance to go completely
> down? It does not make sense.
>
>> Workloads just don't behave that way.
>>
>> Dynamic registration does an excellent job at overcommitment for #1
>> because most
>> of the registrations are done at the very beginning and can be
>> further optimized to
>> cause little or no performance loss by simply issuing the
>> registrations before the
>> migration ever begins.
> How does it? You pin all VM's memory eventually.
> You said your tests have the OOM killer triggering.
>

That's because of cgroups memory limitations. Not the protocol.

Infiband was never designed to work with cgroups - that's a kernel
problem, not a QEMU problem or a protocol problem. Why do
we have to worry about that exactly?

>> Performance for #2 even with dynamic registration is excellent and I am not
>> experiencing any problems associated with it.
> Well previously you said the reverse. You keep vaguely speaking about
> performance.  We care about these metrics:
>
> 	1. total migration time: measured by:
>
> 	time
> 	 ssh dest qemu -incoming &;echo migrate > monitor
> 	time
>
> 	2.  min allowed downtime that lets migration converge
>
> 	3. average host CPU utilization during migration,
> 	   on source and destination
>
> 	4. max real memory used by qemu
>
> Can you fill this table for TCP, and two protocol versions?
>
> If dynamic works as well as static, this is a good reason to drop the
> static one.  As the next step, fix the dynamic to unregister
> memory (this is required for _GIFT anyway). When you do this
> it is possible that pipelining is required.

First, yes, I'm happy to fill out the table - let me address
Paolo's last requested changes (including the COMPRESS fix)

Second, there are not two protocol versions. That's incorrect.
There's only one protocol which can operate in different ways
as any protocol can operate in different ways. It has different
command types, not all of which need to be used by the protocol
at the same time.

Second, as I've explained, I strongly, strongly disagree with unregistering
memory for all of the aforementioned reasons - workloads do not
operate in such a manner that they can tolerate memory to be
pulled out from underneath them at such fine-grained time scales
in the *middle* of a relocation and I will not commit to writing a solution
for a problem that doesn't exist.

If you can prove (through some kind of anaylsis) that workloads
would benefit from this kind of fine-grained memory overcommit
by having cgroups swap out memory to disk underneath them
without their permission, I would happily reconsider my position.

- Michael


>> So, we're discussing a non-issue.
>>
>> - Michael
>>
> There are two issues.
>
> 1.  You have two protocols already and this does not make sense in
> version 1 of the patch.  You said dynamic is slow so I pointed out ways
> to improve it. Now you says it's as fast as static?  so drop static
> then. At no point does it make sense to have management commands to play
> with low level protocol details.
>
>>
>> Overcommit has two