From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:40456) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UHgc9-0000Bm-UI for qemu-devel@nongnu.org; Mon, 18 Mar 2013 16:25:05 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UHgc6-0001nY-Hm for qemu-devel@nongnu.org; Mon, 18 Mar 2013 16:25:01 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:53040) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UHgc6-0001nP-Dd for qemu-devel@nongnu.org; Mon, 18 Mar 2013 16:24:58 -0400 Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 18 Mar 2013 16:24:56 -0400 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id F39BBC9005C for ; Mon, 18 Mar 2013 16:24:52 -0400 (EDT) Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r2IKOouq217142 for ; Mon, 18 Mar 2013 16:24:51 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r2IKOkDb003846 for ; Mon, 18 Mar 2013 17:24:46 -0300 Message-ID: <5147780C.1080800@linux.vnet.ibm.com> Date: Mon, 18 Mar 2013 16:24:44 -0400 From: "Michael R. Hines" MIME-Version: 1.0 References: <1363576743-6146-1-git-send-email-mrhines@linux.vnet.ibm.com> <1363576743-6146-4-git-send-email-mrhines@linux.vnet.ibm.com> <20130318104013.GE5267@redhat.com> In-Reply-To: <20130318104013.GE5267@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: aliguori@us.ibm.com, qemu-devel@nongnu.org, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com On 03/18/2013 06:40 AM, Michael S. Tsirkin wrote: > I think there are two things here, API documentation and protocol > documentation, protocol documentation still needs some more work. Also > if what I understand from this document is correct this breaks memory > overcommit on destination which needs to be fixed. > > I think something chunk-based on the destination side is required as > well. You also can't trust the source to tell you the chunk size it > could be malicious and ask for too much. Maybe source gives chunk size > hint and destination responds with what it wants to use. Do we allow ballooning *during* the live migration? Is that necessary? Would it be sufficient to inform the destination which pages are ballooned and then only register the ones that the VM actually owns? > Is there any feature and/or version negotiation? How are we going to > handle compatibility when we extend the protocol? You mean, on top of the protocol versioning that's already builtin to QEMUFile? inside qemu_savevm_state_begin()? Should I piggy-back and additional protocol version number before QEMUFile sends it's version number? > So how does destination know it's ok to send anything to source? I > suspect this is wrong. When using CM you must post on RQ before > completing the connection negotiation, not after it's done. This is already handled by the RDMA connection manager (librdmacm). The library already has functions like listen() and accept() the same way that TCP does. Once these functions return success, we have a gaurantee that both sides of the connection have already posted the appropriate work requests sufficient for driving the migration. >> +2. We transmit an empty SEND to let the sender know that >> + we are *ready* to receive some bytes from QEMUFileRDMA. >> + These bytes will come in the form of a another SEND. > Using an empty message seems somewhat hacky, a fixed header in the > message would let you do more things if protocol is ever extended. Great idea....... I'll add a struct RDMAHeader to each send message in the next RFC which includes a version number. (Until now, there were *only* QEMUFile bytes, nothing else, so I didn't have any reason for a formal structure.) > OK to summarize flow control: at any time there's either 0 or 1 > outstanding buffers in RQ. At each time only one side can talk. > Destination always goes first, then source, etc. At each time a single > send message can be passed. Just FYI, this means you are often at 0 > buffers in RQ and IIRC 0 buffers is a worst-case path for infiniband. > It's better to keep at least 1 buffers in RQ at all times, so prepost > 2 initially so it would fluctuate between 1 and 2. That's correct. Having 0 buffers is not possible - sending a message with 0 buffers would throw an error. The "protocol" as I described ensures that there is always one buffer posted before waiting for another message to arrive. I avoided "better" flow control because the non-live state is so small in comparison to the pc.ram contents that would be sent. The non-live state is in the range of kilobytes, so it seemed silly to have more rigorous flow control.... >> +Migration of pc.ram: >> +=============================== >> + >> +At the beginning of the migration, (migration-rdma.c), >> +the sender and the receiver populate the list of RAMBlocks >> +to be registered with each other into a structure. > Could you add the packet format here as well please? > Need to document endian-ness etc. There is no packet format for pc.ram. It's just bytes - raw RDMA writes of each 4K page, because the memory must be registered before the RDMA write can begin. (As discussed, there will be a format for SEND, though - so I'll take care of that in my next RFC). > Yes but we also need to report errors detected during migration. Need > to document how this is done. We also need to report success. Acknowledged - I'll add more verbosity to the different error conditions. - Michael R. Hines