From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52348) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a7KK1-00015t-8S for qemu-devel@nongnu.org; Fri, 11 Dec 2015 04:49:06 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a7KJx-00008m-Uf for qemu-devel@nongnu.org; Fri, 11 Dec 2015 04:49:05 -0500 Received: from szxga03-in.huawei.com ([119.145.14.66]:8005) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a7KJw-00008Q-IT for qemu-devel@nongnu.org; Fri, 11 Dec 2015 04:49:01 -0500 References: <1448357149-17572-1-git-send-email-zhang.zhanghailiang@huawei.com> <1448357149-17572-26-git-send-email-zhang.zhanghailiang@huawei.com> <20151210190113.GL2570@work-vm> From: Hailiang Zhang Message-ID: <566A9BF3.5080302@huawei.com> Date: Fri, 11 Dec 2015 17:48:35 +0800 MIME-Version: 1.0 In-Reply-To: <20151210190113.GL2570@work-vm> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH COLO-Frame v11 25/39] COLO: implement default failover treatment List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" Cc: lizhijian@cn.fujitsu.com, quintela@redhat.com, yunhong.jiang@intel.com, eddie.dong@intel.com, peter.huangpeng@huawei.com, qemu-devel@nongnu.org, arei.gonglei@huawei.com, stefanha@redhat.com, amit.shah@redhat.com, hongyang.yang@easystack.cn On 2015/12/11 3:01, Dr. David Alan Gilbert wrote: > * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote: >> If we detect some error in colo, we will wait for some time, >> hoping users also detect it. If users don't issue failover command. >> We will go into default failover procedure, which the PVM will takeover >> work while SVM is exit in default. > > I'm not sure this is needed; especially on the SVM. I don't see any harm > in the SVM waiting forever to be told what to do - it could be told to > failover or quit; I don't see any benefit to it automatically exiting. > > In the primary, I can see if you didn't have some automated error > detection system then I can understand it (but I think it's rare); > but you really would want to make that failover delay configurable > so that you could turn it off in a system that did have failure detection; > because automatically restarting the primary after it had caused a failover > to the secondary would be very bad. Yes, automatically restarting the PVM may cause split-brain. I'll drop this patch temporarily. Thanks. Hailiang >> >> Signed-off-by: zhanghailiang >> Signed-off-by: Li Zhijian >> --- >> migration/colo.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 46 insertions(+) >> >> diff --git a/migration/colo.c b/migration/colo.c >> index f31e957..1e6d3dd 100644 >> --- a/migration/colo.c >> +++ b/migration/colo.c >> @@ -19,6 +19,14 @@ >> #include "qemu/sockets.h" >> #include "migration/failover.h" >> >> +/* >> + * The delay time before qemu begin the procedure of default failover treatment. >> + * Unit: ms >> + * Fix me: This value should be able to change by command >> + * 'migrate-set-parameters' >> + */ >> +#define DEFAULT_FAILOVER_DELAY 2000 >> + >> /* colo buffer */ >> #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024) >> >> @@ -264,6 +272,7 @@ static void colo_process_checkpoint(MigrationState *s) >> { >> QEMUSizedBuffer *buffer = NULL; >> int64_t current_time, checkpoint_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); >> + int64_t error_time; >> int ret = 0; >> uint64_t value; >> >> @@ -322,8 +331,25 @@ static void colo_process_checkpoint(MigrationState *s) >> } >> >> out: >> + current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); >> if (ret < 0) { >> error_report("%s: %s", __func__, strerror(-ret)); >> + /* Give users time to get involved in this verdict */ >> + while (current_time - error_time <= DEFAULT_FAILOVER_DELAY) { >> + if (failover_request_is_active()) { >> + error_report("Primary VM will take over work"); >> + break; >> + } >> + usleep(100 * 1000); >> + current_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); >> + } >> + >> + qemu_mutex_lock_iothread(); >> + if (!failover_request_is_active()) { >> + error_report("Primary VM will take over work in default"); >> + failover_request_active(NULL); >> + } >> + qemu_mutex_unlock_iothread(); >> } >> >> qsb_free(buffer); >> @@ -384,6 +410,7 @@ void *colo_process_incoming_thread(void *opaque) >> QEMUFile *fb = NULL; >> QEMUSizedBuffer *buffer = NULL; /* Cache incoming device state */ >> uint64_t total_size; >> + int64_t error_time, current_time; >> int ret = 0; >> uint64_t value; >> >> @@ -499,9 +526,28 @@ void *colo_process_incoming_thread(void *opaque) >> } >> >> out: >> + current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); >> if (ret < 0) { >> error_report("colo incoming thread will exit, detect error: %s", >> strerror(-ret)); >> + /* Give users time to get involved in this verdict */ >> + while (current_time - error_time <= DEFAULT_FAILOVER_DELAY) { >> + if (failover_request_is_active()) { >> + error_report("Secondary VM will take over work"); >> + break; >> + } >> + usleep(100 * 1000); >> + current_time = qemu_clock_get_ms(QEMU_CLOCK_HOST); >> + } >> + /* check flag again*/ >> + if (!failover_request_is_active()) { >> + /* >> + * We assume that Primary VM is still alive according to >> + * heartbeat, just kill Secondary VM >> + */ >> + error_report("SVM is going to exit in default!"); >> + exit(1); >> + } >> } >> >> if (fb) { >> -- >> 1.8.3.1 >> >> > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > . >