All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wen Congyang <wency@cn.fujitsu.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	zhanghailiang <zhang.zhanghailiang@huawei.com>
Cc: <peter.huangpeng@huawei.com>, <qemu-devel@nongnu.org>,
	<quintela@redhat.com>, <amit.shah@redhat.com>,
	<eblake@redhat.com>, <berrange@redhat.com>,
	<eddie.dong@intel.com>, <yunhong.jiang@intel.com>,
	<lizhijian@cn.fujitsu.com>, <arei.gonglei@huawei.com>,
	<david@gibson.dropbear.id.au>, <netfilter-devel@vger.kernel.org>
Subject: Re: [PATCH COLO-Frame v5 00/29] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service
Date: Fri, 29 May 2015 20:34:18 +0800	[thread overview]
Message-ID: <55685CCA.2010604@cn.fujitsu.com> (raw)
In-Reply-To: <20150529084249.GC2127@work-vm>

On 05/29/2015 04:42 PM, Dr. David Alan Gilbert wrote:
> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>> On 2015/5/29 9:29, Wen Congyang wrote:
>>> On 05/29/2015 12:24 AM, Dr. David Alan Gilbert wrote:
>>>> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>>>>> This is the 5th version of COLO, here is only COLO frame part, include: VM checkpoint,
>>>>> failover, proxy API, block replication API, not include block replication.
>>>>> The block part has been sent by wencongyang:
>>>>> "[Qemu-devel] [PATCH COLO-Block v5 00/15] Block replication for continuous checkpoints"
>>>>>
>>>>> we have finished some new features and optimization on COLO (As a development branch in github),
>>>>> but for easy of review, it is better to keep it simple now, so we will not add too much new
>>>>> codes into this frame patch set before it been totally reviewed.
>>>>>
>>>>> You can get the latest integrated qemu colo patches from github (Include Block part):
>>>>> https://github.com/coloft/qemu/commits/colo-v1.2-basic
>>>>> https://github.com/coloft/qemu/commits/colo-v1.2-developing (more features)
>>>>>
>>>>> Please NOTE the difference between these two branch.
>>>>> colo-v1.2-basic is exactly same with this patch series, which has basic features of COLO.
>>>>> Compared with colo-v1.2-basic, colo-v1.2-developing has some optimization in the
>>>>> process of checkpoint, including:
>>>>>    1) separate ram and device save/load process to reduce size of extra memory
>>>>>       used during checkpoint
>>>>>    2) live migrate part of dirty pages to slave during sleep time.
>>>>> Besides, we add some statistic info in colo-v1.2-developing, which you can get these stat
>>>>> info by using command 'info migrate'.
>>>>
>>>>
>>>> Hi,
>>>>   I have that running now.
>>>>
>>>> Some notes:
>>>>   1) The colo-proxy is working OK until qemu quits, and then it gets an RCU problem; see below
>>>>   2) I've attached some minor tweaks that were needed to build with the 4.1rc kernel I'm using;
>>>>      they're very minor changes and I don't think related to (1).
>>>>   3) I've also included some minor fixups I needed to get the -developing world
>>>>      to build;  my compiler is fussy about unused variables etc - but I think the code
>>>>      in ram_save_complete in your -developing patch is wrong because there are two
>>>>      'pages' variables and the one in the inner loop is the only one changed.
>>
>> Oops, i will fix them. thank you for pointing out this low grade mistake. :)
> 
> No problem; we all make them.
> 
>>>>   4) I've started trying simple benchmarks and tests now:
>>>>     a) With a simple web server most requests have very little overhead, the comparison
>>>>        matches most of the time;  I do get quite large spikes (0.04s->1.05s) which I guess
>>>>        corresponds to when a checkpoint happens, but I'm not sure why the spike is so big,
>>>>        since the downtime isn't that big.
>>
>> Have you disabled DEBUG for colo proxy? I turned it on in default, is this related?
> 
> Yes, I've turned that off, I still get the big spikes; not looked why yet.
> 
>>>>     b) I tried something with more dynamic pages - the front page of a simple bugzilla
>>>>        install;  it failed the comparison every time; it took me a while to figure out
>>
>> Failed comprison ? Do you mean the net packets in these two sides are always inconsistent?
> 
> Yes.
> 
>>>>        why, but it generates a unique token in it's javascript each time (for a password reset
>>>>        link), and I guess the randomness used by that doesn't match on the two hosts.
>>>>        It surprised me, because I didn't expect this page to have much randomness
>>>>        in.
>>>>
>>>>   4a is really nice - it shows the benefit of COLO over the simple checkpointing;
>>>> checkpoints happen very rarely.
>>>>
>>>> The colo-proxy rcu problem I hit shows as rcu-stalls in both primary and secondary
>>>> after the qemu quits; the backtrace of the qemu stack is:
>>>
>>> How to reproduce it? Use monitor command quit to quit qemu? Or kill the qemu?
>>>
>>>>
>>>> [<ffffffff810d8c0c>] wait_rcu_gp+0x5c/0x80
>>>> [<ffffffff810ddb05>] synchronize_rcu+0x45/0xd0
>>>> [<ffffffffa0a251e5>] colo_node_release+0x35/0x50 [nfnetlink_colo]
>>>> [<ffffffffa0a25795>] colonl_close_event+0xe5/0x160 [nfnetlink_colo]
>>>> [<ffffffff81090c96>] notifier_call_chain+0x66/0x90
>>>> [<ffffffff8109154c>] atomic_notifier_call_chain+0x6c/0x110
>>>> [<ffffffff815eee07>] netlink_release+0x5b7/0x7f0
>>>> [<ffffffff815878bf>] sock_release+0x1f/0x90
>>>> [<ffffffff81587942>] sock_close+0x12/0x20
>>>> [<ffffffff812193c3>] __fput+0xd3/0x210
>>>> [<ffffffff8121954e>] ____fput+0xe/0x10
>>>> [<ffffffff8108d9f7>] task_work_run+0xb7/0xf0
>>>> [<ffffffff81002d4d>] do_notify_resume+0x8d/0xa0
>>>> [<ffffffff81722b66>] int_signal+0x12/0x17
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> Thanks for your test. The backtrace is very useful, and we will fix it soon.
>>>
>>
>> Yes, it is a bug, the callback function colonl_close_event() is called when holding
>> rcu lock:
>> netlink_release
>>     ->atomic_notifier_call_chain
>>          ->rcu_read_lock();
>>          ->notifier_call_chain
>>             ->ret = nb->notifier_call(nb, val, v);
>> And here it is wrong to call synchronize_rcu which will lead to sleep.
>> Besides, there is another function might lead to sleep, kthread_stop which is called
>> in destroy_notify_cb.
>>
>>>>
>>>> that's with both the 423a8e268acbe3e644a16c15bc79603cfe9eb084 from yesterday and
>>>> older e58e5152b74945871b00a88164901c0d46e6365e tags on colo-proxy.
>>>> I'm not sure of the right fix; perhaps it might be possible to replace the
>>>> synchronize_rcu in colo_node_release by a call_rcu that does the kfree later?
>>>
>>> I agree with it.
>>
>> That is a good solution, i will fix both of the above problems.
> 
> Thanks,

We have fix this problem, and test it. The patch is pushed to github, please try it.

Thanks
Wen Congyang

> 
> Dave
> 
>>
>> Thanks,
>> zhanghailiang
>>
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Dave
>>>>
>>>>>
>>>
>>>
>>> .
>>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
> 


  reply	other threads:[~2015-05-29 12:30 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-21  8:12 [PATCH COLO-Frame v5 00/29] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service zhanghailiang
2015-05-21  8:12 ` [Qemu-devel] " zhanghailiang
2015-05-21  8:12 ` [Qemu-devel] [PATCH COLO-Frame v5 01/29] configure: Add parameter for configure to enable/disable COLO support zhanghailiang
2015-05-21  8:12 ` [Qemu-devel] [PATCH COLO-Frame v5 02/29] migration: Introduce capability 'colo' to migration zhanghailiang
2015-05-21  8:12 ` [Qemu-devel] [PATCH COLO-Frame v5 03/29] COLO: migrate colo related info to slave zhanghailiang
2015-05-21  8:12 ` [Qemu-devel] [PATCH COLO-Frame v5 04/29] migration: Integrate COLO checkpoint process into migration zhanghailiang
2015-05-21  8:12 ` [Qemu-devel] [PATCH COLO-Frame v5 05/29] migration: Integrate COLO checkpoint process into loadvm zhanghailiang
2015-05-21  8:12 ` [Qemu-devel] [PATCH COLO-Frame v5 06/29] COLO: Implement colo checkpoint protocol zhanghailiang
2015-05-21  8:12 ` [Qemu-devel] [PATCH COLO-Frame v5 07/29] COLO: Add a new RunState RUN_STATE_COLO zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 08/29] QEMUSizedBuffer: Introduce two help functions for qsb zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 09/29] COLO: Save VM state to slave when do checkpoint zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 10/29] COLO RAM: Load PVM's dirty page into SVM's RAM cache temporarily zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 11/29] COLO VMstate: Load VM state into qsb before restore it zhanghailiang
2015-06-05 18:02   ` Dr. David Alan Gilbert
2015-06-09  2:19     ` zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 12/29] arch_init: Start to trace dirty pages of SVM zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 13/29] COLO RAM: Flush cached RAM into SVM's memory zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 14/29] COLO failover: Introduce a new command to trigger a failover zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 15/29] COLO failover: Implement COLO master/slave failover work zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 16/29] COLO failover: Don't do failover during loading VM's state zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 17/29] COLO: Add new command parameter 'colo_nicname' 'colo_script' for net zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 18/29] COLO NIC: Init/remove colo nic devices when add/cleanup tap devices zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 19/29] COLO NIC: Implement colo nic device interface configure() zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 20/29] COLO NIC : Implement colo nic init/destroy function zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 21/29] COLO NIC: Some init work related with proxy module zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 22/29] COLO: Handle nfnetlink message from " zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 23/29] COLO: Do checkpoint according to the result of packets comparation zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 24/29] COLO: Improve checkpoint efficiency by do additional periodic checkpoint zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 25/29] COLO: Add colo-set-checkpoint-period command zhanghailiang
2015-06-05 18:45   ` Dr. David Alan Gilbert
2015-06-09  3:28     ` zhanghailiang
2015-06-09  8:01       ` Dr. David Alan Gilbert
2015-06-09 10:14         ` zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 26/29] COLO NIC: Implement NIC checkpoint and failover zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 27/29] COLO: Disable qdev hotplug when VM is in COLO mode zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 28/29] COLO: Implement shutdown checkpoint zhanghailiang
2015-05-21  8:13 ` [Qemu-devel] [PATCH COLO-Frame v5 29/29] COLO: Add block replication into colo process zhanghailiang
2015-05-21 11:30 ` [PATCH COLO-Frame v5 00/29] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service Dr. David Alan Gilbert
2015-05-21 11:30   ` [Qemu-devel] " Dr. David Alan Gilbert
2015-05-22  6:26   ` zhanghailiang
2015-05-22  6:26     ` [Qemu-devel] " zhanghailiang
2015-05-28 16:24 ` Dr. David Alan Gilbert
2015-05-28 16:24   ` [Qemu-devel] " Dr. David Alan Gilbert
2015-05-29  1:29   ` Wen Congyang
2015-05-29  1:29     ` [Qemu-devel] " Wen Congyang
2015-05-29  8:01     ` Dr. David Alan Gilbert
2015-05-29  8:01       ` [Qemu-devel] " Dr. David Alan Gilbert
2015-05-29  8:06     ` zhanghailiang
2015-05-29  8:06       ` [Qemu-devel] " zhanghailiang
2015-05-29  8:42       ` Dr. David Alan Gilbert
2015-05-29  8:42         ` [Qemu-devel] " Dr. David Alan Gilbert
2015-05-29 12:34         ` Wen Congyang [this message]
2015-05-29 15:12           ` Dr. David Alan Gilbert
2015-05-29 15:12             ` [Qemu-devel] " Dr. David Alan Gilbert
2015-06-01  1:41         ` Wen Congyang
2015-06-01  1:41           ` [Qemu-devel] " Wen Congyang
2015-06-01  9:16           ` Dr. David Alan Gilbert
2015-06-01  9:16             ` [Qemu-devel] " Dr. David Alan Gilbert
2015-06-02  3:51   ` Wen Congyang
2015-06-02  3:51     ` [Qemu-devel] " Wen Congyang
2015-06-02  8:02     ` Dr. David Alan Gilbert
2015-06-02  8:02       ` [Qemu-devel] " Dr. David Alan Gilbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55685CCA.2010604@cn.fujitsu.com \
    --to=wency@cn.fujitsu.com \
    --cc=amit.shah@redhat.com \
    --cc=arei.gonglei@huawei.com \
    --cc=berrange@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=dgilbert@redhat.com \
    --cc=eblake@redhat.com \
    --cc=eddie.dong@intel.com \
    --cc=lizhijian@cn.fujitsu.com \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=peter.huangpeng@huawei.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=yunhong.jiang@intel.com \
    --cc=zhang.zhanghailiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.