Hi Dave, Zhang, Thanks for your help. I will try your recommendations. Best Regard, Daniel Cho Zhang, Chen 於 2019年12月4日 週三 下午4:32寫道: > > On 12/3/2019 9:25 PM, Dr. David Alan Gilbert wrote: > > * Daniel Cho (danielcho@qnap.com) wrote: > >> Hi Dave, > >> > >> We could use the exist interface to add netfilter and chardev, it might > not > >> have the problem you said. > >> > >> However, the netfilter and chardev on the primary at the start, that > means > >> we could not dynamic set COLO > >> feature to VM? > > I wasn't expecting that to be possible - I'd expect you to be able > > to start in a state that looks the same as a primary+failed secondary; > > but I'm not sure. > > Current COLO (with Lukas's patch) can support dynamic set COLO system. > > This status is same like the secondary has triggered failover, the > primary node need to find new secondary > > node to combine new COLO system. > > > >> We try to change this chardev to prevent primary VM will stuck to wait > >> secondary VM. > >> > >> -chardev socket,id=compare1,host=127.0.0.1,port=9004,server,wait \ > >> > >> to > >> > >> -chardev socket,id=compare1,host=127.0.0.1,port=9004,server,nowait \ > >> > >> But it will make primary VM's network not works. (Can't get ip), until > >> starting connect with secondary VM. > > I think you need to check the port 9004 if already connect to the pair > node. > > > I'm not sure of the answer to this; I've not tried doing it - I'm not > > sure anyone has! > > But, the colo components do track the state of tcp connections, so I'm > > expecting that they have to already exist to have the state of those > > connections available for when you start the secondary. > > Yes, you are right. > > For this status, we don't need to sync the state of tcp connections, > because after failover > > (or just solo COLO primary node), we have empty all the tcp connections > state in COLO module. > > In the processing of build new COLO pair, we will sync all the VM state > to secondary node and re-build > > new track things in COLO module. > > > > > > > >> Otherwise, the primary VM with netfileter / chardev and without > netfilter / > >> chardev , they takes very different > >> booting time. > >> Without netfilter / chardev : about 1 mins > >> With netfilter / chardev : about 5 mins > >> Is this an issue? > > that sounds like it needs investigating. > > > > Dave > > Yes, In previous COLO use cases, we need make primary node and secondary > node boot in the same time. > > I did't expect such a big difference on netfilter/chardev. > > I think you can try without netfilter/chardev, after VM boot, re-build > the netfilter/chardev related work with chardev server nowait. > > > Thanks > > Zhang Chen > > > > >> Best regards, > >> Daniel Cho > >> > >> > >> Dr. David Alan Gilbert 於 2019年12月2日 週一 下午5:58寫道: > >> > >>> * Daniel Cho (danielcho@qnap.com) wrote: > >>>> Hi Zhang, > >>>> > >>>> We use qemu-4.1.0 release on this case. > >>>> > >>>> I think we need use block mirror to sync the disk to secondary node > >>> first, > >>>> then stop the primary VM and build COLO system. > >>>> > >>>> In the stop moment, you need add some netfilter and chardev socket > node > >>> for > >>>> COLO, maybe you need re-check this part. > >>>> > >>>> > >>>> Our test was already follow those step. Maybe I could describe the > detail > >>>> of the test flow and issues. > >>>> > >>>> > >>>> Step 1: > >>>> > >>>> Create primary VM without any netfilter and chardev for COLO, and > using > >>>> other host ping primary VM continually. > >>>> > >>>> > >>>> Step 2: > >>>> > >>>> Create secondary VM (the same device/drive with primary VM), and do > block > >>>> mirror sync ( ping to primary VM normally ) > >>>> > >>>> > >>>> Step 3: > >>>> > >>>> After block mirror sync finish, add those netfilter and chardev to > >>> primary > >>>> VM and secondary VM for COLO ( *Can't* ping to primary VM but those > >>> packets > >>>> will be received later ) > >>>> > >>>> > >>>> Step 4: > >>>> > >>>> Start migrate primary VM to secondary VM, and primary VM & secondary > VM > >>> are > >>>> running ( ping to primary VM works and receive those packets on step 3 > >>>> status ) > >>>> > >>>> > >>>> > >>>> > >>>> Between Step 3 to Step 4, it will take 10~20 seconds in our > environment. > >>>> > >>>> I could image this issue (delay reply packets) is because of setting > COLO > >>>> proxy for temporary status, > >>>> > >>>> but we thought 10~20 seconds might a little long. (If primary VM is > >>> already > >>>> doing some jobs, it might lose the data.) > >>>> > >>>> > >>>> Could we reduce those time? or those delay is depends on different VM? > >>> I think you need to set up the netfilter and chardev on the primary at > >>> the start; the filter contains the state of the TCP connections > working > >>> with the VM, so adding it later can't gain that state for existing > >>> connections. > >>> > >>> Dave > >>> > >>>> Best Regard, > >>>> > >>>> Daniel Cho. > >>>> > >>>> > >>>> > >>>> Zhang, Chen 於 2019年11月30日 週六 上午2:04寫道: > >>>> > >>>>> > >>>>> > >>>>> > >>>>> *From:* Daniel Cho > >>>>> *Sent:* Friday, November 29, 2019 10:43 AM > >>>>> *To:* Zhang, Chen > >>>>> *Cc:* Dr. David Alan Gilbert ; > >>> lukasstraub2@web.de; > >>>>> qemu-devel@nongnu.org > >>>>> *Subject:* Re: Network connection with COLO VM > >>>>> > >>>>> > >>>>> > >>>>> Hi David, Zhang, > >>>>> > >>>>> > >>>>> > >>>>> Thanks for replying my question. > >>>>> > >>>>> We know why will occur this issue. > >>>>> > >>>>> As you said, the COLO VM's network needs > >>>>> > >>>>> colo-proxy to control packets, so the guest's > >>>>> > >>>>> interface should set the filter to solve the problem. > >>>>> > >>>>> > >>>>> > >>>>> But we found another question, when we set the > >>>>> > >>>>> fault-tolerance feature to guest (primary VM is running, > >>>>> > >>>>> secondary VM is pausing), the guest's network would not > >>>>> > >>>>> responds any request for a while (in our environment > >>>>> > >>>>> about 20~30 secs) after secondary VM runs. > >>>>> > >>>>> > >>>>> > >>>>> Does it be a normal situation, or a known issue? > >>>>> > >>>>> > >>>>> > >>>>> Our test is creating primary VM for a while, then creating > >>>>> > >>>>> secondary VM to make it with COLO feature. > >>>>> > >>>>> > >>>>> > >>>>> Hi Daniel, > >>>>> > >>>>> > >>>>> > >>>>> Happy to hear you have solved ssh disconnection issue. > >>>>> > >>>>> > >>>>> > >>>>> Do you use Lukas’s patch on this case? > >>>>> > >>>>> I think we need use block mirror to sync the disk to secondary node > >>> first, > >>>>> then stop the primary VM and build COLO system. > >>>>> > >>>>> In the stop moment, you need add some netfilter and chardev socket > node > >>>>> for COLO, maybe you need re-check this part. > >>>>> > >>>>> > >>>>> > >>>>> Best Regard, > >>>>> > >>>>> Daniel Cho > >>>>> > >>>>> > >>>>> > >>>>> Zhang, Chen 於 2019年11月28日 週四 上午9:26寫道: > >>>>> > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Dr. David Alan Gilbert > >>>>>> Sent: Wednesday, November 27, 2019 6:51 PM > >>>>>> To: Daniel Cho ; Zhang, Chen > >>>>>> ; lukasstraub2@web.de > >>>>>> Cc: qemu-devel@nongnu.org > >>>>>> Subject: Re: Network connection with COLO VM > >>>>>> > >>>>>> * Daniel Cho (danielcho@qnap.com) wrote: > >>>>>>> Hello everyone, > >>>>>>> > >>>>>>> Could we ssh to colo VM (means PVM & SVM are starting)? > >>>>>>> > >>>>>> Lets cc in Zhang Chen and Lukas Straub. > >>>>> Thanks Dave. > >>>>> > >>>>>>> SSH will connect to colo VM for a while, but it will disconnect > >>> with > >>>>>>> error > >>>>>>> *client_loop: send disconnect: Broken pipe* > >>>>>>> > >>>>>>> It seems to colo VM could not keep network session. > >>>>>>> > >>>>>>> Does it be a known issue? > >>>>>> That sounds like the COLO proxy is getting upset; it's supposed to > >>>>> compare > >>>>>> packets sent by the primary and secondary and only send one to the > >>>>> outside > >>>>>> - you shouldn't be talking directly to the guest, but always via the > >>>>> proxy. See > >>>>>> docs/colo-proxy.txt > >>>>>> > >>>>> Hi Daniel, > >>>>> > >>>>> I have try ssh to COLO guest with 8 hours, not occurred this issue. > >>>>> Please check your network/qemu configuration. > >>>>> But I found another problem maybe related this issue, if no network > >>>>> communication for a period of time(maybe 10min), the first message > >>> send to > >>>>> guest have a chance with delay(maybe 1-5 sec), I will try to fix it > >>> when I > >>>>> have time. > >>>>> > >>>>> Thanks > >>>>> Zhang Chen > >>>>> > >>>>>> Dave > >>>>>> > >>>>>>> Best Regard, > >>>>>>> Daniel Cho > >>>>>> -- > >>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > >>>>> > >>> -- > >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > >>> > >>> > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > >