Re: Network connection with COLO VM

From: Daniel Cho <danielcho@qnap.com>
To: "Zhang, Chen" <chen.zhang@intel.com>
Cc: "lukasstraub2@web.de" <lukasstraub2@web.de>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: Network connection with COLO VM
Date: Fri, 6 Dec 2019 14:31:13 +0800	[thread overview]
Message-ID: <CA+XQNE6+KALuWf1NqOg_KjET1XcBsudb9tSBFGJiEN_-JxVbtw@mail.gmail.com> (raw)
In-Reply-To: <f6bf1e64-a66e-9df8-04f6-b543753eda79@intel.com>

[-- Attachment #1: Type: text/plain, Size: 9496 bytes --]

Hi Dave,  Zhang,

Thanks for your help. I will try your recommendations.

Best Regard,
Daniel Cho

Zhang, Chen <chen.zhang@intel.com> 於 2019年12月4日 週三 下午4:32寫道：

>
> On 12/3/2019 9:25 PM, Dr. David Alan Gilbert wrote:
> > * Daniel Cho (danielcho@qnap.com) wrote:
> >> Hi Dave,
> >>
> >> We could use the exist interface to add netfilter and chardev, it might
> not
> >> have the problem you said.
> >>
> >> However, the netfilter and chardev on the primary at the start, that
> means
> >> we could not dynamic set COLO
> >> feature to VM?
> > I wasn't expecting that to be possible - I'd expect you to be able
> > to start in a state that looks the same as a primary+failed secondary;
> > but I'm not sure.
>
> Current COLO (with Lukas's patch) can support dynamic set COLO system.
>
> This status is same like the secondary has triggered failover, the
> primary node need to find new secondary
>
> node to combine new COLO system.
>
>
> >> We try to change this chardev to prevent primary VM will stuck to wait
> >> secondary VM.
> >>
> >> -chardev socket,id=compare1,host=127.0.0.1,port=9004,server,wait \
> >>
> >> to
> >>
> >> -chardev socket,id=compare1,host=127.0.0.1,port=9004,server,nowait \
> >>
> >> But it will make primary VM's network not works. (Can't get ip), until
> >> starting connect with secondary VM.
>
> I think you need to check the port 9004 if already connect to the pair
> node.
>
> > I'm not sure of the answer to this; I've not tried doing it - I'm not
> > sure anyone has!
> > But, the colo components do track the state of tcp connections, so I'm
> > expecting that they have to already exist to have the state of those
> > connections available for when you start the secondary.
>
> Yes, you are right.
>
> For this status, we don't need to sync the state of tcp connections,
> because after failover
>
> (or just solo COLO primary node), we have empty all the tcp connections
> state in COLO module.
>
> In the processing of build new COLO pair, we will sync all the VM state
> to secondary node and re-build
>
> new track things in COLO module.
>
>
> >
> >
> >> Otherwise, the primary VM with netfileter / chardev and without
> netfilter /
> >> chardev , they takes very different
> >> booting time.
> >> Without  netfilter / chardev : about 1 mins
> >> With   netfilter / chardev : about 5 mins
> >> Is this an issue?
> > that sounds like it needs investigating.
> >
> > Dave
>
> Yes, In previous COLO use cases, we need make primary node and secondary
> node boot in the same time.
>
> I did't expect such a big difference on netfilter/chardev.
>
> I think you can try without netfilter/chardev, after VM boot, re-build
> the netfilter/chardev related work with chardev server nowait.
>
>
> Thanks
>
> Zhang Chen
>
> >
> >> Best regards,
> >> Daniel Cho
> >>
> >>
> >> Dr. David Alan Gilbert <dgilbert@redhat.com> 於 2019年12月2日 週一 下午5:58寫道：
> >>
> >>> * Daniel Cho (danielcho@qnap.com) wrote:
> >>>> Hi Zhang,
> >>>>
> >>>> We use qemu-4.1.0 release on this case.
> >>>>
> >>>> I think we need use block mirror to sync the disk to secondary node
> >>> first,
> >>>> then stop the primary VM and build COLO system.
> >>>>
> >>>> In the stop moment, you need add some netfilter and chardev socket
> node
> >>> for
> >>>> COLO, maybe you need re-check this part.
> >>>>
> >>>>
> >>>> Our test was already follow those step. Maybe I could describe the
> detail
> >>>> of the test flow and issues.
> >>>>
> >>>>
> >>>> Step 1:
> >>>>
> >>>> Create primary VM without any netfilter and chardev for COLO, and
> using
> >>>> other host ping primary VM continually.
> >>>>
> >>>>
> >>>> Step 2:
> >>>>
> >>>> Create secondary VM (the same device/drive with primary VM), and do
> block
> >>>> mirror sync ( ping to primary VM normally )
> >>>>
> >>>>
> >>>> Step 3:
> >>>>
> >>>> After block mirror sync finish, add those netfilter and chardev to
> >>> primary
> >>>> VM and secondary VM for COLO ( *Can't* ping to primary VM but those
> >>> packets
> >>>> will be received later )
> >>>>
> >>>>
> >>>> Step 4:
> >>>>
> >>>> Start migrate primary VM to secondary VM, and primary VM & secondary
> VM
> >>> are
> >>>> running ( ping to primary VM works and receive those packets on step 3
> >>>> status )
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Between Step 3 to Step 4, it will take 10~20 seconds in our
> environment.
> >>>>
> >>>> I could image this issue (delay reply packets) is because of setting
> COLO
> >>>> proxy for temporary status,
> >>>>
> >>>> but we thought 10~20 seconds might a little long. (If primary VM is
> >>> already
> >>>> doing some jobs, it might lose the data.)
> >>>>
> >>>>
> >>>> Could we reduce those time? or those delay is depends on different VM?
> >>> I think you need to set up the netfilter and chardev on the primary at
> >>> the start;  the filter contains the state of the TCP connections
> working
> >>> with the VM, so adding it later can't gain that state for existing
> >>> connections.
> >>>
> >>> Dave
> >>>
> >>>> Best Regard,
> >>>>
> >>>> Daniel Cho.
> >>>>
> >>>>
> >>>>
> >>>> Zhang, Chen <chen.zhang@intel.com> 於 2019年11月30日 週六 上午2:04寫道：
> >>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> *From:* Daniel Cho <danielcho@qnap.com>
> >>>>> *Sent:* Friday, November 29, 2019 10:43 AM
> >>>>> *To:* Zhang, Chen <chen.zhang@intel.com>
> >>>>> *Cc:* Dr. David Alan Gilbert <dgilbert@redhat.com>;
> >>> lukasstraub2@web.de;
> >>>>> qemu-devel@nongnu.org
> >>>>> *Subject:* Re: Network connection with COLO VM
> >>>>>
> >>>>>
> >>>>>
> >>>>> Hi David,  Zhang,
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks for replying my question.
> >>>>>
> >>>>> We know why will occur this issue.
> >>>>>
> >>>>> As you said, the COLO VM's network needs
> >>>>>
> >>>>> colo-proxy to control packets, so the guest's
> >>>>>
> >>>>> interface should set the filter to solve the problem.
> >>>>>
> >>>>>
> >>>>>
> >>>>> But we found another question, when we set the
> >>>>>
> >>>>> fault-tolerance feature to guest (primary VM is running,
> >>>>>
> >>>>> secondary VM is pausing), the guest's network would not
> >>>>>
> >>>>> responds any request for a while (in our environment
> >>>>>
> >>>>> about 20~30 secs) after secondary VM runs.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Does it be a normal situation, or a known issue?
> >>>>>
> >>>>>
> >>>>>
> >>>>> Our test is creating primary VM for a while, then creating
> >>>>>
> >>>>> secondary VM to make it with COLO feature.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Hi Daniel,
> >>>>>
> >>>>>
> >>>>>
> >>>>> Happy to hear you have solved ssh disconnection issue.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Do you use Lukas’s patch on this case?
> >>>>>
> >>>>> I think we need use block mirror to sync the disk to secondary node
> >>> first,
> >>>>> then stop the primary VM and build COLO system.
> >>>>>
> >>>>> In the stop moment, you need add some netfilter and chardev socket
> node
> >>>>> for COLO, maybe you need re-check this part.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Best Regard,
> >>>>>
> >>>>> Daniel Cho
> >>>>>
> >>>>>
> >>>>>
> >>>>> Zhang, Chen <chen.zhang@intel.com> 於 2019年11月28日 週四 上午9:26寫道：
> >>>>>
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >>>>>> Sent: Wednesday, November 27, 2019 6:51 PM
> >>>>>> To: Daniel Cho <danielcho@qnap.com>; Zhang, Chen
> >>>>>> <chen.zhang@intel.com>; lukasstraub2@web.de
> >>>>>> Cc: qemu-devel@nongnu.org
> >>>>>> Subject: Re: Network connection with COLO VM
> >>>>>>
> >>>>>> * Daniel Cho (danielcho@qnap.com) wrote:
> >>>>>>> Hello everyone,
> >>>>>>>
> >>>>>>> Could we ssh to colo VM (means PVM & SVM are starting)?
> >>>>>>>
> >>>>>> Lets cc in Zhang Chen and Lukas Straub.
> >>>>> Thanks Dave.
> >>>>>
> >>>>>>> SSH will connect to colo VM for a while, but it will disconnect
> >>> with
> >>>>>>> error
> >>>>>>> *client_loop: send disconnect: Broken pipe*
> >>>>>>>
> >>>>>>> It seems to colo VM could not keep network session.
> >>>>>>>
> >>>>>>> Does it be a known issue?
> >>>>>> That sounds like the COLO proxy is getting upset; it's supposed to
> >>>>> compare
> >>>>>> packets sent by the primary and secondary and only send one to the
> >>>>> outside
> >>>>>> - you shouldn't be talking directly to the guest, but always via the
> >>>>> proxy.  See
> >>>>>> docs/colo-proxy.txt
> >>>>>>
> >>>>> Hi Daniel,
> >>>>>
> >>>>> I have try ssh to COLO guest with 8 hours, not occurred this issue.
> >>>>> Please check your network/qemu configuration.
> >>>>> But I found another problem maybe related this issue, if no network
> >>>>> communication for a period of time(maybe 10min), the first message
> >>> send to
> >>>>> guest have a chance with delay(maybe 1-5 sec), I will try to fix it
> >>> when I
> >>>>> have time.
> >>>>>
> >>>>> Thanks
> >>>>> Zhang Chen
> >>>>>
> >>>>>> Dave
> >>>>>>
> >>>>>>> Best Regard,
> >>>>>>> Daniel Cho
> >>>>>> --
> >>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>>>
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>
> >>>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
>

[-- Attachment #2: Type: text/html, Size: 14888 bytes --]