All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Li, Liang Z" <liang.z.li@intel.com>
To: Chunguang Li <lichunguang@hust.edu.cn>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Amit Shah <amit.shah@redhat.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"stefanha@redhat.com" <stefanha@redhat.com>,
	"quintela@redhat.com" <quintela@redhat.com>
Subject: Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent
Date: Mon, 7 Nov 2016 14:17:27 +0000	[thread overview]
Message-ID: <F2CBF3009FA73547804AE4C663CAB28E3A10782D@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <6c320b.9d63.1583f0fe90e.Coremail.lichunguang@hust.edu.cn>

> > > > > > > > > I think this is "very" wasteful. Assume the workload
> > > > > > > > > writes the pages
> > > > > dirty randomly within the guest address space, and the transfer
> > > > > speed is constant. Intuitively, I think nearly half of the dirty
> > > > > pages produced in Iteration 1 is not really dirty. This means
> > > > > the time of Iteration 2 is double of that to send only really dirty pages.
> > > > > > > >
> > > > > > > > It makes sense, can you get some perf numbers to show what
> > > > > > > > kinds of workloads get impacted the most?  That would also
> > > > > > > > help us to figure out what kinds of speed improvements we
> > > > > > > > can
> > > expect.
> > > > > > > >
> > > > > > > >
> > > > > > > > 		Amit
> > > > > > >
> > > > > > > I have picked up 6 workloads and got the following
> > > > > > > statistics numbers of every iteration (except the last
> > > > > > > stop-copy one) during
> > > precopy.
> > > > > > > These numbers are obtained with the basic precopy migration,
> > > > > > > without the capabilities like xbzrle or compression, etc.
> > > > > > > The network for the migration is exclusive, with a separate
> > > > > > > network for
> > > the workloads.
> > > > > > > They are both gigabit ethernet. I use qemu-2.5.1.
> > > > > > >
> > > > > > > Three (booting, idle, web server) of them converged to the
> > > > > > > stop-copy
> > > > > phase,
> > > > > > > with the given bandwidth and default downtime (300ms), while
> > > > > > > the other three (kernel compilation, zeusmp, memcached) did not.
> > > > > > >
> > > > > > > One page is "not-really-dirty", if it is written first and
> > > > > > > is sent later (and not written again after that) during one
> > > > > > > iteration. I guess this would not happen so often during the
> > > > > > > other iterations as during the 1st iteration. Because all
> > > > > > > the pages of the VM are sent to the dest node
> > > > > during
> > > > > > > the 1st iteration, while during the others, only part of the
> > > > > > > pages are
> > > sent.
> > > > > > > So I think the "not-really-dirty" pages should be produced
> > > > > > > mainly during the 1st iteration , and maybe very little
> > > > > > > during the other
> > > iterations.
> > > > > > >
> > > > > > > If we could avoid resending the "not-really-dirty" pages,
> > > > > > > intuitively, I think the time spent on Iteration 2 would be
> > > > > > > halved. This is a chain
> > > > > reaction,
> > > > > > > because the dirty pages produced during Iteration 2 is
> > > > > > > halved, which
> > > > > incurs
> > > > > > > that the time spent on Iteration 3 is halved, then Iteration 4, 5...
> > > > > >
> > > > > > Yes; these numbers don't show how many of them are false dirty
> > > though.
> > > > > >
> > > > > > One problem is thinking about pages that have been redirtied,
> > > > > > if the page is
> > > > > dirtied
> > > > > > after the sync but before the network write then it's the
> > > > > > false-dirty that you're describing.
> > > > > >
> > > > > > However, if the page is being written a few times, and so it
> > > > > > would have
> > > > > been written
> > > > > > after the network write then it isn't a false-dirty.
> > > > > >
> > > > > > You might be able to figure that out with some kernel tracing
> > > > > > of when the
> > > > > dirtying
> > > > > > happens, but it might be easier to write the fix!
> > > > > >
> > > > > > Dave
> > > > >
> > > > > Hi, I have made some new progress now.
> > > > >
> > > > > To tell how many false dirty pages there are exactly in each
> > > > > iteration, I malloc a buffer in memory as big as the size of the
> > > > > whole VM memory. When a page is transferred to the dest node, it
> > > > > is copied to the buffer; During the next iteration, if one page
> > > > > is transferred, it is compared to the old one in the buffer, and
> > > > > the old one will be replaced for next comparison if it is really dirty.
> > > > > Thus, we are now able to get the exact number of false dirty pages.
> > > > >
> > > > > This time, I use 15 workloads to get the statistic number. They are:
> > > > >
> > > > >   1. 11 benchmarks picked up from cpu2006 benchmark suit. They
> > > > > are all scientific
> > > > >      computing workloads like Quantum Chromodynamics, Fluid
> > > > > Dynamics,
> > > etc.
> > > > > I pick
> > > > >      up these 11 benchmarks because compared to others, they
> > > > > have bigger memory
> > > > >      occupation and higher memory dirty rate. Thus most of them
> > > > > could not converge
> > > > >      to stop-and-copy using the default migration speed (32MB/s).
> > > > >   2. kernel compilation
> > > > >   3. idle VM
> > > > >   4. Apache web server which serves static content
> > > > >
> > > > >   (the above workloads are all running in VM with 1 vcpu and 1GB
> > > > > memory, and the
> > > > >    migration speed is the default 32MB/s)
> > > > >
> > > > >   5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB
> > > > > are used as the cache.
> > > > >      After filling up the 4GB cache, a client writes the cache
> > > > > at a constant
> > > speed
> > > > >      during migration. This time, migration speed has no limit,
> > > > > and is up to
> > > the
> > > > >      capability of 1Gbps Ethernet.
> > > > >
> > > > > Summarize the results first: (and you can read the precise
> > > > > number
> > > > > below)
> > > > >
> > > > >   1. 4 of these 15 workloads have a big proportion (>60%, even
> > > > > >80% during some iterations)
> > > > >      of false dirty pages out of all the dirty pages since
> > > > > iteration 2 (and the
> > > big
> > > > >      proportion lasts during the following iterations). They are
> > > cpu2006.zeusmp,
> > > > >      cpu2006.bzip2, cpu2006.mcf, and memcached.
> > > > >   2. 2 workloads (idle, webserver) spend most of the migration
> > > > > time on iteration 1, even
> > > > >      though the proportion of false dirty pages is big since
> > > > > iteration 2, the space to
> > > > >      optimize is small.
> > > > >   3. 1 workload (kernel compilation) only have a big proportion
> > > > > during iteration 2, not
> > > > >      in the other iterations.
> > > > >   4. 8 workloads (the other 8 benchmarks of cpu2006) have little
> > > > > proportion of false
> > > > >      dirty pages since iteration 2. So the spaces to optimize
> > > > > for them are
> > > small.
> > > > >
> > > > > Now I want to talk a little more about the reasons why false
> > > > > dirty pages are produced.
> > > > > The first reason is what we have discussed before---the
> > > > > mechanism to track the dirty pages.
> > > > > And then I come up with another reason. Here is the situation: a
> > > > > write operation to one memory page happens, but it doesn't
> > > > > change any content of the page. So it's "write but not dirty",
> > > > > and kernel still marks it as dirty. One guy in our lab has done
> > > > > some experiments to figure out the proportion of "write but not
> dirty"
> > > > > operations, and he uses the cpu2006 benchmark suit. According to
> > > > > his results, general workloads has a little proportion (<10%) of
> > > > > "write but not dirty" out of all the write operations, while few
> > > > > workloads have higher proportion (one even as high as 50%). Now
> > > > > we are not sure why "write but not dirty" would happen, it just
> happened.
> > > > >
> > > > > So these two reasons contribute to the false dirty pages. To
> > > > > optimize, I compute and store the SHA1 hash before transferring
> > > > > each page. Next time, if one page needs retransmission, its
> > > > > SHA1 hash is computed again, and compared to the old hash. If
> > > > > the hash is the same, it's a false dirty page, and we just skip
> > > > > this page; Otherwise, the page is transferred, and the new hash
> > > > > replaces the old one for next comparison.
> > > > > The reason to use SHA1 hash but not byte-by-byte comparison is
> > > > > the memory overheads. One SHA1 hash is 20 bytes. So we need
> > > > > extra
> > > > > 20/4096 (<1/200) memory space of the whole VM memory, which is
> > > > > relatively small.
> > > > > As far as I know, SHA1 hash is widely used in the scenes of
> > > > > deduplication for backup systems.
> > > > > They have proven that the probability of hash collision is far
> > > > > smaller than disk hardware fault, so it's secure hash, that is,
> > > > > if the hashes of two chunks are the same, the content must be the
> same.
> > > > > So I think the SHA1 hash could replace byte-to-byte comparison
> > > > > in the VM memory scenery.
> > > > >
> > > > > Then I do the same migration experiments using the SHA1 hash.
> > > > > For the 4 workloads which have big proportions of false dirty
> > > > > pages, the improvement is remarkable. Without optimization, they
> > > > > either can not converge to stop-and-copy, or take a very long time to
> complete.
> > > > > With the
> > > > > SHA1 hash method, all of them now complete in a relatively short
> time.
> > > > > For the reason I have talked above, the other workloads don't
> > > > > get notable improvements from the optimization. So below, I only
> > > > > show the exact number after optimization for the 4 workloads
> > > > > with remarkable improvements.
> > > > >
> > > > > Any comments or suggestions?
> > > >
> > > > Maybe you can compare the performance of your solution as that of
> > > XBZRLE to see which one is better.
> > > > The merit of using SHA1 is that it can avoid data copy as that in
> > > > XBZRLE, and
> > > need less buffer.
> > > > How about the overhead of calculating the SHA1? Is it faster than
> > > > copying a
> > > page?
> > > >
> > > > Liang
> > > >
> > > >
> > >
> > > Yes, XBZRLE is able to handle the false dirty pages. However, if we
> > > want to avoid transferring all of the false dirty pages using
> > > XBZRLE, we need a buffer as big as the whole VM memory, while SHA1
> > > needs a much small buffer. Of course, if we have a buffer as big as
> > > the whole VM memory using XBZRLE, we could transfer less data on
> > > network than SHA1, because XBZRLE is able to compress similar pages.
> > > In a word, yes, the merit of using SHA1 is that it needs much less
> > > buffer, and leads to nice improvement if there are many false dirty pages.
> > >
> >
> > The current implementation of XBZRLE begins to buffer page from the
> > second iteration, Maybe it's worth to make it start to work from the first
> iteration based on your finding.
> >
> > > In terms of the overhead of calculating the SHA1 compared with
> > > transferring a page, it's related to the CPU and network
> > > performance. In my test environment(Intel Xeon
> > > E5620 @2.4GHz, 1Gbps Ethernet), I didn't observe obvious extra
> > > computing overhead caused by calculating the SHA1, because the
> > > throughput of network (got by "info migrate") remains almost the same.
> >
> > You can check the CPU usage, or to measure the time spend on a local
> > live migration  which use SHA1/ XBZRLE.
> >
> > Liang
> >
> >
> 
> I compare SHA1 with XBZRLE. I use XBZRLE in two ways:
> 1. Begins to buffer pages from iteration 1; 2. As current implementation,
> begins to buffer pages from iteration 2.
> 
> I post the results of three workloads: cpu2006.zeusmp, cpu2006.mcf,
> memcached.
> I set the cache size as 256MB for zeusmp & mcf (they run in VM with 1GB
> ram), and set the cache size as 1GB for memcached (it run in VM with 6GB
> ram, and memcached takes 4GB as cache).
> 
> As you can read from the data below, beginning to buffer pages from
> iteration 1 is better than the current implementation(from iteration 2),
> because the total migration time is shorter.
> 
> SHA1 is better than the XBZRLE with the cache size I choose, because it leads
> to shorter migration time, and consumes far less memory overhead (<1/200
> of the total VM memory).
> 

Hi Chunguang,

Have you tried to use a large XBZRLE cache size which equals to the guest's RAM size?
Is SHA1 faster in that case?

Thanks!
Liang

  reply	other threads:[~2016-11-07 14:17 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-25  8:22 [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent Chunguang Li
2016-09-26 11:23 ` Dr. David Alan Gilbert
2016-09-26 14:55   ` Chunguang Li
2016-09-26 18:52     ` Dr. David Alan Gilbert
2016-09-27 12:28       ` Chunguang Li
2016-09-30  5:46     ` Amit Shah
2016-09-30  8:18       ` Chunguang Li
2016-10-08  7:55       ` Chunguang Li
2016-10-14 11:15         ` Dr. David Alan Gilbert
2016-11-03  8:25           ` Chunguang Li
2016-11-03  9:59             ` Li, Liang Z
2016-11-03 10:13             ` Li, Liang Z
2016-11-04  3:07               ` Chunguang Li
2016-11-04  4:50                 ` Li, Liang Z
2016-11-04  7:03                   ` Chunguang Li
2016-11-07 13:52                   ` Chunguang Li
2016-11-07 14:17                     ` Li, Liang Z [this message]
2016-11-08  5:27                       ` Chunguang Li
2016-11-07 14:44                     ` Li, Liang Z
2016-11-08 11:05             ` Dr. David Alan Gilbert
2016-11-08 13:40               ` Chunguang Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=F2CBF3009FA73547804AE4C663CAB28E3A10782D@shsmsx102.ccr.corp.intel.com \
    --to=liang.z.li@intel.com \
    --cc=amit.shah@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=lichunguang@hust.edu.cn \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.