All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Li, Liang Z" <liang.z.li@intel.com>
To: Chunguang Li <lichunguang@hust.edu.cn>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Amit Shah <amit.shah@redhat.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"stefanha@redhat.com" <stefanha@redhat.com>,
	"quintela@redhat.com" <quintela@redhat.com>
Subject: Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent
Date: Fri, 4 Nov 2016 04:50:22 +0000	[thread overview]
Message-ID: <F2CBF3009FA73547804AE4C663CAB28E3A105C5C@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <484431.8261.1582d4e1433.Coremail.lichunguang@hust.edu.cn>

> > > > > > > I think this is "very" wasteful. Assume the workload writes
> > > > > > > the pages
> > > dirty randomly within the guest address space, and the transfer
> > > speed is constant. Intuitively, I think nearly half of the dirty
> > > pages produced in Iteration 1 is not really dirty. This means the
> > > time of Iteration 2 is double of that to send only really dirty pages.
> > > > > >
> > > > > > It makes sense, can you get some perf numbers to show what
> > > > > > kinds of workloads get impacted the most?  That would also
> > > > > > help us to figure out what kinds of speed improvements we can
> expect.
> > > > > >
> > > > > >
> > > > > > 		Amit
> > > > >
> > > > > I have picked up 6 workloads and got the following statistics
> > > > > numbers of every iteration (except the last stop-copy one) during
> precopy.
> > > > > These numbers are obtained with the basic precopy migration,
> > > > > without the capabilities like xbzrle or compression, etc. The
> > > > > network for the migration is exclusive, with a separate network for
> the workloads.
> > > > > They are both gigabit ethernet. I use qemu-2.5.1.
> > > > >
> > > > > Three (booting, idle, web server) of them converged to the
> > > > > stop-copy
> > > phase,
> > > > > with the given bandwidth and default downtime (300ms), while the
> > > > > other three (kernel compilation, zeusmp, memcached) did not.
> > > > >
> > > > > One page is "not-really-dirty", if it is written first and is
> > > > > sent later (and not written again after that) during one
> > > > > iteration. I guess this would not happen so often during the
> > > > > other iterations as during the 1st iteration. Because all the
> > > > > pages of the VM are sent to the dest node
> > > during
> > > > > the 1st iteration, while during the others, only part of the pages are
> sent.
> > > > > So I think the "not-really-dirty" pages should be produced
> > > > > mainly during the 1st iteration , and maybe very little during the other
> iterations.
> > > > >
> > > > > If we could avoid resending the "not-really-dirty" pages,
> > > > > intuitively, I think the time spent on Iteration 2 would be
> > > > > halved. This is a chain
> > > reaction,
> > > > > because the dirty pages produced during Iteration 2 is halved,
> > > > > which
> > > incurs
> > > > > that the time spent on Iteration 3 is halved, then Iteration 4, 5...
> > > >
> > > > Yes; these numbers don't show how many of them are false dirty
> though.
> > > >
> > > > One problem is thinking about pages that have been redirtied, if
> > > > the page is
> > > dirtied
> > > > after the sync but before the network write then it's the
> > > > false-dirty that you're describing.
> > > >
> > > > However, if the page is being written a few times, and so it would
> > > > have
> > > been written
> > > > after the network write then it isn't a false-dirty.
> > > >
> > > > You might be able to figure that out with some kernel tracing of
> > > > when the
> > > dirtying
> > > > happens, but it might be easier to write the fix!
> > > >
> > > > Dave
> > >
> > > Hi, I have made some new progress now.
> > >
> > > To tell how many false dirty pages there are exactly in each
> > > iteration, I malloc a buffer in memory as big as the size of the
> > > whole VM memory. When a page is transferred to the dest node, it is
> > > copied to the buffer; During the next iteration, if one page is
> > > transferred, it is compared to the old one in the buffer, and the
> > > old one will be replaced for next comparison if it is really dirty.
> > > Thus, we are now able to get the exact number of false dirty pages.
> > >
> > > This time, I use 15 workloads to get the statistic number. They are:
> > >
> > >   1. 11 benchmarks picked up from cpu2006 benchmark suit. They are
> > > all scientific
> > >      computing workloads like Quantum Chromodynamics, Fluid Dynamics,
> etc.
> > > I pick
> > >      up these 11 benchmarks because compared to others, they have
> > > bigger memory
> > >      occupation and higher memory dirty rate. Thus most of them
> > > could not converge
> > >      to stop-and-copy using the default migration speed (32MB/s).
> > >   2. kernel compilation
> > >   3. idle VM
> > >   4. Apache web server which serves static content
> > >
> > >   (the above workloads are all running in VM with 1 vcpu and 1GB
> > > memory, and the
> > >    migration speed is the default 32MB/s)
> > >
> > >   5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB are
> > > used as the cache.
> > >      After filling up the 4GB cache, a client writes the cache at a constant
> speed
> > >      during migration. This time, migration speed has no limit, and is up to
> the
> > >      capability of 1Gbps Ethernet.
> > >
> > > Summarize the results first: (and you can read the precise number
> > > below)
> > >
> > >   1. 4 of these 15 workloads have a big proportion (>60%, even >80%
> > > during some iterations)
> > >      of false dirty pages out of all the dirty pages since iteration 2 (and the
> big
> > >      proportion lasts during the following iterations). They are
> cpu2006.zeusmp,
> > >      cpu2006.bzip2, cpu2006.mcf, and memcached.
> > >   2. 2 workloads (idle, webserver) spend most of the migration time
> > > on iteration 1, even
> > >      though the proportion of false dirty pages is big since
> > > iteration 2, the space to
> > >      optimize is small.
> > >   3. 1 workload (kernel compilation) only have a big proportion
> > > during iteration 2, not
> > >      in the other iterations.
> > >   4. 8 workloads (the other 8 benchmarks of cpu2006) have little
> > > proportion of false
> > >      dirty pages since iteration 2. So the spaces to optimize for them are
> small.
> > >
> > > Now I want to talk a little more about the reasons why false dirty
> > > pages are produced.
> > > The first reason is what we have discussed before---the mechanism to
> > > track the dirty pages.
> > > And then I come up with another reason. Here is the situation: a
> > > write operation to one memory page happens, but it doesn't change
> > > any content of the page. So it's "write but not dirty", and kernel
> > > still marks it as dirty. One guy in our lab has done some
> > > experiments to figure out the proportion of "write but not dirty"
> > > operations, and he uses the cpu2006 benchmark suit. According to his
> > > results, general workloads has a little proportion (<10%) of "write
> > > but not dirty" out of all the write operations, while few workloads
> > > have higher proportion (one even as high as 50%). Now we are not
> > > sure why "write but not dirty" would happen, it just happened.
> > >
> > > So these two reasons contribute to the false dirty pages. To
> > > optimize, I compute and store the SHA1 hash before transferring each
> > > page. Next time, if one page needs retransmission, its
> > > SHA1 hash is computed again, and compared to the old hash. If the
> > > hash is the same, it's a false dirty page, and we just skip this
> > > page; Otherwise, the page is transferred, and the new hash replaces
> > > the old one for next comparison.
> > > The reason to use SHA1 hash but not byte-by-byte comparison is the
> > > memory overheads. One SHA1 hash is 20 bytes. So we need extra
> > > 20/4096 (<1/200) memory space of the whole VM memory, which is
> > > relatively small.
> > > As far as I know, SHA1 hash is widely used in the scenes of
> > > deduplication for backup systems.
> > > They have proven that the probability of hash collision is far
> > > smaller than disk hardware fault, so it's secure hash, that is, if
> > > the hashes of two chunks are the same, the content must be the same.
> > > So I think the SHA1 hash could replace byte-to-byte comparison in
> > > the VM memory scenery.
> > >
> > > Then I do the same migration experiments using the SHA1 hash. For
> > > the 4 workloads which have big proportions of false dirty pages, the
> > > improvement is remarkable. Without optimization, they either can not
> > > converge to stop-and-copy, or take a very long time to complete.
> > > With the
> > > SHA1 hash method, all of them now complete in a relatively short time.
> > > For the reason I have talked above, the other workloads don't get
> > > notable improvements from the optimization. So below, I only show
> > > the exact number after optimization for the 4 workloads with
> > > remarkable improvements.
> > >
> > > Any comments or suggestions?
> >
> > Maybe you can compare the performance of your solution as that of
> XBZRLE to see which one is better.
> > The merit of using SHA1 is that it can avoid data copy as that in XBZRLE, and
> need less buffer.
> > How about the overhead of calculating the SHA1? Is it faster than copying a
> page?
> >
> > Liang
> >
> >
> 
> Yes, XBZRLE is able to handle the false dirty pages. However, if we want to
> avoid transferring all of the false dirty pages using XBZRLE, we need a buffer
> as big as the whole VM memory, while SHA1 needs a much small buffer. Of
> course, if we have a buffer as big as the whole VM memory using XBZRLE, we
> could transfer less data on network than SHA1, because XBZRLE is able to
> compress similar pages. In a word, yes, the merit of using SHA1 is that it
> needs much less buffer, and leads to nice improvement if there are many
> false dirty pages.
> 

The current implementation of XBZRLE begins to buffer page from the second iteration,
Maybe it's worth to make it start to work from the first iteration based on your finding.

> In terms of the overhead of calculating the SHA1 compared with transferring
> a page, it's related to the CPU and network performance. In my test
> environment(Intel Xeon
> E5620 @2.4GHz, 1Gbps Ethernet), I didn't observe obvious extra computing
> overhead caused by calculating the SHA1, because the throughput of
> network (got by "info migrate") remains almost the same.

You can check the CPU usage, or to measure the time spend on a local live migration
 which use SHA1/ XBZRLE.

Liang



  reply	other threads:[~2016-11-04  4:50 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-25  8:22 [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent Chunguang Li
2016-09-26 11:23 ` Dr. David Alan Gilbert
2016-09-26 14:55   ` Chunguang Li
2016-09-26 18:52     ` Dr. David Alan Gilbert
2016-09-27 12:28       ` Chunguang Li
2016-09-30  5:46     ` Amit Shah
2016-09-30  8:18       ` Chunguang Li
2016-10-08  7:55       ` Chunguang Li
2016-10-14 11:15         ` Dr. David Alan Gilbert
2016-11-03  8:25           ` Chunguang Li
2016-11-03  9:59             ` Li, Liang Z
2016-11-03 10:13             ` Li, Liang Z
2016-11-04  3:07               ` Chunguang Li
2016-11-04  4:50                 ` Li, Liang Z [this message]
2016-11-04  7:03                   ` Chunguang Li
2016-11-07 13:52                   ` Chunguang Li
2016-11-07 14:17                     ` Li, Liang Z
2016-11-08  5:27                       ` Chunguang Li
2016-11-07 14:44                     ` Li, Liang Z
2016-11-08 11:05             ` Dr. David Alan Gilbert
2016-11-08 13:40               ` Chunguang Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=F2CBF3009FA73547804AE4C663CAB28E3A105C5C@shsmsx102.ccr.corp.intel.com \
    --to=liang.z.li@intel.com \
    --cc=amit.shah@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=lichunguang@hust.edu.cn \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.