linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Alexander Duyck <aduyck@mirantis.com>,
	kvm@vger.kernel.org,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	x86@kernel.org,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	qemu-devel@nongnu.org, Lan Tianyu <tianyu.lan@intel.com>,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	konrad.wilk@oracle.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Alexander Graf <agraf@suse.de>,
	Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
Date: Mon, 14 Dec 2015 22:52:44 +0200	[thread overview]
Message-ID: <20151214221919-mutt-send-email-mst@redhat.com> (raw)
In-Reply-To: <CAKgT0Uem+7p9mJDq117XAHktSdeZaWFpyMtbhj87g=XvG0vE-g@mail.gmail.com>

On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
> >> > This way distro can use a guest agent to disable
> >> > dirtying until before migration starts.
> >>
> >> Right.  For a v2 version I would definitely want to have some way to
> >> limit the scope of this.  My main reason for putting this out here is
> >> to start altering the course of discussions since it seems like were
> >> weren't getting anywhere with the ixgbevf migration changes that were
> >> being proposed.
> >
> > Absolutely, thanks for working on this.
> >
> >> >> +     unsigned long pg_addr, start;
> >> >> +
> >> >> +     start = (unsigned long)addr;
> >> >> +     pg_addr = PAGE_ALIGN(start + size);
> >> >> +     start &= ~(sizeof(atomic_t) - 1);
> >> >> +
> >> >> +     /* trigger a write fault on each page, excluding first page */
> >> >> +     while ((pg_addr -= PAGE_SIZE) > start)
> >> >> +             atomic_add(0, (atomic_t *)pg_addr);
> >> >> +
> >> >> +     /* trigger a write fault on first word of DMA */
> >> >> +     atomic_add(0, (atomic_t *)start);

Actually, I have second thoughts about using atomic_add here,
especially for _sync.

Many architectures do

#define ATOMIC_OP_RETURN(op, c_op)                                      \
static inline int atomic_##op##_return(int i, atomic_t *v)              \
{                                                                       \
        unsigned long flags;                                            \
        int ret;                                                        \
                                                                        \
        raw_local_irq_save(flags);                                      \
        ret = (v->counter = v->counter c_op i);                         \
        raw_local_irq_restore(flags);                                   \
                                                                        \
        return ret;                                                     \
}

and this is not safe if device is still doing DMA to/from
this memory.

Generally, atomic_t is there for SMP effects, not for sync
with devices.

This is why I said you should do
	cmpxchg(pg_addr, 0xdead, 0xdead); 

Yes, we probably never actually want to run m68k within a VM,
but let's not misuse interfaces like this.


> >> >
> >> > start might not be aligned correctly for a cast to atomic_t.
> >> > It's harmless to do this for any memory, so I think you should
> >> > just do this for 1st byte of all pages including the first one.
> >>
> >> You may not have noticed it but I actually aligned start in the line
> >> after pg_addr.
> >
> > Yes you did. alignof would make it a bit more noticeable.
> >
> >>  However instead of aligning to the start of the next
> >> atomic_t I just masked off the lower bits so that we start at the
> >> DWORD that contains the first byte of the starting address.  The
> >> assumption here is that I cannot trigger any sort of fault since if I
> >> have access to a given byte within a DWORD I will have access to the
> >> entire DWORD.
> >
> > I'm curious where does this come from.  Isn't it true that access is
> > controlled at page granularity normally, so you can touch beginning of
> > page just as well?
> 
> Yeah, I am pretty sure it probably is page granularity.  However my
> thought was to try and stick to the start of the DMA as the last
> access.  That way we don't pull in any more cache lines than we need
> to in order to dirty the pages.  Usually the start of the DMA region
> will contain some sort of headers or something that needs to be
> accessed with the highest priority so I wanted to make certain that we
> were forcing usable data into the L1 cache rather than just the first
> cache line of the page where the DMA started.  If however the start of
> a DMA was the start of the page there is nothing there to prevent
> that.

OK, maybe this helps. You should document all these tricks
in code comments.

> >>  I coded this up so that the spots where we touch the
> >> memory should match up with addresses provided by the hardware to
> >> perform the DMA over the PCI bus.
> >
> > Yes but there's no requirement to do it like this from
> > virt POV. You just need to touch each page.
> 
> I know, but at the same time if we match up with the DMA then it is
> more likely that we avoid grabbing unneeded cache lines.  In the case
> of most drivers the data for headers and start is at the start of the
> DMA.  So if we dirty the cache line associated with the start of the
> DMA it will be pulled into the L1 cache and there is a greater chance
> that it may already be prefetched as well.
> 
> >> Also I intentionally ran from highest address to lowest since that way
> >> we don't risk pushing the first cache line of the DMA buffer out of
> >> the L1 cache due to the PAGE_SIZE stride.
> >
> > Interesting. How does order of access help with this?
> 
> If you use a PAGE_SIZE stride you will start evicting things from L1
> cache after something like 8 accesses on an x86 processor as most of
> the recent ones have a 32K 8 way associative L1 cache.  So if I go
> from back to front then I evict the stuff that would likely be in the
> data portion of a buffer instead of headers which are usually located
> at the front.

I see, interesting.

> > By the way, if you are into these micro-optimizations you might want to
> > limit prefetch, to this end you want to access the last line of the
> > page.  And it's probably worth benchmarking a bit and not doing it all just
> > based on theory, keep code simple in v1 otherwise.
> 
> My main goal for now is functional code over high performance code.
> That is why I have kept this code fairly simple.  I might have done
> some optimization but it was as much about the optimization as keeping
> the code simple.

Well you were trying to avoid putting extra stress on
the cache, and it seems clear to me that prefetch
is not your friend here. So
-             atomic_add(0, (atomic_t *)pg_addr);
+             atomic_add(0, (atomic_t *)(pg_addr + PAGE_SIZE - sizeof(atomic_t));
(or whatever we change atomic_t to) is probably a win.

>  For example by using the start of the page instead
> of the end I could easily do the comparison against start and avoid
> doing more than one write per page.

That's probably worth fixing, we don't want two atomics
if we can help it.

-     while ((pg_addr -= PAGE_SIZE) > start)
+     while ((pg_addr -= PAGE_SIZE) >= PAGE_ALIGN(start + PAGE_SIZE))

will do it with no fuss.

> The issue for me doing performance testing is that I don't have
> anything that uses DMA blocks that are actually big enough to make use
> of the PAGE_SIZE stride.  That is why the PAGE_SIZE stride portion is
> mostly just theoretical.  I just have a few NICs and most of them only
> allocate 1 page or so for DMA buffers.  What little benchmarking I
> have done with netperf only showed a ~1% CPU penalty for the page
> dirtying code.  For setups where we did more with the DMA such as
> small packet handling I would expect that value to increase, but I
> still wouldn't expect to see a penalty of much more than ~5% most
> likely since there are still a number of other items that are calling
> atomic operations as well such as the code for releasing pages.
> 
> - Alex

  reply	other threads:[~2015-12-14 20:52 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-13 21:28 [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Alexander Duyck
2015-12-13 21:28 ` [RFC PATCH 1/3] swiotlb: Fold static unmap and sync calls into calling functions Alexander Duyck
2015-12-13 21:28 ` [RFC PATCH 2/3] xen/swiotlb: " Alexander Duyck
2015-12-13 21:28 ` [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest Alexander Duyck
2015-12-14 14:00   ` Michael S. Tsirkin
2015-12-14 16:34     ` Alexander Duyck
2015-12-14 17:20       ` Michael S. Tsirkin
2015-12-14 17:59         ` Alexander Duyck
2015-12-14 20:52           ` Michael S. Tsirkin [this message]
2015-12-14 22:32             ` Alexander Duyck
2015-12-14  2:27 ` [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Yang Zhang
2015-12-14  4:54   ` Alexander Duyck
2015-12-14  5:22     ` Yang Zhang
2015-12-14  5:46       ` Alexander Duyck
2015-12-14  7:20         ` Yang Zhang
2015-12-14 14:02           ` Michael S. Tsirkin
     [not found] ` <20160104204104.GB17427@char.us.oracle.com>
2016-01-05  3:11   ` Alexander Duyck
2016-01-05  9:40     ` Michael S. Tsirkin
2016-01-05 10:01       ` Dr. David Alan Gilbert
2016-01-05 10:35         ` Michael S. Tsirkin
2016-01-05 10:45           ` Dr. David Alan Gilbert
2016-01-05 10:59             ` Michael S. Tsirkin
2016-01-05 11:03               ` Dr. David Alan Gilbert
2016-01-05 11:11                 ` Michael S. Tsirkin
2016-01-05 11:06               ` Michael S. Tsirkin
2016-01-05 11:05             ` Michael S. Tsirkin
2016-01-05 12:43               ` Dr. David Alan Gilbert
2016-01-05 13:16                 ` Michael S. Tsirkin
2016-01-05 16:18       ` Alexander Duyck
2016-06-06  9:18         ` Re: [Qemu-devel] " Zhou Jie
2016-06-06 16:04           ` Alex Duyck
2016-06-09 10:14             ` Zhou Jie
2016-06-09 15:39               ` Alexander Duyck
2016-06-12  3:03                 ` Zhou Jie
2016-06-13  1:28                   ` Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151214221919-mutt-send-email-mst@redhat.com \
    --to=mst@redhat.com \
    --cc=aduyck@mirantis.com \
    --cc=agraf@suse.de \
    --cc=alex.williamson@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=dgilbert@redhat.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    --cc=tianyu.lan@intel.com \
    --cc=x86@kernel.org \
    --cc=yang.zhang.wz@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).