All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Alexander Duyck <aduyck@mirantis.com>,
	kvm@vger.kernel.org,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	x86@kernel.org,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	qemu-devel@nongnu.org, Lan Tianyu <tianyu.lan@intel.com>,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	konrad.wilk@oracle.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Alexander Graf <agraf@suse.de>,
	Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
Date: Mon, 14 Dec 2015 22:52:44 +0200	[thread overview]
Message-ID: <20151214221919-mutt-send-email-mst@redhat.com> (raw)
In-Reply-To: <CAKgT0Uem+7p9mJDq117XAHktSdeZaWFpyMtbhj87g=XvG0vE-g@mail.gmail.com>

On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
> >> > This way distro can use a guest agent to disable
> >> > dirtying until before migration starts.
> >>
> >> Right.  For a v2 version I would definitely want to have some way to
> >> limit the scope of this.  My main reason for putting this out here is
> >> to start altering the course of discussions since it seems like were
> >> weren't getting anywhere with the ixgbevf migration changes that were
> >> being proposed.
> >
> > Absolutely, thanks for working on this.
> >
> >> >> +     unsigned long pg_addr, start;
> >> >> +
> >> >> +     start = (unsigned long)addr;
> >> >> +     pg_addr = PAGE_ALIGN(start + size);
> >> >> +     start &= ~(sizeof(atomic_t) - 1);
> >> >> +
> >> >> +     /* trigger a write fault on each page, excluding first page */
> >> >> +     while ((pg_addr -= PAGE_SIZE) > start)
> >> >> +             atomic_add(0, (atomic_t *)pg_addr);
> >> >> +
> >> >> +     /* trigger a write fault on first word of DMA */
> >> >> +     atomic_add(0, (atomic_t *)start);

Actually, I have second thoughts about using atomic_add here,
especially for _sync.

Many architectures do

#define ATOMIC_OP_RETURN(op, c_op)                                      \
static inline int atomic_##op##_return(int i, atomic_t *v)              \
{                                                                       \
        unsigned long flags;                                            \
        int ret;                                                        \
                                                                        \
        raw_local_irq_save(flags);                                      \
        ret = (v->counter = v->counter c_op i);                         \
        raw_local_irq_restore(flags);                                   \
                                                                        \
        return ret;                                                     \
}

and this is not safe if device is still doing DMA to/from
this memory.

Generally, atomic_t is there for SMP effects, not for sync
with devices.

This is why I said you should do
	cmpxchg(pg_addr, 0xdead, 0xdead); 

Yes, we probably never actually want to run m68k within a VM,
but let's not misuse interfaces like this.


> >> >
> >> > start might not be aligned correctly for a cast to atomic_t.
> >> > It's harmless to do this for any memory, so I think you should
> >> > just do this for 1st byte of all pages including the first one.
> >>
> >> You may not have noticed it but I actually aligned start in the line
> >> after pg_addr.
> >
> > Yes you did. alignof would make it a bit more noticeable.
> >
> >>  However instead of aligning to the start of the next
> >> atomic_t I just masked off the lower bits so that we start at the
> >> DWORD that contains the first byte of the starting address.  The
> >> assumption here is that I cannot trigger any sort of fault since if I
> >> have access to a given byte within a DWORD I will have access to the
> >> entire DWORD.
> >
> > I'm curious where does this come from.  Isn't it true that access is
> > controlled at page granularity normally, so you can touch beginning of
> > page just as well?
> 
> Yeah, I am pretty sure it probably is page granularity.  However my
> thought was to try and stick to the start of the DMA as the last
> access.  That way we don't pull in any more cache lines than we need
> to in order to dirty the pages.  Usually the start of the DMA region
> will contain some sort of headers or something that needs to be
> accessed with the highest priority so I wanted to make certain that we
> were forcing usable data into the L1 cache rather than just the first
> cache line of the page where the DMA started.  If however the start of
> a DMA was the start of the page there is nothing there to prevent
> that.

OK, maybe this helps. You should document all these tricks
in code comments.

> >>  I coded this up so that the spots where we touch the
> >> memory should match up with addresses provided by the hardware to
> >> perform the DMA over the PCI bus.
> >
> > Yes but there's no requirement to do it like this from
> > virt POV. You just need to touch each page.
> 
> I know, but at the same time if we match up with the DMA then it is
> more likely that we avoid grabbing unneeded cache lines.  In the case
> of most drivers the data for headers and start is at the start of the
> DMA.  So if we dirty the cache line associated with the start of the
> DMA it will be pulled into the L1 cache and there is a greater chance
> that it may already be prefetched as well.
> 
> >> Also I intentionally ran from highest address to lowest since that way
> >> we don't risk pushing the first cache line of the DMA buffer out of
> >> the L1 cache due to the PAGE_SIZE stride.
> >
> > Interesting. How does order of access help with this?
> 
> If you use a PAGE_SIZE stride you will start evicting things from L1
> cache after something like 8 accesses on an x86 processor as most of
> the recent ones have a 32K 8 way associative L1 cache.  So if I go
> from back to front then I evict the stuff that would likely be in the
> data portion of a buffer instead of headers which are usually located
> at the front.

I see, interesting.

> > By the way, if you are into these micro-optimizations you might want to
> > limit prefetch, to this end you want to access the last line of the
> > page.  And it's probably worth benchmarking a bit and not doing it all just
> > based on theory, keep code simple in v1 otherwise.
> 
> My main goal for now is functional code over high performance code.
> That is why I have kept this code fairly simple.  I might have done
> some optimization but it was as much about the optimization as keeping
> the code simple.

Well you were trying to avoid putting extra stress on
the cache, and it seems clear to me that prefetch
is not your friend here. So
-             atomic_add(0, (atomic_t *)pg_addr);
+             atomic_add(0, (atomic_t *)(pg_addr + PAGE_SIZE - sizeof(atomic_t));
(or whatever we change atomic_t to) is probably a win.

>  For example by using the start of the page instead
> of the end I could easily do the comparison against start and avoid
> doing more than one write per page.

That's probably worth fixing, we don't want two atomics
if we can help it.

-     while ((pg_addr -= PAGE_SIZE) > start)
+     while ((pg_addr -= PAGE_SIZE) >= PAGE_ALIGN(start + PAGE_SIZE))

will do it with no fuss.

> The issue for me doing performance testing is that I don't have
> anything that uses DMA blocks that are actually big enough to make use
> of the PAGE_SIZE stride.  That is why the PAGE_SIZE stride portion is
> mostly just theoretical.  I just have a few NICs and most of them only
> allocate 1 page or so for DMA buffers.  What little benchmarking I
> have done with netperf only showed a ~1% CPU penalty for the page
> dirtying code.  For setups where we did more with the DMA such as
> small packet handling I would expect that value to increase, but I
> still wouldn't expect to see a penalty of much more than ~5% most
> likely since there are still a number of other items that are calling
> atomic operations as well such as the code for releasing pages.
> 
> - Alex

WARNING: multiple messages have this Message-ID (diff)
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Lan Tianyu <tianyu.lan@intel.com>,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	kvm@vger.kernel.org, konrad.wilk@oracle.com,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	x86@kernel.org,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	qemu-devel@nongnu.org, Alexander Graf <agraf@suse.de>,
	Alexander Duyck <aduyck@mirantis.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [Qemu-devel] [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest
Date: Mon, 14 Dec 2015 22:52:44 +0200	[thread overview]
Message-ID: <20151214221919-mutt-send-email-mst@redhat.com> (raw)
In-Reply-To: <CAKgT0Uem+7p9mJDq117XAHktSdeZaWFpyMtbhj87g=XvG0vE-g@mail.gmail.com>

On Mon, Dec 14, 2015 at 09:59:13AM -0800, Alexander Duyck wrote:
> On Mon, Dec 14, 2015 at 9:20 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Dec 14, 2015 at 08:34:00AM -0800, Alexander Duyck wrote:
> >> > This way distro can use a guest agent to disable
> >> > dirtying until before migration starts.
> >>
> >> Right.  For a v2 version I would definitely want to have some way to
> >> limit the scope of this.  My main reason for putting this out here is
> >> to start altering the course of discussions since it seems like were
> >> weren't getting anywhere with the ixgbevf migration changes that were
> >> being proposed.
> >
> > Absolutely, thanks for working on this.
> >
> >> >> +     unsigned long pg_addr, start;
> >> >> +
> >> >> +     start = (unsigned long)addr;
> >> >> +     pg_addr = PAGE_ALIGN(start + size);
> >> >> +     start &= ~(sizeof(atomic_t) - 1);
> >> >> +
> >> >> +     /* trigger a write fault on each page, excluding first page */
> >> >> +     while ((pg_addr -= PAGE_SIZE) > start)
> >> >> +             atomic_add(0, (atomic_t *)pg_addr);
> >> >> +
> >> >> +     /* trigger a write fault on first word of DMA */
> >> >> +     atomic_add(0, (atomic_t *)start);

Actually, I have second thoughts about using atomic_add here,
especially for _sync.

Many architectures do

#define ATOMIC_OP_RETURN(op, c_op)                                      \
static inline int atomic_##op##_return(int i, atomic_t *v)              \
{                                                                       \
        unsigned long flags;                                            \
        int ret;                                                        \
                                                                        \
        raw_local_irq_save(flags);                                      \
        ret = (v->counter = v->counter c_op i);                         \
        raw_local_irq_restore(flags);                                   \
                                                                        \
        return ret;                                                     \
}

and this is not safe if device is still doing DMA to/from
this memory.

Generally, atomic_t is there for SMP effects, not for sync
with devices.

This is why I said you should do
	cmpxchg(pg_addr, 0xdead, 0xdead); 

Yes, we probably never actually want to run m68k within a VM,
but let's not misuse interfaces like this.


> >> >
> >> > start might not be aligned correctly for a cast to atomic_t.
> >> > It's harmless to do this for any memory, so I think you should
> >> > just do this for 1st byte of all pages including the first one.
> >>
> >> You may not have noticed it but I actually aligned start in the line
> >> after pg_addr.
> >
> > Yes you did. alignof would make it a bit more noticeable.
> >
> >>  However instead of aligning to the start of the next
> >> atomic_t I just masked off the lower bits so that we start at the
> >> DWORD that contains the first byte of the starting address.  The
> >> assumption here is that I cannot trigger any sort of fault since if I
> >> have access to a given byte within a DWORD I will have access to the
> >> entire DWORD.
> >
> > I'm curious where does this come from.  Isn't it true that access is
> > controlled at page granularity normally, so you can touch beginning of
> > page just as well?
> 
> Yeah, I am pretty sure it probably is page granularity.  However my
> thought was to try and stick to the start of the DMA as the last
> access.  That way we don't pull in any more cache lines than we need
> to in order to dirty the pages.  Usually the start of the DMA region
> will contain some sort of headers or something that needs to be
> accessed with the highest priority so I wanted to make certain that we
> were forcing usable data into the L1 cache rather than just the first
> cache line of the page where the DMA started.  If however the start of
> a DMA was the start of the page there is nothing there to prevent
> that.

OK, maybe this helps. You should document all these tricks
in code comments.

> >>  I coded this up so that the spots where we touch the
> >> memory should match up with addresses provided by the hardware to
> >> perform the DMA over the PCI bus.
> >
> > Yes but there's no requirement to do it like this from
> > virt POV. You just need to touch each page.
> 
> I know, but at the same time if we match up with the DMA then it is
> more likely that we avoid grabbing unneeded cache lines.  In the case
> of most drivers the data for headers and start is at the start of the
> DMA.  So if we dirty the cache line associated with the start of the
> DMA it will be pulled into the L1 cache and there is a greater chance
> that it may already be prefetched as well.
> 
> >> Also I intentionally ran from highest address to lowest since that way
> >> we don't risk pushing the first cache line of the DMA buffer out of
> >> the L1 cache due to the PAGE_SIZE stride.
> >
> > Interesting. How does order of access help with this?
> 
> If you use a PAGE_SIZE stride you will start evicting things from L1
> cache after something like 8 accesses on an x86 processor as most of
> the recent ones have a 32K 8 way associative L1 cache.  So if I go
> from back to front then I evict the stuff that would likely be in the
> data portion of a buffer instead of headers which are usually located
> at the front.

I see, interesting.

> > By the way, if you are into these micro-optimizations you might want to
> > limit prefetch, to this end you want to access the last line of the
> > page.  And it's probably worth benchmarking a bit and not doing it all just
> > based on theory, keep code simple in v1 otherwise.
> 
> My main goal for now is functional code over high performance code.
> That is why I have kept this code fairly simple.  I might have done
> some optimization but it was as much about the optimization as keeping
> the code simple.

Well you were trying to avoid putting extra stress on
the cache, and it seems clear to me that prefetch
is not your friend here. So
-             atomic_add(0, (atomic_t *)pg_addr);
+             atomic_add(0, (atomic_t *)(pg_addr + PAGE_SIZE - sizeof(atomic_t));
(or whatever we change atomic_t to) is probably a win.

>  For example by using the start of the page instead
> of the end I could easily do the comparison against start and avoid
> doing more than one write per page.

That's probably worth fixing, we don't want two atomics
if we can help it.

-     while ((pg_addr -= PAGE_SIZE) > start)
+     while ((pg_addr -= PAGE_SIZE) >= PAGE_ALIGN(start + PAGE_SIZE))

will do it with no fuss.

> The issue for me doing performance testing is that I don't have
> anything that uses DMA blocks that are actually big enough to make use
> of the PAGE_SIZE stride.  That is why the PAGE_SIZE stride portion is
> mostly just theoretical.  I just have a few NICs and most of them only
> allocate 1 page or so for DMA buffers.  What little benchmarking I
> have done with netperf only showed a ~1% CPU penalty for the page
> dirtying code.  For setups where we did more with the DMA such as
> small packet handling I would expect that value to increase, but I
> still wouldn't expect to see a penalty of much more than ~5% most
> likely since there are still a number of other items that are calling
> atomic operations as well such as the code for releasing pages.
> 
> - Alex

  reply	other threads:[~2015-12-14 20:52 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-13 21:28 [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Alexander Duyck
2015-12-13 21:28 ` [Qemu-devel] " Alexander Duyck
2015-12-13 21:28 ` [RFC PATCH 1/3] swiotlb: Fold static unmap and sync calls into calling functions Alexander Duyck
2015-12-13 21:28   ` [Qemu-devel] " Alexander Duyck
2015-12-13 21:28 ` [RFC PATCH 2/3] xen/swiotlb: " Alexander Duyck
2015-12-13 21:28   ` [Qemu-devel] " Alexander Duyck
2015-12-13 21:28 ` [RFC PATCH 3/3] x86: Create dma_mark_dirty to dirty pages used for DMA by VM guest Alexander Duyck
2015-12-13 21:28   ` [Qemu-devel] " Alexander Duyck
2015-12-14 14:00   ` Michael S. Tsirkin
2015-12-14 14:00     ` [Qemu-devel] " Michael S. Tsirkin
2015-12-14 16:34     ` Alexander Duyck
2015-12-14 16:34       ` [Qemu-devel] " Alexander Duyck
2015-12-14 17:20       ` Michael S. Tsirkin
2015-12-14 17:20         ` [Qemu-devel] " Michael S. Tsirkin
2015-12-14 17:59         ` Alexander Duyck
2015-12-14 17:59           ` [Qemu-devel] " Alexander Duyck
2015-12-14 20:52           ` Michael S. Tsirkin [this message]
2015-12-14 20:52             ` Michael S. Tsirkin
2015-12-14 22:32             ` Alexander Duyck
2015-12-14 22:32               ` [Qemu-devel] " Alexander Duyck
2015-12-14  2:27 ` [RFC PATCH 0/3] x86: Add support for guest DMA dirty page tracking Yang Zhang
2015-12-14  2:27   ` [Qemu-devel] " Yang Zhang
2015-12-14  2:27   ` Yang Zhang
2015-12-14  4:54   ` Alexander Duyck
2015-12-14  4:54     ` [Qemu-devel] " Alexander Duyck
2015-12-14  4:54     ` Alexander Duyck
2015-12-14  5:22     ` Yang Zhang
2015-12-14  5:22       ` [Qemu-devel] " Yang Zhang
2015-12-14  5:22       ` Yang Zhang
2015-12-14  5:46       ` Alexander Duyck
2015-12-14  5:46         ` [Qemu-devel] " Alexander Duyck
2015-12-14  5:46         ` Alexander Duyck
2015-12-14  7:20         ` Yang Zhang
2015-12-14  7:20           ` [Qemu-devel] " Yang Zhang
2015-12-14 14:02           ` Michael S. Tsirkin
2015-12-14 14:02             ` [Qemu-devel] " Michael S. Tsirkin
2015-12-14 14:02             ` Michael S. Tsirkin
2016-01-04 20:41 ` Konrad Rzeszutek Wilk
2016-01-04 20:41   ` [Qemu-devel] " Konrad Rzeszutek Wilk
2016-01-05  3:11   ` Alexander Duyck
2016-01-05  3:11     ` [Qemu-devel] " Alexander Duyck
2016-01-05  9:40     ` Michael S. Tsirkin
2016-01-05  9:40       ` [Qemu-devel] " Michael S. Tsirkin
2016-01-05 10:01       ` Dr. David Alan Gilbert
2016-01-05 10:01         ` [Qemu-devel] " Dr. David Alan Gilbert
2016-01-05 10:35         ` Michael S. Tsirkin
2016-01-05 10:35           ` [Qemu-devel] " Michael S. Tsirkin
2016-01-05 10:35           ` Michael S. Tsirkin
2016-01-05 10:45           ` Dr. David Alan Gilbert
2016-01-05 10:45             ` [Qemu-devel] " Dr. David Alan Gilbert
2016-01-05 10:59             ` Michael S. Tsirkin
2016-01-05 10:59               ` [Qemu-devel] " Michael S. Tsirkin
2016-01-05 11:03               ` Dr. David Alan Gilbert
2016-01-05 11:03                 ` [Qemu-devel] " Dr. David Alan Gilbert
2016-01-05 11:11                 ` Michael S. Tsirkin
2016-01-05 11:11                   ` [Qemu-devel] " Michael S. Tsirkin
2016-01-05 11:06               ` Michael S. Tsirkin
2016-01-05 11:06                 ` [Qemu-devel] " Michael S. Tsirkin
2016-01-05 11:05             ` Michael S. Tsirkin
2016-01-05 11:05               ` [Qemu-devel] " Michael S. Tsirkin
2016-01-05 12:43               ` Dr. David Alan Gilbert
2016-01-05 12:43                 ` [Qemu-devel] " Dr. David Alan Gilbert
2016-01-05 13:16                 ` Michael S. Tsirkin
2016-01-05 13:16                   ` [Qemu-devel] " Michael S. Tsirkin
2016-01-05 18:42                   ` Konrad Rzeszutek Wilk
2016-01-05 18:42                     ` [Qemu-devel] " Konrad Rzeszutek Wilk
2016-01-05 16:18       ` Alexander Duyck
2016-01-05 16:18         ` [Qemu-devel] " Alexander Duyck
2016-06-06  9:18         ` Zhou Jie
2016-06-06  9:18           ` Zhou Jie
2016-06-06 16:04           ` Alex Duyck
2016-06-06 16:04             ` Alex Duyck
2016-06-09 10:14             ` Zhou Jie
2016-06-09 15:39               ` Alexander Duyck
2016-06-12  3:03                 ` Zhou Jie
2016-06-13  1:28                   ` Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151214221919-mutt-send-email-mst@redhat.com \
    --to=mst@redhat.com \
    --cc=aduyck@mirantis.com \
    --cc=agraf@suse.de \
    --cc=alex.williamson@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=dgilbert@redhat.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    --cc=tianyu.lan@intel.com \
    --cc=x86@kernel.org \
    --cc=yang.zhang.wz@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.