From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: [RFC v3 5/6] xen/arm: Add log_dirty support for ARM
Date: Wed, 14 May 2014 14:18:23 +0100
Message-ID: <1400073503.29366.108.camel@kazak.uk.xensource.com>
References: <1399583908-21755-1-git-send-email-w1.huang@samsung.com>
	<1399583908-21755-6-git-send-email-w1.huang@samsung.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1399583908-21755-6-git-send-email-w1.huang@samsung.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Wei Huang <w1.huang@samsung.com>
Cc: keir@xen.org, stefano.stabellini@eu.citrix.com, andrew.cooper3@citrix.com, julien.grall@linaro.org, tim@xen.org, jaeyong.yoo@samsung.com, xen-devel@lists.xen.org, jbeulich@suse.com, ian.jackson@eu.citrix.com, yjhyun.yoo@samsung.com
List-Id: xen-devel@lists.xenproject.org

On Thu, 2014-05-08 at 16:18 -0500, Wei Huang wrote:
> This patch implements log_dirty for ARM guest VMs. This feature
> is provided via two basic blocks: dirty_bit_map and VLPT
> (virtual-linear page table)

> 1. VLPT provides fast accessing of 3rd PTE of guest P2M.
> When creating a mapping for VLPT, the page table mapping
> becomes the following:
>    xen's 1st PTE --> xen's 2nd PTE --> guest p2m's 2nd PTE -->
>    guest p2m's 3rd PTE

I think "xen's 2nd PTE" here literally means the xen_second[] page
table? As discussed with Jaeyong this is shared between all PCPUs, which
means that only a single domain can be being migrated at one time.

> 
> With VLPT, xen can immediately locate the 3rd PTE of guest P2M
> and modify PTE attirbute during dirty page tracking. The following

"attribute"

> link shows the performance comparison for handling a dirty-page
> between VLPT and typical page table walking.
> http://lists.xen.org/archives/html/xen-devel/2013-08/msg01503.html

Can you inline the results here please.


> +/* Mark the bitmap for a corresponding page as dirty */
> +static inline void bitmap_mark_dirty(struct domain *d, paddr_t addr)
> +{
> +    paddr_t ram_base = (paddr_t) GUEST_RAM_BASE;
> +    int bit_index = PFN_DOWN(addr - ram_base);
> +    int page_index = bit_index >> (PAGE_SHIFT + 3);
> +    int bit_index_residual = bit_index & ((1ul << (PAGE_SHIFT + 3)) - 1);

I queried this magic +3 on Jaeyong's v5 posting of this functionality
too.

> +
> +    set_bit(bit_index_residual, d->arch.dirty.bitmap[page_index]);
> +}
> +
> +/* Allocate dirty bitmap resource */
> +static int bitmap_init(struct domain *d)
> +{
> +    paddr_t gma_start = 0;
> +    paddr_t gma_end = 0;
> +    int nr_bytes;
> +    int nr_pages;
> +    int i;
> +
> +    domain_get_ram_range(d, &gma_start, &gma_end);
> +
> +    nr_bytes = (PFN_DOWN(gma_end - gma_start) + 7) / 8;
> +    nr_pages = (nr_bytes + PAGE_SIZE - 1) / PAGE_SIZE;

DIV_ROUNDUP?

> +
> +    BUG_ON(nr_pages > MAX_DIRTY_BITMAP_PAGES);
> +
> +    for ( i = 0; i < nr_pages; ++i )
> +    {
> +        struct page_info *page;
> +        page = alloc_domheap_page(NULL, 0);
> +        if ( page == NULL )
> +            goto cleanup_on_failure;
> +
> +        d->arch.dirty.bitmap[i] = map_domain_page_global(__page_to_mfn(page));
> +        clear_page(d->arch.dirty.bitmap[i]);
> +    }
> +
> +    d->arch.dirty.bitmap_pages = nr_pages;
> +    return 0;
> +
> +cleanup_on_failure:

This would more normally be called out or err.

> +    nr_pages = i;
> +    for ( i = 0; i < nr_pages; ++i )

I think people normally do this with a while counting backwards.

> +    {
> +        unmap_domain_page_global(d->arch.dirty.bitmap[i]);
> +    }

Coding Style doesn't need {}'s around single line statement like this.

> +
> +    return -ENOMEM;
> +}
> +
> +/* Cleanup dirty bitmap resource */
> +static void bitmap_cleanup(struct domain *d)
> +{
> +    int i;
> +
> +    for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i )
> +    {
> +        unmap_domain_page_global(d->arch.dirty.bitmap[i]);
> +    }
> +}
> +
> +/* Flush VLPT area */
> +static void vlpt_flush(struct domain *d)
> +{
> +    int flush_size;
> +    flush_size = (d->arch.dirty.second_lvl_end - 
> +                  d->arch.dirty.second_lvl_start) << SECOND_SHIFT;
> +
> +    /* flushing the 3rd level mapping */
> +    flush_xen_data_tlb_range_va(d->arch.dirty.second_lvl_start << SECOND_SHIFT,
> +                                flush_size);
> +}
> +
> +/* Set up a page table for VLPT mapping */
> +static int vlpt_init(struct domain *d)
> +{
> +    uint64_t required, avail = VIRT_LIN_P2M_END - VIRT_LIN_P2M_START;
> +    int xen_second_linear_base;
> +    int gp2m_start_index, gp2m_end_index;
> +    struct p2m_domain *p2m = &d->arch.p2m;
> +    struct page_info *second_lvl_page;
> +    paddr_t gma_start = 0;
> +    paddr_t gma_end = 0;
> +    lpae_t *first[2];
> +    int i;
> +
> +    /* Check if reserved space is enough to cover guest physical address space.
> +     * Note that each LPAE page table entry is 64-bit (8 bytes). So we only
> +     * shift left with LPAE_SHIFT instead of PAGE_SHIFT. */
> +    domain_get_ram_range(d, &gma_start, &gma_end);
> +    required = (gma_end - gma_start) >> LPAE_SHIFT;
> +    if ( required > avail )
> +    {
> +        dprintk(XENLOG_ERR, "Available VLPT is small for domU guest (avail: "

"...is too small...".

What is the size limit here? We can probably accept a reasonable limit
for 32-bit guests, but for 64-bit guests we might need to consider other
options. I have patches which enable up to 1TB guests for 64-bit, and I
would expect that to grow again sooner rather than later.

This will need doing differently for arm64 anyway since there is no
per-pcpu page tables there. But there is plenty of address space so
there can probably be a linear area per pcpu, which depending on the
size of the VLPT might do, otherwise we might need to switch to per-pcpu
page tables.

> +                "%#llx, required: %#llx)\n", (unsigned long long)avail,
> +                (unsigned long long)required);
> +        return -ENOMEM;
> +    }
> +
> +    /* Caulculate the base of 2nd linear table base for VIRT_LIN_P2M_START */
> +    xen_second_linear_base = second_linear_offset(VIRT_LIN_P2M_START);
> +
> +    gp2m_start_index = gma_start >> FIRST_SHIFT;
> +    gp2m_end_index = (gma_end >> FIRST_SHIFT) + 1;
> +
> +    if ( xen_second_linear_base + gp2m_end_index >= LPAE_ENTRIES * 2 )
> +    {
> +        dprintk(XENLOG_ERR, "xen second page is small for VLPT for domU");
> +        return -ENOMEM;
> +    }
> +
> +    /* Two pages are allocated to backup the related PTE content of guest 
> +     * VM's 1st-level table. */

By "backup" do you mean context switch?

> +    second_lvl_page = alloc_domheap_pages(NULL, 1, 0);

> +    if ( second_lvl_page == NULL )
> +        return -ENOMEM;
> +    d->arch.dirty.second_lvl[0] = map_domain_page_global(
> +        page_to_mfn(second_lvl_page) );
> +    d->arch.dirty.second_lvl[1] = map_domain_page_global(
> +        page_to_mfn(second_lvl_page+1) );
> +
> +    /* 1st level P2M of guest VM is 2 consecutive pages */

Note that 4 level P2M with no concatenated level zero is on the cards
for arm64 soon.

> +    first[0] = __map_domain_page(p2m->first_level);
> +    first[1] = __map_domain_page(p2m->first_level+1);
> +
> +    for ( i = gp2m_start_index; i < gp2m_end_index; ++i )
> +    {
> +        int k = i % LPAE_ENTRIES;
> +        int l = i / LPAE_ENTRIES;
> +        int k2 = (xen_second_linear_base + i) % LPAE_ENTRIES;
> +        int l2 = (xen_second_linear_base + i) / LPAE_ENTRIES;
> +
> +        /* Update 2nd-level PTE of Xen linear table. With this, Xen linear 
> +         * page table layout becomes: 1st Xen linear ==> 2nd Xen linear ==> 
> +         * 2nd guest P2M (i.e. 3rd Xen linear) ==> 3rd guest P2M (i.e. Xen 
> +         * linear content) for VIRT_LIN_P2M_START address space. */
> +        write_pte(&xen_second[xen_second_linear_base+i], first[l][k]);

This has two barriers for each write, but you only actually need one
before the loop and one after, i.e. dsb() + copy_page() + dsb()
> +
> +        /* We copy the mapping into domain's structure as a reference
> +         * in case of the context switch (used in vlpt_restore function ) */

This is to avoid having to map the p2m pages on context switch I think.
But in order to do that you have to create a permanent mapping of two
other fresh pages to create a "cache". Why not just map the 2 actual p2m
pages instead?

Having done that I wonder how much of this loop can then be shared with
the context switcher?

> +        d->arch.dirty.second_lvl[l2][k2] = first[l][k];
> +    }
> +    unmap_domain_page(first[0]);
> +    unmap_domain_page(first[1]);
> +
> +    /* storing the start and end index */
> +    d->arch.dirty.second_lvl_start = xen_second_linear_base + gp2m_start_index;
> +    d->arch.dirty.second_lvl_end = xen_second_linear_base + gp2m_end_index;
> +
> +    vlpt_flush(d);
> +
> +    return 0;
> +}
> +
> +static void vlpt_cleanup(struct domain *d)
> +{
> +    /* First level p2m is 2 consecutive pages */
> +    unmap_domain_page_global(d->arch.dirty.second_lvl[0]);
> +    unmap_domain_page_global(d->arch.dirty.second_lvl[1]);
> +}
> +
> +/* Returns zero if addr is not valid or dirty mode is not set */
> +int handle_page_fault(struct domain *d, paddr_t addr)
> +{
> +    lpae_t *vlp2m_pte = 0;
> +    paddr_t gma_start = 0;
> +    paddr_t gma_end = 0;
> +
> +    if ( !d->arch.dirty.mode )
> +        return 0;
> +
> +    domain_get_ram_range(d, &gma_start, &gma_end);

Couldn't this just be d->arch.foo_start/end?

> +
> +    /* Ensure that addr is inside guest's RAM */
> +    if ( addr < gma_start || addr > gma_end )
> +        return 0;
> +
> +    vlp2m_pte = vlpt_get_3lvl_pte(addr);
> +    if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 &&
> +         vlp2m_pte->p2m.type == p2m_ram_logdirty )
> +    {
> +        lpae_t pte = *vlp2m_pte;
> +        pte.p2m.write = 1;
> +        write_pte(vlp2m_pte, pte);

Should you not be changing the type back to p2m_ram_rw?

> +        flush_tlb_local();

What about other CPUs?

> +
> +        /* only necessary to lock between get-dirty bitmap and mark dirty
> +         * bitmap. If get-dirty bitmap happens immediately before this
> +         * lock, the corresponding dirty-page would be marked at the next
> +         * round of get-dirty bitmap */
> +        spin_lock(&d->arch.dirty.lock);

At some point I think I suggested that you might be able to use an
atomic bitop rather than a lock, did that turn out to be impossible?

> +        bitmap_mark_dirty(d, addr);
> +        spin_unlock(&d->arch.dirty.lock);
> +    }
> +
> +    return 1;
> +}
> +
> +/* Restore the xen page table for vlpt mapping for domain */
> +void log_dirty_restore(struct domain *d)
> +{
> +    int i;
> +
> +    /* Nothing to do as log dirty mode is off */
> +    if ( !(d->arch.dirty.mode) )
> +        return;
> +
> +    dsb(sy);
> +
> +    for ( i = d->arch.dirty.second_lvl_start; i < d->arch.dirty.second_lvl_end;
> +          ++i )
> +    {
> +        int k = i % LPAE_ENTRIES;
> +        int l = i / LPAE_ENTRIES;
> +
> +        if ( xen_second[i].bits != d->arch.dirty.second_lvl[l][k].bits )
> +        {
> +            write_pte(&xen_second[i], d->arch.dirty.second_lvl[l][k]);
> +            flush_xen_data_tlb_range_va(i << SECOND_SHIFT, 1 << SECOND_SHIFT);
> +        }
> +    }
> +
> +    dsb(sy);
> +    isb();

You have barriers here, and in write_pte and in
flush_xen_data_tlb_range, which is somewhat overkill. I think you can
just have a single set before and after.

> +/* Initialize log dirty fields */
> +int log_dirty_init(struct domain *d)
> +{
> +    d->arch.dirty.count = 0;

I have a feeling that all of these are zeroed in struct domain when it
is allocated.

> +    d->arch.dirty.mode = 0;
> +    spin_lock_init(&d->arch.dirty.lock);
> +
> +    d->arch.dirty.second_lvl_start = 0;
> +    d->arch.dirty.second_lvl_end = 0;
> +    d->arch.dirty.second_lvl[0] = NULL;
> +    d->arch.dirty.second_lvl[1] = NULL;
> +
> +    memset(d->arch.dirty.bitmap, 0, sizeof(d->arch.dirty.bitmap));
> +    d->arch.dirty.bitmap_pages = 0;
> +
> +    return 0;
> +}
> +
> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index 603c097..0808cc9 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -6,6 +6,8 @@
>  #include <xen/bitops.h>
>  #include <asm/flushtlb.h>
>  #include <asm/gic.h>
> +#include <xen/guest_access.h>
> +#include <xen/pfn.h>
>  #include <asm/event.h>
>  #include <asm/hardirq.h>
>  #include <asm/page.h>
> @@ -208,6 +210,7 @@ static lpae_t mfn_to_p2m_entry(unsigned long mfn, unsigned int mattr,
>          break;
>  
>      case p2m_ram_ro:
> +    case p2m_ram_logdirty:
>          e.p2m.xn = 0;
>          e.p2m.write = 0;
>          break;
> @@ -261,6 +264,10 @@ static int p2m_create_table(struct domain *d,
>  
>      pte = mfn_to_p2m_entry(page_to_mfn(page), MATTR_MEM, p2m_invalid);
>  
> +    /* mark the write bit (page table's case, ro bit) as 0
> +     * so, it is writable in case of vlpt access */

"writeable"

> +    pte.pt.ro = 0;
> +
>      write_pte(entry, pte);
>  
>      return 0;
> @@ -696,6 +703,203 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn)
>      return p >> PAGE_SHIFT;
>  }
>  
> +/* Change types across all p2m entries in a domain */
> +void p2m_change_entry_type_global(struct domain *d, enum mg nt)

Please try and use apply_p2m_changes for this.

mg_* are host (i.e. Xen) mapping types. You should be using p2m_type_t
here.

This should all just fall out nicely if you use apply_p2m_changes I
think.

> +                if ( nt == mg_ro )
> +                {
> +                    if ( pte.p2m.write == 1 )
> +                    {
> +                        pte.p2m.write = 0;
> +                        pte.p2m.type = p2m_ram_logdirty;

If the new type is mg_ro then shouldn't this be p2m_ram_ro? There's no
need to do logdirty tracking on ro pages.

> +                    }
> +                    else
> +                    {
> +                        /* reuse avail bit as an indicator of 'actual' 

The avail bit?

> +                         * read-only */
> +                        pte.p2m.type = p2m_ram_rw;

The new type is mg_ro -- so surely here we *do* need logdirty, iff
logdirty is currently enabled.

> +                    }
> +                }
> +                else if ( nt == mg_rw )
> +                {
> +                    if ( pte.p2m.write == 0 && 
> +                         pte.p2m.type == p2m_ram_logdirty )
> +                    {
> +                        pte.p2m.write = p2m_ram_rw;

This also seems wrong to me.

Surely the logic here ought to be something like:

    switch (nt)
    {
    case p2m_ram_ro:
        pte.p2m.write = 0;
        pte.p2m.type = nt;
        break;
    case p2m_ram_rw:
        if ( logdirty_is_enable(d) )
        {
            pte.p2m.write = 0;
            pte.p2m.type = p2m_ram_logdirty;
        }
        else
        {
            pte.p2m.write = 1;
            pte.p2m.type = p2m_ram_rw;
        }
        break;
    /* Other types, perhaps via default... */
    }
?

> +                    }
> +                }
> +                write_pte(&third[i3], pte);
> +            }
> +            unmap_domain_page(third);
> +
> +            third = NULL;
> +            third_index = 0;
> +        }
> +        unmap_domain_page(second);
> +
> +        second = NULL;
> +        second_index = 0;
> +        third_index = 0;
> +    }
> +
> +out:
> +    flush_tlb_all_local();
> +    if ( third ) unmap_domain_page(third);
> +    if ( second ) unmap_domain_page(second);
> +    if ( first ) unmap_domain_page(first);
> +
> +    spin_unlock(&p2m->lock);
> +}
> +
> +/* Read a domain's log-dirty bitmap and stats. If the operation is a CLEAN, 
> + * clear the bitmap and stats. */
> +int log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc)
> +{
> +    int peek = 1;
> +    int i;
> +    int bitmap_size;
> +    paddr_t gma_start, gma_end;
> +
> +    /* this hypercall is called from domain 0, and we don't know which guest's
> +     * vlpt is mapped in xen_second, so, to be sure, we restore vlpt here */
> +    log_dirty_restore(d);

You don't seem to clear it again at the end?

> +    domain_get_ram_range(d, &gma_start, &gma_end);
> +    bitmap_size = (gma_end - gma_start) / 8;
> +
> +    if ( guest_handle_is_null(sc->dirty_bitmap) )
> +    {
> +        peek = 0;
> +    }
> +    else
> +    {
> +        spin_lock(&d->arch.dirty.lock);
> +
> +        for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i )
> +        {
> +            int j = 0;
> +            uint8_t *bitmap;
> +
> +            copy_to_guest_offset(sc->dirty_bitmap, i * PAGE_SIZE,
> +                                 d->arch.dirty.bitmap[i],
> +                                 bitmap_size < PAGE_SIZE ? bitmap_size :
> +                                                           PAGE_SIZE);
> +            bitmap_size -= PAGE_SIZE;
> +
> +            /* set p2m page table read-only */
> +            bitmap = d->arch.dirty.bitmap[i];
> +            while ((j = find_next_bit((const long unsigned int *)bitmap,
> +                                      PAGE_SIZE*8, j)) < PAGE_SIZE*8)

What are these magic 8s?

> +            {
> +                lpae_t *vlpt;
> +                paddr_t addr = gma_start + (i << (2*PAGE_SHIFT+3)) +

Magic 2* and+3, a helper might be nice.

Isn't j effectively a pfn here (or a pfn offswet from RAM base). In
which case all the normal helpers for manipulating pfns and addresses
are available to you.

> +                    (j << PAGE_SHIFT);
> +                vlpt = vlpt_get_3lvl_pte(addr);
> +                vlpt->p2m.write = 0;
> +                j++;
> +            }

No barrier here?

> +        }
> +
> +        if ( sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN )
> +        {
> +            for ( i = 0; i < d->arch.dirty.bitmap_pages; ++i )
> +            {
> +                clear_page(d->arch.dirty.bitmap[i]);
> +            }
> +        }
> +
> +        spin_unlock(&d->arch.dirty.lock);
> +        flush_tlb_local();
> +    }
> +
> +    sc->stats.dirty_count = d->arch.dirty.count;
> +
> +    return 0;
> +}
> +
> @@ -1577,6 +1579,13 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>      if ( rc == -EFAULT )
>          goto bad_data_abort;
>  
> +    /* domU page fault handling for guest live migration. Note that 
> +     * dabt.valid can be 0 here */
> +    if ( page_fault && handle_page_fault(current->domain, info.gpa) )

handle_page_fault only deals with logdirty faults -- please name it
appropriately.

I'm not sure "page_fault" is a very descriptive name -- it's more
specific than that I think?

> diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
> index aabeb51..ac82643 100644
> --- a/xen/include/asm-arm/domain.h
> +++ b/xen/include/asm-arm/domain.h
> @@ -162,6 +162,25 @@ struct arch_domain
>      } vuart;
>  
>      unsigned int evtchn_irq;
> +
> +    /* dirty page tracing */
> +    struct {
> +        spinlock_t lock;
> +        volatile int mode;               /* 1 if dirty pages tracing enabled */
> +        volatile unsigned int count;     /* dirty pages counter */
> +
> +        /* vlpt context switch */
> +        volatile int second_lvl_start; /* start idx of virt linear space 2nd */
> +        volatile int second_lvl_end;   /* end idx of virt linear space 2nd */
> +        lpae_t *second_lvl[2];         /* copy of guest P2M 1st-lvl content */
> +
> +        /* bitmap to track dirty pages */
> +#define MAX_DIRTY_BITMAP_PAGES 64
> +        /* Because each bit represents a dirty page, the total supported guest 
> +         * memory is (64 entries x 4KB/entry x 8bits/byte x 4KB) = 8GB. */

8GB isn't very much, especially not for a 64-bit guest.

I've previously discussed with Jaeyong the possibility of using some
state in each pte to track dirtiness and walking the vlpt to copy them
out into the toolstacks bitmap. I think pte.p2m.type could be used --
any page which is p2m_ram_rw has been dirtied (otherwise it would still
be p2m_ram_logdirty). Only thing I'm not sure about is other types --
e.g. ballooning a page out in the middle of a migrate -- I suppose there
is some explicit "dirtying" of such a page somewhere in decrease
reservation.

BTW x86 seems to use a 4-level bitmap trie here.

> +        uint8_t *bitmap[MAX_DIRTY_BITMAP_PAGES]; /* dirty bitmap */
> +        int bitmap_pages;                        /* # of dirty bitmap pages */
> +    } dirty;
>  }  __cacheline_aligned;
>  
>  struct arch_vcpu

Ian.