All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: [0/2] Remove netloop by lazy copying in netback
@ 2007-03-20  4:46 Herbert Xu
  2007-03-20  4:50 ` RFC: [1/2] [XEN] gnttab: Add new op unmap_and_replace Herbert Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-20  4:46 UTC (permalink / raw)
  To: Keir Fraser, Xen Development Mailing List

Hi Keir:

These two patches remove the need for netloop by performing the
copying in netback and only if it is necessary.  The rationale
is that most packets will be processed without delay allowing
them to be freed without copying at all.  So instead of copying
every packet destined to dom0 we'll only copy those that linger
longer than a specified amount of time (currently 0.5s).

As it is netloop doesn't take care of all delays anyway.  For
instance packets delayed by qdisc or netfilter can hold up
resources without any limits.  Also if bridging isn't used
then traffic to dom0 does not even go through netloop.

Testing shows that these patches do eliminate the copying for
bulk transfers.  In fact, bulk transfer throughput from domU
to dom0 are increased by around 50%.  Even when the copying
path is taken the performance is roughly equal to that of
netloop despite the unoptimised copying path.

The copying is achieved through a new grant table operation.
I've only implemented it for x86.  However, there is a fallback
path for other platforms so they should continue to work.  It
shouldn't be too hard to implement this for ia64/ppc (for someone
with access to them).

In future I intend to exntend this idea to support lazy
copying for dom0 to domU as well which should give us a
complete zero-copy path from one domU to another.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RFC: [1/2] [XEN] gnttab: Add new op unmap_and_replace
  2007-03-20  4:46 RFC: [0/2] Remove netloop by lazy copying in netback Herbert Xu
@ 2007-03-20  4:50 ` Herbert Xu
  2007-03-20  4:56 ` RFC: [2/2] [NET] back: Add lazy copying Herbert Xu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-20  4:50 UTC (permalink / raw)
  To: Keir Fraser, Xen Development Mailing List

Hi:

[XEN] gnttab: Add new op unmap_and_replace

The operation unmap_and_replace is an extension of unmap_grant_ref.
A new argument in the form of a virtual address (for PV) is given.
Instead of modifying the PTE for the mapped grant table entry to
null, we change it to the PTE for the new address.  In turn we
point the new address to null.

As it stands grant table entries once mapped cannot be
remapped by the guest OS (it can however perform a new
mapping on the same entry but that is within our control).
Therefore it's safe to manipulate the mapped PTE entry to
redirect it to a normal page where we've copied the contents.

It's intended to be used as follows:

1) map_grant_ref to v1
2) ...
3) alloc page at v2
4) copy the page at v1 to v2
5) unmap_and_replace v1 with v2

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff -r 3ac19fda0bc2 linux-2.6-xen-sparse/include/xen/gnttab.h
--- a/linux-2.6-xen-sparse/include/xen/gnttab.h	Fri Mar 02 12:11:52 2007 +0000
+++ b/linux-2.6-xen-sparse/include/xen/gnttab.h	Tue Mar 20 14:08:40 2007 +1100
@@ -135,4 +135,19 @@ gnttab_set_unmap_op(struct gnttab_unmap_
 	unmap->dev_bus_addr = 0;
 }
 
+static inline void
+gnttab_set_replace_op(struct gnttab_unmap_and_replace *unmap, maddr_t addr,
+		      maddr_t new_addr, grant_handle_t handle)
+{
+	if (xen_feature(XENFEAT_auto_translated_physmap)) {
+		unmap->host_addr = __pa(addr);
+		unmap->new_addr = __pa(new_addr);
+	} else {
+		unmap->host_addr = addr;
+		unmap->new_addr = new_addr;
+	}
+
+	unmap->handle = handle;
+}
+
 #endif /* __ASM_GNTTAB_H__ */
diff -r 3ac19fda0bc2 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Fri Mar 02 12:11:52 2007 +0000
+++ b/xen/arch/x86/mm.c	Tue Mar 20 14:08:40 2007 +1100
@@ -2594,8 +2594,8 @@ static int create_grant_va_mapping(
     return GNTST_okay;
 }
 
-static int destroy_grant_va_mapping(
-    unsigned long addr, unsigned long frame, struct vcpu *v)
+static int replace_grant_va_mapping(
+    unsigned long addr, unsigned long frame, l1_pgentry_t nl1e, struct vcpu *v)
 {
     l1_pgentry_t *pl1e, ol1e;
     unsigned long gl1mfn;
@@ -2619,7 +2619,7 @@ static int destroy_grant_va_mapping(
     }
 
     /* Delete pagetable entry. */
-    if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, l1e_empty(), gl1mfn, v)) )
+    if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, v)) )
     {
         MEM_LOG("Cannot delete PTE entry at %p", (unsigned long *)pl1e);
         rc = GNTST_general_error;
@@ -2629,6 +2629,12 @@ static int destroy_grant_va_mapping(
  out:
     guest_unmap_l1e(v, pl1e);
     return rc;
+}
+
+static int destroy_grant_va_mapping(
+    unsigned long addr, unsigned long frame, struct vcpu *v)
+{
+    return replace_grant_va_mapping(addr, frame, l1e_empty(), v);
 }
 
 int create_grant_host_mapping(
@@ -2652,6 +2658,44 @@ int destroy_grant_host_mapping(
     if ( flags & GNTMAP_contains_pte )
         return destroy_grant_pte_mapping(addr, frame, current->domain);
     return destroy_grant_va_mapping(addr, frame, current);
+}
+
+int replace_grant_host_mapping(
+    uint64_t addr, unsigned long frame, uint64_t new_addr, unsigned int flags)
+{
+    l1_pgentry_t *pl1e, ol1e;
+    unsigned long gl1mfn;
+    int rc;
+    
+    if ( flags & GNTMAP_contains_pte )
+    {
+        MEM_LOG("Unsupported grant table operation");
+	return GNTST_general_error;
+    }
+
+    pl1e = guest_map_l1e(current, new_addr, &gl1mfn);
+    if ( !pl1e )
+    {
+        MEM_LOG("Could not find L1 PTE for address %lx",
+                (unsigned long)new_addr);
+        return GNTST_general_error;
+    }
+    ol1e = *pl1e;
+
+    if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, l1e_empty(), gl1mfn, current)) )
+    {
+        MEM_LOG("Cannot delete PTE entry at %p", (unsigned long *)pl1e);
+        guest_unmap_l1e(current, pl1e);
+        return GNTST_general_error;
+    }
+
+    guest_unmap_l1e(current, pl1e);
+
+    rc = replace_grant_va_mapping(addr, frame, ol1e, current);
+    if ( rc && !paging_mode_refcounts(current->domain) )
+        put_page_from_l1e(ol1e, current->domain);
+
+    return rc;
 }
 
 int steal_page(
diff -r 3ac19fda0bc2 xen/common/grant_table.c
--- a/xen/common/grant_table.c	Fri Mar 02 12:11:52 2007 +0000
+++ b/xen/common/grant_table.c	Tue Mar 20 14:08:40 2007 +1100
@@ -540,6 +540,126 @@ fault:
     return -EFAULT;    
 }
 
+static void
+__gnttab_unmap_and_replace(
+    struct gnttab_unmap_and_replace *op)
+{
+    domid_t          dom;
+    grant_ref_t      ref;
+    struct domain   *ld, *rd;
+    struct active_grant_entry *act;
+    grant_entry_t   *sha;
+    struct grant_mapping *map;
+    u16              flags;
+    s16              rc = 0;
+    unsigned long    frame;
+
+    ld = current->domain;
+
+    if ( unlikely(op->handle >= ld->grant_table->maptrack_limit) )
+    {
+        gdprintk(XENLOG_INFO, "Bad handle (%d).\n", op->handle);
+        op->status = GNTST_bad_handle;
+        return;
+    }
+
+    map = &maptrack_entry(ld->grant_table, op->handle);
+
+    if ( unlikely(!map->flags) )
+    {
+        gdprintk(XENLOG_INFO, "Zero flags for handle (%d).\n", op->handle);
+        op->status = GNTST_bad_handle;
+        return;
+    }
+
+    dom   = map->domid;
+    ref   = map->ref;
+    flags = map->flags;
+
+    if ( unlikely((rd = rcu_lock_domain_by_id(dom)) == NULL) )
+    {
+        /* This can happen when a grant is implicitly unmapped. */
+        gdprintk(XENLOG_INFO, "Could not find domain %d\n", dom);
+        domain_crash(ld); /* naughty... */
+        return;
+    }
+
+    TRACE_1D(TRC_MEM_PAGE_GRANT_UNMAP, dom);
+
+    spin_lock(&rd->grant_table->lock);
+
+    act = &active_entry(rd->grant_table, ref);
+    sha = &shared_entry(rd->grant_table, ref);
+
+    frame = act->frame;
+
+    if ( flags & GNTMAP_host_map )
+    {
+        if ( (rc = replace_grant_host_mapping(op->host_addr,
+                                              frame, op->new_addr, flags)) < 0 )
+            goto unmap_out;
+
+        ASSERT(act->pin & (GNTPIN_hstw_mask | GNTPIN_hstr_mask));
+        map->flags &= ~GNTMAP_host_map;
+        if ( flags & GNTMAP_readonly )
+        {
+            act->pin -= GNTPIN_hstr_inc;
+            put_page(mfn_to_page(frame));
+        }
+        else
+        {
+            act->pin -= GNTPIN_hstw_inc;
+            put_page_and_type(mfn_to_page(frame));
+        }
+    }
+
+    if ( (map->flags & (GNTMAP_device_map|GNTMAP_host_map)) == 0 )
+    {
+        map->flags = 0;
+        put_maptrack_handle(ld->grant_table, op->handle);
+    }
+
+    /* If just unmapped a writable mapping, mark as dirtied */
+    if ( !(flags & GNTMAP_readonly) )
+         gnttab_mark_dirty(rd, frame);
+
+    if ( ((act->pin & (GNTPIN_devw_mask|GNTPIN_hstw_mask)) == 0) &&
+         !(flags & GNTMAP_readonly) )
+        gnttab_clear_flag(_GTF_writing, &sha->flags);
+
+    if ( act->pin == 0 )
+        gnttab_clear_flag(_GTF_reading, &sha->flags);
+
+ unmap_out:
+    op->status = rc;
+    spin_unlock(&rd->grant_table->lock);
+    rcu_unlock_domain(rd);
+}
+
+static long
+gnttab_unmap_and_replace(
+    XEN_GUEST_HANDLE(gnttab_unmap_and_replace_t) uop, unsigned int count)
+{
+    int i;
+    struct gnttab_unmap_and_replace op;
+
+    for ( i = 0; i < count; i++ )
+    {
+        if ( unlikely(__copy_from_guest_offset(&op, uop, i, 1)) )
+            goto fault;
+        __gnttab_unmap_and_replace(&op);
+        if ( unlikely(__copy_to_guest_offset(uop, i, &op, 1)) )
+            goto fault;
+    }
+
+    flush_tlb_mask(current->domain->domain_dirty_cpumask);
+    return 0;
+
+fault:
+    flush_tlb_mask(current->domain->domain_dirty_cpumask);
+    return -EFAULT;    
+}
+
 int
 gnttab_grow_table(struct domain *d, unsigned int req_nr_frames)
 {
@@ -1201,6 +1321,21 @@ do_grant_table_op(
         if ( unlikely(!grant_operation_permitted(d)) )
             goto out;
         rc = gnttab_unmap_grant_ref(unmap, count);
+        break;
+    }
+    case GNTTABOP_unmap_and_replace:
+    {
+        XEN_GUEST_HANDLE(gnttab_unmap_and_replace_t) unmap =
+            guest_handle_cast(uop, gnttab_unmap_and_replace_t);
+        if ( unlikely(!guest_handle_okay(unmap, count)) )
+            goto out;
+        rc = -EPERM;
+        if ( unlikely(!grant_operation_permitted(d)) )
+            goto out;
+        rc = -ENOSYS;
+        if ( unlikely(!replace_grant_supported()) )
+            goto out;
+        rc = gnttab_unmap_and_replace(unmap, count);
         break;
     }
     case GNTTABOP_setup_table:
diff -r 3ac19fda0bc2 xen/include/asm-ia64/grant_table.h
--- a/xen/include/asm-ia64/grant_table.h	Fri Mar 02 12:11:52 2007 +0000
+++ b/xen/include/asm-ia64/grant_table.h	Tue Mar 20 14:08:40 2007 +1100
@@ -64,4 +64,14 @@ static inline void gnttab_clear_flag(uns
 	clear_bit(nr, addr);
 }
 
+static inline int replace_grant_host_mapping(unsigned long gpaddr, unsigned long mfn, unsigned long new_gpaddr, unsigned int flags)
+{
+    return 0;
+}
+
+static inline int replace_grant_supported(void)
+{
+    return 0;
+}
+
 #endif /* __ASM_GRANT_TABLE_H__ */
diff -r 3ac19fda0bc2 xen/include/asm-powerpc/grant_table.h
--- a/xen/include/asm-powerpc/grant_table.h	Fri Mar 02 12:11:52 2007 +0000
+++ b/xen/include/asm-powerpc/grant_table.h	Tue Mar 20 14:08:40 2007 +1100
@@ -69,4 +69,14 @@ static inline uint cpu_foreign_map_order
     /* 16 GiB */
     return 34 - PAGE_SHIFT;
 }
+
+static inline int replace_grant_host_mapping(unsigned long addr, unsigned long frame, unsigned long new_addr, unsigned int flags)
+{
+    return 0;
+}
+
+static inline int replace_grant_supported(void)
+{
+    return 0;
+}
 #endif  /* __ASM_PPC_GRANT_TABLE_H__ */
diff -r 3ac19fda0bc2 xen/include/asm-x86/grant_table.h
--- a/xen/include/asm-x86/grant_table.h	Fri Mar 02 12:11:52 2007 +0000
+++ b/xen/include/asm-x86/grant_table.h	Tue Mar 20 14:08:40 2007 +1100
@@ -17,6 +17,8 @@ int create_grant_host_mapping(
     uint64_t addr, unsigned long frame, unsigned int flags);
 int destroy_grant_host_mapping(
     uint64_t addr, unsigned long frame, unsigned int flags);
+int replace_grant_host_mapping(
+    uint64_t addr, unsigned long frame, uint64_t new_addr, unsigned int flags);
 
 #define gnttab_create_shared_page(d, t, i)                               \
     do {                                                                 \
@@ -38,4 +40,9 @@ static inline void gnttab_clear_flag(uns
     clear_bit(nr, addr);
 }
 
+static inline int replace_grant_supported(void)
+{
+    return 1;
+}
+
 #endif /* __ASM_GRANT_TABLE_H__ */
diff -r 3ac19fda0bc2 xen/include/public/grant_table.h
--- a/xen/include/public/grant_table.h	Fri Mar 02 12:11:52 2007 +0000
+++ b/xen/include/public/grant_table.h	Tue Mar 20 14:08:40 2007 +1100
@@ -327,6 +327,29 @@ struct gnttab_query_size {
 };
 typedef struct gnttab_query_size gnttab_query_size_t;
 DEFINE_XEN_GUEST_HANDLE(gnttab_query_size_t);
+
+/*
+ * GNTTABOP_unmap_and_replace: Destroy one or more grant-reference mappings
+ * tracked by <handle> but atomically replace the page table entry with one
+ * pointing to the machine address under <new_addr>.  <new_addr> will be
+ * redirected to the null entry.
+ * NOTES:
+ *  1. The call may fail in an undefined manner if either mapping is not
+ *     tracked by <handle>.
+ *  2. After executing a batch of unmaps, it is guaranteed that no stale
+ *     mappings will remain in the device or host TLBs.
+ */
+#define GNTTABOP_unmap_and_replace    7
+struct gnttab_unmap_and_replace {
+    /* IN parameters. */
+    uint64_t host_addr;
+    uint64_t new_addr;
+    grant_handle_t handle;
+    /* OUT parameters. */
+    int16_t  status;              /* GNTST_* */
+};
+typedef struct gnttab_unmap_and_replace gnttab_unmap_and_replace_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_unmap_and_replace_t);
 
 
 /*

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RFC: [2/2] [NET] back: Add lazy copying
  2007-03-20  4:46 RFC: [0/2] Remove netloop by lazy copying in netback Herbert Xu
  2007-03-20  4:50 ` RFC: [1/2] [XEN] gnttab: Add new op unmap_and_replace Herbert Xu
@ 2007-03-20  4:56 ` Herbert Xu
  2007-03-20  7:10 ` RFC: [0/2] Remove netloop by lazy copying in netback Keir Fraser
  2007-03-20 10:22 ` Keir Fraser
  3 siblings, 0 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-20  4:56 UTC (permalink / raw)
  To: Keir Fraser, Xen Development Mailing List

Hi:

[NET] back: Add lazy copying

This patch adds lazy copying using the new unmap_and_replace grant
table operation.

We keep a list of pending entries sorted by arrival order.  We'll
process this list every time net_tx_action is invoked.  We ensure
that net_tx_action is invoked within one second of the arrival of
the first packet in the list.

When we process the list any entry that has been around for more
than half a second is copied.  This allows up to free the grant
table entry and return it to domU.

If the new grant table operation is not available (e.g., old HV
or architectures that don't support it yet) we simply copy each
packet as we receive them using skb_linearize.  We also disable
SG/TSO if this is the case.

By default the new code is disabled.  In order to enable it,
the module needs to be loaded with the argument copy_skb=1.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> 

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff -r 3ac19fda0bc2 linux-2.6-xen-sparse/drivers/xen/netback/common.h
--- a/linux-2.6-xen-sparse/drivers/xen/netback/common.h	Fri Mar 02 12:11:52 2007 +0000
+++ b/linux-2.6-xen-sparse/drivers/xen/netback/common.h	Tue Mar 20 14:08:40 2007 +1100
@@ -102,6 +102,14 @@ typedef struct netif_st {
 	wait_queue_head_t waiting_to_free;
 } netif_t;
 
+enum {
+	NETBK_DONT_COPY_SKB,
+	NETBK_DELAYED_COPY_SKB,
+	NETBK_ALWAYS_COPY_SKB,
+};
+
+extern int netbk_copy_skb_mode;
+
 #define NET_TX_RING_SIZE __RING_SIZE((netif_tx_sring_t *)0, PAGE_SIZE)
 #define NET_RX_RING_SIZE __RING_SIZE((netif_rx_sring_t *)0, PAGE_SIZE)
 
diff -r 3ac19fda0bc2 linux-2.6-xen-sparse/drivers/xen/netback/netback.c
--- a/linux-2.6-xen-sparse/drivers/xen/netback/netback.c	Fri Mar 02 12:11:52 2007 +0000
+++ b/linux-2.6-xen-sparse/drivers/xen/netback/netback.c	Tue Mar 20 14:08:40 2007 +1100
@@ -46,6 +46,11 @@ struct netbk_rx_meta {
 	int copy:1;
 };
 
+struct netbk_tx_pending_inuse {
+	struct list_head list;
+	unsigned long alloc_time;
+};
+
 static void netif_idx_release(u16 pending_idx);
 static void netif_page_release(struct page *page);
 static void make_tx_response(netif_t *netif, 
@@ -65,6 +70,7 @@ static DECLARE_TASKLET(net_rx_tasklet, n
 static DECLARE_TASKLET(net_rx_tasklet, net_rx_action, 0);
 
 static struct timer_list net_timer;
+static struct timer_list netbk_tx_pending_timer;
 
 #define MAX_PENDING_REQS 256
 
@@ -92,6 +98,10 @@ static u16 dealloc_ring[MAX_PENDING_REQS
 static u16 dealloc_ring[MAX_PENDING_REQS];
 static PEND_RING_IDX dealloc_prod, dealloc_cons;
 
+/* Doubly-linked list of in-use pending entries. */
+static struct netbk_tx_pending_inuse pending_inuse[MAX_PENDING_REQS];
+static LIST_HEAD(pending_inuse_head);
+
 static struct sk_buff_head tx_queue;
 
 static grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
@@ -104,6 +114,13 @@ static spinlock_t net_schedule_list_lock
 #define MAX_MFN_ALLOC 64
 static unsigned long mfn_list[MAX_MFN_ALLOC];
 static unsigned int alloc_index = 0;
+
+/* Setting this allows the safe use of this driver without netloop. */
+static int MODPARM_copy_skb;
+module_param_named(copy_skb, MODPARM_copy_skb, bool, 0);
+MODULE_PARM_DESC(copy_skb, "Copy data received from netfront without netloop");
+
+int netbk_copy_skb_mode;
 
 static inline unsigned long alloc_mfn(void)
 {
@@ -710,6 +727,11 @@ static void net_alarm(unsigned long unus
 	tasklet_schedule(&net_rx_tasklet);
 }
 
+static void netbk_tx_pending_timeout(unsigned long unused)
+{
+	tasklet_schedule(&net_tx_tasklet);
+}
+
 struct net_device_stats *netif_be_get_stats(struct net_device *dev)
 {
 	netif_t *netif = netdev_priv(dev);
@@ -803,46 +825,140 @@ static void tx_credit_callback(unsigned 
 	netif_schedule_work(netif);
 }
 
+/* Perform a delayed copy.  This is slow-path only. */
+static int copy_pending_req(PEND_RING_IDX pending_idx)
+{
+	struct gnttab_unmap_and_replace unmap;
+	mmu_update_t mmu;
+	struct page *page;
+	struct page *new_page;
+	void *new_addr;
+	void *addr;
+	unsigned long pfn;
+	unsigned long new_mfn;
+	int err;
+
+	page = mmap_pages[pending_idx];
+	if (!get_page_unless_zero(page))
+		return -ENOENT;
+
+	new_page = alloc_page(GFP_ATOMIC | __GFP_NOWARN);
+	if (!new_page)
+		return -ENOMEM;
+
+	new_addr = page_address(new_page);
+	addr = page_address(page);
+	memcpy(new_addr, addr, PAGE_SIZE);
+
+	pfn = page_to_pfn(page);
+	new_mfn = virt_to_mfn(new_addr);
+
+	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+		set_phys_to_machine(pfn, new_mfn);
+		set_phys_to_machine(page_to_pfn(new_page), INVALID_P2M_ENTRY);
+	}
+
+	gnttab_set_replace_op(&unmap, (unsigned long)addr,
+			      (unsigned long)new_addr,
+			      grant_tx_handle[pending_idx]);
+
+	err = HYPERVISOR_grant_table_op(GNTTABOP_unmap_and_replace,
+					&unmap, 1);
+	BUG_ON(err);
+	BUG_ON(unmap.status);
+
+	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+		mmu.ptr = ((maddr_t)new_mfn << PAGE_SHIFT) |
+			  MMU_MACHPHYS_UPDATE;
+		mmu.val = pfn;
+		err = HYPERVISOR_mmu_update(&mmu, 1, NULL, DOMID_SELF);
+		BUG_ON(err);
+	}
+
+	ClearPageForeign(page);
+	put_page(page);
+
+	SetPageForeign(new_page, netif_page_release);
+	new_page->index = pending_idx;
+	mmap_pages[pending_idx] = new_page;
+
+	return 0;
+}
+
 inline static void net_tx_action_dealloc(void)
 {
+	struct netbk_tx_pending_inuse *inuse, *n;
 	gnttab_unmap_grant_ref_t *gop;
 	u16 pending_idx;
 	PEND_RING_IDX dc, dp;
 	netif_t *netif;
 	int ret;
+	LIST_HEAD(list);
 
 	dc = dealloc_cons;
-	dp = dealloc_prod;
-
-	/* Ensure we see all indexes enqueued by netif_idx_release(). */
-	smp_rmb();
+	gop = tx_unmap_ops;
 
 	/*
 	 * Free up any grants we have finished using
 	 */
-	gop = tx_unmap_ops;
-	while (dc != dp) {
-		pending_idx = dealloc_ring[MASK_PEND_IDX(dc++)];
-		gnttab_set_unmap_op(gop, idx_to_kaddr(pending_idx),
-				    GNTMAP_host_map,
-				    grant_tx_handle[pending_idx]);
-		gop++;
-	}
+	do {
+		dp = dealloc_prod;
+
+		/* Ensure we see all indices enqueued by netif_idx_release(). */
+		smp_rmb();
+
+		while (dc != dp) {
+			pending_idx = dealloc_ring[MASK_PEND_IDX(dc++)];
+			list_move_tail(&pending_inuse[pending_idx].list, &list);
+			gnttab_set_unmap_op(gop, idx_to_kaddr(pending_idx),
+					    GNTMAP_host_map,
+					    grant_tx_handle[pending_idx]);
+			gop++;
+		}
+
+		if (netbk_copy_skb_mode != NETBK_DELAYED_COPY_SKB ||
+		    list_empty(&pending_inuse_head))
+			break;
+
+		/* Copy any entries that have been pending for too long. */
+		list_for_each_entry_safe(inuse, n, &pending_inuse_head, list) {
+			if (time_after(inuse->alloc_time + HZ / 2, jiffies))
+				break;
+
+			switch (copy_pending_req(inuse - pending_inuse)) {
+			case 0:
+				list_move_tail(&inuse->list, &list);
+				/* fall through */
+			case -ENOENT:
+				continue;
+			}
+
+			break;
+		}
+	} while (dp != dealloc_prod);
+
+	dealloc_cons = dc;
+
 	ret = HYPERVISOR_grant_table_op(
 		GNTTABOP_unmap_grant_ref, tx_unmap_ops, gop - tx_unmap_ops);
 	BUG_ON(ret);
 
-	while (dealloc_cons != dp) {
-		pending_idx = dealloc_ring[MASK_PEND_IDX(dealloc_cons++)];
+	list_for_each_entry_safe(inuse, n, &list, list) {
+		pending_idx = inuse - pending_inuse;
 
 		netif = pending_tx_info[pending_idx].netif;
 
 		make_tx_response(netif, &pending_tx_info[pending_idx].req, 
 				 NETIF_RSP_OKAY);
 
+		/* Ready for next use. */
+		init_page_count(mmap_pages[pending_idx]);
+
 		pending_ring[MASK_PEND_IDX(pending_prod++)] = pending_idx;
 
 		netif_put(netif);
+
+		list_del_init(&inuse->list);
 	}
 }
 
@@ -1014,6 +1130,11 @@ static void netbk_fill_frags(struct sk_b
 		unsigned long pending_idx;
 
 		pending_idx = (unsigned long)frag->page;
+
+		pending_inuse[pending_idx].alloc_time = jiffies;
+		list_add_tail(&pending_inuse[pending_idx].list,
+			      &pending_inuse_head);
+
 		txp = &pending_tx_info[pending_idx].req;
 		frag->page = virt_to_page(idx_to_kaddr(pending_idx));
 		frag->size = txp->size;
@@ -1302,8 +1423,24 @@ static void net_tx_action(unsigned long 
 		netif->stats.rx_bytes += skb->len;
 		netif->stats.rx_packets++;
 
+		if (unlikely(netbk_copy_skb_mode == NETBK_ALWAYS_COPY_SKB) &&
+		    unlikely(skb_linearize(skb))) {
+			DPRINTK("Can't linearize skb in net_tx_action.\n");
+			kfree_skb(skb);
+			continue;
+		}
+
 		netif_rx(skb);
 		netif->dev->last_rx = jiffies;
+	}
+
+	if (netbk_copy_skb_mode == NETBK_DELAYED_COPY_SKB &&
+	    !list_empty(&pending_inuse_head)) {
+		struct netbk_tx_pending_inuse *oldest;
+
+		oldest = list_entry(pending_inuse_head.next,
+				    struct netbk_tx_pending_inuse, list);
+		mod_timer(&netbk_tx_pending_timer, oldest->alloc_time + HZ);
 	}
 }
 
@@ -1324,9 +1461,6 @@ static void netif_idx_release(u16 pendin
 
 static void netif_page_release(struct page *page)
 {
-	/* Ready for next use. */
-	init_page_count(page);
-
 	netif_idx_release(page->index);
 }
 
@@ -1448,6 +1582,10 @@ static int __init netback_init(void)
 	net_timer.data = 0;
 	net_timer.function = net_alarm;
 
+	init_timer(&netbk_tx_pending_timer);
+	netbk_tx_pending_timer.data = 0;
+	netbk_tx_pending_timer.function = netbk_tx_pending_timeout;
+
 	mmap_pages = alloc_empty_pages_and_pagevec(MAX_PENDING_REQS);
 	if (mmap_pages == NULL) {
 		printk("%s: out of memory\n", __FUNCTION__);
@@ -1458,6 +1596,7 @@ static int __init netback_init(void)
 		page = mmap_pages[i];
 		SetPageForeign(page, netif_page_release);
 		page->index = i;
+		INIT_LIST_HEAD(&pending_inuse[i].list);
 	}
 
 	pending_cons = 0;
@@ -1467,6 +1606,15 @@ static int __init netback_init(void)
 
 	spin_lock_init(&net_schedule_list_lock);
 	INIT_LIST_HEAD(&net_schedule_list);
+
+	netbk_copy_skb_mode = NETBK_DONT_COPY_SKB;
+	if (MODPARM_copy_skb) {
+		if (HYPERVISOR_grant_table_op(GNTTABOP_unmap_and_replace,
+					      NULL, 0))
+			netbk_copy_skb_mode = NETBK_ALWAYS_COPY_SKB;
+		else
+			netbk_copy_skb_mode = NETBK_DELAYED_COPY_SKB;
+	}
 
 	netif_xenbus_init();
 
diff -r 3ac19fda0bc2 linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c
--- a/linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c	Fri Mar 02 12:11:52 2007 +0000
+++ b/linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c	Tue Mar 20 14:08:40 2007 +1100
@@ -62,6 +62,7 @@ static int netback_probe(struct xenbus_d
 	const char *message;
 	struct xenbus_transaction xbt;
 	int err;
+	int sg;
 	struct backend_info *be = kzalloc(sizeof(struct backend_info),
 					  GFP_KERNEL);
 	if (!be) {
@@ -73,6 +74,10 @@ static int netback_probe(struct xenbus_d
 	be->dev = dev;
 	dev->dev.driver_data = be;
 
+	sg = 1;
+	if (netbk_copy_skb_mode == NETBK_ALWAYS_COPY_SKB)
+		sg = 0;
+
 	do {
 		err = xenbus_transaction_start(&xbt);
 		if (err) {
@@ -80,14 +85,14 @@ static int netback_probe(struct xenbus_d
 			goto fail;
 		}
 
-		err = xenbus_printf(xbt, dev->nodename, "feature-sg", "%d", 1);
+		err = xenbus_printf(xbt, dev->nodename, "feature-sg", "%d", sg);
 		if (err) {
 			message = "writing feature-sg";
 			goto abort_transaction;
 		}
 
 		err = xenbus_printf(xbt, dev->nodename, "feature-gso-tcpv4",
-				    "%d", 1);
+				    "%d", sg);
 		if (err) {
 			message = "writing feature-gso-tcpv4";
 			goto abort_transaction;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-20  4:46 RFC: [0/2] Remove netloop by lazy copying in netback Herbert Xu
  2007-03-20  4:50 ` RFC: [1/2] [XEN] gnttab: Add new op unmap_and_replace Herbert Xu
  2007-03-20  4:56 ` RFC: [2/2] [NET] back: Add lazy copying Herbert Xu
@ 2007-03-20  7:10 ` Keir Fraser
  2007-03-20 10:11   ` Herbert Xu
  2007-03-20 10:22 ` Keir Fraser
  3 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-20  7:10 UTC (permalink / raw)
  To: Herbert Xu, Keir Fraser, Xen Development Mailing List

Sounds good. I'm not sure about putting this in for 3.0.5 since that's only
a couple of weeks away. Also I'm loathe to add yet another darned
grant-table operation if we can avoid it. In this case I have my doubts it's
required, if we're prepared to hook off a slow path in the page-fault
handler (unhandled kernel fault). If we could get a callback into netback in
that case for some range of kernel virtual addresses then we could fix up
lazily in the case that an access races replacement of foreign mapping with
local mapping. We would like this anyway so we can do zero-copy-to-the-wire
(this may additionally require us to rev our interface to copy the packet
headers in the rings, or we could see how far just a small grant-copy
operation gets us). The idea would be to point the skb at a bunch of empty
mappings in the netback mapping area: if the areas have to be filled in (due
to firewall rules, filtering, PIO, etc) then we do it on demand from the
page-fault handler.

 -- Keir

On 20/3/07 04:46, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> These two patches remove the need for netloop by performing the
> copying in netback and only if it is necessary.  The rationale
> is that most packets will be processed without delay allowing
> them to be freed without copying at all.  So instead of copying
> every packet destined to dom0 we'll only copy those that linger
> longer than a specified amount of time (currently 0.5s).
> 
> As it is netloop doesn't take care of all delays anyway.  For
> instance packets delayed by qdisc or netfilter can hold up
> resources without any limits.  Also if bridging isn't used
> then traffic to dom0 does not even go through netloop.
> 
> Testing shows that these patches do eliminate the copying for
> bulk transfers.  In fact, bulk transfer throughput from domU
> to dom0 are increased by around 50%.  Even when the copying
> path is taken the performance is roughly equal to that of
> netloop despite the unoptimised copying path.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-20  7:10 ` RFC: [0/2] Remove netloop by lazy copying in netback Keir Fraser
@ 2007-03-20 10:11   ` Herbert Xu
  2007-03-20 10:18     ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-20 10:11 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List, Keir Fraser

On Tue, Mar 20, 2007 at 07:10:27AM +0000, Keir Fraser wrote:
> Sounds good. I'm not sure about putting this in for 3.0.5 since that's only
> a couple of weeks away. Also I'm loathe to add yet another darned
> grant-table operation if we can avoid it. In this case I have my doubts it's
> required, if we're prepared to hook off a slow path in the page-fault
> handler (unhandled kernel fault). If we could get a callback into netback in
> that case for some range of kernel virtual addresses then we could fix up

Yes that's a good idea and would avoid the grant table addition.

My only concern would be future acceptance into the mainstream
Linux kernel, especially on the domU side since I'd like to see
this idea used for dom0 => domU as well.

However, I'm certainly happy to explore doing it in the way you
suggested.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-20 10:11   ` Herbert Xu
@ 2007-03-20 10:18     ` Keir Fraser
  2007-03-22 10:50       ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-20 10:18 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List, Keir Fraser

On 20/3/07 10:11, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> Yes that's a good idea and would avoid the grant table addition.
> 
> My only concern would be future acceptance into the mainstream
> Linux kernel, especially on the domU side since I'd like to see
> this idea used for dom0 => domU as well.
> 
> However, I'm certainly happy to explore doing it in the way you
> suggested.

That's quite a valid concern, but I think that the required addition to the
#PF handler (certainly for i386 and x86/64) will be clean and small, and it
will not affect #PF critical-path latencies. I'd be fairly optimistic about
it getting accepted upstream, perhaps modulo concerns over whether we'd need
to implement it for *every* architecture.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-20  4:46 RFC: [0/2] Remove netloop by lazy copying in netback Herbert Xu
                   ` (2 preceding siblings ...)
  2007-03-20  7:10 ` RFC: [0/2] Remove netloop by lazy copying in netback Keir Fraser
@ 2007-03-20 10:22 ` Keir Fraser
  2007-03-20 10:27   ` Keir Fraser
  3 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-20 10:22 UTC (permalink / raw)
  To: Herbert Xu, Keir Fraser, Xen Development Mailing List

On 20/3/07 04:46, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> These two patches remove the need for netloop by performing the
> copying in netback and only if it is necessary.  The rationale
> is that most packets will be processed without delay allowing
> them to be freed without copying at all.  So instead of copying
> every packet destined to dom0 we'll only copy those that linger
> longer than a specified amount of time (currently 0.5s).

One other thought I had -- does the Linux network stack ever modify date
residing outside the main skbuff data area (i.e., data in fragments)? If it
ever does, wouldn't your delayed copy in netback potentially race such
updates? If the Linux network stack guarantees never to do such a thing,
that would be nice for us. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-20 10:22 ` Keir Fraser
@ 2007-03-20 10:27   ` Keir Fraser
  2007-03-20 11:50     ` David Edmondson
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-20 10:27 UTC (permalink / raw)
  To: Keir Fraser, Herbert Xu, Xen Development Mailing List




On 20/3/07 10:22, "Keir Fraser" <keir@xensource.com> wrote:

> One other thought I had -- does the Linux network stack ever modify date
> residing outside the main skbuff data area (i.e., data in fragments)? If it
> ever does, wouldn't your delayed copy in netback potentially race such
> updates? If the Linux network stack guarantees never to do such a thing,
> that would be nice for us. :-)

Scrub that question: we already map the grants read-only, so the network
stack can't be trying to write to fragments.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-20 10:27   ` Keir Fraser
@ 2007-03-20 11:50     ` David Edmondson
  0 siblings, 0 replies; 63+ messages in thread
From: David Edmondson @ 2007-03-20 11:50 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List, Herbert Xu

On Tue, Mar 20, 2007 at 10:27:48AM +0000, Keir Fraser wrote:
> On 20/3/07 10:22, "Keir Fraser" <keir@xensource.com> wrote:
> > One other thought I had -- does the Linux network stack ever modify date
> > residing outside the main skbuff data area (i.e., data in fragments)? If it
> > ever does, wouldn't your delayed copy in netback potentially race such
> > updates? If the Linux network stack guarantees never to do such a thing,
> > that would be nice for us. :-)
> 
> Scrub that question: we already map the grants read-only, so the network
> stack can't be trying to write to fragments.

The Solaris stack expects to be able to write to received packets and
I have a hack in our front and back end drivers to permit a writable
mapping of front->back buffers in preference to a copy (it's declared
using a feature flag, so a guest that doesn't do it still works).

I've resisted meddling in the fault handler, but I guess that I should
spend some time there in case it's a straightforward addition...

dme.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-20 10:18     ` Keir Fraser
@ 2007-03-22 10:50       ` Herbert Xu
  2007-03-22 15:40         ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-22 10:50 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List, Keir Fraser

On Tue, Mar 20, 2007 at 03:18:54AM -0700, Keir Fraser wrote:
> 
> That's quite a valid concern, but I think that the required addition to the
> #PF handler (certainly for i386 and x86/64) will be clean and small, and it
> will not affect #PF critical-path latencies. I'd be fairly optimistic about
> it getting accepted upstream, perhaps modulo concerns over whether we'd need
> to implement it for *every* architecture.

OK I've given it a spin and it's pretty straight-forward for
x86 as you said.  However, we'll need a bit more work for
ia64/ppc either on the kernel side or on the hypervisor side
as there is no way currently to swap entries in the P2M table.

Any better suggestions for dealing with those two?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-22 10:50       ` Herbert Xu
@ 2007-03-22 15:40         ` Keir Fraser
  2007-03-23  3:17           ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-22 15:40 UTC (permalink / raw)
  To: Herbert Xu, Keir Fraser; +Cc: Xen Development Mailing List, Keir Fraser


> On Tue, Mar 20, 2007 at 03:18:54AM -0700, Keir Fraser wrote:
>> 
>> That's quite a valid concern, but I think that the required addition to the
>> #PF handler (certainly for i386 and x86/64) will be clean and small, and it
>> will not affect #PF critical-path latencies. I'd be fairly optimistic about
>> it getting accepted upstream, perhaps modulo concerns over whether we'd need
>> to implement it for *every* architecture.
> 
> OK I've given it a spin and it's pretty straight-forward for
> x86 as you said.  However, we'll need a bit more work for
> ia64/ppc either on the kernel side or on the hypervisor side
> as there is no way currently to swap entries in the P2M table.
> 
> Any better suggestions for dealing with those two?

I didn't look super closely at the precise set of steps on x86 either. Do
you take a normal page of memory in the guest's address space and simply
give it an extra mapping in the netback area? Or do you take a page with no
pseudophysical address and assign it a pseudophysical address corresponding
to its place in the netback area?

I certainly assumed (a), and I think that would work fine on ia64 and
powerpc, as the page would already have a pseudophysical address?

Maybe I'm misunderstanding something... :-)

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-22 15:40         ` Keir Fraser
@ 2007-03-23  3:17           ` Herbert Xu
  2007-03-23 10:32             ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-23  3:17 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List, Keir Fraser, Keir Fraser

On Thu, Mar 22, 2007 at 03:40:59PM +0000, Keir Fraser wrote:
> 
> I didn't look super closely at the precise set of steps on x86 either. Do
> you take a normal page of memory in the guest's address space and simply
> give it an extra mapping in the netback area? Or do you take a page with no
> pseudophysical address and assign it a pseudophysical address corresponding
> to its place in the netback area?

I allocate a new page in dom0, i.e., an mfn/pfn pair, take away the mfn
and give it to the original netback pfn which would have no mfn after
the grant table entry has been unmapped.  The newly mfn-less pfn is
then put back into the netback area in the place of the old one which
is now a normal page.

> I certainly assumed (a), and I think that would work fine on ia64 and
> powerpc, as the page would already have a pseudophysical address?

a) would be good but it doesn't look easy.  The reason is that we have
to fit the result into skb frags which takes a page_struct and kmaps it
on demand.  In other words once the skb is in injected into the stack
the pseudo-physical address has to remain fixed whether we have a grant
table entry mapped there or not.

So changing the guest PTE in the auto-translted case (ia64/ppc and EPT/NPT
in future) isn't possible.  As we either have to change the guest PTE or
the host P2M, this leaves us with only the latter as an option.  We can't
do that currently as the hypervisor doesn't have such an operation.

So it looks like we'll need a new hypercall (although not a grant table
operation) to do it this way on an auto-translated dom0.

PS Check out the To/Cc fields, the number of Keirs is getting out of
control :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-23  3:17           ` Herbert Xu
@ 2007-03-23 10:32             ` Keir Fraser
  2007-03-23 11:42               ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-23 10:32 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List

On 23/3/07 03:17, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> a) would be good but it doesn't look easy.  The reason is that we have
> to fit the result into skb frags which takes a page_struct and kmaps it
> on demand.  In other words once the skb is in injected into the stack
> the pseudo-physical address has to remain fixed whether we have a grant
> table entry mapped there or not.
> 
> So changing the guest PTE in the auto-translted case (ia64/ppc and EPT/NPT
> in future) isn't possible.

It still sounds like it would work. The fragment's 'struct page *' will map
to a particular kernel virtual addres. That kernel virtual address can be
transformed by arithmetic back to the 'struct page *'. The fact that the pte
that maps that kernel virtual address actually points over at some other
poor unsuspecting pfn (which already has a struct page *, thank you very
much) doesn't actually matter, does it? Does anyone ever go look at the pte
contents and try to work out the 'struct page *' from that? I doubt it -- or
our netback driver would not work right now on x86 (as the mach-to-phys
entry is garbage from the p.o.v. of dom0, so any attempt to translate the
pte contents into something meaningful in pseudophys space would fail).

PS. I've stripped my name from the To/Cc lists! :-)

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-23 10:32             ` Keir Fraser
@ 2007-03-23 11:42               ` Herbert Xu
  2007-03-23 11:47                 ` Keir Fraser
                                   ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-23 11:42 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List

On Fri, Mar 23, 2007 at 10:32:37AM +0000, Keir Fraser wrote:
> 
> It still sounds like it would work. The fragment's 'struct page *' will map
> to a particular kernel virtual addres. That kernel virtual address can be
> transformed by arithmetic back to the 'struct page *'. The fact that the pte
> that maps that kernel virtual address actually points over at some other
> poor unsuspecting pfn (which already has a struct page *, thank you very
> much) doesn't actually matter, does it? Does anyone ever go look at the pte
> contents and try to work out the 'struct page *' from that? I doubt it -- or
> our netback driver would not work right now on x86 (as the mach-to-phys
> entry is garbage from the p.o.v. of dom0, so any attempt to translate the
> pte contents into something meaningful in pseudophys space would fail).

You're right, it's not as bad as I thought.

I was worried about this because if anybody calls virt_to_page
or something equivalent on that virtual address they'll get
the old struct page/pseudo-physical address rather than the new
one.  However, this is OK because as long as the access is
within the guest they'll still have to convert it back to that
virtual address.

The only catch is that I wanted to have the same lazy-copy
mechanism for dom0=>domU.  In that case we can't rely on domU
to trigger the copy so dom0 (or the driver domain) will need
to initiate it.  The easiest way to do that from outside domU
is through changing its P2M table.

I suppose doing it this way doesn't necessarily preclude a
different solution for dom0=>domU.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-23 11:42               ` Herbert Xu
@ 2007-03-23 11:47                 ` Keir Fraser
  2007-03-23 13:24                 ` Ian Pratt
  2007-03-25 11:41                 ` Herbert Xu
  2 siblings, 0 replies; 63+ messages in thread
From: Keir Fraser @ 2007-03-23 11:47 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List




On 23/3/07 11:42, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> The only catch is that I wanted to have the same lazy-copy
> mechanism for dom0=>domU.  In that case we can't rely on domU
> to trigger the copy so dom0 (or the driver domain) will need
> to initiate it.  The easiest way to do that from outside domU
> is through changing its P2M table.

Hmmmmm..... :-)

> I suppose doing it this way doesn't necessarily preclude a
> different solution for dom0=>domU.

We should cross this bridge when we come to it.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-23 11:42               ` Herbert Xu
  2007-03-23 11:47                 ` Keir Fraser
@ 2007-03-23 13:24                 ` Ian Pratt
  2007-03-23 23:07                   ` Santos, Jose Renato G
  2007-03-25 11:41                 ` Herbert Xu
  2 siblings, 1 reply; 63+ messages in thread
From: Ian Pratt @ 2007-03-23 13:24 UTC (permalink / raw)
  To: Herbert Xu, Keir Fraser
  Cc: Xen Development Mailing List, Santos, Jose Renato G

> The only catch is that I wanted to have the same lazy-copy
> mechanism for dom0=>domU.  In that case we can't rely on domU
> to trigger the copy so dom0 (or the driver domain) will need
> to initiate it.  The easiest way to do that from outside domU
> is through changing its P2M table.
> 
> I suppose doing it this way doesn't necessarily preclude a
> different solution for dom0=>domU.

I think we can make the guest TX path work rather better by putting the
tx packet header off into a separate fragment and placing it within one
of a set of pages that are 'permanently' mapped between the front and
back ends. The backend would not typically grant map the payload
fragment(s), but only do so if it catches a page fault trying to access
them.  

The way I'd do the 'permanent' mappings is to have a bit in the grant
reference passed across the ring that indicates that the backend is free
to leave the grant mapped. The bit should be set whenever the grant
entry is used (the front end can't mix and match). The backend thus just
needs to keep track of which grant refs it has mapped via this permanent
scheme so that it can spot (valid) reuse, and clean up afterwards. Since
the front end will typically use a block of grant refs for this purpose
the backend code can be optimised for a simple range check. The frontend
is free to pack multiple header fragments on to a page to save memory
(rather than using larger payload sized frags).

Jose Renato investigated something similar to this last year, but AFAIK
we didn't take it the whole way of introducing permanent mappings with
no (default) mapping of payloads. 


BTW: when adding fragment support you 'borrowed' the descriptor id field
which wasn't being used at the time.  This means we can no longer return
free buffers to the rx ring out of order. This is rather unfortunate as
its very useful for dealing with some kinds of smart NIC. I expect we'll
have to introduce a net ring v2 format in the not too distant future
unless we can think of a cunning way of stealing them back?

Thanks,
Ian

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-23 13:24                 ` Ian Pratt
@ 2007-03-23 23:07                   ` Santos, Jose Renato G
  2007-03-23 23:29                     ` Ian Pratt
  0 siblings, 1 reply; 63+ messages in thread
From: Santos, Jose Renato G @ 2007-03-23 23:07 UTC (permalink / raw)
  To: Ian Pratt, Herbert Xu, Keir Fraser; +Cc: Xen Development Mailing List

 

> -----Original Message-----
> From: Ian Pratt [mailto:Ian.Pratt@cl.cam.ac.uk] 
> Sent: Friday, March 23, 2007 6:24 AM
> To: Herbert Xu; Keir Fraser
> Cc: Xen Development Mailing List; Santos, Jose Renato G; 
> ian.pratt@cl.cam.ac.uk
> Subject: RE: [Xen-devel] RFC: [0/2] Remove netloop by lazy 
> copying in netback
> 
> > The only catch is that I wanted to have the same lazy-copy 
> mechanism 
> > for dom0=>domU.  In that case we can't rely on domU to trigger the 
> > copy so dom0 (or the driver domain) will need to initiate it.  The 
> > easiest way to do that from outside domU is through 
> changing its P2M 
> > table.
> > 
> > I suppose doing it this way doesn't necessarily preclude a 
> different 
> > solution for dom0=>domU.
> 
> I think we can make the guest TX path work rather better by 
> putting the tx packet header off into a separate fragment and 
> placing it within one of a set of pages that are 
> 'permanently' mapped between the front and back ends. The 
> backend would not typically grant map the payload 
> fragment(s), but only do so if it catches a page fault trying 
> to access them.  
> 
> The way I'd do the 'permanent' mappings is to have a bit in 
> the grant reference passed across the ring that indicates 
> that the backend is free to leave the grant mapped. The bit 
> should be set whenever the grant entry is used (the front end 
> can't mix and match). The backend thus just needs to keep 
> track of which grant refs it has mapped via this permanent 
> scheme so that it can spot (valid) reuse, and clean up 
> afterwards. Since the front end will typically use a block of 
> grant refs for this purpose the backend code can be optimised 
> for a simple range check. The frontend is free to pack 
> multiple header fragments on to a page to save memory (rather 
> than using larger payload sized frags).
> 
> Jose Renato investigated something similar to this last year, 
> but AFAIK we didn't take it the whole way of introducing 
> permanent mappings with no (default) mapping of payloads. 
> 
>

  The way I did it was slightly different.
  The shared pages used to store the packet headers are  
  mapped at driver initialization at the same time the
  I/O ring is initialized (You can think of these pages
  as an extension to the I/O ring and mapped and unmapped at
  the same time). This shared header area
  is divided into fixed size slots large enough to store all
  packet headers (MAC/IP/TCP). Netfront copies the
  header of the packet to a free slot on the shared header
  buffer area and passes the offset of the header on the
  first I/O ring request for the packet (instead of
  passing a grant reference). Payload data are transferred
  as additional requests on I/O ring using grant
  references as usual. Netback assumes the first request
  of all packets point to headers in the shared area and
  copies the header into the main socket buffer data area.
  Payload data (in additional I/O ring requests) are associated
  with skb fragments but not mapped into dom0 memory 
  unless a page fault happens
  (indicating that dom0 is trying to access that data).
  If the packet is sent out on the wire, the payload is never  
  touched by the CPU (assuming no netfilter rules, etc...)
  and no memory mapping is necessary.

  I have put this code on the shelf for a while as I have been
  busy on analyzing/optimizing the receive path.
  I plan to port my patch to the current xen-unstable
  and get some performance data for the Xen summit after
  I finish my experiments on the receive side.
  I think we should include this on future releases, but
  I would like to hold on until I have some other results
  that may influence how we do this.
  This would be a good topic for discussion on the summit

  Regards
  
  Renato

 
> BTW: when adding fragment support you 'borrowed' the 
> descriptor id field which wasn't being used at the time.  
> This means we can no longer return free buffers to the rx 
> ring out of order. This is rather unfortunate as its very 
> useful for dealing with some kinds of smart NIC. I expect 
> we'll have to introduce a net ring v2 format in the not too 
> distant future unless we can think of a cunning way of 
> stealing them back?
> 
> Thanks,
> Ian
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-23 23:07                   ` Santos, Jose Renato G
@ 2007-03-23 23:29                     ` Ian Pratt
  2007-03-26 23:58                       ` Santos, Jose Renato G
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Pratt @ 2007-03-23 23:29 UTC (permalink / raw)
  To: Santos, Jose Renato G, Ian Pratt, Herbert Xu, Keir Fraser
  Cc: Xen Development Mailing List

> > I think we can make the guest TX path work rather better by
> > putting the tx packet header off into a separate fragment and
> > placing it within one of a set of pages that are
> > 'permanently' mapped between the front and back ends. The
> > backend would not typically grant map the payload
> > fragment(s), but only do so if it catches a page fault trying
> > to access them.
> >
> > The way I'd do the 'permanent' mappings is to have a bit in
> > the grant reference passed across the ring that indicates
> > that the backend is free to leave the grant mapped. The bit
> > should be set whenever the grant entry is used (the front end
> > can't mix and match). The backend thus just needs to keep
> > track of which grant refs it has mapped via this permanent
> > scheme so that it can spot (valid) reuse, and clean up
> > afterwards. Since the front end will typically use a block of
> > grant refs for this purpose the backend code can be optimised
> > for a simple range check. The frontend is free to pack
> > multiple header fragments on to a page to save memory (rather
> > than using larger payload sized frags).
>
>   The way I did it was slightly different.
>   The shared pages used to store the packet headers are
>   mapped at driver initialization at the same time the
>   I/O ring is initialized (You can think of these pages
>   as an extension to the I/O ring and mapped and unmapped at
>   the same time). This shared header area
>   is divided into fixed size slots large enough to store all
>   packet headers (MAC/IP/TCP). Netfront copies the
>   header of the packet to a free slot on the shared header
>   buffer area and passes the offset of the header on the
>   first I/O ring request for the packet (instead of
>   passing a grant reference). Payload data are transferred
>   as additional requests on I/O ring using grant
>   references as usual. Netback assumes the first request
>   of all packets point to headers in the shared area and
>   copies the header into the main socket buffer data area.
>   Payload data (in additional I/O ring requests) are associated
>   with skb fragments but not mapped into dom0 memory
>   unless a page fault happens
>   (indicating that dom0 is trying to access that data).
>   If the packet is sent out on the wire, the payload is never
>   touched by the CPU (assuming no netfilter rules, etc...)
>   and no memory mapping is necessary.

Yep, the approach is similar, but I think having the 'permanent' grants
mechanism is probably a bit more flexible for managing the header
fragments.

It would be good if you could post your patch as it has relevance to
what Herbert is currently working on.

BTW: I'm pretty convinced its time to redefine the net ring format to
ensure each fragment has an id, flags, grant, offset and length fields.
Things are just getting messy with the current format.

Best,
Ian

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-23 11:42               ` Herbert Xu
  2007-03-23 11:47                 ` Keir Fraser
  2007-03-23 13:24                 ` Ian Pratt
@ 2007-03-25 11:41                 ` Herbert Xu
  2007-03-25 12:27                   ` Keir Fraser
  2 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-25 11:41 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List

On Fri, Mar 23, 2007 at 10:42:17PM +1100, Herbert Xu wrote:
> On Fri, Mar 23, 2007 at 10:32:37AM +0000, Keir Fraser wrote:
> > 
> > It still sounds like it would work. The fragment's 'struct page *' will map
> > to a particular kernel virtual addres. That kernel virtual address can be
> > transformed by arithmetic back to the 'struct page *'. The fact that the pte
> > that maps that kernel virtual address actually points over at some other
> > poor unsuspecting pfn (which already has a struct page *, thank you very
> > much) doesn't actually matter, does it? Does anyone ever go look at the pte
> > contents and try to work out the 'struct page *' from that? I doubt it -- or
> > our netback driver would not work right now on x86 (as the mach-to-phys
> > entry is garbage from the p.o.v. of dom0, so any attempt to translate the
> > pte contents into something meaningful in pseudophys space would fail).
> 
> You're right, it's not as bad as I thought.

Actually looking into it further for ia64 there is still a problem.
We use large pages for the identity mapping on ia64.  In order to
modify the PTE we'd have to break the large pages.

Am I missing something obvious?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-25 11:41                 ` Herbert Xu
@ 2007-03-25 12:27                   ` Keir Fraser
  2007-03-26  2:19                     ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-25 12:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List

On 25/3/07 12:41, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> Actually looking into it further for ia64 there is still a problem.
> We use large pages for the identity mapping on ia64.  In order to
> modify the PTE we'd have to break the large pages.
> 
> Am I missing something obvious?

No. 

Actually I changed my mind -- your original scheme (strip the original
pseudophysical frame of its RAM, and add that original pseudophys frame into
the netback set of frames) has the very nice advantage that we don't end up
with queued packets taking space in the netback frame pool. I missed that
your scheme essentially switches the roles of a normal RAM frame and a
netback frame, so that you end up with a freed-up netback frame at the end
of the swap. Sorry about that.

So we're back to the problem of doing this switch when Xen is doing the p2m
translation (as on ia64 for example). On x86 we have a XENMEM_add_to_physmap
hypercall. This could be generalised to other architectures and extended.
For example, we could add a XENMAPSPACE_gpfn -- which would mean take the
'thing' currently mapped at the specified gpfn and map it at the new gpfn
location instead. I'd certainly personally rather see add_to_physmap()
extended than add extra single-purpose crap to the grant-table interfaces.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-25 12:27                   ` Keir Fraser
@ 2007-03-26  2:19                     ` Herbert Xu
  2007-03-26 18:36                       ` Keir Fraser
  2007-03-27  3:41                       ` Isaku Yamahata
  0 siblings, 2 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-26  2:19 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List

On Sun, Mar 25, 2007 at 01:27:04PM +0100, Keir Fraser wrote:
> 
> So we're back to the problem of doing this switch when Xen is doing the p2m
> translation (as on ia64 for example). On x86 we have a XENMEM_add_to_physmap
> hypercall. This could be generalised to other architectures and extended.
> For example, we could add a XENMAPSPACE_gpfn -- which would mean take the
> 'thing' currently mapped at the specified gpfn and map it at the new gpfn
> location instead. I'd certainly personally rather see add_to_physmap()
> extended than add extra single-purpose crap to the grant-table interfaces.

I've had a look at this and it seems that

1) We don't have the underlying operations suited for this.

We need something that can replace a p2m entry atomically and more
importantly swap two p2m entries rather than setting one and unmapping
the other.  The former is because we can't easily process p2m page
faults in the guest.  The latter is because we stlil need to unmap the
grant table entry after this operation so we have to keep the entry
around.

This is actually one of the reasons I picked the grant tables interface
originally in that we could unmap it at the same time rather than doing
a full swap followed by an unmap.

So are you OK with adding underlying operations that allows a full swap
of two p2m entries? This would then be used as follows in translated mode:

	a) new_addr = alloc_page
	b) memcpy(new_addr, addr, len)
	c) p2m_swap(__pa(new_addr), __pa(addr))
	d) grant_unmap(__pa(new_addr))

2) I'm unsure what you want me to do for non-translated mode, i.e., x86.

Are you happy with the new grant table operation or do you want to follow
a swap mode as above? The swapping code would look like:

	a) new_addr = alloc_page
	b) memcpy(new_addr, addr, len)
	c) pte_swap(new_addr, addr)
	d) grant_unmap(new_addr)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-26  2:19                     ` Herbert Xu
@ 2007-03-26 18:36                       ` Keir Fraser
  2007-03-26 21:08                         ` Herbert Xu
  2007-03-27  3:41                       ` Isaku Yamahata
  1 sibling, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-26 18:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List

On 26/3/07 03:19, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> We need something that can replace a p2m entry atomically and more
> importantly swap two p2m entries rather than setting one and unmapping
> the other.  The former is because we can't easily process p2m page
> faults in the guest.  The latter is because we stlil need to unmap the
> grant table entry after this operation so we have to keep the entry
> around.

Can't we wrap the 'swap around' critical section in an irq-safe spinlock?
All we'd need to do from the page-fault handler then is a barrier on that
spinlock (i.e, wait for it to be released). Netback can simply copy the page
to new memory frame, unmap the grant, then relocate the new memory frame's
pseudophysical address.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-26 18:36                       ` Keir Fraser
@ 2007-03-26 21:08                         ` Herbert Xu
  2007-03-27  0:33                           ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-26 21:08 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List

On Mon, Mar 26, 2007 at 07:36:16PM +0100, Keir Fraser wrote:
> On 26/3/07 03:19, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:
> 
> > We need something that can replace a p2m entry atomically and more
> > importantly swap two p2m entries rather than setting one and unmapping
> > the other.  The former is because we can't easily process p2m page
> > faults in the guest.  The latter is because we stlil need to unmap the
> > grant table entry after this operation so we have to keep the entry
> > around.
> 
> Can't we wrap the 'swap around' critical section in an irq-safe spinlock?
> All we'd need to do from the page-fault handler then is a barrier on that
> spinlock (i.e, wait for it to be released). Netback can simply copy the page
> to new memory frame, unmap the grant, then relocate the new memory frame's
> pseudophysical address.

That works fine for the x86 case.  But when it's auto-translated,
you won't even get a page fault in the guest because the guest PTE
is unchanged and completely valid.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-23 23:29                     ` Ian Pratt
@ 2007-03-26 23:58                       ` Santos, Jose Renato G
  2007-03-27 13:15                         ` Ian Pratt
  0 siblings, 1 reply; 63+ messages in thread
From: Santos, Jose Renato G @ 2007-03-26 23:58 UTC (permalink / raw)
  To: Ian Pratt, Herbert Xu, Keir Fraser; +Cc: Xen Development Mailing List

[-- Attachment #1: Type: text/plain, Size: 2221 bytes --]

 

> 
> Yep, the approach is similar, but I think having the 
> 'permanent' grants mechanism is probably a bit more flexible 
> for managing the header fragments.
> 

  Not sure why it would be more flexible. Do you mean dealing with
headers that are in fragments? If headers are in fragments they will be
linearized when copied into the shared header area by netfront. Also, it
seems to me that having a fixed set of pages at initialization that
never change is easier to manage. I probably did not understand you
correctly... 
 
> It would be good if you could post your patch as it has 
> relevance to what Herbert is currently working on.
> 
  Ok. Here are the patches. A few disclaimers. They were not intended to
be distributed as is, so there may be some cleanups/optimizations
needed, although they were tested on an older version and used to work.
They also do not apply cleanly to the current xen-unstable. I am not
sure they will be very helpful to Herbert, but since you asked, here
they are.

> BTW: I'm pretty convinced its time to redefine the net ring 
> format to ensure each fragment has an id, flags, grant, 
> offset and length fields.
> Things are just getting messy with the current format.

I agree that we will need a new format soon, but I am not sure I
understood your concerns above. Currently all fragments do have an id,
flag, grant, offset and length fields. Maybe you mean that we need a
better way to represent multiple fragments belonging to the same packet.

Any way, the performance analysis that I am doing may indicate that we
might need some architectural changes on the device model. I would like
to discuss this before we settle on a format. I am trying to get some
data in time for the summit, but I am racing the clock...

Regards

Renato

========================
Patches

tx_header_copy: Create and uses set of shared pages between netfront and
netback to copy packet headers on network TX path.
tx_lazy_map: Do not create a host mapping for packet fragments
Tx_handle_page_fault: Handle page fault caused by dom0 trying to access
unmapped skb fragment. Creates the mapping in that case

> 
> Best,
> Ian
> 
> 
> 
> 

[-- Attachment #2: tx_header_copy-11760.patch --]
[-- Type: application/octet-stream, Size: 29431 bytes --]

diff -r 6ed4368b4a9e -r baccbd3a1aa1 linux-2.6-xen-sparse/drivers/xen/netback/common.h
--- a/linux-2.6-xen-sparse/drivers/xen/netback/common.h	Sun Oct 15 09:53:20 2006 +0100
+++ b/linux-2.6-xen-sparse/drivers/xen/netback/common.h	Mon Oct 30 17:01:26 2006 -0800
@@ -55,6 +55,10 @@
 #define WPRINTK(fmt, args...)				\
 	printk(KERN_WARNING "xen_net: " fmt, ##args)
 
+/* Maximum number of shared pages for TX packet headers */ 
+#define TX_HEADER_MAX_ORDER 3
+#define TX_HEADER_MAX_PAGES (1U << TX_HEADER_MAX_ORDER)
+
 typedef struct netif_st {
 	/* Unique identifier for this interface. */
 	domid_t          domid;
@@ -76,6 +80,13 @@ typedef struct netif_st {
 	struct vm_struct *tx_comms_area;
 	struct vm_struct *rx_comms_area;
 
+	/* Shared TX header area */
+	unsigned int     tx_header_pages;
+	grant_ref_t      tx_header_ref[TX_HEADER_MAX_PAGES];
+	grant_handle_t   tx_header_handle[TX_HEADER_MAX_PAGES];
+	struct vm_struct *tx_header_area;
+	char             *tx_header_buf;
+ 
 	/* Set of features that can be turned on in dev->features. */
 	int features;
 
diff -r 6ed4368b4a9e -r baccbd3a1aa1 linux-2.6-xen-sparse/drivers/xen/netback/interface.c
--- a/linux-2.6-xen-sparse/drivers/xen/netback/interface.c	Sun Oct 15 09:53:20 2006 +0100
+++ b/linux-2.6-xen-sparse/drivers/xen/netback/interface.c	Mon Oct 30 17:01:26 2006 -0800
@@ -230,6 +230,38 @@ static int map_frontend_pages(
 	netif->rx_shmem_ref    = rx_ring_ref;
 	netif->rx_shmem_handle = op.handle;
 
+ 	return 0;
+}
+
+static int map_header_pages(netif_t *netif)
+{
+	struct gnttab_map_grant_ref op;
+	int i, ret;
+
+	if (!netif->tx_header_area)
+		return 0;
+
+	lock_vm_area(netif->tx_header_area);
+	for (i = 0; i < netif->tx_header_pages; i++) {
+		gnttab_set_map_op(&op, 
+				  (unsigned long)netif->tx_header_buf +
+				  (i << PAGE_SHIFT),
+				  GNTMAP_host_map, 
+				  netif->tx_header_ref[i], 
+				  netif->domid);
+		ret = HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, 
+						&op, 1);
+		BUG_ON(ret);
+		
+		if (op.status) {
+			DPRINTK("Gnttab failure mapping tx_header_ref %d\n",i);
+			unlock_vm_area(netif->tx_header_area);
+			return op.status;
+		}
+		netif->tx_header_handle[i] = op.handle;
+	}
+	unlock_vm_area(netif->tx_header_area);
+
 	return 0;
 }
 
@@ -253,6 +285,25 @@ static void unmap_frontend_pages(netif_t
 	ret = HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, &op, 1);
 	unlock_vm_area(netif->rx_comms_area);
 	BUG_ON(ret);
+}
+
+static void unmap_header_pages(netif_t *netif)
+{
+	struct gnttab_unmap_grant_ref op;
+	int i, ret;
+
+	lock_vm_area(netif->tx_header_area);
+	for (i = 0; i < netif->tx_header_pages; i++) {
+		gnttab_set_unmap_op(&op, 
+				    (unsigned long)netif->tx_header_buf +
+				    (i << PAGE_SHIFT), 
+				    GNTMAP_host_map, 
+				    netif->tx_header_handle[i]);
+		ret = HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, 
+						&op, 1);
+		BUG_ON(ret);
+	}
+	unlock_vm_area(netif->tx_header_area);
 }
 
 int netif_map(netif_t *netif, unsigned long tx_ring_ref,
@@ -278,6 +329,19 @@ int netif_map(netif_t *netif, unsigned l
 	if (err)
 		goto err_map;
 
+	netif->tx_header_area = NULL;
+	if (netif->tx_header_pages > 0) {
+		netif->tx_header_area = 
+			alloc_vm_area(netif->tx_header_pages << PAGE_SHIFT);
+		if (netif->tx_header_area == NULL)
+			goto err_header;
+		netif->tx_header_buf = netif->tx_header_area->addr;
+
+		err = map_header_pages(netif);
+		if (err)
+			goto err_map_header;
+	}
+
 	bind_interdomain.remote_dom = netif->domid;
 	bind_interdomain.remote_port = evtchn;
 
@@ -311,6 +375,11 @@ int netif_map(netif_t *netif, unsigned l
 
 	return 0;
 err_hypervisor:
+	unmap_header_pages(netif);
+err_map_header:
+	free_vm_area(netif->tx_header_area);
+	netif->tx_header_area = NULL;
+err_header:
 	unmap_frontend_pages(netif);
 err_map:
 	free_vm_area(netif->rx_comms_area);
@@ -333,6 +402,12 @@ static void netif_free(netif_t *netif)
 		unmap_frontend_pages(netif);
 		free_vm_area(netif->tx_comms_area);
 		free_vm_area(netif->rx_comms_area);
+	}
+
+	if (netif->tx_header_area) {
+		unmap_header_pages(netif);
+		free_vm_area(netif->tx_header_area);
+		netif->tx_header_area = NULL;
 	}
 
 	free_netdev(netif->dev);
diff -r 6ed4368b4a9e -r baccbd3a1aa1 linux-2.6-xen-sparse/drivers/xen/netback/netback.c
--- a/linux-2.6-xen-sparse/drivers/xen/netback/netback.c	Sun Oct 15 09:53:20 2006 +0100
+++ b/linux-2.6-xen-sparse/drivers/xen/netback/netback.c	Mon Oct 30 17:01:26 2006 -0800
@@ -40,6 +40,8 @@
 
 /*#define NETBE_DEBUG_INTERRUPT*/
 
+#define GRANT_INVALID_HANDLE	(0xFFFFFFFF)
+
 struct netbk_rx_meta {
 	skb_frag_t frag;
 	int id;
@@ -845,14 +847,21 @@ inline static void net_tx_action_dealloc
 	gop = tx_unmap_ops;
 	while (dc != dp) {
 		pending_idx = dealloc_ring[MASK_PEND_IDX(dc++)];
+		if (grant_tx_handle[pending_idx] == GRANT_INVALID_HANDLE)
+			continue;
 		gnttab_set_unmap_op(gop, idx_to_kaddr(pending_idx),
 				    GNTMAP_host_map,
 				    grant_tx_handle[pending_idx]);
 		gop++;
-	}
-	ret = HYPERVISOR_grant_table_op(
-		GNTTABOP_unmap_grant_ref, tx_unmap_ops, gop - tx_unmap_ops);
-	BUG_ON(ret);
+
+	}
+
+	if (gop - tx_unmap_ops) {
+		ret = HYPERVISOR_grant_table_op(
+			GNTTABOP_unmap_grant_ref, 
+			tx_unmap_ops, gop - tx_unmap_ops);
+		BUG_ON(ret);
+	}
 
 	while (dealloc_cons != dp) {
 		pending_idx = dealloc_ring[MASK_PEND_IDX(dealloc_cons++)];
@@ -933,8 +942,14 @@ static gnttab_map_grant_ref_t *netbk_get
 	unsigned long pending_idx = *((u16 *)skb->data);
 	int i, start;
 
-	/* Skip first skb fragment if it is on same page as header fragment. */
-	start = ((unsigned long)shinfo->frags[0].page == pending_idx);
+	if (netif->tx_header_pages == 0) {
+		/* Skip first skb frag if it is on same page as header frag.*/
+		start = ((unsigned long)shinfo->frags[0].page == pending_idx);
+	} else {
+		/* if using TX shared header pages first request does not
+		   have a grant and it is never on a fragment */ 
+		start = 0;
+	}
 
 	for (i = start; i < shinfo->nr_frags; i++, txp++) {
 		pending_idx = pending_ring[MASK_PEND_IDX(pending_cons++)];
@@ -961,24 +976,33 @@ static int netbk_tx_check_mop(struct sk_
 	netif_tx_request_t *txp;
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	int nr_frags = shinfo->nr_frags;
-	int i, err, start;
-
-	/* Check status of header. */
-	err = mop->status;
-	if (unlikely(err)) {
-		txp = &pending_tx_info[pending_idx].req;
-		make_tx_response(netif, txp, NETIF_RSP_ERROR);
-		pending_ring[MASK_PEND_IDX(pending_prod++)] = pending_idx;
-		netif_put(netif);
+	int i, start;
+	int err = 0;
+
+	/* Check status of header, only if not using TX header shared page. */
+	if (!netif->tx_header_pages) {
+		err = mop->status;
+		if (unlikely(err)) {
+			txp = &pending_tx_info[pending_idx].req;
+			make_tx_response(netif, txp, NETIF_RSP_ERROR);
+			pending_ring[MASK_PEND_IDX(pending_prod++)] = 
+				pending_idx;
+			netif_put(netif);
+		} else {
+			set_phys_to_machine(
+				__pa(idx_to_kaddr(pending_idx)) >> PAGE_SHIFT,
+				FOREIGN_FRAME(mop->dev_bus_addr >> PAGE_SHIFT));
+			grant_tx_handle[pending_idx] = mop->handle;
+		}
+		mop++;
+		/* Skip first skb frag if it is on same page as header frag. */
+		start = ((unsigned long)shinfo->frags[0].page == pending_idx);
 	} else {
-		set_phys_to_machine(
-			__pa(idx_to_kaddr(pending_idx)) >> PAGE_SHIFT,
-			FOREIGN_FRAME(mop->dev_bus_addr >> PAGE_SHIFT));
-		grant_tx_handle[pending_idx] = mop->handle;
-	}
-
-	/* Skip first skb fragment if it is on same page as header fragment. */
-	start = ((unsigned long)shinfo->frags[0].page == pending_idx);
+		/* if using TX shared header pages first request does not 
+		   have a grant. */
+		grant_tx_handle[pending_idx] = GRANT_INVALID_HANDLE;
+		start = 0;
+	}
 
 	for (i = start; i < nr_frags; i++) {
 		int j, newerr;
@@ -986,7 +1010,7 @@ static int netbk_tx_check_mop(struct sk_
 		pending_idx = (unsigned long)shinfo->frags[i].page;
 
 		/* Check error status: if okay then remember grant handle. */
-		newerr = (++mop)->status;
+		newerr = mop->status;
 		if (likely(!newerr)) {
 			set_phys_to_machine(
 				__pa(idx_to_kaddr(pending_idx))>>PAGE_SHIFT,
@@ -995,8 +1019,10 @@ static int netbk_tx_check_mop(struct sk_
 			/* Had a previous error? Invalidate this fragment. */
 			if (unlikely(err))
 				netif_idx_release(pending_idx);
+			mop++;
 			continue;
 		}
+		mop++;
 
 		/* Error on this fragment: respond to client with an error. */
 		txp = &pending_tx_info[pending_idx].req;
@@ -1020,7 +1046,7 @@ static int netbk_tx_check_mop(struct sk_
 		err = newerr;
 	}
 
-	*mopp = mop + 1;
+	*mopp = mop;
 	return err;
 }
 
@@ -1195,18 +1221,36 @@ static void net_tx_action(unsigned long 
 			continue;
 		}
 
-		/* No crossing a page as the payload mustn't fragment. */
-		if (unlikely((txreq.offset + txreq.size) > PAGE_SIZE)) {
-			DPRINTK("txreq.offset: %x, size: %u, end: %lu\n", 
-				txreq.offset, txreq.size, 
-				(txreq.offset &~PAGE_MASK) + txreq.size);
-			netbk_tx_err(netif, &txreq, i);
-			continue;
-		}
+		if (netif->tx_header_pages) {
+			// For shared TX header pages "gref" has the offset
+			uint32_t header = txreq.gref;
+			// Header should not overflow shared header pages
+			if (unlikely((header + txreq.size) > 
+				     netif->tx_header_pages << PAGE_SHIFT)) {
+				DPRINTK("txreq header offset:%x, size:%u, end:"
+					"%lu, max:%lu (header request) \n", 
+					header, txreq.size, 
+					header + txreq.size,
+					netif->tx_header_page << PAGE_SHIFT);
+				netbk_tx_err(netif, &txreq, i);
+				continue;			
+			}
+		} else {
+			// No crossing page as the payload mustn't fragment.
+			if (unlikely((txreq.offset+txreq.size) > PAGE_SIZE)) {
+				DPRINTK("txreq offset:%x, size:%u, end:%lu\n",
+					txreq.offset, txreq.size, 
+					(txreq.offset &~PAGE_MASK)+txreq.size);
+				netbk_tx_err(netif, &txreq, i);
+				continue;
+			}
+		}
+
 
 		pending_idx = pending_ring[MASK_PEND_IDX(pending_cons)];
 
-		data_len = (txreq.size > PKT_PROT_LEN &&
+		data_len = (netif->tx_header_pages == 0 &&
+			    txreq.size > PKT_PROT_LEN &&
 			    ret < MAX_SKB_FRAGS) ?
 			PKT_PROT_LEN : txreq.size;
 
@@ -1217,7 +1261,7 @@ static void net_tx_action(unsigned long 
 			break;
 		}
 
-		/* Packets passed to netif_rx() must have some headroom. */
+	        /* Packets passed to netif_rx() must have some headroom. */
 		skb_reserve(skb, 16);
 
 		if (extras[XEN_NETIF_EXTRA_TYPE_GSO - 1].type) {
@@ -1231,10 +1275,14 @@ static void net_tx_action(unsigned long 
 			}
 		}
 
-		gnttab_set_map_op(mop, idx_to_kaddr(pending_idx),
-				  GNTMAP_host_map | GNTMAP_readonly,
-				  txreq.gref, netif->domid);
-		mop++;
+		/* First request has a valid grant only if not using shared TX
+                 * header pages. Otherwise first req. contains header offset */
+		if (!netif->tx_header_pages) {
+			gnttab_set_map_op(mop, idx_to_kaddr(pending_idx),
+					  GNTMAP_host_map | GNTMAP_readonly,
+					  txreq.gref, netif->domid);
+			mop++;
+		}
 
 		memcpy(&pending_tx_info[pending_idx].req,
 		       &txreq, sizeof(txreq));
@@ -1265,13 +1313,12 @@ static void net_tx_action(unsigned long 
 		if ((mop - tx_map_ops) >= ARRAY_SIZE(tx_map_ops))
 			break;
 	}
-
-	if (mop == tx_map_ops)
-		return;
-
-	ret = HYPERVISOR_grant_table_op(
-		GNTTABOP_map_grant_ref, tx_map_ops, mop - tx_map_ops);
-	BUG_ON(ret);
+	
+	if (mop != tx_map_ops) {
+		ret = HYPERVISOR_grant_table_op(
+			GNTTABOP_map_grant_ref, tx_map_ops, mop - tx_map_ops);
+		BUG_ON(ret);
+	}
 
 	mop = tx_map_ops;
 	while ((skb = __skb_dequeue(&tx_queue)) != NULL) {
@@ -1290,9 +1337,19 @@ static void net_tx_action(unsigned long 
 		}
 
 		data_len = skb->len;
-		memcpy(skb->data,
-		       (void *)(idx_to_kaddr(pending_idx)|txp->offset),
-		       data_len);
+
+		if ( netif->tx_header_pages ) {
+			// For shared header page "gref" has the offset
+			uint32_t header = txp->gref;
+			memcpy(skb->data,
+			       netif->tx_header_buf + header,
+			       data_len);
+		} else {
+			memcpy(skb->data,
+			       (void *)(idx_to_kaddr(pending_idx)|txp->offset),
+			       data_len);
+		}
+
 		if (data_len < txp->size) {
 			/* Append the packet payload as a fragment. */
 			txp->offset += data_len;
diff -r 6ed4368b4a9e -r baccbd3a1aa1 linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c
--- a/linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c	Sun Oct 15 09:53:20 2006 +0100
+++ b/linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c	Mon Oct 30 17:01:26 2006 -0800
@@ -97,6 +97,13 @@ static int netback_probe(struct xenbus_d
 				    "feature-rx-copy", "%d", 1);
 		if (err) {
 			message = "writing feature-copying";
+			goto abort_transaction;
+		}
+
+		err = xenbus_printf(xbt, dev->nodename,
+				    "tx-header-max-pages", "%u", TX_HEADER_MAX_PAGES);
+		if (err) {
+			message = "writing tx-header-max-pages";
 			goto abort_transaction;
 		}
 
@@ -331,12 +338,32 @@ static void connect(struct backend_info 
 		netif_start_queue(be->netif->dev);
 }
 
+static inline int get_tx_header_refs(struct backend_info *be, int npages)
+{
+	struct xenbus_device *dev = be->dev;
+	char str[32];
+	unsigned long ref;
+	int i, err;
+	
+	for (i = 0; i < npages; i++) {
+		sprintf(str,"tx-header-ref-%d",i);
+		err = xenbus_scanf(XBT_NIL, dev->otherend, str, "%lu",
+				   &ref);
+		if (err < 0) {
+			xenbus_dev_fatal(dev, err, "reading %s/%s", 
+					 dev->otherend, str);
+			return err;
+		}
+		be->netif->tx_header_ref[i] = ref;
+	}
+	return 0;
+}
 
 static int connect_rings(struct backend_info *be)
 {
 	struct xenbus_device *dev = be->dev;
 	unsigned long tx_ring_ref, rx_ring_ref;
-	unsigned int evtchn, rx_copy;
+	unsigned int evtchn, rx_copy, tx_header_pages;
 	int err;
 	int val;
 
@@ -365,6 +392,27 @@ static int connect_rings(struct backend_
 		return err;
 	}
 	be->netif->copying_receiver = !!rx_copy;
+
+	err = xenbus_scanf(XBT_NIL, dev->otherend, "tx-header-pages", "%u",
+			   &tx_header_pages);
+	if (err == -ENOENT) {
+		err = 0;
+		tx_header_pages = 0;
+	}
+	if (err < 0) {
+		xenbus_dev_fatal(dev, err, "reading %s/tx-header-pages",
+				 dev->otherend);
+		return err;
+	}
+	if (tx_header_pages > TX_HEADER_MAX_PAGES) {
+		xenbus_dev_fatal(dev, err, "invalid %s/tx-header-pages "
+				 "(%u, max=%u)", dev->otherend, 
+				 tx_header_pages, TX_HEADER_MAX_PAGES);
+		return -EINVAL;
+	}
+	if ( (err = get_tx_header_refs(be, tx_header_pages)) < 0 )
+		return err;
+	be->netif->tx_header_pages = tx_header_pages;
 
 	if (be->netif->dev->tx_queue_len != 0) {
 		if (xenbus_scanf(XBT_NIL, dev->otherend,
diff -r 6ed4368b4a9e -r baccbd3a1aa1 linux-2.6-xen-sparse/drivers/xen/netfront/netfront.c
--- a/linux-2.6-xen-sparse/drivers/xen/netfront/netfront.c	Sun Oct 15 09:53:20 2006 +0100
+++ b/linux-2.6-xen-sparse/drivers/xen/netfront/netfront.c	Mon Oct 30 17:01:26 2006 -0800
@@ -85,6 +85,34 @@ static const int MODPARM_rx_flip = 0;
 
 #define RX_COPY_THRESHOLD 256
 
+#define GRANT_INVALID_REF	0
+
+#define NET_TX_RING_SIZE __RING_SIZE((struct netif_tx_sring *)0, PAGE_SIZE)
+#define NET_RX_RING_SIZE __RING_SIZE((struct netif_rx_sring *)0, PAGE_SIZE)
+
+/* Maximum size of headers in a TCP packet:
+ * ethernet(14) + vlan(4) + ip(20) +tcp(60) */
+#define MAX_HEADER_TCP (14+4+20+60)
+
+
+/* Maximum number of shared pages for TX packet headers */ 
+#define TX_HEADER_MAX_ORDER 3
+#define TX_HEADER_MAX_PAGES (1U << TX_HEADER_MAX_ORDER)
+
+/* Target ratio of TX ring slots to header slots used 
+   to determine header buffer size */
+#define TX_REQ_HEADER_RATIO 4
+
+/* Invalid header index for the shared TX header buffer */
+#define INVALID_TX_HEADER_IDX (0xffff)
+
+/* Macros for converting a TX header buffer index into an address
+   and vice-versa */
+#define TX_HEADER_ADDR(_np, _idx) \
+	((_np)->tx_header_buf + ((_idx) * MAX_HEADER_TCP))
+#define TX_HEADER_IDX(_np, _addr) \
+	(((_addr) - ((_np)->tx_header_buf)) / MAX_HEADER_TCP)
+
 /* If we don't have GSO, fake things up so that we never try to use it. */
 #if defined(NETIF_F_GSO)
 #define HAVE_GSO			1
@@ -123,11 +151,6 @@ static inline int netif_needs_gso(struct
 #define netif_needs_gso(dev, skb)	0
 #define dev_disable_gso_features(dev)	((void)0)
 #endif
-
-#define GRANT_INVALID_REF	0
-
-#define NET_TX_RING_SIZE __RING_SIZE((struct netif_tx_sring *)0, PAGE_SIZE)
-#define NET_RX_RING_SIZE __RING_SIZE((struct netif_rx_sring *)0, PAGE_SIZE)
 
 struct netfront_info {
 	struct list_head list;
@@ -174,6 +197,15 @@ struct netfront_info {
 	unsigned long rx_pfn_array[NET_RX_RING_SIZE];
 	struct multicall_entry rx_mcl[NET_RX_RING_SIZE+1];
 	struct mmu_update rx_mmu[NET_RX_RING_SIZE];
+
+	/* TX shared headers pages */
+	int tx_header_size;                     /* number of headers         */
+	int tx_header_pages;                    /* number of shared pages    */
+	char* tx_header_buf;                    /* Pointer to header buffer  */
+	int tx_header_ref[TX_HEADER_MAX_PAGES]; /* grants for shared pages   */
+	struct list_head free_tx_header;        /* free list of header slots */
+	u16 tx_header_idx[NET_TX_RING_SIZE + 1];/* header in use by pending  */
+                                                /* request                   */
 };
 
 struct netfront_rx_info {
@@ -339,7 +371,8 @@ static int talk_to_backend(struct xenbus
 {
 	const char *message;
 	struct xenbus_transaction xbt;
-	int err;
+	char str[32];
+	int i, err;
 
 	err = xen_net_read_mac(dev, info->mac);
 	if (err) {
@@ -404,6 +437,23 @@ again:
 		goto abort_transaction;
 	}
 #endif
+
+	err = xenbus_printf(xbt, dev->nodename, "tx-header-pages", "%u",
+			    info->tx_header_pages);
+	if (err) {
+		message = "writing tx-header-pages";
+		goto abort_transaction;
+	}
+
+	for (i = 0; i < info->tx_header_pages; i++) {
+		sprintf(str,"tx-header-ref-%d",i);
+		err = xenbus_printf(xbt, dev->nodename, str, "%u", 
+				    info->tx_header_ref[i]);
+		if (err) {
+			message = "writing tx-header-ref";
+			goto abort_transaction;
+		}
+	}
 
 	err = xenbus_transaction_end(xbt, 0);
 	if (err) {
@@ -424,13 +474,26 @@ again:
 	return err;
 }
 
+static inline char *get_pages(unsigned int pages)
+{
+	struct page *page;
+
+	page = alloc_pages(GFP_KERNEL | __GFP_ZERO, 
+			   get_order(pages << PAGE_SHIFT));
+	if (!page)
+		return 0;
+		
+	return page_address(page);
+}
 
 static int setup_device(struct xenbus_device *dev, struct netfront_info *info)
 {
 	struct netif_tx_sring *txs;
 	struct netif_rx_sring *rxs;
+	char *txh;
 	int err;
 	struct net_device *netdev = info->netdev;
+	int i;
 
 	info->tx_ring_ref = GRANT_INVALID_REF;
 	info->rx_ring_ref = GRANT_INVALID_REF;
@@ -438,6 +501,10 @@ static int setup_device(struct xenbus_de
 	info->tx.sring = NULL;
 	info->irq = 0;
 
+	info->tx_header_buf = NULL;
+	for (i = 0; i < TX_HEADER_MAX_PAGES; i++)
+		info->tx_header_ref[i] = GRANT_INVALID_REF;
+
 	txs = (struct netif_tx_sring *)get_zeroed_page(GFP_KERNEL);
 	if (!txs) {
 		err = -ENOMEM;
@@ -469,6 +536,30 @@ static int setup_device(struct xenbus_de
 		goto fail;
 	}
 	info->rx_ring_ref = err;
+
+	if (info->tx_header_pages > 0) {
+		txh = get_pages(info->tx_header_pages);
+		if (!txh) {
+			err = -ENOMEM;
+			xenbus_dev_fatal(dev, err, 
+					 "allocating shared tx header pages");
+			goto fail;
+		}
+		info->tx_header_buf = txh;
+		INIT_LIST_HEAD(&info->free_tx_header);
+		for ( i=0; i < info->tx_header_size; i++ ) {
+			list_add_tail( (struct list_head *)
+				       (txh + i * MAX_HEADER_TCP),
+				       &(info->free_tx_header) );
+		}
+		for ( i=0; i < info->tx_header_pages; i++ ) {
+			err = xenbus_grant_ring(
+				dev, virt_to_mfn(txh + (i << PAGE_SHIFT)));
+			if (err < 0)
+				goto fail;
+			info->tx_header_ref[i] = err;
+		}
+	}
 
 	err = xenbus_alloc_evtchn(dev, &info->evtchn);
 	if (err)
@@ -575,7 +666,25 @@ static int network_open(struct net_devic
 
 static inline int netfront_tx_slot_available(struct netfront_info *np)
 {
-	return RING_FREE_REQUESTS(&np->tx) >= MAX_SKB_FRAGS + 2;
+	return RING_FREE_REQUESTS(&np->tx) >= MAX_SKB_FRAGS + 3;
+}
+
+static inline int netfront_header_slot_available(struct netfront_info *np)
+{
+	return !list_empty(&np->free_tx_header);
+}
+
+static inline char *get_header_slot(struct netfront_info *np)
+{
+	char *ret;
+	ret = (char *)(np->free_tx_header.next);
+	list_del(np->free_tx_header.next);
+	return ret;
+}
+
+static inline void release_header_slot(struct netfront_info *np, char* slot)
+{
+	list_add((struct list_head *) slot, &np->free_tx_header);
 }
 
 static inline void network_maybe_wake_tx(struct net_device *dev)
@@ -584,7 +693,9 @@ static inline void network_maybe_wake_tx
 
 	if (unlikely(netif_queue_stopped(dev)) &&
 	    netfront_tx_slot_available(np) &&
-	    likely(netif_running(dev)))
+	    likely(netif_running(dev)) &&
+	    (!np->tx_header_pages || 
+	     netfront_header_slot_available(np)))
 		netif_wake_queue(dev);
 }
 
@@ -610,18 +721,32 @@ static void network_tx_buf_gc(struct net
 
 			id  = txrsp->id;
 			skb = np->tx_skbs[id];
-			if (unlikely(gnttab_query_foreign_access(
-				np->grant_tx_ref[id]) != 0)) {
-				printk(KERN_ALERT "network_tx_buf_gc: warning "
-				       "-- grant still in use by backend "
-				       "domain.\n");
-				BUG();
+
+			if (np->grant_tx_ref[id] != GRANT_INVALID_REF) {
+				if (unlikely(gnttab_query_foreign_access(
+						     np->grant_tx_ref[id]) 
+					     != 0)) {
+					printk(KERN_ALERT "network_tx_buf_gc:"
+					       " warning -- grant still in use"
+					       " by backend domain.\n");
+					BUG();
+				}
+				gnttab_end_foreign_access_ref(
+					np->grant_tx_ref[id], GNTMAP_readonly);
+			
+				gnttab_release_grant_reference(
+					&np->gref_tx_head, 
+					np->grant_tx_ref[id]);
+				np->grant_tx_ref[id] = GRANT_INVALID_REF;
 			}
-			gnttab_end_foreign_access_ref(
-				np->grant_tx_ref[id], GNTMAP_readonly);
-			gnttab_release_grant_reference(
-				&np->gref_tx_head, np->grant_tx_ref[id]);
-			np->grant_tx_ref[id] = GRANT_INVALID_REF;
+
+			if(np->tx_header_idx[id] != INVALID_TX_HEADER_IDX) {
+				release_header_slot(
+					np, 
+					TX_HEADER_ADDR(np, 
+						       np->tx_header_idx[id]));
+			}
+			
 			add_id_to_freelist(np->tx_skbs, id);
 			dev_kfree_skb_irq(skb);
 		}
@@ -741,9 +866,8 @@ no_skb:
 
 		req = RING_GET_REQUEST(&np->rx, req_prod + i);
 		if (!np->copying_receiver) {
-			gnttab_grant_foreign_transfer_ref(ref,
-							  np->xbdev->otherend_id,
-							  pfn);
+			gnttab_grant_foreign_transfer_ref(
+				ref, np->xbdev->otherend_id, pfn);
 			np->rx_pfn_array[nr_flips] = pfn_to_mfn(pfn);
 			if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 				/* Remove this page before passing
@@ -814,22 +938,18 @@ static void xennet_make_frags(struct sk_
 			      struct netif_tx_request *tx)
 {
 	struct netfront_info *np = netdev_priv(dev);
-	char *data = skb->data;
+	char *data = skb->data + tx->size;
 	unsigned long mfn;
 	RING_IDX prod = np->tx.req_prod_pvt;
 	int frags = skb_shinfo(skb)->nr_frags;
 	unsigned int offset = offset_in_page(data);
-	unsigned int len = skb_headlen(skb);
+	unsigned int len = skb_headlen(skb) - tx->size;
 	unsigned int id;
 	grant_ref_t ref;
 	int i;
 
-	while (len > PAGE_SIZE - offset) {
-		tx->size = PAGE_SIZE - offset;
+	while (len > 0) {
 		tx->flags |= NETTXF_more_data;
-		len -= tx->size;
-		data += tx->size;
-		offset = 0;
 
 		id = get_id_from_freelist(np->tx_skbs);
 		np->tx_skbs[id] = skb_get(skb);
@@ -844,7 +964,12 @@ static void xennet_make_frags(struct sk_
 
 		tx->gref = np->grant_tx_ref[id] = ref;
 		tx->offset = offset;
-		tx->size = len;
+		tx->size = (len < (PAGE_SIZE - offset)) ? 
+			len : PAGE_SIZE - offset;
+		np->tx_header_idx[id] = INVALID_TX_HEADER_IDX;
+		len -= tx->size;
+		data += tx->size;
+		offset = 0;
 		tx->flags = 0;
 	}
 
@@ -864,6 +989,7 @@ static void xennet_make_frags(struct sk_
 		gnttab_grant_foreign_access_ref(ref, np->xbdev->otherend_id,
 						mfn, GNTMAP_readonly);
 
+		np->tx_header_idx[id] = INVALID_TX_HEADER_IDX;
 		tx->gref = np->grant_tx_ref[id] = ref;
 		tx->offset = frag->page_offset;
 		tx->size = frag->size;
@@ -913,14 +1039,31 @@ static int network_start_xmit(struct sk_
 	tx = RING_GET_REQUEST(&np->tx, i);
 
 	tx->id   = id;
-	ref = gnttab_claim_grant_reference(&np->gref_tx_head);
-	BUG_ON((signed short)ref < 0);
-	mfn = virt_to_mfn(data);
-	gnttab_grant_foreign_access_ref(
-		ref, np->xbdev->otherend_id, mfn, GNTMAP_readonly);
-	tx->gref = np->grant_tx_ref[id] = ref;
-	tx->offset = offset;
-	tx->size = len;
+	
+	if (np->tx_header_pages) {
+		char *tx_header_buf = get_header_slot(np);
+		np->tx_header_idx[id] = TX_HEADER_IDX(np, tx_header_buf);
+		np->grant_tx_ref[id] = GRANT_INVALID_REF;
+		/* If using shared header page gref has header offset */
+		tx->gref = tx_header_buf - np->tx_header_buf;
+		tx->offset = 0;
+		tx->size = (len > MAX_HEADER_TCP) ?
+			MAX_HEADER_TCP : len;
+		memcpy(tx_header_buf, skb->data, tx->size);
+	} else {
+		np->tx_header_idx[id] = INVALID_TX_HEADER_IDX;
+		ref = gnttab_claim_grant_reference(&np->gref_tx_head);
+		BUG_ON((signed short)ref < 0);
+		mfn = virt_to_mfn(data);
+		gnttab_grant_foreign_access_ref(
+			ref, np->xbdev->otherend_id, mfn, GNTMAP_readonly);
+		tx->gref = np->grant_tx_ref[id] = ref;
+		tx->offset = offset;
+		if ((offset + len) > PAGE_SIZE)
+			tx->size = PAGE_SIZE - offset;
+		else
+			tx->size = len;
+	}
 
 	tx->flags = 0;
 	extra = NULL;
@@ -964,7 +1107,8 @@ static int network_start_xmit(struct sk_
 
 	network_tx_buf_gc(dev);
 
-	if (!netfront_tx_slot_available(np))
+	if (!netfront_tx_slot_available(np) ||
+	    (np->tx_header_pages && !netfront_header_slot_available(np)))
 		netif_stop_queue(dev);
 
 	spin_unlock_irq(&np->tx_lock);
@@ -1126,11 +1270,9 @@ static int xennet_get_responses(struct n
 				mcl = np->rx_mcl + pages_flipped;
 				mmu = np->rx_mmu + pages_flipped;
 
-				MULTI_update_va_mapping(mcl,
-							(unsigned long)vaddr,
-							pfn_pte_ma(mfn,
-								   PAGE_KERNEL),
-							0);
+				MULTI_update_va_mapping(
+					mcl, (unsigned long)vaddr,
+					pfn_pte_ma(mfn, PAGE_KERNEL), 0);
 				mmu->ptr = ((maddr_t)mfn << PAGE_SHIFT)
 					| MMU_MACHPHYS_UPDATE;
 				mmu->val = pfn;
@@ -1447,11 +1589,15 @@ static void netif_release_tx_bufs(struct
 			continue;
 
 		skb = np->tx_skbs[i];
-		gnttab_end_foreign_access_ref(
-			np->grant_tx_ref[i], GNTMAP_readonly);
-		gnttab_release_grant_reference(
-			&np->gref_tx_head, np->grant_tx_ref[i]);
-		np->grant_tx_ref[i] = GRANT_INVALID_REF;
+		if (np->grant_tx_ref[i] != GRANT_INVALID_REF) {
+			gnttab_end_foreign_access_ref(
+				np->grant_tx_ref[i], 
+				GNTMAP_readonly);
+			gnttab_release_grant_reference(
+				&np->gref_tx_head, 
+				np->grant_tx_ref[i]);
+			np->grant_tx_ref[i] = GRANT_INVALID_REF;
+		}
 		add_id_to_freelist(np->tx_skbs, i);
 		dev_kfree_skb_irq(skb);
 	}
@@ -1625,6 +1771,7 @@ static int network_connect(struct net_de
 	grant_ref_t ref;
 	netif_rx_request_t *req;
 	unsigned int feature_rx_copy, feature_rx_flip;
+	unsigned int tx_header_size, tx_header_pages, tx_header_max_pages;
 
 	err = xenbus_scanf(XBT_NIL, np->xbdev->otherend,
 			   "feature-rx-copy", "%u", &feature_rx_copy);
@@ -1643,11 +1790,35 @@ static int network_connect(struct net_de
 	np->copying_receiver = ((MODPARM_rx_copy && feature_rx_copy) ||
 				(MODPARM_rx_flip && !feature_rx_flip));
 
+	err = xenbus_scanf(XBT_NIL, np->xbdev->otherend,
+			   "tx-header-max-pages", "%u", &tx_header_max_pages);
+	if (err != 1)
+		tx_header_max_pages = 0;
+
+	/* We need less header entries than TX ring entries, since a large
+         * packet will consume multiple TX ring entries but just one header */
+	tx_header_size = NET_TX_RING_SIZE / TX_REQ_HEADER_RATIO;
+	tx_header_pages = 1 << get_order(tx_header_size * MAX_HEADER_TCP); 
+
+	/* Do not exceed maximum pages for TX shared header buffer */
+	if (tx_header_pages > TX_HEADER_MAX_PAGES)
+		tx_header_pages = TX_HEADER_MAX_PAGES;
+
+	/* Do not exceed netback maximum pages for TX shared header buffer */
+	if (tx_header_pages > tx_header_max_pages)
+		tx_header_pages = tx_header_max_pages;
+
+	np->tx_header_size = (tx_header_pages << PAGE_SHIFT) / MAX_HEADER_TCP;
+	np->tx_header_pages = tx_header_pages;
+
 	err = talk_to_backend(np->xbdev, np);
 	if (err)
 		return err;
 
 	xennet_set_features(dev);
+
+	IPRINTK("device %s uses %d shared pages for TX header copying.\n",
+		dev->name, np->tx_header_pages);
 
 	IPRINTK("device %s has %sing receive path.\n",
 		dev->name, np->copying_receiver ? "copy" : "flipp");
@@ -1915,6 +2086,10 @@ static struct net_device * __devinit cre
 		np->grant_rx_ref[i] = GRANT_INVALID_REF;
 	}
 
+	for (i = 0; i <= NET_TX_RING_SIZE; i++) {
+		np->tx_header_idx[i] = INVALID_TX_HEADER_IDX;
+	}
+		
 	/* A grant for every tx ring slot */
 	if (gnttab_alloc_grant_references(TX_MAX_TARGET,
 					  &np->gref_tx_head) < 0) {
@@ -2039,6 +2214,8 @@ static void close_netdev(struct netfront
 
 static void netif_disconnect_backend(struct netfront_info *info)
 {
+	int i;
+
 	/* Stop old i/f to prevent errors whilst we rebuild the state. */
 	spin_lock_irq(&info->tx_lock);
 	spin_lock(&info->rx_lock);
@@ -2056,6 +2233,18 @@ static void netif_disconnect_backend(str
 	info->rx_ring_ref = GRANT_INVALID_REF;
 	info->tx.sring = NULL;
 	info->rx.sring = NULL;
+
+	for (i = 0; i < TX_HEADER_MAX_PAGES; i++) {
+	  end_access(info->tx_header_ref[i], 
+		     info->tx_header_buf + (i << PAGE_SHIFT));
+		info->tx_header_ref[i] = GRANT_INVALID_REF;
+	}
+
+	if (info->tx_header_buf) {
+	  free_pages((unsigned long)info->tx_header_buf, 
+		     get_order(info->tx_header_pages << PAGE_SHIFT));
+		info->tx_header_buf = NULL;
+	}
 }
 
 

[-- Attachment #3: tx_lazy_map-11760.patch --]
[-- Type: application/octet-stream, Size: 2760 bytes --]

# HG changeset patch
# User root@lsnpsvr3.hpl.hp.com
# Node ID 0ea439663b7ed1430df0764214f484ee9c67f02b
# Parent  baccbd3a1aa15b8f900d0bf618db88b1381bd7a8
Do not map guest pages up front. Use device_map option for access grant.

diff -r baccbd3a1aa1 -r 0ea439663b7e linux-2.6-xen-sparse/drivers/xen/Kconfig
--- a/linux-2.6-xen-sparse/drivers/xen/Kconfig	Mon Oct 30 17:01:26 2006 -0800
+++ b/linux-2.6-xen-sparse/drivers/xen/Kconfig	Wed Nov  1 09:01:16 2006 -0800
@@ -88,6 +88,15 @@ config XEN_NETDEV_PIPELINED_TRANSMITTER
 	  are unsure; or if you experience network hangs when this option is
 	  enabled; then you must say N here.
 
+config XEN_NETDEV_TX_LAZY_MAPPING
+	bool "Only map network TX guest page if driver domain access it (EXPERIMENTAL)"
+	depends on XEN_NETDEV_BACKEND
+	default n
+	help
+	  Improves network performance for packets sent from guests to the 
+	  network, or to other guests. It may slightly degrade performance
+	  for packets sent from guests to dom0 (or driver domain).
+
 config XEN_NETDEV_LOOPBACK
 	tristate "Network-device loopback driver"
 	depends on XEN_NETDEV_BACKEND
diff -r baccbd3a1aa1 -r 0ea439663b7e linux-2.6-xen-sparse/drivers/xen/netback/netback.c
--- a/linux-2.6-xen-sparse/drivers/xen/netback/netback.c	Mon Oct 30 17:01:26 2006 -0800
+++ b/linux-2.6-xen-sparse/drivers/xen/netback/netback.c	Wed Nov  1 09:01:16 2006 -0800
@@ -42,6 +42,12 @@
 
 #define GRANT_INVALID_HANDLE	(0xFFFFFFFF)
 
+#ifdef CONFIG_XEN_NETDEV_TX_LAZY_MAPPING
+#define GRANT_MAP_TYPE	GNTMAP_device_map
+#else
+#define GRANT_MAP_TYPE  GNTMAP_host_map
+#endif
+
 struct netbk_rx_meta {
 	skb_frag_t frag;
 	int id;
@@ -849,9 +855,10 @@ inline static void net_tx_action_dealloc
 		pending_idx = dealloc_ring[MASK_PEND_IDX(dc++)];
 		if (grant_tx_handle[pending_idx] == GRANT_INVALID_HANDLE)
 			continue;
-		gnttab_set_unmap_op(gop, idx_to_kaddr(pending_idx),
-				    GNTMAP_host_map,
+		gnttab_set_unmap_op(gop, idx_to_kaddr(pending_idx), 0,
 				    grant_tx_handle[pending_idx]);
+		gop->dev_bus_addr = virt_to_mfn(idx_to_kaddr(pending_idx))
+			<< PAGE_SHIFT;
 		gop++;
 
 	}
@@ -955,7 +962,7 @@ static gnttab_map_grant_ref_t *netbk_get
 		pending_idx = pending_ring[MASK_PEND_IDX(pending_cons++)];
 
 		gnttab_set_map_op(mop++, idx_to_kaddr(pending_idx),
-				  GNTMAP_host_map | GNTMAP_readonly,
+				  GRANT_MAP_TYPE | GNTMAP_readonly,
 				  txp->gref, netif->domid);
 
 		memcpy(&pending_tx_info[pending_idx].req, txp, sizeof(*txp));
@@ -1558,6 +1565,12 @@ static int __init netback_init(void)
 		&netif_be_dbg);
 #endif
 
+#ifdef CONFIG_XEN_NETDEV_TX_LAZY_MAPPING
+	printk("netback: CONFIG_XEN_NETDEV_TX_LAZY_MAPPING is SET\n");
+#else
+	printk("netback: CONFIG_XEN_NETDEV_TX_LAZY_MAPPING is NOT set\n");
+#endif
+
 	return 0;
 }
 

[-- Attachment #4: tx_handle_page_fault.patch --]
[-- Type: application/octet-stream, Size: 5350 bytes --]

diff -r 85b2b5165df5 -r 9e2ffca8e781 linux-2.6-xen-sparse/arch/i386/mm/fault-xen.c
--- a/linux-2.6-xen-sparse/arch/i386/mm/fault-xen.c	Wed Nov 01 09:12:51 2006 -0800
+++ b/linux-2.6-xen-sparse/arch/i386/mm/fault-xen.c	Wed Nov 01 17:10:00 2006 -0800
@@ -29,6 +29,11 @@
 #include <asm/kdebug.h>
 
 extern void die(const char *,struct pt_regs *,long);
+
+#ifdef CONFIG_XEN_NETDEV_TX_LAZY_MAPPING
+extern int netback_map_tx_page(unsigned long address, 
+			       unsigned long error_code);
+#endif
 
 /*
  * Unlock any spinlocks which will prevent us from getting the
@@ -549,6 +554,14 @@ no_context:
  	if (is_prefetch(regs, address, error_code))
  		return;
 
+#ifdef CONFIG_XEN_NETDEV_TX_LAZY_MAPPING
+	/* Check and map if this is a Xen guest page used for network TX */
+	/* In lazy mode netback does not premap the page but instead maps */
+	/* it on demand if a fault occurs */
+	if (netback_map_tx_page(address, error_code))
+		return;
+#endif	
+
 /*
  * Oops. The kernel tried to access some bad page. We'll have to
  * terminate things with extreme prejudice.
diff -r 85b2b5165df5 -r 9e2ffca8e781 linux-2.6-xen-sparse/drivers/xen/netback/netback.c
--- a/linux-2.6-xen-sparse/drivers/xen/netback/netback.c	Wed Nov 01 09:12:51 2006 -0800
+++ b/linux-2.6-xen-sparse/drivers/xen/netback/netback.c	Wed Nov 01 17:10:00 2006 -0800
@@ -40,8 +40,6 @@
 
 /*#define NETBE_DEBUG_INTERRUPT*/
 
-#define GRANT_INVALID_HANDLE	(0xFFFFFFFF)
-
 #ifdef CONFIG_XEN_NETDEV_TX_LAZY_MAPPING
 #define GRANT_MAP_TYPE	GNTMAP_device_map
 #else
@@ -102,9 +100,15 @@ static PEND_RING_IDX dealloc_prod, deall
 
 static struct sk_buff_head tx_queue;
 
+#define _NETIF_GRANT_VALID 0
+#define NETIF_GRANT_VALID (1U << _NETIF_GRANT_VALID)
+#define _NETIF_GRANT_MAPPED 1
+#define NETIF_GRANT_MAPPED (1U << _NETIF_GRANT_MAPPED)
+
 static grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
 static gnttab_unmap_grant_ref_t tx_unmap_ops[MAX_PENDING_REQS];
 static gnttab_map_grant_ref_t tx_map_ops[MAX_PENDING_REQS];
+static u8 grant_tx_status[MAX_PENDING_REQS];
 
 static struct list_head net_schedule_list;
 static spinlock_t net_schedule_list_lock;
@@ -853,7 +857,7 @@ inline static void net_tx_action_dealloc
 	gop = tx_unmap_ops;
 	while (dc != dp) {
 		pending_idx = dealloc_ring[MASK_PEND_IDX(dc++)];
-		if (grant_tx_handle[pending_idx] == GRANT_INVALID_HANDLE)
+		if (!(grant_tx_status[pending_idx] & NETIF_GRANT_VALID))
 			continue;
 		gnttab_set_unmap_op(gop, idx_to_kaddr(pending_idx), 0,
 				    grant_tx_handle[pending_idx]);
@@ -1000,6 +1004,7 @@ static int netbk_tx_check_mop(struct sk_
 				__pa(idx_to_kaddr(pending_idx)) >> PAGE_SHIFT,
 				FOREIGN_FRAME(mop->dev_bus_addr >> PAGE_SHIFT));
 			grant_tx_handle[pending_idx] = mop->handle;
+			grant_tx_status[pending_idx] = NETIF_GRANT_VALID;
 		}
 		mop++;
 		/* Skip first skb frag if it is on same page as header frag. */
@@ -1007,10 +1012,14 @@ static int netbk_tx_check_mop(struct sk_
 	} else {
 		/* if using TX shared header pages first request does not 
 		   have a grant. */
-		grant_tx_handle[pending_idx] = GRANT_INVALID_HANDLE;
+		grant_tx_status[pending_idx] = 0;
 		start = 0;
 	}
 
+#ifndef CONFIG_XEN_NETDEV_TX_LAZY_MAPPING
+	grant_tx_status[pending_idx] |= NETIF_GRANT_MAPPED;
+#endif
+	
 	for (i = start; i < nr_frags; i++) {
 		int j, newerr;
 
@@ -1023,6 +1032,13 @@ static int netbk_tx_check_mop(struct sk_
 				__pa(idx_to_kaddr(pending_idx))>>PAGE_SHIFT,
 				FOREIGN_FRAME(mop->dev_bus_addr>>PAGE_SHIFT));
 			grant_tx_handle[pending_idx] = mop->handle;
+#ifdef CONFIG_XEN_NETDEV_TX_LAZY_MAPPING
+			grant_tx_status[pending_idx] =  
+				NETIF_GRANT_VALID;
+#else
+			grant_tx_status[pending_idx] = 
+				NETIF_GRANT_VALID | NETIF_GRANT_MAPPED;
+#endif
 			/* Had a previous error? Invalidate this fragment. */
 			if (unlikely(err))
 				netif_idx_release(pending_idx);
@@ -1415,6 +1431,56 @@ static void netif_page_release(struct pa
 	netif_idx_release(page->index);
 }
 
+#ifdef CONFIG_XEN_NETDEV_TX_LAZY_MAPPING
+static void map_tx_guest_page(u16 pending_idx)
+{
+	gnttab_unmap_grant_ref_t gunmap;
+	gnttab_map_grant_ref_t gmap;
+	int ret;
+
+	/* First unmap original grant handle */
+	gnttab_set_unmap_op(&gunmap, idx_to_kaddr(pending_idx), 0,
+			    grant_tx_handle[pending_idx]);
+	gunmap.dev_bus_addr = 
+		virt_to_mfn(idx_to_kaddr(pending_idx)) << PAGE_SHIFT;
+	
+	ret = HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref, &gunmap, 1);
+	BUG_ON(ret);
+
+	gnttab_set_map_op(&gmap, idx_to_kaddr(pending_idx),
+			  GNTMAP_host_map | GNTMAP_readonly,
+			  pending_tx_info[pending_idx].req.gref,
+			  pending_tx_info[pending_idx].netif->domid);
+
+	ret = HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, &gmap, 1);
+	BUG_ON(ret);
+	
+	grant_tx_handle[pending_idx] = gmap.handle;
+}
+
+/* called by page_fault handler */
+int netback_map_tx_page(unsigned long address, unsigned long error_code)
+{
+	unsigned long page_address = address & PAGE_MASK;
+	int i;
+
+	if (error_code != 0)
+		return 0;
+
+	for (i = 0; i < MAX_PENDING_REQS; i++) {
+		if (page_address == idx_to_kaddr(i)) {
+			if (!(grant_tx_status[i] & NETIF_GRANT_VALID) ||
+			    (grant_tx_status[i] & NETIF_GRANT_MAPPED))
+				return 0;
+			map_tx_guest_page(i);
+			return 1;
+		}
+	}
+
+	return 0;
+}
+#endif
+
 irqreturn_t netif_be_int(int irq, void *dev_id, struct pt_regs *regs)
 {
 	netif_t *netif = dev_id;

[-- Attachment #5: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-26 21:08                         ` Herbert Xu
@ 2007-03-27  0:33                           ` Keir Fraser
  2007-03-27  0:35                             ` Herbert Xu
  2007-03-27  0:45                             ` Keir Fraser
  0 siblings, 2 replies; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  0:33 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List




On 26/3/07 22:08, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

>> Can't we wrap the 'swap around' critical section in an irq-safe spinlock?
>> All we'd need to do from the page-fault handler then is a barrier on that
>> spinlock (i.e, wait for it to be released). Netback can simply copy the page
>> to new memory frame, unmap the grant, then relocate the new memory frame's
>> pseudophysical address.
> 
> That works fine for the x86 case.  But when it's auto-translated,
> you won't even get a page fault in the guest because the guest PTE
> is unchanged and completely valid.

How about you invalidate the PTE for the duration of the critical section.
It's a bit skanky, but would work around this issue quite nicely!

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  0:33                           ` Keir Fraser
@ 2007-03-27  0:35                             ` Herbert Xu
  2007-03-27  0:46                               ` Keir Fraser
  2007-03-27  0:45                             ` Keir Fraser
  1 sibling, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  0:35 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List

On Tue, Mar 27, 2007 at 01:33:58AM +0100, Keir Fraser wrote:
> 
> >> Can't we wrap the 'swap around' critical section in an irq-safe spinlock?
> >> All we'd need to do from the page-fault handler then is a barrier on that
> >> spinlock (i.e, wait for it to be released). Netback can simply copy the page
> >> to new memory frame, unmap the grant, then relocate the new memory frame's
> >> pseudophysical address.
> > 
> > That works fine for the x86 case.  But when it's auto-translated,
> > you won't even get a page fault in the guest because the guest PTE
> > is unchanged and completely valid.
> 
> How about you invalidate the PTE for the duration of the critical section.
> It's a bit skanky, but would work around this issue quite nicely!

Maybe I am missing something.  Because I thought we've agreed that
we can't (or don't want to) do the PTE trick because of the large
pages (16MB at least) used for the identity mapping.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  0:33                           ` Keir Fraser
  2007-03-27  0:35                             ` Herbert Xu
@ 2007-03-27  0:45                             ` Keir Fraser
  2007-03-27  0:52                               ` Herbert Xu
  1 sibling, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  0:45 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List

On 27/3/07 01:33, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk> wrote:

>> That works fine for the x86 case.  But when it's auto-translated,
>> you won't even get a page fault in the guest because the guest PTE
>> is unchanged and completely valid.
> 
> How about you invalidate the PTE for the duration of the critical section.
> It's a bit skanky, but would work around this issue quite nicely!

As an arguably better alternative, perhaps we could simply make
XENMEM_add_to_physmap an atomic switch operation (from old mapping to new
mapping)? That is probably not a very hard thing to guarantee.

So you would XENMEM_add_to_physmap over the top of the existing grant
mapping, and expect the grant to simply 'fall away' as a result of this
operation (i.e., implicit removal from physmap triggers release of the
grant).

This sounds neat to me.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  0:35                             ` Herbert Xu
@ 2007-03-27  0:46                               ` Keir Fraser
  0 siblings, 0 replies; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  0:46 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List




On 27/3/07 01:35, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

>> How about you invalidate the PTE for the duration of the critical section.
>> It's a bit skanky, but would work around this issue quite nicely!
> 
> Maybe I am missing something.  Because I thought we've agreed that
> we can't (or don't want to) do the PTE trick because of the large
> pages (16MB at least) used for the identity mapping.

Oh yes, good point. Well, see my follow-up email for another suggestion. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  0:45                             ` Keir Fraser
@ 2007-03-27  0:52                               ` Herbert Xu
  2007-03-27  1:02                                 ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  0:52 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List

On Tue, Mar 27, 2007 at 01:45:54AM +0100, Keir Fraser wrote:
> 
> As an arguably better alternative, perhaps we could simply make
> XENMEM_add_to_physmap an atomic switch operation (from old mapping to new
> mapping)? That is probably not a very hard thing to guarantee.
> 
> So you would XENMEM_add_to_physmap over the top of the existing grant
> mapping, and expect the grant to simply 'fall away' as a result of this
> operation (i.e., implicit removal from physmap triggers release of the
> grant).
> 
> This sounds neat to me.

It sounds good but looking further there is still an issue.

The problem is that the gpfn itself is not sufficient if we want to
release the grant table entry.  We need to have the grant handle too
in order to free the entry completely.  If we did provide the handle
then this will just be the same operation that I was trying to add
except that now we're adding it under XENMEM_add_to_physmap :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  0:52                               ` Herbert Xu
@ 2007-03-27  1:02                                 ` Keir Fraser
  2007-03-27  1:08                                   ` Herbert Xu
  2007-03-27  3:34                                   ` Isaku Yamahata
  0 siblings, 2 replies; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  1:02 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List

On 27/3/07 01:52, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> It sounds good but looking further there is still an issue.
> 
> The problem is that the gpfn itself is not sufficient if we want to
> release the grant table entry.  We need to have the grant handle too
> in order to free the entry completely.  If we did provide the handle
> then this will just be the same operation that I was trying to add
> except that now we're adding it under XENMEM_add_to_physmap :)

Well, it could be a bit more work to change, but what does a p2m entry look
like for a grant mapping? The p2m table should be a very convenient place to
have typed entries, with one of those types being grant-mapping, and
including the maptrack handle. The MFN can be derived from the maptrack
handle where that is required. This would then conveniently be enough
information to perform implicit unmaps (the lack of support of which is an
annoying limitation for non-auto-translate guests which it would be really
nice to avoid for auto-translate guests, particularly since it ought to be
not really all that hard afaics).

This is pretty much the avenue (hopefully not a dead end!) that I pushed
Isaku Yamahata down when he queried physmap semantics a week or so ago.
Sadly we've had no need for grant mappings in x86 auto-translate guests yet,
so there's no example x86 implementation to lead the way. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  1:02                                 ` Keir Fraser
@ 2007-03-27  1:08                                   ` Herbert Xu
  2007-03-27  1:17                                     ` Herbert Xu
  2007-03-27  3:34                                   ` Isaku Yamahata
  1 sibling, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  1:08 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List

On Tue, Mar 27, 2007 at 02:02:39AM +0100, Keir Fraser wrote:
> 
> Well, it could be a bit more work to change, but what does a p2m entry look
> like for a grant mapping? The p2m table should be a very convenient place to
> have typed entries, with one of those types being grant-mapping, and
> including the maptrack handle. The MFN can be derived from the maptrack
> handle where that is required. This would then conveniently be enough
> information to perform implicit unmaps (the lack of support of which is an
> annoying limitation for non-auto-translate guests which it would be really
> nice to avoid for auto-translate guests, particularly since it ought to be
> not really all that hard afaics).

OK I'll look into this for the auto-translated case.

BTW, what do you want me to do for the non-translated case? Should
I go back to the original grant table operation or use the guest
page-fault method?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  1:08                                   ` Herbert Xu
@ 2007-03-27  1:17                                     ` Herbert Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  1:17 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Xen Development Mailing List

On Tue, Mar 27, 2007 at 11:08:10AM +1000, Herbert Xu wrote:
> On Tue, Mar 27, 2007 at 02:02:39AM +0100, Keir Fraser wrote:
> > 
> > Well, it could be a bit more work to change, but what does a p2m entry look
> > like for a grant mapping? The p2m table should be a very convenient place to
> > have typed entries, with one of those types being grant-mapping, and
> > including the maptrack handle. The MFN can be derived from the maptrack
> > handle where that is required. This would then conveniently be enough
> > information to perform implicit unmaps (the lack of support of which is an
> > annoying limitation for non-auto-translate guests which it would be really
> > nice to avoid for auto-translate guests, particularly since it ought to be
> > not really all that hard afaics).
> 
> OK I'll look into this for the auto-translated case.

Hmm where are we going to get space to store the grant handle?
The p2m table itself is just a page table.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  1:02                                 ` Keir Fraser
  2007-03-27  1:08                                   ` Herbert Xu
@ 2007-03-27  3:34                                   ` Isaku Yamahata
  2007-03-27  7:36                                     ` Keir Fraser
  1 sibling, 1 reply; 63+ messages in thread
From: Isaku Yamahata @ 2007-03-27  3:34 UTC (permalink / raw)
  To: Keir Fraser, Herbert Xu; +Cc: Xen Development Mailing List

On Tue, Mar 27, 2007 at 02:02:39AM +0100, Keir Fraser wrote:
> On 27/3/07 01:52, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:
> 
> > It sounds good but looking further there is still an issue.
> > 
> > The problem is that the gpfn itself is not sufficient if we want to
> > release the grant table entry.  We need to have the grant handle too
> > in order to free the entry completely.  If we did provide the handle
> > then this will just be the same operation that I was trying to add
> > except that now we're adding it under XENMEM_add_to_physmap :)
> 
> Well, it could be a bit more work to change, but what does a p2m entry look
> like for a grant mapping? The p2m table should be a very convenient place to
> have typed entries, with one of those types being grant-mapping, and
> including the maptrack handle. The MFN can be derived from the maptrack
> handle where that is required. This would then conveniently be enough
> information to perform implicit unmaps (the lack of support of which is an
> annoying limitation for non-auto-translate guests which it would be really
> nice to avoid for auto-translate guests, particularly since it ought to be
> not really all that hard afaics).

The ia64 p2m entry is 64bit and some bits are already used
so that there is no room for storing grant table related handle.
I don't think storing the maptrack handle instead of MFN is good idea
because it would make the p2m more complicated.

Is it possible to look up grant table reference by MFN effectively?
Probably MFN-keyed hash table whose entry is struct active_grant_table.
I'm not sure whether it is effective compared to guest providing
grant refernece, though.


> This is pretty much the avenue (hopefully not a dead end!) that I pushed
> Isaku Yamahata down when he queried physmap semantics a week or so ago.
> Sadly we've had no need for grant mappings in x86 auto-translate guests yet,
> so there's no example x86 implementation to lead the way. :-)

The ia64 p2m semantic is now fixed in xen-ia64-unstable.hg so that
page_is_removable() arch hook can be removed.
I'll send the patch to remove the hook after the next pull.

-- 
yamahata

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-26  2:19                     ` Herbert Xu
  2007-03-26 18:36                       ` Keir Fraser
@ 2007-03-27  3:41                       ` Isaku Yamahata
  2007-03-27  5:45                         ` Herbert Xu
  1 sibling, 1 reply; 63+ messages in thread
From: Isaku Yamahata @ 2007-03-27  3:41 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List

On Mon, Mar 26, 2007 at 12:19:47PM +1000, Herbert Xu wrote:
> This is actually one of the reasons I picked the grant tables interface
> originally in that we could unmap it at the same time rather than doing
> a full swap followed by an unmap.
> 
> So are you OK with adding underlying operations that allows a full swap
> of two p2m entries? This would then be used as follows in translated mode:
> 
> 	a) new_addr = alloc_page
> 	b) memcpy(new_addr, addr, len)
> 	c) p2m_swap(__pa(new_addr), __pa(addr))
> 	d) grant_unmap(__pa(new_addr))

Swaping the p2m entry on ia64 is easy.
However the following tlb flush cost for the newly allocated page
is extremly high on ia64.
Can alloc_page() at step a) be avoided?
For example batched page allocation and reusing allocated pfns by setting
PG_foreign. And I want arch hook before use/release those pages to
reduce tlb flush cost.
-- 
yamahata

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  3:41                       ` Isaku Yamahata
@ 2007-03-27  5:45                         ` Herbert Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  5:45 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: Xen Development Mailing List

On Tue, Mar 27, 2007 at 12:41:25PM +0900, Isaku Yamahata wrote:
> 
> Swaping the p2m entry on ia64 is easy.

Great!

> However the following tlb flush cost for the newly allocated page
> is extremly high on ia64.

It doesn't matter here because this is the slow path which we should
never enter unless something is misbehaving or deliberately slow.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  3:34                                   ` Isaku Yamahata
@ 2007-03-27  7:36                                     ` Keir Fraser
  2007-03-27  7:44                                       ` Herbert Xu
                                                         ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  7:36 UTC (permalink / raw)
  To: Isaku Yamahata, Herbert Xu; +Cc: Xen Development Mailing List

On 27/3/07 04:34, "Isaku Yamahata" <yamahata@valinux.co.jp> wrote:

> The ia64 p2m entry is 64bit and some bits are already used
> so that there is no room for storing grant table related handle.
> I don't think storing the maptrack handle instead of MFN is good idea
> because it would make the p2m more complicated.
> 
> Is it possible to look up grant table reference by MFN effectively?
> Probably MFN-keyed hash table whose entry is struct active_grant_table.
> I'm not sure whether it is effective compared to guest providing
> grant refernece, though.

This is quite a shame, since the swap-grant-mapping-for-memory hypercall
simply is not very nice. What if we wanted to swap a grant mapping for
another grant mapping in future? Or some other kind of mapping? We end up
needing a new, complicated, grant-table hypercall every time. Or in six
months time we decide this wasn't such a good idea, implement another
scheme, but have this special-purpose grant-table hypercall hanging around
unused forever more.

Could we shatter the ia64 guest large mappings easily? They have no benefit
on Xen anyway apart from some inconsequential memory saving.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:36                                     ` Keir Fraser
@ 2007-03-27  7:44                                       ` Herbert Xu
  2007-03-27  8:12                                         ` Keir Fraser
  2007-03-27  8:15                                         ` Isaku Yamahata
  2007-03-27  7:51                                       ` Keir Fraser
  2007-03-27 10:15                                       ` Herbert Xu
  2 siblings, 2 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  7:44 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 08:36:39AM +0100, Keir Fraser wrote:
> 
> This is quite a shame, since the swap-grant-mapping-for-memory hypercall
> simply is not very nice. What if we wanted to swap a grant mapping for
> another grant mapping in future? Or some other kind of mapping? We end up
> needing a new, complicated, grant-table hypercall every time. Or in six
> months time we decide this wasn't such a good idea, implement another
> scheme, but have this special-purpose grant-table hypercall hanging around
> unused forever more.
> 
> Could we shatter the ia64 guest large mappings easily? They have no benefit
> on Xen anyway apart from some inconsequential memory saving.

There is another way.  Instead of expanding add_to_physmap which
doesn't quite fit our semantics, let's add a new operation that
*actually* swaps two p2m entries.  Then it can be used as follows:

        a) new_addr = alloc_page
        b) memcpy(new_addr, addr, len)
        c) p2m_swap(__pa(new_addr), __pa(addr))
        d) grant_unmap(__pa(new_addr))

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:36                                     ` Keir Fraser
  2007-03-27  7:44                                       ` Herbert Xu
@ 2007-03-27  7:51                                       ` Keir Fraser
  2007-03-27  7:53                                         ` Herbert Xu
  2007-03-27 10:15                                       ` Herbert Xu
  2 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  7:51 UTC (permalink / raw)
  To: Keir Fraser, Isaku Yamahata, Herbert Xu; +Cc: Xen Development Mailing List

On 27/3/07 08:36, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk> wrote:

> This is quite a shame, since the swap-grant-mapping-for-memory hypercall
> simply is not very nice. What if we wanted to swap a grant mapping for
> another grant mapping in future? Or some other kind of mapping? We end up
> needing a new, complicated, grant-table hypercall every time. Or in six
> months time we decide this wasn't such a good idea, implement another
> scheme, but have this special-purpose grant-table hypercall hanging around
> unused forever more.
> 
> Could we shatter the ia64 guest large mappings easily? They have no benefit
> on Xen anyway apart from some inconsequential memory saving.

Also bear in mind the plan to avoid the grant mapping at all in the
direct-to-wire transmit case. This also will rely on a #PF (or some other
fault, potentially) if the guest tries to access that pseudophysical page of
memory. If we can work out a way to make this work on ia64 then the
delayed-copy implementation problems are solved too.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:51                                       ` Keir Fraser
@ 2007-03-27  7:53                                         ` Herbert Xu
  2007-03-27  7:59                                           ` Herbert Xu
                                                             ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  7:53 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 08:51:15AM +0100, Keir Fraser wrote:
> 
> Also bear in mind the plan to avoid the grant mapping at all in the
> direct-to-wire transmit case. This also will rely on a #PF (or some other
> fault, potentially) if the guest tries to access that pseudophysical page of
> memory. If we can work out a way to make this work on ia64 then the
> delayed-copy implementation problems are solved too.

That's to avoid the TLB flush costs, right?

On ia64 (or anything other than x86-32 for that matter) you have
enough virtual address bits to map the entire physical memory
into dom0 (or the driver domain) so that could also be used to
avoid TLB flushes.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:53                                         ` Herbert Xu
@ 2007-03-27  7:59                                           ` Herbert Xu
  2007-03-27  8:10                                             ` Keir Fraser
  2007-03-27  8:03                                           ` Keir Fraser
  2007-03-27  8:22                                           ` Isaku Yamahata
  2 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  7:59 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 05:53:18PM +1000, Herbert Xu wrote:
> On Tue, Mar 27, 2007 at 08:51:15AM +0100, Keir Fraser wrote:
> > 
> > Also bear in mind the plan to avoid the grant mapping at all in the
> > direct-to-wire transmit case. This also will rely on a #PF (or some other
> > fault, potentially) if the guest tries to access that pseudophysical page of
> > memory. If we can work out a way to make this work on ia64 then the
> > delayed-copy implementation problems are solved too.

Even with this we could still keep the large pages by using page
faults on the host page table rather than the guest page table.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:53                                         ` Herbert Xu
  2007-03-27  7:59                                           ` Herbert Xu
@ 2007-03-27  8:03                                           ` Keir Fraser
  2007-03-27  8:22                                           ` Isaku Yamahata
  2 siblings, 0 replies; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  8:03 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Isaku Yamahata, Xen Development Mailing List

On 27/3/07 08:53, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

>> Also bear in mind the plan to avoid the grant mapping at all in the
>> direct-to-wire transmit case. This also will rely on a #PF (or some other
>> fault, potentially) if the guest tries to access that pseudophysical page of
>> memory. If we can work out a way to make this work on ia64 then the
>> delayed-copy implementation problems are solved too.
> 
> That's to avoid the TLB flush costs, right?
> 
> On ia64 (or anything other than x86-32 for that matter) you have
> enough virtual address bits to map the entire physical memory
> into dom0 (or the driver domain) so that could also be used to
> avoid TLB flushes.

Once we have IOMMU support we want to completely close off driver domain
access to arbitrary memory (which they currently have via DMA). Giving a
mapping of all physical memory would be a step in quite the opposite
direction.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:59                                           ` Herbert Xu
@ 2007-03-27  8:10                                             ` Keir Fraser
  2007-03-27  8:13                                               ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  8:10 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Isaku Yamahata, Xen Development Mailing List

On 27/3/07 08:59, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

>>> Also bear in mind the plan to avoid the grant mapping at all in the
>>> direct-to-wire transmit case. This also will rely on a #PF (or some other
>>> fault, potentially) if the guest tries to access that pseudophysical page of
>>> memory. If we can work out a way to make this work on ia64 then the
>>> delayed-copy implementation problems are solved too.
> 
> Even with this we could still keep the large pages by using page
> faults on the host page table rather than the guest page table.

Sure. How would you expose those to the guest? If that can be solved we can
define grant-table unmap to switch the p2m entry into an explicitly invalid
state and we're away.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:44                                       ` Herbert Xu
@ 2007-03-27  8:12                                         ` Keir Fraser
  2007-03-27  8:15                                           ` Herbert Xu
  2007-03-27  8:15                                         ` Isaku Yamahata
  1 sibling, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  8:12 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Isaku Yamahata, Xen Development Mailing List

On 27/3/07 08:44, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> There is another way.  Instead of expanding add_to_physmap which
> doesn't quite fit our semantics, let's add a new operation that
> *actually* swaps two p2m entries.  Then it can be used as follows:
> 
>         a) new_addr = alloc_page
>         b) memcpy(new_addr, addr, len)
>         c) p2m_swap(__pa(new_addr), __pa(addr))
>         d) grant_unmap(__pa(new_addr))

I like this a bit more than the grant-table hypercall, mainly because I
suspect the hypercall implementation is simpler. I'm not sure how easy it is
to make atomic though! Would there be enough locking to ensure that one of
the p2m entries being present in both (or neither) locations at some
intermediate point in time would not matter?

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  8:10                                             ` Keir Fraser
@ 2007-03-27  8:13                                               ` Herbert Xu
  2007-03-27  8:31                                                 ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  8:13 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 09:10:40AM +0100, Keir Fraser wrote:
>
> > Even with this we could still keep the large pages by using page
> > faults on the host page table rather than the guest page table.
> 
> Sure. How would you expose those to the guest? If that can be solved we can
> define grant-table unmap to switch the p2m entry into an explicitly invalid
> state and we're away.

I was actually thinking of Jose's scheme where it's mapped in on
demand.  So you would still do your grant table map operation,
but instead of mapping it in as we do now it would simply notify
the hypervisor of the intention to map this page, and the hypervisor
can then map it in when it actually faults.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  8:12                                         ` Keir Fraser
@ 2007-03-27  8:15                                           ` Herbert Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  8:15 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 09:12:09AM +0100, Keir Fraser wrote:
> On 27/3/07 08:44, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:
> 
> > There is another way.  Instead of expanding add_to_physmap which
> > doesn't quite fit our semantics, let's add a new operation that
> > *actually* swaps two p2m entries.  Then it can be used as follows:
> > 
> >         a) new_addr = alloc_page
> >         b) memcpy(new_addr, addr, len)
> >         c) p2m_swap(__pa(new_addr), __pa(addr))
> >         d) grant_unmap(__pa(new_addr))
> 
> I like this a bit more than the grant-table hypercall, mainly because I
> suspect the hypercall implementation is simpler. I'm not sure how easy it is
> to make atomic though! Would there be enough locking to ensure that one of
> the p2m entries being present in both (or neither) locations at some
> intermediate point in time would not matter?

Well for my purpose I don't need both entries to be written atomically.
As long as each entry is replaced atomically wrt itself it'll be good
enough for the usage scenario above.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:44                                       ` Herbert Xu
  2007-03-27  8:12                                         ` Keir Fraser
@ 2007-03-27  8:15                                         ` Isaku Yamahata
  1 sibling, 0 replies; 63+ messages in thread
From: Isaku Yamahata @ 2007-03-27  8:15 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List

On Tue, Mar 27, 2007 at 05:44:51PM +1000, Herbert Xu wrote:
> On Tue, Mar 27, 2007 at 08:36:39AM +0100, Keir Fraser wrote:
> > 
> > This is quite a shame, since the swap-grant-mapping-for-memory hypercall
> > simply is not very nice. What if we wanted to swap a grant mapping for
> > another grant mapping in future? Or some other kind of mapping? We end up
> > needing a new, complicated, grant-table hypercall every time. Or in six
> > months time we decide this wasn't such a good idea, implement another
> > scheme, but have this special-purpose grant-table hypercall hanging around
> > unused forever more.
> > 
> > Could we shatter the ia64 guest large mappings easily? They have no benefit
> > on Xen anyway apart from some inconsequential memory saving.
> 
> There is another way.  Instead of expanding add_to_physmap which
> doesn't quite fit our semantics, let's add a new operation that
> *actually* swaps two p2m entries.  Then it can be used as follows:
> 
>         a) new_addr = alloc_page
>         b) memcpy(new_addr, addr, len)
>         c) p2m_swap(__pa(new_addr), __pa(addr))
>         d) grant_unmap(__pa(new_addr))

I'm thinking same thing. This would work.

-- 
yamahata

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:53                                         ` Herbert Xu
  2007-03-27  7:59                                           ` Herbert Xu
  2007-03-27  8:03                                           ` Keir Fraser
@ 2007-03-27  8:22                                           ` Isaku Yamahata
  2 siblings, 0 replies; 63+ messages in thread
From: Isaku Yamahata @ 2007-03-27  8:22 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List

On Tue, Mar 27, 2007 at 05:53:18PM +1000, Herbert Xu wrote:
> On Tue, Mar 27, 2007 at 08:51:15AM +0100, Keir Fraser wrote:
> > 
> > Also bear in mind the plan to avoid the grant mapping at all in the
> > direct-to-wire transmit case. This also will rely on a #PF (or some other
> > fault, potentially) if the guest tries to access that pseudophysical page of
> > memory. If we can work out a way to make this work on ia64 then the
> > delayed-copy implementation problems are solved too.
> That's to avoid the TLB flush costs, right?

Xen/IA64 has optimization there. Xen/ia64 tracks tlb inserts
on grant table mapped page.
If there was no tlb insert(i.e. direct-to-wire transmit case),
it doesn't issue tlb flush, but only clears the p2m entry
when grant table unmap.

On Xen/IA64 case, something like copy-on-demand-access can be implemented
as copy-on-tlb-insert in xen, I suppose.

-- 
yamahata

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  8:13                                               ` Herbert Xu
@ 2007-03-27  8:31                                                 ` Keir Fraser
  2007-03-27  9:20                                                   ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  8:31 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Isaku Yamahata, Xen Development Mailing List

On 27/3/07 09:13, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> I was actually thinking of Jose's scheme where it's mapped in on
> demand.  So you would still do your grant table map operation,
> but instead of mapping it in as we do now it would simply notify
> the hypervisor of the intention to map this page, and the hypervisor
> can then map it in when it actually faults.

That's not quite how Jose's patch works. The lazy mapping is done by the
guest in his patch.

If we're not going to go as far as supporting guest-visible 'p2m faults'
then I think overall your original patch is probably the best approach so
far. If I view the new hypercall as a special case of the existing unmap
(which is an unmap-to-zero-pte special case) then it's not so bad. At least
there should be very little new code in the hypervisor needed to implement
it. We have a little while longer to think about this since the patches
aren't for 3.0.5.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  8:31                                                 ` Keir Fraser
@ 2007-03-27  9:20                                                   ` Herbert Xu
  2007-03-27  9:58                                                     ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27  9:20 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 09:31:15AM +0100, Keir Fraser wrote:
> 
> > I was actually thinking of Jose's scheme where it's mapped in on
> > demand.  So you would still do your grant table map operation,
> > but instead of mapping it in as we do now it would simply notify
> > the hypervisor of the intention to map this page, and the hypervisor
> > can then map it in when it actually faults.
> 
> That's not quite how Jose's patch works. The lazy mapping is done by the
> guest in his patch.

That would mean the breaking of large pages on ia64 or any other
platform that uses them.  Is there any benefit for not doing it
in the hypervisor?

> If we're not going to go as far as supporting guest-visible 'p2m faults'
> then I think overall your original patch is probably the best approach so
> far. If I view the new hypercall as a special case of the existing unmap
> (which is an unmap-to-zero-pte special case) then it's not so bad. At least
> there should be very little new code in the hypervisor needed to implement
> it. We have a little while longer to think about this since the patches
> aren't for 3.0.5.

OK.  How do you want to proceed from here then?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  9:20                                                   ` Herbert Xu
@ 2007-03-27  9:58                                                     ` Keir Fraser
  2007-03-27 10:06                                                       ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27  9:58 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Isaku Yamahata, Xen Development Mailing List

On 27/3/07 10:20, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

>> That's not quite how Jose's patch works. The lazy mapping is done by the
>> guest in his patch.
> 
> That would mean the breaking of large pages on ia64 or any other
> platform that uses them.  Is there any benefit for not doing it
> in the hypervisor?

Yet another bloody grant-table hypercall! Also there's the question of where
this lazy state gets squirrelled away until the demand mapping occurs.
Perhaps the p2m is usable in this case, since I expect we have plenty of
spare bits for a not-present entry... But anyway, we don't have the luxury
of a p2m to stick things in the x86 case.

>> If we're not going to go as far as supporting guest-visible 'p2m faults'
>> then I think overall your original patch is probably the best approach so
>> far. If I view the new hypercall as a special case of the existing unmap
>> (which is an unmap-to-zero-pte special case) then it's not so bad. At least
>> there should be very little new code in the hypervisor needed to implement
>> it. We have a little while longer to think about this since the patches
>> aren't for 3.0.5.
> 
> OK.  How do you want to proceed from here then?

I'll take a closer look at your original patch. Having viewed the
alternatives I don't hate it so much now. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  9:58                                                     ` Keir Fraser
@ 2007-03-27 10:06                                                       ` Herbert Xu
  2007-03-27 10:17                                                         ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27 10:06 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 10:58:46AM +0100, Keir Fraser wrote:
> 
> Yet another bloody grant-table hypercall! Also there's the question of where
> this lazy state gets squirrelled away until the demand mapping occurs.
> Perhaps the p2m is usable in this case, since I expect we have plenty of
> spare bits for a not-present entry... But anyway, we don't have the luxury
> of a p2m to stick things in the x86 case.

Sorry I don't get it.  What if we called it a page table operation,
would that make it better :) In other words, what exactly is wrong
with having such an operation?

As for a place to store the info, we can just store the real MFN in
the p2m table but mark it as not present.

> I'll take a closer look at your original patch. Having viewed the
> alternatives I don't hate it so much now. :-)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27  7:36                                     ` Keir Fraser
  2007-03-27  7:44                                       ` Herbert Xu
  2007-03-27  7:51                                       ` Keir Fraser
@ 2007-03-27 10:15                                       ` Herbert Xu
  2007-03-27 10:19                                         ` Keir Fraser
  2 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27 10:15 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 08:36:39AM +0100, Keir Fraser wrote:
> 
> Could we shatter the ia64 guest large mappings easily? They have no benefit
> on Xen anyway apart from some inconsequential memory saving.

I think large pages are needed in general not to reduce memory usage,
but to reduce TLB usage.  I'm not sure about ia64, but for other
architectures it's quite important.

So getting rid of them will add a non-trivial cost.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27 10:06                                                       ` Herbert Xu
@ 2007-03-27 10:17                                                         ` Keir Fraser
  2007-03-27 10:25                                                           ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27 10:17 UTC (permalink / raw)
  To: Herbert Xu, Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On 27/3/07 11:06, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> Sorry I don't get it.  What if we called it a page table operation,
> would that make it better :) In other words, what exactly is wrong
> with having such an operation?

The grant-table code is pretty complicated and a harbour for bugs. So I
prefer to keep it as small and simple as possible. I'm not sure the
comparison with page-table operations is a good one -- the Xen PV interface
for pagetable updates doesn't include anything as special-case as the things
we are currently considering for grant tables.

> As for a place to store the info, we can just store the real MFN in
> the p2m table but mark it as not present.

We can't do that where there is no p2m. I suppose we could make this
operation specific to auto-translate guests.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27 10:15                                       ` Herbert Xu
@ 2007-03-27 10:19                                         ` Keir Fraser
  2007-03-27 11:11                                           ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27 10:19 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Isaku Yamahata, Xen Development Mailing List




On 27/3/07 11:15, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

>> Could we shatter the ia64 guest large mappings easily? They have no benefit
>> on Xen anyway apart from some inconsequential memory saving.
> 
> I think large pages are needed in general not to reduce memory usage,
> but to reduce TLB usage.  I'm not sure about ia64, but for other
> architectures it's quite important.
> 
> So getting rid of them will add a non-trivial cost.

Is this a virtual TLB we're talking about? Xen will already have to
implicitly shatter the large mappings before updating the real TLB, since
guest machine memory map is not contiguous at the scale of large pages (even
if we tried at domain build time, it certainly wouldn't be the case when
foreign grant mappings are scattered all over the place).

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27 10:17                                                         ` Keir Fraser
@ 2007-03-27 10:25                                                           ` Herbert Xu
  2007-03-27 10:40                                                             ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27 10:25 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 11:17:40AM +0100, Keir Fraser wrote:
> 
> The grant-table code is pretty complicated and a harbour for bugs. So I
> prefer to keep it as small and simple as possible. I'm not sure the
> comparison with page-table operations is a good one -- the Xen PV interface
> for pagetable updates doesn't include anything as special-case as the things
> we are currently considering for grant tables.

Fair enough.  The only thing I could say in this regard is that if
we moved in this direction we could replace all existing grant table
ops with this and eventually phase them out.

> > As for a place to store the info, we can just store the real MFN in
> > the p2m table but mark it as not present.
> 
> We can't do that where there is no p2m. I suppose we could make this
> operation specific to auto-translate guests.

Actually we can just mark the PTEs as not present in that case.  The
hypervisor can then mark the PTEs as present on the first fault.

Just as grant table operations currently work on virtual addresses for
non-translated guests and physical addresses for translated guests,
such an operation would work on PTEs for non-translated guests and
p2m entries for translated guests.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27 10:25                                                           ` Herbert Xu
@ 2007-03-27 10:40                                                             ` Keir Fraser
  2007-03-27 11:09                                                               ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27 10:40 UTC (permalink / raw)
  To: Herbert Xu, Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On 27/3/07 11:25, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> Actually we can just mark the PTEs as not present in that case.  The
> hypervisor can then mark the PTEs as present on the first fault.
> 
> Just as grant table operations currently work on virtual addresses for
> non-translated guests and physical addresses for translated guests,
> such an operation would work on PTEs for non-translated guests and
> p2m entries for translated guests.

Yeah, but: how does Xen efficiently work out whether a not-present fault is
due to a lazy grant mapping and, if so, what it is that needs to be demand
mapped? PV guests can store what they like in not-present pagetable entries.
Perhaps we could make use of reserved bits in ptes instead (not that there
are any in non-pae x86, but I'd very much like to obsolete non-pae x86 Xen
after 3.0.5 anyway -- I don't think anyone actually ships a product with
non-pae bits).

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27 10:40                                                             ` Keir Fraser
@ 2007-03-27 11:09                                                               ` Herbert Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Herbert Xu @ 2007-03-27 11:09 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 11:40:00AM +0100, Keir Fraser wrote:
>
> Yeah, but: how does Xen efficiently work out whether a not-present fault is
> due to a lazy grant mapping and, if so, what it is that needs to be demand
> mapped? PV guests can store what they like in not-present pagetable entries.
> Perhaps we could make use of reserved bits in ptes instead (not that there
> are any in non-pae x86, but I'd very much like to obsolete non-pae x86 Xen
> after 3.0.5 anyway -- I don't think anyone actually ships a product with
> non-pae bits).

That's a good point.  Actually, on x86 we have the accessed bit so
we can do this in a different way.

Instead of setting up PTE/P2M entries that cause a page fault on the
first access, we setup the real thing every time and simply check the
accessed bit when we unmap the grant table entry.  That will then allow
us to flush the TLB iff the accessed bit is set.

Now I haven't looked at the other architectures, but I would think
the accessed bit would likely to be present there as well, do you
know whether this is the case for ia64/ppc?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27 10:19                                         ` Keir Fraser
@ 2007-03-27 11:11                                           ` Herbert Xu
  2007-03-27 13:50                                             ` Keir Fraser
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27 11:11 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 11:19:10AM +0100, Keir Fraser wrote:
>
> Is this a virtual TLB we're talking about? Xen will already have to
> implicitly shatter the large mappings before updating the real TLB, since
> guest machine memory map is not contiguous at the scale of large pages (even
> if we tried at domain build time, it certainly wouldn't be the case when
> foreign grant mappings are scattered all over the place).

OK, you got me there :)

Of course I'd still like to avoid this if possible since it introduces
another change from Linux.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-26 23:58                       ` Santos, Jose Renato G
@ 2007-03-27 13:15                         ` Ian Pratt
  0 siblings, 0 replies; 63+ messages in thread
From: Ian Pratt @ 2007-03-27 13:15 UTC (permalink / raw)
  To: Santos, Jose Renato G, Ian Pratt, Herbert Xu, Keir Fraser
  Cc: Xen Development Mailing List

> > Yep, the approach is similar, but I think having the
> > 'permanent' grants mechanism is probably a bit more flexible
> > for managing the header fragments.
> >
> 
>   Not sure why it would be more flexible. Do you mean dealing with
> headers that are in fragments? If headers are in fragments they will
be
> linearized when copied into the shared header area by netfront. Also,
it
> seems to me that having a fixed set of pages at initialization that
> never change is easier to manage. I probably did not understand you
> correctly...

Netfront would copy the header out into a separate (small) fragment,
allocating it from a from a frame in the permanently mapped pool. 

Having a bit to indicate 'permanent' grants means that the header pool
is easily expanded at run time if need be, and seems a more generic
mechanism than having a bunch of frames agreed out-of-band at startup.

Ian 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27 11:11                                           ` Herbert Xu
@ 2007-03-27 13:50                                             ` Keir Fraser
  2007-03-27 20:46                                               ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Keir Fraser @ 2007-03-27 13:50 UTC (permalink / raw)
  To: Herbert Xu, Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List




On 27/3/07 12:11, "Herbert Xu" <herbert@gondor.apana.org.au> wrote:

> OK, you got me there :)
> 
> Of course I'd still like to avoid this if possible since it introduces
> another change from Linux.

Yeah, that's fair enough. We've avoided PV-driver dependencies on changes in
core architectural code so far. It would be nice to continue this,
particularly given we now support PV-on-HVM (ie.,
PV-on-otherwise-unmodified-guest). Of course we'd only be introducing a
dependency for a *backend* driver here.

Your thought of using the accessed bit is rather neat. I believe it is
guaranteed that if we test-and-clear a pte, and see A==0, then no TLB can
cache that pte. If so, I think this could be a winner.

 -- Keir

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27 13:50                                             ` Keir Fraser
@ 2007-03-27 20:46                                               ` Herbert Xu
  2007-03-28 13:05                                                 ` Herbert Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-27 20:46 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Tue, Mar 27, 2007 at 02:50:19PM +0100, Keir Fraser wrote:
> 
> PV-on-otherwise-unmodified-guest). Of course we'd only be introducing a
> dependency for a *backend* driver here.

Well I'm still hoping to have a zero-copy dom0 => domU path :)

> Your thought of using the accessed bit is rather neat. I believe it is
> guaranteed that if we test-and-clear a pte, and see A==0, then no TLB can
> cache that pte. If so, I think this could be a winner.

Yes that assumption is used elsewhere too so it should be fine.

I've checked again and the accessed bit is certainly present on ppc
as well as ia64.  The only I don't know is if it's present on the
nested page tables on ia64 but I see no reasons why it wouldn't be.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-27 20:46                                               ` Herbert Xu
@ 2007-03-28 13:05                                                 ` Herbert Xu
  2007-03-29  6:08                                                   ` Isaku Yamahata
  0 siblings, 1 reply; 63+ messages in thread
From: Herbert Xu @ 2007-03-28 13:05 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Isaku Yamahata, Xen Development Mailing List

On Wed, Mar 28, 2007 at 06:46:40AM +1000, Herbert Xu wrote:
> 
> I've checked again and the accessed bit is certainly present on ppc
> as well as ia64.  The only I don't know is if it's present on the
> nested page tables on ia64 but I see no reasons why it wouldn't be.

Here is a completely untested patch that demonstrates how this could
be done on x86.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff -r 3ac19fda0bc2 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c	Fri Mar 02 12:11:52 2007 +0000
+++ b/xen/arch/x86/mm.c	Wed Mar 28 22:58:45 2007 +1000
@@ -1262,6 +1262,36 @@ static void free_l4_table(struct page_in
 #endif
 
 
+static inline int cmpxchg_intpte(intpte_t *p,
+				 intpte_t *old,
+				 intpte_t new,
+				 unsigned long mfn,
+				 struct vcpu *v)
+{
+        int rv;
+
+        for ( ; ; )
+        {
+            intpte_t t = *old;
+
+            rv = paging_cmpxchg_guest_entry(v, p, old, new, _mfn(mfn));
+            if ( unlikely(rv == 0) )
+            {
+                MEM_LOG("Failed to update %" PRIpte " -> %" PRIpte
+                        ": saw %" PRIpte, t, new, t);
+                break;
+            }
+
+            if ( t == *old )
+                break;
+
+            /* Allowed to change in Accessed/Dirty flags only. */
+            BUG_ON((t ^ *old) & ~(intpte_t)(_PAGE_ACCESSED|_PAGE_DIRTY));
+        }
+
+        return rv;
+}
+
 /* How to write an entry to the guest pagetables.
  * Returns 0 for failure (pointer not valid), 1 for success. */
 static inline int update_intpte(intpte_t *p, 
@@ -1274,27 +1304,7 @@ static inline int update_intpte(intpte_t
 #ifndef PTE_UPDATE_WITH_CMPXCHG
     rv = paging_write_guest_entry(v, p, new, _mfn(mfn));
 #else
-    {
-        intpte_t t = old;
-        for ( ; ; )
-        {
-            rv = paging_cmpxchg_guest_entry(v, p, &t, new, _mfn(mfn));
-            if ( unlikely(rv == 0) )
-            {
-                MEM_LOG("Failed to update %" PRIpte " -> %" PRIpte
-                        ": saw %" PRIpte, old, new, t);
-                break;
-            }
-
-            if ( t == old )
-                break;
-
-            /* Allowed to change in Accessed/Dirty flags only. */
-            BUG_ON((t ^ old) & ~(intpte_t)(_PAGE_ACCESSED|_PAGE_DIRTY));
-
-            old = t;
-        }
-    }
+    rv = cmpxchg_intpte(p, &old, new, mfn, v);
 #endif
     return rv;
 }
@@ -2598,6 +2608,7 @@ static int destroy_grant_va_mapping(
     unsigned long addr, unsigned long frame, struct vcpu *v)
 {
     l1_pgentry_t *pl1e, ol1e;
+    intpte_t *ptep, opte, npte;
     unsigned long gl1mfn;
     int rc = 0;
     
@@ -2619,12 +2630,19 @@ static int destroy_grant_va_mapping(
     }
 
     /* Delete pagetable entry. */
-    if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, l1e_empty(), gl1mfn, v)) )
+    ptep = (intpte_t *)pl1e;
+    opte = l1e_get_intpte(ol1e);
+    npte = l1e_get_intpte(l1e_empty());
+
+    if ( unlikely(!cmpxchg_intpte(ptep, opte, npte, mfn, v)) )
     {
         MEM_LOG("Cannot delete PTE entry at %p", (unsigned long *)pl1e);
         rc = GNTST_general_error;
         goto out;
     }
+
+    if (!(opte & (intpte_t)_PAGE_ACCESSED))
+        rc = GNTST_unused;
 
  out:
     guest_unmap_l1e(v, pl1e);
diff -r 3ac19fda0bc2 xen/common/grant_table.c
--- a/xen/common/grant_table.c	Fri Mar 02 12:11:52 2007 +0000
+++ b/xen/common/grant_table.c	Wed Mar 28 22:58:45 2007 +1000
@@ -394,7 +394,7 @@ gnttab_map_grant_ref(
     return 0;
 }
 
-static void
+static int
 __gnttab_unmap_grant_ref(
     struct gnttab_unmap_grant_ref *op)
 {
@@ -416,7 +416,7 @@ __gnttab_unmap_grant_ref(
     {
         gdprintk(XENLOG_INFO, "Bad handle (%d).\n", op->handle);
         op->status = GNTST_bad_handle;
-        return;
+        return 0;
     }
 
     map = &maptrack_entry(ld->grant_table, op->handle);
@@ -425,7 +425,7 @@ __gnttab_unmap_grant_ref(
     {
         gdprintk(XENLOG_INFO, "Zero flags for handle (%d).\n", op->handle);
         op->status = GNTST_bad_handle;
-        return;
+        return 0;
     }
 
     dom   = map->domid;
@@ -437,7 +437,7 @@ __gnttab_unmap_grant_ref(
         /* This can happen when a grant is implicitly unmapped. */
         gdprintk(XENLOG_INFO, "Could not find domain %d\n", dom);
         domain_crash(ld); /* naughty... */
-        return;
+        return 0;
     }
 
     TRACE_1D(TRC_MEM_PAGE_GRANT_UNMAP, dom);
@@ -511,9 +511,17 @@ __gnttab_unmap_grant_ref(
         gnttab_clear_flag(_GTF_reading, &sha->flags);
 
  unmap_out:
-    op->status = rc;
     spin_unlock(&rd->grant_table->lock);
     rcu_unlock_domain(rd);
+
+    op->status = rc;
+
+    if (rc == GNTST_unused) {
+        op->status = GNTST_okay;
+        return 0;
+    }
+
+    return 1;
 }
 
 static long
@@ -521,23 +529,25 @@ gnttab_unmap_grant_ref(
     XEN_GUEST_HANDLE(gnttab_unmap_grant_ref_t) uop, unsigned int count)
 {
     int i;
+    int used = 0;
+    int err = -EFAULT;
     struct gnttab_unmap_grant_ref op;
 
     for ( i = 0; i < count; i++ )
     {
         if ( unlikely(__copy_from_guest_offset(&op, uop, i, 1)) )
             goto fault;
-        __gnttab_unmap_grant_ref(&op);
+        used |= __gnttab_unmap_grant_ref(&op);
         if ( unlikely(__copy_to_guest_offset(uop, i, &op, 1)) )
             goto fault;
     }
 
-    flush_tlb_mask(current->domain->domain_dirty_cpumask);
-    return 0;
+    err = 0;
 
 fault:
-    flush_tlb_mask(current->domain->domain_dirty_cpumask);
-    return -EFAULT;    
+    if (used)
+        flush_tlb_mask(current->domain->domain_dirty_cpumask);
+    return err;
 }
 
 int
diff -r 3ac19fda0bc2 xen/include/public/grant_table.h
--- a/xen/include/public/grant_table.h	Fri Mar 02 12:11:52 2007 +0000
+++ b/xen/include/public/grant_table.h	Wed Mar 28 22:58:45 2007 +1000
@@ -360,6 +360,7 @@ DEFINE_XEN_GUEST_HANDLE(gnttab_query_siz
 /*
  * Values for error status returns. All errors are -ve.
  */
+#define GNTST_unused           (1)  /* Entry is unused (internal status).    */
 #define GNTST_okay             (0)  /* Normal return.                        */
 #define GNTST_general_error    (-1) /* General undefined error.              */
 #define GNTST_bad_domain       (-2) /* Unrecognsed domain id.                */

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: [0/2] Remove netloop by lazy copying in netback
  2007-03-28 13:05                                                 ` Herbert Xu
@ 2007-03-29  6:08                                                   ` Isaku Yamahata
  0 siblings, 0 replies; 63+ messages in thread
From: Isaku Yamahata @ 2007-03-29  6:08 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Xen Development Mailing List, Keir Fraser

On Wed, Mar 28, 2007 at 11:05:16PM +1000, Herbert Xu wrote:
> On Wed, Mar 28, 2007 at 06:46:40AM +1000, Herbert Xu wrote:
> > 
> > I've checked again and the accessed bit is certainly present on ppc
> > as well as ia64.  The only I don't know is if it's present on the
> > nested page tables on ia64 but I see no reasons why it wouldn't be.
> 
> Here is a completely untested patch that demonstrates how this could
> be done on x86.

Xen/IA64 needs more than accessed bit for the optimized tlb flush
so that destroy_grant_host_mapping() flushes tlb and
it doesn't set any bit of domain_dirty_cpumask.
Can we make flush_tlb_mask() arch dependent and 
record something like GNTST_unuse in struct arch_vcpu?
-- 
yamahata

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2007-03-29  6:08 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-20  4:46 RFC: [0/2] Remove netloop by lazy copying in netback Herbert Xu
2007-03-20  4:50 ` RFC: [1/2] [XEN] gnttab: Add new op unmap_and_replace Herbert Xu
2007-03-20  4:56 ` RFC: [2/2] [NET] back: Add lazy copying Herbert Xu
2007-03-20  7:10 ` RFC: [0/2] Remove netloop by lazy copying in netback Keir Fraser
2007-03-20 10:11   ` Herbert Xu
2007-03-20 10:18     ` Keir Fraser
2007-03-22 10:50       ` Herbert Xu
2007-03-22 15:40         ` Keir Fraser
2007-03-23  3:17           ` Herbert Xu
2007-03-23 10:32             ` Keir Fraser
2007-03-23 11:42               ` Herbert Xu
2007-03-23 11:47                 ` Keir Fraser
2007-03-23 13:24                 ` Ian Pratt
2007-03-23 23:07                   ` Santos, Jose Renato G
2007-03-23 23:29                     ` Ian Pratt
2007-03-26 23:58                       ` Santos, Jose Renato G
2007-03-27 13:15                         ` Ian Pratt
2007-03-25 11:41                 ` Herbert Xu
2007-03-25 12:27                   ` Keir Fraser
2007-03-26  2:19                     ` Herbert Xu
2007-03-26 18:36                       ` Keir Fraser
2007-03-26 21:08                         ` Herbert Xu
2007-03-27  0:33                           ` Keir Fraser
2007-03-27  0:35                             ` Herbert Xu
2007-03-27  0:46                               ` Keir Fraser
2007-03-27  0:45                             ` Keir Fraser
2007-03-27  0:52                               ` Herbert Xu
2007-03-27  1:02                                 ` Keir Fraser
2007-03-27  1:08                                   ` Herbert Xu
2007-03-27  1:17                                     ` Herbert Xu
2007-03-27  3:34                                   ` Isaku Yamahata
2007-03-27  7:36                                     ` Keir Fraser
2007-03-27  7:44                                       ` Herbert Xu
2007-03-27  8:12                                         ` Keir Fraser
2007-03-27  8:15                                           ` Herbert Xu
2007-03-27  8:15                                         ` Isaku Yamahata
2007-03-27  7:51                                       ` Keir Fraser
2007-03-27  7:53                                         ` Herbert Xu
2007-03-27  7:59                                           ` Herbert Xu
2007-03-27  8:10                                             ` Keir Fraser
2007-03-27  8:13                                               ` Herbert Xu
2007-03-27  8:31                                                 ` Keir Fraser
2007-03-27  9:20                                                   ` Herbert Xu
2007-03-27  9:58                                                     ` Keir Fraser
2007-03-27 10:06                                                       ` Herbert Xu
2007-03-27 10:17                                                         ` Keir Fraser
2007-03-27 10:25                                                           ` Herbert Xu
2007-03-27 10:40                                                             ` Keir Fraser
2007-03-27 11:09                                                               ` Herbert Xu
2007-03-27  8:03                                           ` Keir Fraser
2007-03-27  8:22                                           ` Isaku Yamahata
2007-03-27 10:15                                       ` Herbert Xu
2007-03-27 10:19                                         ` Keir Fraser
2007-03-27 11:11                                           ` Herbert Xu
2007-03-27 13:50                                             ` Keir Fraser
2007-03-27 20:46                                               ` Herbert Xu
2007-03-28 13:05                                                 ` Herbert Xu
2007-03-29  6:08                                                   ` Isaku Yamahata
2007-03-27  3:41                       ` Isaku Yamahata
2007-03-27  5:45                         ` Herbert Xu
2007-03-20 10:22 ` Keir Fraser
2007-03-20 10:27   ` Keir Fraser
2007-03-20 11:50     ` David Edmondson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.