[Hackathon minutes] PV network improvements

* [Hackathon minutes] PV network improvements
@ 2013-05-20 14:08 Stefano Stabellini
  2013-05-20 14:49 ` George Dunlap
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Stefano Stabellini @ 2013-05-20 14:08 UTC (permalink / raw)
  To: xen-devel; +Cc: Stefano Stabellini

Hi all,
these are Konrad's and my notes (mostly Konrad's) on possible
improvements of the PV network protocol, taken at the Hackathon.

A) Network bandwidth: multipage rings
The max outstanding amount of data the it can have is 898kB (64K of
data use 18 slot, out of 256. 256 / 18 = 14, 14 * 64KB).  This can be
expanded by having multi-page to expand the ring. This would benefit NFS
and bulk data transfer (such as netperf data).

B) Producer and consumer index is on the same cache line
In present hardware that means the reader and writer will compete for
the same cacheline causing a ping-pong between sockets.
This can be solved by having a feature-split-indexes (or better name)
where the req_prod and req_event as a tuple are different from the
rsp_prod and rsp_prod. This would entail using 128bytes of the ring at
the start - each cacheline for each tuple. 

C)  Cache alignment of requests
The fix is to make the request structures more cache-aligned. For
networking that means making it 16 bytes and block 64 bytes.
Since it does not shrink the structure but just expands it, could be
called feature-align-slot.

E) Multiqueue (request-feature-multiqueue)
It means creating many TX and RX rings for each vif.

F) don't gnt_copy all of the requests
Instead don't touch them and let the Xen IOMMU create appropriate
entries. This would require the DMA API in dom0 to be aware whether the
grant has been done and if not (so FOREIGN, aka no m2p_override), then
do the hypercall to tell the hypervisor that this grant is going to be
used by a specific PCI device. This would create the IOMMU entry in Xen.

G) On TX side, do persistent grant mapping
This would only be done from frontend -> backend path.  That means that
we could exhaust initial domains memory.

H) Affinity of the frontend and backend being on the same NUMA node
This touches upon the discussion about NUMA and having PV guests be
aware of memory layout. It also means that each backend kthread needs to
be on a different NUMA node.

I) separate request and response rings for TX and RX

J) Map the whole physical memory of the machine in dom0
If mapping/unmapping or copying slows us down, could we just keep the
whole physical memory of the machine mapped in dom0 (with corresponding
IOMMU entries)?
At that point the frontend could just pass mfn numbers to the backend,
and the backend would already have them mapped.
>From a security perspective it doesn't change anything when running
the backend in dom0, because dom0 is already capable of mapping random
pages of any guests. QEMU instances do that all the time.
But it would take away one of the benefits of deploying driver domains:
we wouldn't be able to run the backends at a lower privilege level.
However it might still be worth considering as an option? The backend is
still trusted and protected from the frontend, but the frontend wouldn't
be protected from the backend.

^ permalink raw reply	[flat|nested] 20+ messages in thread