All of lore.kernel.org
 help / color / mirror / Atom feed
* Interesting observation with network event notification and batching
@ 2013-06-12 10:14 Wei Liu
  2013-06-14 18:53 ` Konrad Rzeszutek Wilk
  2013-06-28 16:15 ` Wei Liu
  0 siblings, 2 replies; 27+ messages in thread
From: Wei Liu @ 2013-06-12 10:14 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, stefano.stabellini, konrad.wilk,
	annie.li, andrew.bennieston

Hi all

I'm hacking on a netback trying to identify whether TLB flushes causes
heavy performance penalty on Tx path. The hack is quite nasty (you would
not want to know, trust me).

Basically what is doesn't is, 1) alter network protocol to pass along
mfns instead of grant references, 2) when the backend sees a new mfn,
map it RO and cache it in its own address space.

With this hack, now we have some sort of zero-copy TX path. Backend
doesn't need to issue any grant copy / map operation any more. When it
sees a new packet in the ring, it just needs to pick up the pages
in its own address space and assemble packets with those pages then pass
the packet on to network stack.

In theory this should boost performance, but in practice it is the other
way around. This hack makes Xen network more than 50% slower than before
(OMG). Further investigation shows that with this hack the batching
ability is gone. Before this hack, netback batches like 64 slots in one
interrupt event, however after this hack, it only batches 3 slots in one
interrupt event -- that's no batching at all because we can expect one
packet to occupy 3 slots.

Time to have some figures (iperf from DomU to Dom0).

Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots
per batch 64.

After the hack, throughput: 2.5 Gb/s, average slots per batch 3.

After the hack, adds in 64 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots
per batch 6.

After the hack, adds in 256 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots
per batch 26.

After the hack, adds in 512 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots
per batch 26.

After the hack, adds in 768 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots
per batch 25.

After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots
per batch 25.

Average slots per batch is calculate as followed:
 1. count total_slots processed from start of day
 2. count tx_count which is the number of tx_action function gets
    invoked
 3. avg_slots_per_tx = total_slots / tx_count

The counter-intuition figures imply that there is something wrong with
the currently batching mechanism. Probably we need to fine-tune the
batching behavior for network and play with event pointers in the ring
(actually I'm looking into it now). It would be good to have some input
on this.

Konrad, IIRC you once mentioned you discovered something with event
notification, what's that?

To all, any thoughts?


Wei.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-12 10:14 Interesting observation with network event notification and batching Wei Liu
@ 2013-06-14 18:53 ` Konrad Rzeszutek Wilk
  2013-06-16  9:54   ` Wei Liu
  2013-06-16 12:46   ` Wei Liu
  2013-06-28 16:15 ` Wei Liu
  1 sibling, 2 replies; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-06-14 18:53 UTC (permalink / raw)
  To: Wei Liu
  Cc: annie.li, stefano.stabellini, andrew.bennieston, ian.campbell, xen-devel

On Wed, Jun 12, 2013 at 11:14:51AM +0100, Wei Liu wrote:
> Hi all
> 
> I'm hacking on a netback trying to identify whether TLB flushes causes
> heavy performance penalty on Tx path. The hack is quite nasty (you would
> not want to know, trust me).
> 
> Basically what is doesn't is, 1) alter network protocol to pass along

You probably meant: "what it does" ?

> mfns instead of grant references, 2) when the backend sees a new mfn,
> map it RO and cache it in its own address space.
> 
> With this hack, now we have some sort of zero-copy TX path. Backend
> doesn't need to issue any grant copy / map operation any more. When it
> sees a new packet in the ring, it just needs to pick up the pages
> in its own address space and assemble packets with those pages then pass
> the packet on to network stack.

Uh, so not sure I understand the RO part. If dom0 is mapping it won't
that trigger a PTE update? And doesn't somebody (either the guest or
initial domain) do a grant mapping to let the hypervisor know it is
OK to map a grant?

Or is dom0 actually permitted to map the MFN of any guest without using
the grants? In which case you are then using the _PAGE_IOMAP
somewhere and setting up vmap entries with the MFN's that point to the
foreign domain - I think?

> 
> In theory this should boost performance, but in practice it is the other
> way around. This hack makes Xen network more than 50% slower than before
> (OMG). Further investigation shows that with this hack the batching
> ability is gone. Before this hack, netback batches like 64 slots in one

That is quite interesting.

> interrupt event, however after this hack, it only batches 3 slots in one
> interrupt event -- that's no batching at all because we can expect one
> packet to occupy 3 slots.

Right.
> 
> Time to have some figures (iperf from DomU to Dom0).
> 
> Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots
> per batch 64.
> 
> After the hack, throughput: 2.5 Gb/s, average slots per batch 3.
> 
> After the hack, adds in 64 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots
> per batch 6.
> 
> After the hack, adds in 256 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots
> per batch 26.
> 
> After the hack, adds in 512 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots
> per batch 26.
> 
> After the hack, adds in 768 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots
> per batch 25.
> 
> After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots
> per batch 25.
> 

How do you get it to do more HYPERVISR_xen_version? Did you just add
a (for i = 1024; i>0;i--) hypervisor_yield();

in netback?
> Average slots per batch is calculate as followed:
>  1. count total_slots processed from start of day
>  2. count tx_count which is the number of tx_action function gets
>     invoked
>  3. avg_slots_per_tx = total_slots / tx_count
> 
> The counter-intuition figures imply that there is something wrong with
> the currently batching mechanism. Probably we need to fine-tune the
> batching behavior for network and play with event pointers in the ring
> (actually I'm looking into it now). It would be good to have some input
> on this.

I am still unsure I understand hwo your changes would incur more
of the yields.
> 
> Konrad, IIRC you once mentioned you discovered something with event
> notification, what's that?

They were bizzare. I naively expected some form of # of physical NIC 
interrupts to be around the same as the VIF or less. And I figured
that the amount of interrupts would be constant irregardless of the
size of the packets. In other words #packets == #interrupts.

In reality the number of interrupts the VIF had was about the same while
for the NIC it would fluctuate. (I can't remember the details).

But it was odd and I didn't go deeper in it to figure out what
was happening. And also to figure out if for the VIF we could
do something of #packets != #interrupts.  And hopefully some
mechanism to adjust so that the amount of interrupts would
be lesser per packets (hand waving here).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-14 18:53 ` Konrad Rzeszutek Wilk
@ 2013-06-16  9:54   ` Wei Liu
  2013-06-17  9:38     ` Ian Campbell
  2013-06-16 12:46   ` Wei Liu
  1 sibling, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-06-16  9:54 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, ian.campbell, stefano.stabellini, xen-devel, annie.li,
	andrew.bennieston

On Fri, Jun 14, 2013 at 02:53:03PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 12, 2013 at 11:14:51AM +0100, Wei Liu wrote:
> > Hi all
> > 
> > I'm hacking on a netback trying to identify whether TLB flushes causes
> > heavy performance penalty on Tx path. The hack is quite nasty (you would
> > not want to know, trust me).
> > 
> > Basically what is doesn't is, 1) alter network protocol to pass along
> 
> You probably meant: "what it does" ?
> 

Oh yes! Muscle memory got me!

> > mfns instead of grant references, 2) when the backend sees a new mfn,
> > map it RO and cache it in its own address space.
> > 
> > With this hack, now we have some sort of zero-copy TX path. Backend
> > doesn't need to issue any grant copy / map operation any more. When it
> > sees a new packet in the ring, it just needs to pick up the pages
> > in its own address space and assemble packets with those pages then pass
> > the packet on to network stack.
> 
> Uh, so not sure I understand the RO part. If dom0 is mapping it won't
> that trigger a PTE update? And doesn't somebody (either the guest or
> initial domain) do a grant mapping to let the hypervisor know it is
> OK to map a grant?
> 

It is very easy to issue HYPERVISOR_mmu_udpate to alter Dom0's mapping,
because Dom0 is priveleged.

> Or is dom0 actually permitted to map the MFN of any guest without using
> the grants? In which case you are then using the _PAGE_IOMAP
> somewhere and setting up vmap entries with the MFN's that point to the
> foreign domain - I think?
> 

Sort of, but I didn't use vmap, I used alloc_page to get actual pages.
Then I modified the underlying PTE to point to the MFN from netfront.

> > 
> > In theory this should boost performance, but in practice it is the other
> > way around. This hack makes Xen network more than 50% slower than before
> > (OMG). Further investigation shows that with this hack the batching
> > ability is gone. Before this hack, netback batches like 64 slots in one
> 
> That is quite interesting.
> 
> > interrupt event, however after this hack, it only batches 3 slots in one
> > interrupt event -- that's no batching at all because we can expect one
> > packet to occupy 3 slots.
> 
> Right.
> > 
> > Time to have some figures (iperf from DomU to Dom0).
> > 
> > Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots
> > per batch 64.
> > 
> > After the hack, throughput: 2.5 Gb/s, average slots per batch 3.
> > 
> > After the hack, adds in 64 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots
> > per batch 6.
> > 
> > After the hack, adds in 256 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots
> > per batch 26.
> > 
> > After the hack, adds in 512 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots
> > per batch 26.
> > 
> > After the hack, adds in 768 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots
> > per batch 25.
> > 
> > After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots
> > per batch 25.
> > 
> 
> How do you get it to do more HYPERVISR_xen_version? Did you just add
> a (for i = 1024; i>0;i--) hypervisor_yield();

 for (i = 0; i < X; i++) (void)HYPERVISOR_xen_version(0, NULL);

> 
> in netback?
> > Average slots per batch is calculate as followed:
> >  1. count total_slots processed from start of day
> >  2. count tx_count which is the number of tx_action function gets
> >     invoked
> >  3. avg_slots_per_tx = total_slots / tx_count
> > 
> > The counter-intuition figures imply that there is something wrong with
> > the currently batching mechanism. Probably we need to fine-tune the
> > batching behavior for network and play with event pointers in the ring
> > (actually I'm looking into it now). It would be good to have some input
> > on this.
> 
> I am still unsure I understand hwo your changes would incur more
> of the yields.

It's not yielding. At least that's not the purpose of that hypercall.
HYPERVISOR_xen_version(0, NULL) only does guest -> hypervisor -> guest
context switching. The original purpose of HYPERVISOR_xen_version(0,
NULL) is to force guest to check pending events.

Since you mentioned yeilding, I will also try to do yielding and post
figures.

> > 
> > Konrad, IIRC you once mentioned you discovered something with event
> > notification, what's that?
> 
> They were bizzare. I naively expected some form of # of physical NIC 
> interrupts to be around the same as the VIF or less. And I figured
> that the amount of interrupts would be constant irregardless of the
> size of the packets. In other words #packets == #interrupts.
> 

It could be that the frontend notifies the backend for every packet it
sends. This is not desirable and I don't expect the ring to behave that
way.

> In reality the number of interrupts the VIF had was about the same while
> for the NIC it would fluctuate. (I can't remember the details).
> 

I'm not sure I understand you here. But for the NIC, if you see the
number of interrupt goes from high to low that's expected. When the NIC
has very high interrupt rate it turns to polling mode.

> But it was odd and I didn't go deeper in it to figure out what
> was happening. And also to figure out if for the VIF we could
> do something of #packets != #interrupts.  And hopefully some
> mechanism to adjust so that the amount of interrupts would
> be lesser per packets (hand waving here).

I'm trying to do this now.


Wei.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-14 18:53 ` Konrad Rzeszutek Wilk
  2013-06-16  9:54   ` Wei Liu
@ 2013-06-16 12:46   ` Wei Liu
  1 sibling, 0 replies; 27+ messages in thread
From: Wei Liu @ 2013-06-16 12:46 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, ian.campbell, stefano.stabellini, xen-devel, annie.li,
	andrew.bennieston

On Fri, Jun 14, 2013 at 02:53:03PM -0400, Konrad Rzeszutek Wilk wrote:
[...]> 
> How do you get it to do more HYPERVISR_xen_version? Did you just add
> a (for i = 1024; i>0;i--) hypervisor_yield();
> 

Here are the figures to replace HYPERVISOR_xen_version(0, NULL) with
HYPERVISOR_sched_op(SCHEDOP_yield, NULL).

64 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 5.15G/s,
average slots per tx 25

128 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 7.75G/s,
average slots per tx 26

512 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 1.74G/s,
average slots per tx 18

1024 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 998M/s,
average slots per tx 18

Please note that Dom0 and DomU runs on different PCPUs.

I think this kind of behavior has something to do with scheduler. But
down to the bottom we should really fix notification mechanism.


Wei.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-16  9:54   ` Wei Liu
@ 2013-06-17  9:38     ` Ian Campbell
  2013-06-17  9:56       ` Andrew Bennieston
                         ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Ian Campbell @ 2013-06-17  9:38 UTC (permalink / raw)
  To: Wei Liu
  Cc: annie.li, xen-devel, andrew.bennieston, stefano.stabellini,
	Konrad Rzeszutek Wilk

On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> > > Konrad, IIRC you once mentioned you discovered something with event
> > > notification, what's that?
> > 
> > They were bizzare. I naively expected some form of # of physical NIC 
> > interrupts to be around the same as the VIF or less. And I figured
> > that the amount of interrupts would be constant irregardless of the
> > size of the packets. In other words #packets == #interrupts.
> > 
> 
> It could be that the frontend notifies the backend for every packet it
> sends. This is not desirable and I don't expect the ring to behave that
> way.

It is probably worth checking that things are working how we think they
should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
suitable points to maximise batching.

Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
loop right? This would push the req_event pointer to just after the last
request, meaning the net request enqueued by the frontend would cause a
notification -- even though the backend is actually still continuing to
process requests and would have picked up that packet without further
notification. n this case there is a fair bit of work left in the
backend for this iteration i.e. plenty of opportunity for the frontend
to queue more requests.

The comments in ring.h say:
 *  These macros will set the req_event/rsp_event field to trigger a
 *  notification on the very next message that is enqueued. If you want to
 *  create batches of work (i.e., only receive a notification after several
 *  messages have been enqueued) then you will need to create a customised
 *  version of the FINAL_CHECK macro in your own code, which sets the event
 *  field appropriately.

Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
(and other similar loops) and add a FINAL check at the very end?

> > But it was odd and I didn't go deeper in it to figure out what
> > was happening. And also to figure out if for the VIF we could
> > do something of #packets != #interrupts.  And hopefully some
> > mechanism to adjust so that the amount of interrupts would
> > be lesser per packets (hand waving here).
> 
> I'm trying to do this now.

What scheme do you have in mind?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-17  9:38     ` Ian Campbell
@ 2013-06-17  9:56       ` Andrew Bennieston
  2013-06-17 10:46         ` Wei Liu
  2013-06-17 10:06       ` Jan Beulich
  2013-06-17 10:35       ` Wei Liu
  2 siblings, 1 reply; 27+ messages in thread
From: Andrew Bennieston @ 2013-06-17  9:56 UTC (permalink / raw)
  To: Ian Campbell
  Cc: annie.li, xen-devel, Wei Liu, stefano.stabellini, Konrad Rzeszutek Wilk

On 17/06/13 10:38, Ian Campbell wrote:
> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>> Konrad, IIRC you once mentioned you discovered something with event
>>>> notification, what's that?
>>>
>>> They were bizzare. I naively expected some form of # of physical NIC
>>> interrupts to be around the same as the VIF or less. And I figured
>>> that the amount of interrupts would be constant irregardless of the
>>> size of the packets. In other words #packets == #interrupts.
>>>
>>
>> It could be that the frontend notifies the backend for every packet it
>> sends. This is not desirable and I don't expect the ring to behave that
>> way.

I have observed this kind of behaviour during network performance tests 
in which I periodically checked the ring state during an iperf session. 
It looked to me like the frontend was sending notifications far too 
often, but that the backend was sending them very infrequently, so the 
Tx (from guest) ring was mostly empty and the Rx (to guest) ring was 
mostly full. This has the effect of both front and backend having to 
block occasionally waiting for the other end to clear or fill a ring, 
even though there is more data available.

My initial theory was that this was caused in part by the shared event 
channel, however I expect that Wei is testing on top of a kernel with 
his split event channel features?

>
> It is probably worth checking that things are working how we think they
> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> suitable points to maximise batching.
>
> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> loop right? This would push the req_event pointer to just after the last
> request, meaning the net request enqueued by the frontend would cause a
> notification -- even though the backend is actually still continuing to
> process requests and would have picked up that packet without further
> notification. n this case there is a fair bit of work left in the
> backend for this iteration i.e. plenty of opportunity for the frontend
> to queue more requests.
>
> The comments in ring.h say:
>   *  These macros will set the req_event/rsp_event field to trigger a
>   *  notification on the very next message that is enqueued. If you want to
>   *  create batches of work (i.e., only receive a notification after several
>   *  messages have been enqueued) then you will need to create a customised
>   *  version of the FINAL_CHECK macro in your own code, which sets the event
>   *  field appropriately.
>
> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> (and other similar loops) and add a FINAL check at the very end?
>
>>> But it was odd and I didn't go deeper in it to figure out what
>>> was happening. And also to figure out if for the VIF we could
>>> do something of #packets != #interrupts.  And hopefully some
>>> mechanism to adjust so that the amount of interrupts would
>>> be lesser per packets (hand waving here).
>>
>> I'm trying to do this now.
>
> What scheme do you have in mind?

As I mentioned above, filling a ring completely appears to be almost as 
bad as sending too many notifications. The ideal scheme may involve 
trying to balance the ring at some "half-full" state, depending on the 
capacity for the front- and backends to process requests and responses.

Andrew.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-17  9:38     ` Ian Campbell
  2013-06-17  9:56       ` Andrew Bennieston
@ 2013-06-17 10:06       ` Jan Beulich
  2013-06-17 10:16         ` Ian Campbell
  2013-06-17 10:35       ` Wei Liu
  2 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2013-06-17 10:06 UTC (permalink / raw)
  To: Ian Campbell, Wei Liu
  Cc: annie.li, xen-devel, andrew.bennieston, Konrad Rzeszutek Wilk,
	stefano.stabellini

>>> On 17.06.13 at 11:38, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>> > > Konrad, IIRC you once mentioned you discovered something with event
>> > > notification, what's that?
>> > 
>> > They were bizzare. I naively expected some form of # of physical NIC 
>> > interrupts to be around the same as the VIF or less. And I figured
>> > that the amount of interrupts would be constant irregardless of the
>> > size of the packets. In other words #packets == #interrupts.
>> > 
>> 
>> It could be that the frontend notifies the backend for every packet it
>> sends. This is not desirable and I don't expect the ring to behave that
>> way.
> 
> It is probably worth checking that things are working how we think they
> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> suitable points to maximise batching.
> 
> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> loop right? This would push the req_event pointer to just after the last
> request, meaning the net request enqueued by the frontend would cause a
> notification -- even though the backend is actually still continuing to
> process requests and would have picked up that packet without further
> notification. n this case there is a fair bit of work left in the
> backend for this iteration i.e. plenty of opportunity for the frontend
> to queue more requests.
> 
> The comments in ring.h say:
>  *  These macros will set the req_event/rsp_event field to trigger a
>  *  notification on the very next message that is enqueued. If you want to
>  *  create batches of work (i.e., only receive a notification after several
>  *  messages have been enqueued) then you will need to create a customised
>  *  version of the FINAL_CHECK macro in your own code, which sets the event
>  *  field appropriately.
> 
> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> (and other similar loops) and add a FINAL check at the very end?

But then again the macro doesn't update req_event when there
are unconsumed requests already upon entry to the macro.

Jan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-17 10:06       ` Jan Beulich
@ 2013-06-17 10:16         ` Ian Campbell
  0 siblings, 0 replies; 27+ messages in thread
From: Ian Campbell @ 2013-06-17 10:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Konrad Rzeszutek Wilk, stefano.stabellini, xen-devel,
	annie.li, andrew.bennieston

On Mon, 2013-06-17 at 11:06 +0100, Jan Beulich wrote:
> >>> On 17.06.13 at 11:38, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> >> > > Konrad, IIRC you once mentioned you discovered something with event
> >> > > notification, what's that?
> >> > 
> >> > They were bizzare. I naively expected some form of # of physical NIC 
> >> > interrupts to be around the same as the VIF or less. And I figured
> >> > that the amount of interrupts would be constant irregardless of the
> >> > size of the packets. In other words #packets == #interrupts.
> >> > 
> >> 
> >> It could be that the frontend notifies the backend for every packet it
> >> sends. This is not desirable and I don't expect the ring to behave that
> >> way.
> > 
> > It is probably worth checking that things are working how we think they
> > should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> > netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> > suitable points to maximise batching.
> > 
> > Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> > loop right? This would push the req_event pointer to just after the last
> > request, meaning the net request enqueued by the frontend would cause a
> > notification -- even though the backend is actually still continuing to
> > process requests and would have picked up that packet without further
> > notification. n this case there is a fair bit of work left in the
> > backend for this iteration i.e. plenty of opportunity for the frontend
> > to queue more requests.
> > 
> > The comments in ring.h say:
> >  *  These macros will set the req_event/rsp_event field to trigger a
> >  *  notification on the very next message that is enqueued. If you want to
> >  *  create batches of work (i.e., only receive a notification after several
> >  *  messages have been enqueued) then you will need to create a customised
> >  *  version of the FINAL_CHECK macro in your own code, which sets the event
> >  *  field appropriately.
> > 
> > Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> > (and other similar loops) and add a FINAL check at the very end?
> 
> But then again the macro doesn't update req_event when there
> are unconsumed requests already upon entry to the macro.

My concern was that when we process the last request currently on the
ring we immediately set it forward, even though netback goes on to do a
bunch more work (including e.g. the grant copies) before looping back
and looking for more work. That's a potentially large window for the
frontend to enqueue and then needlessly notify a new packet.

It could potentially lead to a pathological case of notifying every
packet unnecessarily.

Ian.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-17  9:38     ` Ian Campbell
  2013-06-17  9:56       ` Andrew Bennieston
  2013-06-17 10:06       ` Jan Beulich
@ 2013-06-17 10:35       ` Wei Liu
  2013-06-17 11:34         ` annie li
  2 siblings, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-06-17 10:35 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, stefano.stabellini, Konrad Rzeszutek Wilk, xen-devel,
	annie.li, andrew.bennieston

On Mon, Jun 17, 2013 at 10:38:33AM +0100, Ian Campbell wrote:
> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> > > > Konrad, IIRC you once mentioned you discovered something with event
> > > > notification, what's that?
> > > 
> > > They were bizzare. I naively expected some form of # of physical NIC 
> > > interrupts to be around the same as the VIF or less. And I figured
> > > that the amount of interrupts would be constant irregardless of the
> > > size of the packets. In other words #packets == #interrupts.
> > > 
> > 
> > It could be that the frontend notifies the backend for every packet it
> > sends. This is not desirable and I don't expect the ring to behave that
> > way.
> 
> It is probably worth checking that things are working how we think they
> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> suitable points to maximise batching.
> 
> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> loop right? This would push the req_event pointer to just after the last
> request, meaning the net request enqueued by the frontend would cause a
> notification -- even though the backend is actually still continuing to
> process requests and would have picked up that packet without further
> notification. n this case there is a fair bit of work left in the
> backend for this iteration i.e. plenty of opportunity for the frontend
> to queue more requests.
> 
> The comments in ring.h say:
>  *  These macros will set the req_event/rsp_event field to trigger a
>  *  notification on the very next message that is enqueued. If you want to
>  *  create batches of work (i.e., only receive a notification after several
>  *  messages have been enqueued) then you will need to create a customised
>  *  version of the FINAL_CHECK macro in your own code, which sets the event
>  *  field appropriately.
> 
> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> (and other similar loops) and add a FINAL check at the very end?
> 
> > > But it was odd and I didn't go deeper in it to figure out what
> > > was happening. And also to figure out if for the VIF we could
> > > do something of #packets != #interrupts.  And hopefully some
> > > mechanism to adjust so that the amount of interrupts would
> > > be lesser per packets (hand waving here).
> > 
> > I'm trying to do this now.
> 
> What scheme do you have in mind?

Basically the one you mentioned above.

Playing with various event pointers now.


Wei.

> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-17  9:56       ` Andrew Bennieston
@ 2013-06-17 10:46         ` Wei Liu
  2013-06-17 10:56           ` Andrew Bennieston
  0 siblings, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-06-17 10:46 UTC (permalink / raw)
  To: Andrew Bennieston
  Cc: Wei Liu, Ian Campbell, stefano.stabellini, Konrad Rzeszutek Wilk,
	xen-devel, annie.li

On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
> On 17/06/13 10:38, Ian Campbell wrote:
> >On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> >>>>Konrad, IIRC you once mentioned you discovered something with event
> >>>>notification, what's that?
> >>>
> >>>They were bizzare. I naively expected some form of # of physical NIC
> >>>interrupts to be around the same as the VIF or less. And I figured
> >>>that the amount of interrupts would be constant irregardless of the
> >>>size of the packets. In other words #packets == #interrupts.
> >>>
> >>
> >>It could be that the frontend notifies the backend for every packet it
> >>sends. This is not desirable and I don't expect the ring to behave that
> >>way.
> 
> I have observed this kind of behaviour during network performance
> tests in which I periodically checked the ring state during an iperf
> session. It looked to me like the frontend was sending notifications
> far too often, but that the backend was sending them very
> infrequently, so the Tx (from guest) ring was mostly empty and the
> Rx (to guest) ring was mostly full. This has the effect of both
> front and backend having to block occasionally waiting for the other
> end to clear or fill a ring, even though there is more data
> available.
> 
> My initial theory was that this was caused in part by the shared
> event channel, however I expect that Wei is testing on top of a
> kernel with his split event channel features?
> 

Yes, with split event channels.

And during tests the interrupt counts, frontend TX has 6 figures
interrupt number while frontend RX has 2 figures number.

> >
> >It is probably worth checking that things are working how we think they
> >should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> >netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> >suitable points to maximise batching.
> >
> >Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> >loop right? This would push the req_event pointer to just after the last
> >request, meaning the net request enqueued by the frontend would cause a
> >notification -- even though the backend is actually still continuing to
> >process requests and would have picked up that packet without further
> >notification. n this case there is a fair bit of work left in the
> >backend for this iteration i.e. plenty of opportunity for the frontend
> >to queue more requests.
> >
> >The comments in ring.h say:
> >  *  These macros will set the req_event/rsp_event field to trigger a
> >  *  notification on the very next message that is enqueued. If you want to
> >  *  create batches of work (i.e., only receive a notification after several
> >  *  messages have been enqueued) then you will need to create a customised
> >  *  version of the FINAL_CHECK macro in your own code, which sets the event
> >  *  field appropriately.
> >
> >Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> >(and other similar loops) and add a FINAL check at the very end?
> >
> >>>But it was odd and I didn't go deeper in it to figure out what
> >>>was happening. And also to figure out if for the VIF we could
> >>>do something of #packets != #interrupts.  And hopefully some
> >>>mechanism to adjust so that the amount of interrupts would
> >>>be lesser per packets (hand waving here).
> >>
> >>I'm trying to do this now.
> >
> >What scheme do you have in mind?
> 
> As I mentioned above, filling a ring completely appears to be almost
> as bad as sending too many notifications. The ideal scheme may
> involve trying to balance the ring at some "half-full" state,
> depending on the capacity for the front- and backends to process
> requests and responses.
> 

I don't think filling the ring full causes any problem, that's just
conceptually the same as "half-full" state if you need to throttle the
ring.

The real problem is how to do notifications correctly.


Wei.

> Andrew.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-17 10:46         ` Wei Liu
@ 2013-06-17 10:56           ` Andrew Bennieston
  2013-06-17 11:08             ` Ian Campbell
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Bennieston @ 2013-06-17 10:56 UTC (permalink / raw)
  To: Wei Liu
  Cc: annie.li, xen-devel, Ian Campbell, stefano.stabellini,
	Konrad Rzeszutek Wilk

On 17/06/13 11:46, Wei Liu wrote:
> On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
>> On 17/06/13 10:38, Ian Campbell wrote:
>>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>>>> Konrad, IIRC you once mentioned you discovered something with event
>>>>>> notification, what's that?
>>>>>
>>>>> They were bizzare. I naively expected some form of # of physical NIC
>>>>> interrupts to be around the same as the VIF or less. And I figured
>>>>> that the amount of interrupts would be constant irregardless of the
>>>>> size of the packets. In other words #packets == #interrupts.
>>>>>
>>>>
>>>> It could be that the frontend notifies the backend for every packet it
>>>> sends. This is not desirable and I don't expect the ring to behave that
>>>> way.
>>
>> I have observed this kind of behaviour during network performance
>> tests in which I periodically checked the ring state during an iperf
>> session. It looked to me like the frontend was sending notifications
>> far too often, but that the backend was sending them very
>> infrequently, so the Tx (from guest) ring was mostly empty and the
>> Rx (to guest) ring was mostly full. This has the effect of both
>> front and backend having to block occasionally waiting for the other
>> end to clear or fill a ring, even though there is more data
>> available.
>>
>> My initial theory was that this was caused in part by the shared
>> event channel, however I expect that Wei is testing on top of a
>> kernel with his split event channel features?
>>
>
> Yes, with split event channels.
>
> And during tests the interrupt counts, frontend TX has 6 figures
> interrupt number while frontend RX has 2 figures number.
>
>>>
>>> It is probably worth checking that things are working how we think they
>>> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
>>> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
>>> suitable points to maximise batching.
>>>
>>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
>>> loop right? This would push the req_event pointer to just after the last
>>> request, meaning the net request enqueued by the frontend would cause a
>>> notification -- even though the backend is actually still continuing to
>>> process requests and would have picked up that packet without further
>>> notification. n this case there is a fair bit of work left in the
>>> backend for this iteration i.e. plenty of opportunity for the frontend
>>> to queue more requests.
>>>
>>> The comments in ring.h say:
>>>   *  These macros will set the req_event/rsp_event field to trigger a
>>>   *  notification on the very next message that is enqueued. If you want to
>>>   *  create batches of work (i.e., only receive a notification after several
>>>   *  messages have been enqueued) then you will need to create a customised
>>>   *  version of the FINAL_CHECK macro in your own code, which sets the event
>>>   *  field appropriately.
>>>
>>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
>>> (and other similar loops) and add a FINAL check at the very end?
>>>
>>>>> But it was odd and I didn't go deeper in it to figure out what
>>>>> was happening. And also to figure out if for the VIF we could
>>>>> do something of #packets != #interrupts.  And hopefully some
>>>>> mechanism to adjust so that the amount of interrupts would
>>>>> be lesser per packets (hand waving here).
>>>>
>>>> I'm trying to do this now.
>>>
>>> What scheme do you have in mind?
>>
>> As I mentioned above, filling a ring completely appears to be almost
>> as bad as sending too many notifications. The ideal scheme may
>> involve trying to balance the ring at some "half-full" state,
>> depending on the capacity for the front- and backends to process
>> requests and responses.
>>
>
> I don't think filling the ring full causes any problem, that's just
> conceptually the same as "half-full" state if you need to throttle the
> ring.
My understanding was that filling the ring will cause the producer to 
sleep until slots become available (i.e. the until the consumer notifies 
it that it has removed something from the ring).

I'm just concerned that overly aggressive batching may lead to a 
situation where the consumer is sitting idle, waiting for a notification 
that the producer hasn't yet sent because it can still fill more slots 
on the ring. When the ring is completely full, the producer would have 
to wait for the ring to partially empty. At this point, the consumer 
would hold off notifying because it can still batch more processing, so 
the producer is left waiting. (Repeat as required). It would be better 
to have both producer and consumer running concurrently.

I mention this mainly so that we don't end up with a swing to the polar 
opposite of what we have now, which (to my mind) is just as bad. Clearly 
this is an edge case, but if there's a reason I'm missing that this 
can't happen (e.g. after a period of inactivity) then don't hesitate to 
point it out :)

(Perhaps "half-full" was misleading... the optimal state may be "just 
enough room for one more packet", or something along those lines...)

Andrew

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-17 10:56           ` Andrew Bennieston
@ 2013-06-17 11:08             ` Ian Campbell
  2013-06-17 11:55               ` Andrew Bennieston
  0 siblings, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2013-06-17 11:08 UTC (permalink / raw)
  To: Andrew Bennieston
  Cc: annie.li, xen-devel, Wei Liu, stefano.stabellini, Konrad Rzeszutek Wilk

On Mon, 2013-06-17 at 11:56 +0100, Andrew Bennieston wrote:
> On 17/06/13 11:46, Wei Liu wrote:
> > On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
> >> On 17/06/13 10:38, Ian Campbell wrote:
> >>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> >>>>>> Konrad, IIRC you once mentioned you discovered something with event
> >>>>>> notification, what's that?
> >>>>>
> >>>>> They were bizzare. I naively expected some form of # of physical NIC
> >>>>> interrupts to be around the same as the VIF or less. And I figured
> >>>>> that the amount of interrupts would be constant irregardless of the
> >>>>> size of the packets. In other words #packets == #interrupts.
> >>>>>
> >>>>
> >>>> It could be that the frontend notifies the backend for every packet it
> >>>> sends. This is not desirable and I don't expect the ring to behave that
> >>>> way.
> >>
> >> I have observed this kind of behaviour during network performance
> >> tests in which I periodically checked the ring state during an iperf
> >> session. It looked to me like the frontend was sending notifications
> >> far too often, but that the backend was sending them very
> >> infrequently, so the Tx (from guest) ring was mostly empty and the
> >> Rx (to guest) ring was mostly full. This has the effect of both
> >> front and backend having to block occasionally waiting for the other
> >> end to clear or fill a ring, even though there is more data
> >> available.
> >>
> >> My initial theory was that this was caused in part by the shared
> >> event channel, however I expect that Wei is testing on top of a
> >> kernel with his split event channel features?
> >>
> >
> > Yes, with split event channels.
> >
> > And during tests the interrupt counts, frontend TX has 6 figures
> > interrupt number while frontend RX has 2 figures number.
> >
> >>>
> >>> It is probably worth checking that things are working how we think they
> >>> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> >>> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> >>> suitable points to maximise batching.
> >>>
> >>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> >>> loop right? This would push the req_event pointer to just after the last
> >>> request, meaning the net request enqueued by the frontend would cause a
> >>> notification -- even though the backend is actually still continuing to
> >>> process requests and would have picked up that packet without further
> >>> notification. n this case there is a fair bit of work left in the
> >>> backend for this iteration i.e. plenty of opportunity for the frontend
> >>> to queue more requests.
> >>>
> >>> The comments in ring.h say:
> >>>   *  These macros will set the req_event/rsp_event field to trigger a
> >>>   *  notification on the very next message that is enqueued. If you want to
> >>>   *  create batches of work (i.e., only receive a notification after several
> >>>   *  messages have been enqueued) then you will need to create a customised
> >>>   *  version of the FINAL_CHECK macro in your own code, which sets the event
> >>>   *  field appropriately.
> >>>
> >>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> >>> (and other similar loops) and add a FINAL check at the very end?
> >>>
> >>>>> But it was odd and I didn't go deeper in it to figure out what
> >>>>> was happening. And also to figure out if for the VIF we could
> >>>>> do something of #packets != #interrupts.  And hopefully some
> >>>>> mechanism to adjust so that the amount of interrupts would
> >>>>> be lesser per packets (hand waving here).
> >>>>
> >>>> I'm trying to do this now.
> >>>
> >>> What scheme do you have in mind?
> >>
> >> As I mentioned above, filling a ring completely appears to be almost
> >> as bad as sending too many notifications. The ideal scheme may
> >> involve trying to balance the ring at some "half-full" state,
> >> depending on the capacity for the front- and backends to process
> >> requests and responses.
> >>
> >
> > I don't think filling the ring full causes any problem, that's just
> > conceptually the same as "half-full" state if you need to throttle the
> > ring.
> My understanding was that filling the ring will cause the producer to 
> sleep until slots become available (i.e. the until the consumer notifies 
> it that it has removed something from the ring).
> 
> I'm just concerned that overly aggressive batching may lead to a 
> situation where the consumer is sitting idle, waiting for a notification 
> that the producer hasn't yet sent because it can still fill more slots 
> on the ring. When the ring is completely full, the producer would have 
> to wait for the ring to partially empty. At this point, the consumer 
> would hold off notifying because it can still batch more processing, so 
> the producer is left waiting. (Repeat as required). It would be better 
> to have both producer and consumer running concurrently.
> 
> I mention this mainly so that we don't end up with a swing to the polar 
> opposite of what we have now, which (to my mind) is just as bad. Clearly 
> this is an edge case, but if there's a reason I'm missing that this 
> can't happen (e.g. after a period of inactivity) then don't hesitate to 
> point it out :)

Doesn't the separation between req_event and rsp_event help here?

So if the producer fills the ring, it will sleep, but set rsp_event
appropriately that when the backend completes some (but not all) work it
will be woken up so that it can put extra stuff on the ring.

It shouldn't need to wait for the backend to process the whole batch for
this.

> 
> (Perhaps "half-full" was misleading... the optimal state may be "just 
> enough room for one more packet", or something along those lines...)
> 
> Andrew
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-17 10:35       ` Wei Liu
@ 2013-06-17 11:34         ` annie li
  0 siblings, 0 replies; 27+ messages in thread
From: annie li @ 2013-06-17 11:34 UTC (permalink / raw)
  To: Wei Liu
  Cc: andrew.bennieston, Konrad Rzeszutek Wilk, xen-devel,
	Ian Campbell, stefano.stabellini


On 2013-6-17 18:35, Wei Liu wrote:
> On Mon, Jun 17, 2013 at 10:38:33AM +0100, Ian Campbell wrote:
>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>>> Konrad, IIRC you once mentioned you discovered something with event
>>>>> notification, what's that?
>>>> They were bizzare. I naively expected some form of # of physical NIC
>>>> interrupts to be around the same as the VIF or less. And I figured
>>>> that the amount of interrupts would be constant irregardless of the
>>>> size of the packets. In other words #packets == #interrupts.
>>>>
>>> It could be that the frontend notifies the backend for every packet it
>>> sends. This is not desirable and I don't expect the ring to behave that
>>> way.
>> It is probably worth checking that things are working how we think they
>> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
>> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
>> suitable points to maximise batching.
>>
>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
>> loop right? This would push the req_event pointer to just after the last
>> request, meaning the net request enqueued by the frontend would cause a
>> notification -- even though the backend is actually still continuing to
>> process requests and would have picked up that packet without further
>> notification. n this case there is a fair bit of work left in the
>> backend for this iteration i.e. plenty of opportunity for the frontend
>> to queue more requests.
>>
>> The comments in ring.h say:
>>   *  These macros will set the req_event/rsp_event field to trigger a
>>   *  notification on the very next message that is enqueued. If you want to
>>   *  create batches of work (i.e., only receive a notification after several
>>   *  messages have been enqueued) then you will need to create a customised
>>   *  version of the FINAL_CHECK macro in your own code, which sets the event
>>   *  field appropriately.
>>
>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
>> (and other similar loops) and add a FINAL check at the very end?
>>
>>>> But it was odd and I didn't go deeper in it to figure out what
>>>> was happening. And also to figure out if for the VIF we could
>>>> do something of #packets != #interrupts.  And hopefully some
>>>> mechanism to adjust so that the amount of interrupts would
>>>> be lesser per packets (hand waving here).
>>> I'm trying to do this now.
>> What scheme do you have in mind?
> Basically the one you mentioned above.
>
> Playing with various event pointers now.

Did you collect data of how much requests netback processes when 
req_event is updated in RING_FINAL_CHECK_FOR_REQUESTS? I assume this 
value is pretty small from your test result. How about not updating 
req_event every time when there is no unconsumed request in 
RING_FINAL_CHECK_FOR_REQUESTS?

Thanks
Annie
>
>
> Wei.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-17 11:08             ` Ian Campbell
@ 2013-06-17 11:55               ` Andrew Bennieston
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Bennieston @ 2013-06-17 11:55 UTC (permalink / raw)
  To: Ian Campbell
  Cc: annie.li, xen-devel, Wei Liu, stefano.stabellini, Konrad Rzeszutek Wilk

On 17/06/13 12:08, Ian Campbell wrote:
> On Mon, 2013-06-17 at 11:56 +0100, Andrew Bennieston wrote:
>> On 17/06/13 11:46, Wei Liu wrote:
>>> On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
>>>> On 17/06/13 10:38, Ian Campbell wrote:
>>>>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>>>>>> Konrad, IIRC you once mentioned you discovered something with event
>>>>>>>> notification, what's that?
>>>>>>>
>>>>>>> They were bizzare. I naively expected some form of # of physical NIC
>>>>>>> interrupts to be around the same as the VIF or less. And I figured
>>>>>>> that the amount of interrupts would be constant irregardless of the
>>>>>>> size of the packets. In other words #packets == #interrupts.
>>>>>>>
>>>>>>
>>>>>> It could be that the frontend notifies the backend for every packet it
>>>>>> sends. This is not desirable and I don't expect the ring to behave that
>>>>>> way.
>>>>
>>>> I have observed this kind of behaviour during network performance
>>>> tests in which I periodically checked the ring state during an iperf
>>>> session. It looked to me like the frontend was sending notifications
>>>> far too often, but that the backend was sending them very
>>>> infrequently, so the Tx (from guest) ring was mostly empty and the
>>>> Rx (to guest) ring was mostly full. This has the effect of both
>>>> front and backend having to block occasionally waiting for the other
>>>> end to clear or fill a ring, even though there is more data
>>>> available.
>>>>
>>>> My initial theory was that this was caused in part by the shared
>>>> event channel, however I expect that Wei is testing on top of a
>>>> kernel with his split event channel features?
>>>>
>>>
>>> Yes, with split event channels.
>>>
>>> And during tests the interrupt counts, frontend TX has 6 figures
>>> interrupt number while frontend RX has 2 figures number.
>>>
>>>>>
>>>>> It is probably worth checking that things are working how we think they
>>>>> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
>>>>> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
>>>>> suitable points to maximise batching.
>>>>>
>>>>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
>>>>> loop right? This would push the req_event pointer to just after the last
>>>>> request, meaning the net request enqueued by the frontend would cause a
>>>>> notification -- even though the backend is actually still continuing to
>>>>> process requests and would have picked up that packet without further
>>>>> notification. n this case there is a fair bit of work left in the
>>>>> backend for this iteration i.e. plenty of opportunity for the frontend
>>>>> to queue more requests.
>>>>>
>>>>> The comments in ring.h say:
>>>>>    *  These macros will set the req_event/rsp_event field to trigger a
>>>>>    *  notification on the very next message that is enqueued. If you want to
>>>>>    *  create batches of work (i.e., only receive a notification after several
>>>>>    *  messages have been enqueued) then you will need to create a customised
>>>>>    *  version of the FINAL_CHECK macro in your own code, which sets the event
>>>>>    *  field appropriately.
>>>>>
>>>>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
>>>>> (and other similar loops) and add a FINAL check at the very end?
>>>>>
>>>>>>> But it was odd and I didn't go deeper in it to figure out what
>>>>>>> was happening. And also to figure out if for the VIF we could
>>>>>>> do something of #packets != #interrupts.  And hopefully some
>>>>>>> mechanism to adjust so that the amount of interrupts would
>>>>>>> be lesser per packets (hand waving here).
>>>>>>
>>>>>> I'm trying to do this now.
>>>>>
>>>>> What scheme do you have in mind?
>>>>
>>>> As I mentioned above, filling a ring completely appears to be almost
>>>> as bad as sending too many notifications. The ideal scheme may
>>>> involve trying to balance the ring at some "half-full" state,
>>>> depending on the capacity for the front- and backends to process
>>>> requests and responses.
>>>>
>>>
>>> I don't think filling the ring full causes any problem, that's just
>>> conceptually the same as "half-full" state if you need to throttle the
>>> ring.
>> My understanding was that filling the ring will cause the producer to
>> sleep until slots become available (i.e. the until the consumer notifies
>> it that it has removed something from the ring).
>>
>> I'm just concerned that overly aggressive batching may lead to a
>> situation where the consumer is sitting idle, waiting for a notification
>> that the producer hasn't yet sent because it can still fill more slots
>> on the ring. When the ring is completely full, the producer would have
>> to wait for the ring to partially empty. At this point, the consumer
>> would hold off notifying because it can still batch more processing, so
>> the producer is left waiting. (Repeat as required). It would be better
>> to have both producer and consumer running concurrently.
>>
>> I mention this mainly so that we don't end up with a swing to the polar
>> opposite of what we have now, which (to my mind) is just as bad. Clearly
>> this is an edge case, but if there's a reason I'm missing that this
>> can't happen (e.g. after a period of inactivity) then don't hesitate to
>> point it out :)
>
> Doesn't the separation between req_event and rsp_event help here?
>
> So if the producer fills the ring, it will sleep, but set rsp_event
> appropriately that when the backend completes some (but not all) work it
> will be woken up so that it can put extra stuff on the ring.
>
> It shouldn't need to wait for the backend to process the whole batch for
> this.

Right. As long as this logic doesn't get inadvertently changed in an 
attempt to improve batching of events!

>
>>
>> (Perhaps "half-full" was misleading... the optimal state may be "just
>> enough room for one more packet", or something along those lines...)
>>
>> Andrew
>>
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-12 10:14 Interesting observation with network event notification and batching Wei Liu
  2013-06-14 18:53 ` Konrad Rzeszutek Wilk
@ 2013-06-28 16:15 ` Wei Liu
  2013-07-01  7:48   ` annie li
  1 sibling, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-06-28 16:15 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, ian.campbell, stefano.stabellini, annie.li, andrew.bennieston

Hi all,

After collecting more stats and comparing copying / mapping cases, I now
have some more interesting finds, which might contradict what I said
before.

I tuned the runes I used for benchmark to make sure iperf and netperf
generate large packets (~64K). Here are the runes I use:

  iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
  netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072

                          COPY                    MAP
iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
         PPI               2.90                  1.07
         SPI               37.75                 13.69
         PPN               2.90                  1.07
         SPN               37.75                 13.69
         tx_count           31808                174769
         nr_napi_schedule   31805                174697
         total_packets      92354                187408
         total_reqs         1200793              2392614

netperf  Tput:            5.8Gb/s             10.5Gb/s
         PPI               2.13                   1.00
         SPI               36.70                  16.73
         PPN               2.13                   1.31
         SPN               36.70                  16.75
         tx_count           57635                205599
         nr_napi_schedule   57633                205311
         total_packets      122800               270254
         total_reqs         2115068              3439751

  PPI: packets processed per interrupt
  SPI: slots processed per interrupt
  PPN: packets processed per napi schedule
  SPN: slots processed per napi schedule
  tx_count: interrupt count
  total_reqs: total slots used during test

* Notification and batching

Is notification and batching really a problem? I'm not so sure now. My
first thought when I didn't measure PPI / PPN / SPI / SPN in copying
case was that "in that case netback *must* have better batching" which
turned out not very true -- copying mode makes netback slower, however
the batching gained is not hugh.

Ideally we still want to batch as much as possible. Possible way
includes playing with the 'weight' parameter in NAPI. But as the figures
show batching seems not to be very important for throughput, at least
for now. If the NAPI framework and netfront / netback are doing their
jobs as designed we might not need to worry about this now.

Andrew, do you have any thought on this? You found out that NAPI didn't
scale well with multi-threaded iperf in DomU, do you have any handle how
that can happen?

* Thoughts on zero-copy TX

With this hack we are able to achieve 10Gb/s single stream, which is
good. But, with classic XenoLinux kernel which has zero copy TX we
didn't able to achieve this.  I also developed another zero copy netback
prototype one year ago with Ian's out-of-tree skb frag destructor patch
series. That prototype couldn't achieve 10Gb/s either (IIRC the
performance was more or less the same as copying mode, about 6~7Gb/s).

My hack maps all necessary pages permantently, there is no unmap, we
skip lots of page table manipulation and TLB flushes. So my basic
conclusion is that page table manipulation and TLB flushes do incur
heavy performance penalty.

This hack can be upstreamed in no way. If we're to re-introduce
zero-copy TX, we would need to implement some sort of lazy flushing
mechanism. I haven't thought this through. Presumably this mechanism
would also benefit blk somehow? I'm not sure yet.

Could persistent mapping (with the to-be-developed reclaim / MRU list
mechanism) be useful here? So that we can unify blk and net drivers?

* Changes required to introduce zero-copy TX

1. SKB frag destructor series: to track life cycle of SKB frags. This is
not yet upstreamed.

2. Mechanism to negotiate max slots frontend can use: mapping requires
backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.

3. Lazy flushing mechanism or persistent grants: ???


Wei.

* Note
In my previous tests I only ran iperf and didn't have the right rune to
generate large packets. Iperf seems to have a behavior to increase
packet size as time goes by. In the copying case the packet size was
increased to 64K eventually while in the mapping case odd thing happened
(I believe that must due to the bug in my hack :-/) -- packet size was
always the default size (8K). Adding '-l 131072' to iperf makes sure
that the packet is always 64K.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-06-28 16:15 ` Wei Liu
@ 2013-07-01  7:48   ` annie li
  2013-07-01  8:54     ` Wei Liu
  2013-07-01 14:19     ` Stefano Stabellini
  0 siblings, 2 replies; 27+ messages in thread
From: annie li @ 2013-07-01  7:48 UTC (permalink / raw)
  To: Wei Liu; +Cc: andrew.bennieston, ian.campbell, stefano.stabellini, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4882 bytes --]


On 2013-6-29 0:15, Wei Liu wrote:
> Hi all,
>
> After collecting more stats and comparing copying / mapping cases, I now
> have some more interesting finds, which might contradict what I said
> before.
>
> I tuned the runes I used for benchmark to make sure iperf and netperf
> generate large packets (~64K). Here are the runes I use:
>
>    iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>    netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>
>                            COPY                    MAP
> iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)

So with default iperf setting, copy is about 7.9G, and map is about 
2.5G? How about the result of netperf without large packets?

>           PPI               2.90                  1.07
>           SPI               37.75                 13.69
>           PPN               2.90                  1.07
>           SPN               37.75                 13.69
>           tx_count           31808                174769

Seems interrupt count does not affect the performance at all with -l 
131072 -w 128k.

>           nr_napi_schedule   31805                174697
>           total_packets      92354                187408
>           total_reqs         1200793              2392614
>
> netperf  Tput:            5.8Gb/s             10.5Gb/s
>           PPI               2.13                   1.00
>           SPI               36.70                  16.73
>           PPN               2.13                   1.31
>           SPN               36.70                  16.75
>           tx_count           57635                205599
>           nr_napi_schedule   57633                205311
>           total_packets      122800               270254
>           total_reqs         2115068              3439751
>
>    PPI: packets processed per interrupt
>    SPI: slots processed per interrupt
>    PPN: packets processed per napi schedule
>    SPN: slots processed per napi schedule
>    tx_count: interrupt count
>    total_reqs: total slots used during test
>
> * Notification and batching
>
> Is notification and batching really a problem? I'm not so sure now. My
> first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> case was that "in that case netback *must* have better batching" which
> turned out not very true -- copying mode makes netback slower, however
> the batching gained is not hugh.
>
> Ideally we still want to batch as much as possible. Possible way
> includes playing with the 'weight' parameter in NAPI. But as the figures
> show batching seems not to be very important for throughput, at least
> for now. If the NAPI framework and netfront / netback are doing their
> jobs as designed we might not need to worry about this now.
>
> Andrew, do you have any thought on this? You found out that NAPI didn't
> scale well with multi-threaded iperf in DomU, do you have any handle how
> that can happen?
>
> * Thoughts on zero-copy TX
>
> With this hack we are able to achieve 10Gb/s single stream, which is
> good. But, with classic XenoLinux kernel which has zero copy TX we
> didn't able to achieve this.  I also developed another zero copy netback
> prototype one year ago with Ian's out-of-tree skb frag destructor patch
> series. That prototype couldn't achieve 10Gb/s either (IIRC the
> performance was more or less the same as copying mode, about 6~7Gb/s).
>
> My hack maps all necessary pages permantently, there is no unmap, we
> skip lots of page table manipulation and TLB flushes. So my basic
> conclusion is that page table manipulation and TLB flushes do incur
> heavy performance penalty.
>
> This hack can be upstreamed in no way. If we're to re-introduce
> zero-copy TX, we would need to implement some sort of lazy flushing
> mechanism. I haven't thought this through. Presumably this mechanism
> would also benefit blk somehow? I'm not sure yet.
>
> Could persistent mapping (with the to-be-developed reclaim / MRU list
> mechanism) be useful here? So that we can unify blk and net drivers?
>
> * Changes required to introduce zero-copy TX
>
> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
> not yet upstreamed.

Are you mentioning this one 
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?

<http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html> 

>
> 2. Mechanism to negotiate max slots frontend can use: mapping requires
> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
>
> 3. Lazy flushing mechanism or persistent grants: ???

I did some test with persistent grants before, it did not show better 
performance than grant copy. But I was using the default params of 
netperf, and not tried large packet size. Your results reminds me that 
maybe persistent grants would get similar results with larger packet 
size too.

Thanks
Annie


[-- Attachment #1.2: Type: text/html, Size: 5710 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01  7:48   ` annie li
@ 2013-07-01  8:54     ` Wei Liu
  2013-07-01 14:29       ` Stefano Stabellini
  2013-07-01 15:59       ` annie li
  2013-07-01 14:19     ` Stefano Stabellini
  1 sibling, 2 replies; 27+ messages in thread
From: Wei Liu @ 2013-07-01  8:54 UTC (permalink / raw)
  To: annie li
  Cc: Wei Liu, ian.campbell, stefano.stabellini, xen-devel, andrew.bennieston

On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> 
> On 2013-6-29 0:15, Wei Liu wrote:
> >Hi all,
> >
> >After collecting more stats and comparing copying / mapping cases, I now
> >have some more interesting finds, which might contradict what I said
> >before.
> >
> >I tuned the runes I used for benchmark to make sure iperf and netperf
> >generate large packets (~64K). Here are the runes I use:
> >
> >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> >
> >                           COPY                    MAP
> >iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> 
> So with default iperf setting, copy is about 7.9G, and map is about
> 2.5G? How about the result of netperf without large packets?
> 

First question, yes.

Second question, 5.8Gb/s. And I believe for the copying scheme without
large packet the throuput is more or less the same.

> >          PPI               2.90                  1.07
> >          SPI               37.75                 13.69
> >          PPN               2.90                  1.07
> >          SPN               37.75                 13.69
> >          tx_count           31808                174769
> 
> Seems interrupt count does not affect the performance at all with -l
> 131072 -w 128k.
> 

Right.

> >          nr_napi_schedule   31805                174697
> >          total_packets      92354                187408
> >          total_reqs         1200793              2392614
> >
> >netperf  Tput:            5.8Gb/s             10.5Gb/s
> >          PPI               2.13                   1.00
> >          SPI               36.70                  16.73
> >          PPN               2.13                   1.31
> >          SPN               36.70                  16.75
> >          tx_count           57635                205599
> >          nr_napi_schedule   57633                205311
> >          total_packets      122800               270254
> >          total_reqs         2115068              3439751
> >
> >   PPI: packets processed per interrupt
> >   SPI: slots processed per interrupt
> >   PPN: packets processed per napi schedule
> >   SPN: slots processed per napi schedule
> >   tx_count: interrupt count
> >   total_reqs: total slots used during test
> >
> >* Notification and batching
> >
> >Is notification and batching really a problem? I'm not so sure now. My
> >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> >case was that "in that case netback *must* have better batching" which
> >turned out not very true -- copying mode makes netback slower, however
> >the batching gained is not hugh.
> >
> >Ideally we still want to batch as much as possible. Possible way
> >includes playing with the 'weight' parameter in NAPI. But as the figures
> >show batching seems not to be very important for throughput, at least
> >for now. If the NAPI framework and netfront / netback are doing their
> >jobs as designed we might not need to worry about this now.
> >
> >Andrew, do you have any thought on this? You found out that NAPI didn't 
> >scale well with multi-threaded iperf in DomU, do you have any handle how
> >that can happen?
> >
> >* Thoughts on zero-copy TX
> >
> >With this hack we are able to achieve 10Gb/s single stream, which is
> >good. But, with classic XenoLinux kernel which has zero copy TX we
> >didn't able to achieve this.  I also developed another zero copy netback
> >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> >performance was more or less the same as copying mode, about 6~7Gb/s).
> >
> >My hack maps all necessary pages permantently, there is no unmap, we
> >skip lots of page table manipulation and TLB flushes. So my basic
> >conclusion is that page table manipulation and TLB flushes do incur
> >heavy performance penalty.
> >
> >This hack can be upstreamed in no way. If we're to re-introduce
> >zero-copy TX, we would need to implement some sort of lazy flushing
> >mechanism. I haven't thought this through. Presumably this mechanism
> >would also benefit blk somehow? I'm not sure yet.
> >
> >Could persistent mapping (with the to-be-developed reclaim / MRU list
> >mechanism) be useful here? So that we can unify blk and net drivers?
> >
> >* Changes required to introduce zero-copy TX
> >
> >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> >not yet upstreamed.
> 
> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> 
> <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> 

Yes. But I believe there's been several versions posted. The link you
have is not the latest version.

> >
> >2. Mechanism to negotiate max slots frontend can use: mapping requires
> >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> >
> >3. Lazy flushing mechanism or persistent grants: ???
> 
> I did some test with persistent grants before, it did not show
> better performance than grant copy. But I was using the default
> params of netperf, and not tried large packet size. Your results
> reminds me that maybe persistent grants would get similar results
> with larger packet size too.
> 

"No better performance" -- that's because both mechanisms are copying?
However I presume persistent grant can scale better? From an earlier
email last week, I read that copying is done by the guest so that this
mechanism scales much better than hypervisor copying in blk's case.


Wei.

> Thanks
> Annie
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01  7:48   ` annie li
  2013-07-01  8:54     ` Wei Liu
@ 2013-07-01 14:19     ` Stefano Stabellini
  2013-07-01 15:59       ` annie li
  1 sibling, 1 reply; 27+ messages in thread
From: Stefano Stabellini @ 2013-07-01 14:19 UTC (permalink / raw)
  To: annie li
  Cc: Wei Liu, ian.campbell, stefano.stabellini, xen-devel, andrew.bennieston

Could you please use plain text emails in the future?

On Mon, 1 Jul 2013, annie li wrote:
> On 2013-6-29 0:15, Wei Liu wrote:
> 
> Hi all,
> 
> After collecting more stats and comparing copying / mapping cases, I now
> have some more interesting finds, which might contradict what I said
> before.
> 
> I tuned the runes I used for benchmark to make sure iperf and netperf
> generate large packets (~64K). Here are the runes I use:
> 
>   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> 
>                           COPY                    MAP
> iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> 
> 
> So with default iperf setting, copy is about 7.9G, and map is about 2.5G? How about the result of netperf without large packets?
> 
>          PPI               2.90                  1.07
>          SPI               37.75                 13.69
>          PPN               2.90                  1.07
>          SPN               37.75                 13.69
>          tx_count           31808                174769
> 
> 
> Seems interrupt count does not affect the performance at all with -l 131072 -w 128k.
> 
>          nr_napi_schedule   31805                174697
>          total_packets      92354                187408
>          total_reqs         1200793              2392614
> 
> netperf  Tput:            5.8Gb/s             10.5Gb/s
>          PPI               2.13                   1.00
>          SPI               36.70                  16.73
>          PPN               2.13                   1.31
>          SPN               36.70                  16.75
>          tx_count           57635                205599
>          nr_napi_schedule   57633                205311
>          total_packets      122800               270254
>          total_reqs         2115068              3439751
> 
>   PPI: packets processed per interrupt
>   SPI: slots processed per interrupt
>   PPN: packets processed per napi schedule
>   SPN: slots processed per napi schedule
>   tx_count: interrupt count
>   total_reqs: total slots used during test
> 
> * Notification and batching
> 
> Is notification and batching really a problem? I'm not so sure now. My
> first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> case was that "in that case netback *must* have better batching" which
> turned out not very true -- copying mode makes netback slower, however
> the batching gained is not hugh.
> 
> Ideally we still want to batch as much as possible. Possible way
> includes playing with the 'weight' parameter in NAPI. But as the figures
> show batching seems not to be very important for throughput, at least
> for now. If the NAPI framework and netfront / netback are doing their
> jobs as designed we might not need to worry about this now.
> 
> Andrew, do you have any thought on this? You found out that NAPI didn't
> scale well with multi-threaded iperf in DomU, do you have any handle how
> that can happen?
> 
> * Thoughts on zero-copy TX
> 
> With this hack we are able to achieve 10Gb/s single stream, which is
> good. But, with classic XenoLinux kernel which has zero copy TX we
> didn't able to achieve this.  I also developed another zero copy netback
> prototype one year ago with Ian's out-of-tree skb frag destructor patch
> series. That prototype couldn't achieve 10Gb/s either (IIRC the
> performance was more or less the same as copying mode, about 6~7Gb/s).
> 
> My hack maps all necessary pages permantently, there is no unmap, we
> skip lots of page table manipulation and TLB flushes. So my basic
> conclusion is that page table manipulation and TLB flushes do incur
> heavy performance penalty.
> 
> This hack can be upstreamed in no way. If we're to re-introduce
> zero-copy TX, we would need to implement some sort of lazy flushing
> mechanism. I haven't thought this through. Presumably this mechanism
> would also benefit blk somehow? I'm not sure yet.
> 
> Could persistent mapping (with the to-be-developed reclaim / MRU list
> mechanism) be useful here? So that we can unify blk and net drivers?
> 
> * Changes required to introduce zero-copy TX
> 
> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
> not yet upstreamed.
> 
> 
> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> 
> 
> 2. Mechanism to negotiate max slots frontend can use: mapping requires
> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> 
> 3. Lazy flushing mechanism or persistent grants: ???
> 
> 
> I did some test with persistent grants before, it did not show better performance than grant copy. But I was using the default
> params of netperf, and not tried large packet size. Your results reminds me that maybe persistent grants would get similar
> results with larger packet size too.
> 
> Thanks
> Annie
> 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01  8:54     ` Wei Liu
@ 2013-07-01 14:29       ` Stefano Stabellini
  2013-07-01 14:39         ` Wei Liu
  2013-07-01 15:59       ` annie li
  1 sibling, 1 reply; 27+ messages in thread
From: Stefano Stabellini @ 2013-07-01 14:29 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, stefano.stabellini, xen-devel, annie li, andrew.bennieston

On Mon, 1 Jul 2013, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > 
> > On 2013-6-29 0:15, Wei Liu wrote:
> > >Hi all,
> > >
> > >After collecting more stats and comparing copying / mapping cases, I now
> > >have some more interesting finds, which might contradict what I said
> > >before.
> > >
> > >I tuned the runes I used for benchmark to make sure iperf and netperf
> > >generate large packets (~64K). Here are the runes I use:
> > >
> > >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > >
> > >                           COPY                    MAP
> > >iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> > 
> > So with default iperf setting, copy is about 7.9G, and map is about
> > 2.5G? How about the result of netperf without large packets?
> > 
> 
> First question, yes.
> 
> Second question, 5.8Gb/s. And I believe for the copying scheme without
> large packet the throuput is more or less the same.
> 
> > >          PPI               2.90                  1.07
> > >          SPI               37.75                 13.69
> > >          PPN               2.90                  1.07
> > >          SPN               37.75                 13.69
> > >          tx_count           31808                174769
> > 
> > Seems interrupt count does not affect the performance at all with -l
> > 131072 -w 128k.
> > 
> 
> Right.
> 
> > >          nr_napi_schedule   31805                174697
> > >          total_packets      92354                187408
> > >          total_reqs         1200793              2392614
> > >
> > >netperf  Tput:            5.8Gb/s             10.5Gb/s
> > >          PPI               2.13                   1.00
> > >          SPI               36.70                  16.73
> > >          PPN               2.13                   1.31
> > >          SPN               36.70                  16.75
> > >          tx_count           57635                205599
> > >          nr_napi_schedule   57633                205311
> > >          total_packets      122800               270254
> > >          total_reqs         2115068              3439751
> > >
> > >   PPI: packets processed per interrupt
> > >   SPI: slots processed per interrupt
> > >   PPN: packets processed per napi schedule
> > >   SPN: slots processed per napi schedule
> > >   tx_count: interrupt count
> > >   total_reqs: total slots used during test
> > >
> > >* Notification and batching
> > >
> > >Is notification and batching really a problem? I'm not so sure now. My
> > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> > >case was that "in that case netback *must* have better batching" which
> > >turned out not very true -- copying mode makes netback slower, however
> > >the batching gained is not hugh.
> > >
> > >Ideally we still want to batch as much as possible. Possible way
> > >includes playing with the 'weight' parameter in NAPI. But as the figures
> > >show batching seems not to be very important for throughput, at least
> > >for now. If the NAPI framework and netfront / netback are doing their
> > >jobs as designed we might not need to worry about this now.
> > >
> > >Andrew, do you have any thought on this? You found out that NAPI didn't 
> > >scale well with multi-threaded iperf in DomU, do you have any handle how
> > >that can happen?
> > >
> > >* Thoughts on zero-copy TX
> > >
> > >With this hack we are able to achieve 10Gb/s single stream, which is
> > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > >didn't able to achieve this.  I also developed another zero copy netback
> > >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> > >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> > >performance was more or less the same as copying mode, about 6~7Gb/s).
> > >
> > >My hack maps all necessary pages permantently, there is no unmap, we
> > >skip lots of page table manipulation and TLB flushes. So my basic
> > >conclusion is that page table manipulation and TLB flushes do incur
> > >heavy performance penalty.
> > >
> > >This hack can be upstreamed in no way. If we're to re-introduce
> > >zero-copy TX, we would need to implement some sort of lazy flushing
> > >mechanism. I haven't thought this through. Presumably this mechanism
> > >would also benefit blk somehow? I'm not sure yet.
> > >
> > >Could persistent mapping (with the to-be-developed reclaim / MRU list
> > >mechanism) be useful here? So that we can unify blk and net drivers?
> > >
> > >* Changes required to introduce zero-copy TX
> > >
> > >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> > >not yet upstreamed.
> > 
> > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > 
> > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > 
> 
> Yes. But I believe there's been several versions posted. The link you
> have is not the latest version.
> 
> > >
> > >2. Mechanism to negotiate max slots frontend can use: mapping requires
> > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> > >
> > >3. Lazy flushing mechanism or persistent grants: ???
> > 
> > I did some test with persistent grants before, it did not show
> > better performance than grant copy. But I was using the default
> > params of netperf, and not tried large packet size. Your results
> > reminds me that maybe persistent grants would get similar results
> > with larger packet size too.
> > 
> 
> "No better performance" -- that's because both mechanisms are copying?
> However I presume persistent grant can scale better? From an earlier
> email last week, I read that copying is done by the guest so that this
> mechanism scales much better than hypervisor copying in blk's case.

Yes, I always expected persistent grants to be faster then
gnttab_copy but I was very surprised by the difference in performances:

http://marc.info/?l=xen-devel&m=137234605929944

I think it's worth trying persistent grants on PV network, although it's
very unlikely that they are going to improve the throughput by 5 Gb/s.

Also once we have both PV block and network using persistent grants,
we might incur the grant table limit, see this email:

http://marc.info/?l=xen-devel&m=137183474618974

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01 14:29       ` Stefano Stabellini
@ 2013-07-01 14:39         ` Wei Liu
  2013-07-01 14:54           ` Stefano Stabellini
  0 siblings, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-07-01 14:39 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, ian.campbell, xen-devel, annie li, andrew.bennieston

On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote:
> On Mon, 1 Jul 2013, Wei Liu wrote:
> > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > > 
> > > On 2013-6-29 0:15, Wei Liu wrote:
> > > >Hi all,
> > > >
> > > >After collecting more stats and comparing copying / mapping cases, I now
> > > >have some more interesting finds, which might contradict what I said
> > > >before.
> > > >
> > > >I tuned the runes I used for benchmark to make sure iperf and netperf
> > > >generate large packets (~64K). Here are the runes I use:
> > > >
> > > >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > > >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > > >
> > > >                           COPY                    MAP
> > > >iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> > > 
> > > So with default iperf setting, copy is about 7.9G, and map is about
> > > 2.5G? How about the result of netperf without large packets?
> > > 
> > 
> > First question, yes.
> > 
> > Second question, 5.8Gb/s. And I believe for the copying scheme without
> > large packet the throuput is more or less the same.
> > 
> > > >          PPI               2.90                  1.07
> > > >          SPI               37.75                 13.69
> > > >          PPN               2.90                  1.07
> > > >          SPN               37.75                 13.69
> > > >          tx_count           31808                174769
> > > 
> > > Seems interrupt count does not affect the performance at all with -l
> > > 131072 -w 128k.
> > > 
> > 
> > Right.
> > 
> > > >          nr_napi_schedule   31805                174697
> > > >          total_packets      92354                187408
> > > >          total_reqs         1200793              2392614
> > > >
> > > >netperf  Tput:            5.8Gb/s             10.5Gb/s
> > > >          PPI               2.13                   1.00
> > > >          SPI               36.70                  16.73
> > > >          PPN               2.13                   1.31
> > > >          SPN               36.70                  16.75
> > > >          tx_count           57635                205599
> > > >          nr_napi_schedule   57633                205311
> > > >          total_packets      122800               270254
> > > >          total_reqs         2115068              3439751
> > > >
> > > >   PPI: packets processed per interrupt
> > > >   SPI: slots processed per interrupt
> > > >   PPN: packets processed per napi schedule
> > > >   SPN: slots processed per napi schedule
> > > >   tx_count: interrupt count
> > > >   total_reqs: total slots used during test
> > > >
> > > >* Notification and batching
> > > >
> > > >Is notification and batching really a problem? I'm not so sure now. My
> > > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> > > >case was that "in that case netback *must* have better batching" which
> > > >turned out not very true -- copying mode makes netback slower, however
> > > >the batching gained is not hugh.
> > > >
> > > >Ideally we still want to batch as much as possible. Possible way
> > > >includes playing with the 'weight' parameter in NAPI. But as the figures
> > > >show batching seems not to be very important for throughput, at least
> > > >for now. If the NAPI framework and netfront / netback are doing their
> > > >jobs as designed we might not need to worry about this now.
> > > >
> > > >Andrew, do you have any thought on this? You found out that NAPI didn't 
> > > >scale well with multi-threaded iperf in DomU, do you have any handle how
> > > >that can happen?
> > > >
> > > >* Thoughts on zero-copy TX
> > > >
> > > >With this hack we are able to achieve 10Gb/s single stream, which is
> > > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > > >didn't able to achieve this.  I also developed another zero copy netback
> > > >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> > > >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> > > >performance was more or less the same as copying mode, about 6~7Gb/s).
> > > >
> > > >My hack maps all necessary pages permantently, there is no unmap, we
> > > >skip lots of page table manipulation and TLB flushes. So my basic
> > > >conclusion is that page table manipulation and TLB flushes do incur
> > > >heavy performance penalty.
> > > >
> > > >This hack can be upstreamed in no way. If we're to re-introduce
> > > >zero-copy TX, we would need to implement some sort of lazy flushing
> > > >mechanism. I haven't thought this through. Presumably this mechanism
> > > >would also benefit blk somehow? I'm not sure yet.
> > > >
> > > >Could persistent mapping (with the to-be-developed reclaim / MRU list
> > > >mechanism) be useful here? So that we can unify blk and net drivers?
> > > >
> > > >* Changes required to introduce zero-copy TX
> > > >
> > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> > > >not yet upstreamed.
> > > 
> > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > > 
> > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > > 
> > 
> > Yes. But I believe there's been several versions posted. The link you
> > have is not the latest version.
> > 
> > > >
> > > >2. Mechanism to negotiate max slots frontend can use: mapping requires
> > > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> > > >
> > > >3. Lazy flushing mechanism or persistent grants: ???
> > > 
> > > I did some test with persistent grants before, it did not show
> > > better performance than grant copy. But I was using the default
> > > params of netperf, and not tried large packet size. Your results
> > > reminds me that maybe persistent grants would get similar results
> > > with larger packet size too.
> > > 
> > 
> > "No better performance" -- that's because both mechanisms are copying?
> > However I presume persistent grant can scale better? From an earlier
> > email last week, I read that copying is done by the guest so that this
> > mechanism scales much better than hypervisor copying in blk's case.
> 
> Yes, I always expected persistent grants to be faster then
> gnttab_copy but I was very surprised by the difference in performances:
> 
> http://marc.info/?l=xen-devel&m=137234605929944
> 
> I think it's worth trying persistent grants on PV network, although it's
> very unlikely that they are going to improve the throughput by 5 Gb/s.
> 

I think it can improve aggregated throughput, however its not likely to
improve single stream throughput.

> Also once we have both PV block and network using persistent grants,
> we might incur the grant table limit, see this email:
> 
> http://marc.info/?l=xen-devel&m=137183474618974

Yes, indeed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01 14:39         ` Wei Liu
@ 2013-07-01 14:54           ` Stefano Stabellini
  0 siblings, 0 replies; 27+ messages in thread
From: Stefano Stabellini @ 2013-07-01 14:54 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, Stefano Stabellini, xen-devel, annie li, andrew.bennieston

On Mon, 1 Jul 2013, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote:
> > On Mon, 1 Jul 2013, Wei Liu wrote:
> > > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > > > 
> > > > On 2013-6-29 0:15, Wei Liu wrote:
> > > > >Hi all,
> > > > >
> > > > >After collecting more stats and comparing copying / mapping cases, I now
> > > > >have some more interesting finds, which might contradict what I said
> > > > >before.
> > > > >
> > > > >I tuned the runes I used for benchmark to make sure iperf and netperf
> > > > >generate large packets (~64K). Here are the runes I use:
> > > > >
> > > > >   iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > > > >   netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > > > >
> > > > >                           COPY                    MAP
> > > > >iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
> > > > 
> > > > So with default iperf setting, copy is about 7.9G, and map is about
> > > > 2.5G? How about the result of netperf without large packets?
> > > > 
> > > 
> > > First question, yes.
> > > 
> > > Second question, 5.8Gb/s. And I believe for the copying scheme without
> > > large packet the throuput is more or less the same.
> > > 
> > > > >          PPI               2.90                  1.07
> > > > >          SPI               37.75                 13.69
> > > > >          PPN               2.90                  1.07
> > > > >          SPN               37.75                 13.69
> > > > >          tx_count           31808                174769
> > > > 
> > > > Seems interrupt count does not affect the performance at all with -l
> > > > 131072 -w 128k.
> > > > 
> > > 
> > > Right.
> > > 
> > > > >          nr_napi_schedule   31805                174697
> > > > >          total_packets      92354                187408
> > > > >          total_reqs         1200793              2392614
> > > > >
> > > > >netperf  Tput:            5.8Gb/s             10.5Gb/s
> > > > >          PPI               2.13                   1.00
> > > > >          SPI               36.70                  16.73
> > > > >          PPN               2.13                   1.31
> > > > >          SPN               36.70                  16.75
> > > > >          tx_count           57635                205599
> > > > >          nr_napi_schedule   57633                205311
> > > > >          total_packets      122800               270254
> > > > >          total_reqs         2115068              3439751
> > > > >
> > > > >   PPI: packets processed per interrupt
> > > > >   SPI: slots processed per interrupt
> > > > >   PPN: packets processed per napi schedule
> > > > >   SPN: slots processed per napi schedule
> > > > >   tx_count: interrupt count
> > > > >   total_reqs: total slots used during test
> > > > >
> > > > >* Notification and batching
> > > > >
> > > > >Is notification and batching really a problem? I'm not so sure now. My
> > > > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> > > > >case was that "in that case netback *must* have better batching" which
> > > > >turned out not very true -- copying mode makes netback slower, however
> > > > >the batching gained is not hugh.
> > > > >
> > > > >Ideally we still want to batch as much as possible. Possible way
> > > > >includes playing with the 'weight' parameter in NAPI. But as the figures
> > > > >show batching seems not to be very important for throughput, at least
> > > > >for now. If the NAPI framework and netfront / netback are doing their
> > > > >jobs as designed we might not need to worry about this now.
> > > > >
> > > > >Andrew, do you have any thought on this? You found out that NAPI didn't 
> > > > >scale well with multi-threaded iperf in DomU, do you have any handle how
> > > > >that can happen?
> > > > >
> > > > >* Thoughts on zero-copy TX
> > > > >
> > > > >With this hack we are able to achieve 10Gb/s single stream, which is
> > > > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > > > >didn't able to achieve this.  I also developed another zero copy netback
> > > > >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> > > > >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> > > > >performance was more or less the same as copying mode, about 6~7Gb/s).
> > > > >
> > > > >My hack maps all necessary pages permantently, there is no unmap, we
> > > > >skip lots of page table manipulation and TLB flushes. So my basic
> > > > >conclusion is that page table manipulation and TLB flushes do incur
> > > > >heavy performance penalty.
> > > > >
> > > > >This hack can be upstreamed in no way. If we're to re-introduce
> > > > >zero-copy TX, we would need to implement some sort of lazy flushing
> > > > >mechanism. I haven't thought this through. Presumably this mechanism
> > > > >would also benefit blk somehow? I'm not sure yet.
> > > > >
> > > > >Could persistent mapping (with the to-be-developed reclaim / MRU list
> > > > >mechanism) be useful here? So that we can unify blk and net drivers?
> > > > >
> > > > >* Changes required to introduce zero-copy TX
> > > > >
> > > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> > > > >not yet upstreamed.
> > > > 
> > > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > > > 
> > > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > > > 
> > > 
> > > Yes. But I believe there's been several versions posted. The link you
> > > have is not the latest version.
> > > 
> > > > >
> > > > >2. Mechanism to negotiate max slots frontend can use: mapping requires
> > > > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> > > > >
> > > > >3. Lazy flushing mechanism or persistent grants: ???
> > > > 
> > > > I did some test with persistent grants before, it did not show
> > > > better performance than grant copy. But I was using the default
> > > > params of netperf, and not tried large packet size. Your results
> > > > reminds me that maybe persistent grants would get similar results
> > > > with larger packet size too.
> > > > 
> > > 
> > > "No better performance" -- that's because both mechanisms are copying?
> > > However I presume persistent grant can scale better? From an earlier
> > > email last week, I read that copying is done by the guest so that this
> > > mechanism scales much better than hypervisor copying in blk's case.
> > 
> > Yes, I always expected persistent grants to be faster then
> > gnttab_copy but I was very surprised by the difference in performances:
> > 
> > http://marc.info/?l=xen-devel&m=137234605929944
> > 
> > I think it's worth trying persistent grants on PV network, although it's
> > very unlikely that they are going to improve the throughput by 5 Gb/s.
> > 
> 
> I think it can improve aggregated throughput, however its not likely to
> improve single stream throughput.

you are probably right

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01  8:54     ` Wei Liu
  2013-07-01 14:29       ` Stefano Stabellini
@ 2013-07-01 15:59       ` annie li
  2013-07-01 16:06         ` Wei Liu
  1 sibling, 1 reply; 27+ messages in thread
From: annie li @ 2013-07-01 15:59 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, andrew.bennieston, ian.campbell, stefano.stabellini


On 2013-7-1 16:54, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
>> On 2013-6-29 0:15, Wei Liu wrote:
>>> Hi all,
>>>
>>> After collecting more stats and comparing copying / mapping cases, I now
>>> have some more interesting finds, which might contradict what I said
>>> before.
>>>
>>> I tuned the runes I used for benchmark to make sure iperf and netperf
>>> generate large packets (~64K). Here are the runes I use:
>>>
>>>    iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>>>    netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>>>
>>>                            COPY                    MAP
>>> iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
>> So with default iperf setting, copy is about 7.9G, and map is about
>> 2.5G? How about the result of netperf without large packets?
>>
> First question, yes.
>
> Second question, 5.8Gb/s. And I believe for the copying scheme without
> large packet the throuput is more or less the same.
>
>>>           PPI               2.90                  1.07
>>>           SPI               37.75                 13.69
>>>           PPN               2.90                  1.07
>>>           SPN               37.75                 13.69
>>>           tx_count           31808                174769
>> Seems interrupt count does not affect the performance at all with -l
>> 131072 -w 128k.
>>
> Right.
>
>>>           nr_napi_schedule   31805                174697
>>>           total_packets      92354                187408
>>>           total_reqs         1200793              2392614
>>>
>>> netperf  Tput:            5.8Gb/s             10.5Gb/s
>>>           PPI               2.13                   1.00
>>>           SPI               36.70                  16.73
>>>           PPN               2.13                   1.31
>>>           SPN               36.70                  16.75
>>>           tx_count           57635                205599
>>>           nr_napi_schedule   57633                205311
>>>           total_packets      122800               270254
>>>           total_reqs         2115068              3439751
>>>
>>>    PPI: packets processed per interrupt
>>>    SPI: slots processed per interrupt
>>>    PPN: packets processed per napi schedule
>>>    SPN: slots processed per napi schedule
>>>    tx_count: interrupt count
>>>    total_reqs: total slots used during test
>>>
>>> * Notification and batching
>>>
>>> Is notification and batching really a problem? I'm not so sure now. My
>>> first thought when I didn't measure PPI / PPN / SPI / SPN in copying
>>> case was that "in that case netback *must* have better batching" which
>>> turned out not very true -- copying mode makes netback slower, however
>>> the batching gained is not hugh.
>>>
>>> Ideally we still want to batch as much as possible. Possible way
>>> includes playing with the 'weight' parameter in NAPI. But as the figures
>>> show batching seems not to be very important for throughput, at least
>>> for now. If the NAPI framework and netfront / netback are doing their
>>> jobs as designed we might not need to worry about this now.
>>>
>>> Andrew, do you have any thought on this? You found out that NAPI didn't
>>> scale well with multi-threaded iperf in DomU, do you have any handle how
>>> that can happen?
>>>
>>> * Thoughts on zero-copy TX
>>>
>>> With this hack we are able to achieve 10Gb/s single stream, which is
>>> good. But, with classic XenoLinux kernel which has zero copy TX we
>>> didn't able to achieve this.  I also developed another zero copy netback
>>> prototype one year ago with Ian's out-of-tree skb frag destructor patch
>>> series. That prototype couldn't achieve 10Gb/s either (IIRC the
>>> performance was more or less the same as copying mode, about 6~7Gb/s).
>>>
>>> My hack maps all necessary pages permantently, there is no unmap, we
>>> skip lots of page table manipulation and TLB flushes. So my basic
>>> conclusion is that page table manipulation and TLB flushes do incur
>>> heavy performance penalty.
>>>
>>> This hack can be upstreamed in no way. If we're to re-introduce
>>> zero-copy TX, we would need to implement some sort of lazy flushing
>>> mechanism. I haven't thought this through. Presumably this mechanism
>>> would also benefit blk somehow? I'm not sure yet.
>>>
>>> Could persistent mapping (with the to-be-developed reclaim / MRU list
>>> mechanism) be useful here? So that we can unify blk and net drivers?
>>>
>>> * Changes required to introduce zero-copy TX
>>>
>>> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
>>> not yet upstreamed.
>> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>>
>> <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
>>
> Yes. But I believe there's been several versions posted. The link you
> have is not the latest version.
>
>>> 2. Mechanism to negotiate max slots frontend can use: mapping requires
>>> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
>>>
>>> 3. Lazy flushing mechanism or persistent grants: ???
>> I did some test with persistent grants before, it did not show
>> better performance than grant copy. But I was using the default
>> params of netperf, and not tried large packet size. Your results
>> reminds me that maybe persistent grants would get similar results
>> with larger packet size too.
>>
> "No better performance" -- that's because both mechanisms are copying?
> However I presume persistent grant can scale better? From an earlier
> email last week, I read that copying is done by the guest so that this
> mechanism scales much better than hypervisor copying in blk's case.

The original persistent patch does memcpy in both netback and netfront 
side. I am thinking maybe the performance can become better if removing 
the memcpy from netfront.
Moreover, I also have a feeling that we got persistent grant performance 
based on default netperf params test, just like wei's hack which does 
not get better performance without large packets. So let me try some 
test with large packets though.

Thanks
Annie

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01 14:19     ` Stefano Stabellini
@ 2013-07-01 15:59       ` annie li
  0 siblings, 0 replies; 27+ messages in thread
From: annie li @ 2013-07-01 15:59 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: andrew.bennieston, Wei Liu, ian.campbell, xen-devel


On 2013-7-1 22:19, Stefano Stabellini wrote:
> Could you please use plain text emails in the future?

Sure, sorry about that.

Thanks
Annie
>
> On Mon, 1 Jul 2013, annie li wrote:
>> On 2013-6-29 0:15, Wei Liu wrote:
>>
>> Hi all,
>>
>> After collecting more stats and comparing copying / mapping cases, I now
>> have some more interesting finds, which might contradict what I said
>> before.
>>
>> I tuned the runes I used for benchmark to make sure iperf and netperf
>> generate large packets (~64K). Here are the runes I use:
>>
>>    iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>>    netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>>
>>                            COPY                    MAP
>> iperf    Tput:             6.5Gb/s             14Gb/s (was 2.5Gb/s)
>>
>>
>> So with default iperf setting, copy is about 7.9G, and map is about 2.5G? How about the result of netperf without large packets?
>>
>>           PPI               2.90                  1.07
>>           SPI               37.75                 13.69
>>           PPN               2.90                  1.07
>>           SPN               37.75                 13.69
>>           tx_count           31808                174769
>>
>>
>> Seems interrupt count does not affect the performance at all with -l 131072 -w 128k.
>>
>>           nr_napi_schedule   31805                174697
>>           total_packets      92354                187408
>>           total_reqs         1200793              2392614
>>
>> netperf  Tput:            5.8Gb/s             10.5Gb/s
>>           PPI               2.13                   1.00
>>           SPI               36.70                  16.73
>>           PPN               2.13                   1.31
>>           SPN               36.70                  16.75
>>           tx_count           57635                205599
>>           nr_napi_schedule   57633                205311
>>           total_packets      122800               270254
>>           total_reqs         2115068              3439751
>>
>>    PPI: packets processed per interrupt
>>    SPI: slots processed per interrupt
>>    PPN: packets processed per napi schedule
>>    SPN: slots processed per napi schedule
>>    tx_count: interrupt count
>>    total_reqs: total slots used during test
>>
>> * Notification and batching
>>
>> Is notification and batching really a problem? I'm not so sure now. My
>> first thought when I didn't measure PPI / PPN / SPI / SPN in copying
>> case was that "in that case netback *must* have better batching" which
>> turned out not very true -- copying mode makes netback slower, however
>> the batching gained is not hugh.
>>
>> Ideally we still want to batch as much as possible. Possible way
>> includes playing with the 'weight' parameter in NAPI. But as the figures
>> show batching seems not to be very important for throughput, at least
>> for now. If the NAPI framework and netfront / netback are doing their
>> jobs as designed we might not need to worry about this now.
>>
>> Andrew, do you have any thought on this? You found out that NAPI didn't
>> scale well with multi-threaded iperf in DomU, do you have any handle how
>> that can happen?
>>
>> * Thoughts on zero-copy TX
>>
>> With this hack we are able to achieve 10Gb/s single stream, which is
>> good. But, with classic XenoLinux kernel which has zero copy TX we
>> didn't able to achieve this.  I also developed another zero copy netback
>> prototype one year ago with Ian's out-of-tree skb frag destructor patch
>> series. That prototype couldn't achieve 10Gb/s either (IIRC the
>> performance was more or less the same as copying mode, about 6~7Gb/s).
>>
>> My hack maps all necessary pages permantently, there is no unmap, we
>> skip lots of page table manipulation and TLB flushes. So my basic
>> conclusion is that page table manipulation and TLB flushes do incur
>> heavy performance penalty.
>>
>> This hack can be upstreamed in no way. If we're to re-introduce
>> zero-copy TX, we would need to implement some sort of lazy flushing
>> mechanism. I haven't thought this through. Presumably this mechanism
>> would also benefit blk somehow? I'm not sure yet.
>>
>> Could persistent mapping (with the to-be-developed reclaim / MRU list
>> mechanism) be useful here? So that we can unify blk and net drivers?
>>
>> * Changes required to introduce zero-copy TX
>>
>> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
>> not yet upstreamed.
>>
>>
>> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>>
>>
>> 2. Mechanism to negotiate max slots frontend can use: mapping requires
>> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
>>
>> 3. Lazy flushing mechanism or persistent grants: ???
>>
>>
>> I did some test with persistent grants before, it did not show better performance than grant copy. But I was using the default
>> params of netperf, and not tried large packet size. Your results reminds me that maybe persistent grants would get similar
>> results with larger packet size too.
>>
>> Thanks
>> Annie
>>
>>
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01 15:59       ` annie li
@ 2013-07-01 16:06         ` Wei Liu
  2013-07-01 16:53           ` Andrew Bennieston
  0 siblings, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-07-01 16:06 UTC (permalink / raw)
  To: annie li
  Cc: andrew.bennieston, xen-devel, Wei Liu, ian.campbell, stefano.stabellini

On Mon, Jul 01, 2013 at 11:59:08PM +0800, annie li wrote:
[...]
> >>>1. SKB frag destructor series: to track life cycle of SKB frags. This is
> >>>not yet upstreamed.
> >>Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> >>
> >><http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> >>
> >Yes. But I believe there's been several versions posted. The link you
> >have is not the latest version.
> >
> >>>2. Mechanism to negotiate max slots frontend can use: mapping requires
> >>>backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> >>>
> >>>3. Lazy flushing mechanism or persistent grants: ???
> >>I did some test with persistent grants before, it did not show
> >>better performance than grant copy. But I was using the default
> >>params of netperf, and not tried large packet size. Your results
> >>reminds me that maybe persistent grants would get similar results
> >>with larger packet size too.
> >>
> >"No better performance" -- that's because both mechanisms are copying?
> >However I presume persistent grant can scale better? From an earlier
> >email last week, I read that copying is done by the guest so that this
> >mechanism scales much better than hypervisor copying in blk's case.
> 
> The original persistent patch does memcpy in both netback and
> netfront side. I am thinking maybe the performance can become better
> if removing the memcpy from netfront.

I would say that removing copy in netback can scale better.

> Moreover, I also have a feeling that we got persistent grant
> performance based on default netperf params test, just like wei's
> hack which does not get better performance without large packets. So
> let me try some test with large packets though.
> 

Sadly enough, I found out today these sort of test seems to be quite
inconsistent. On a Intel 10G Nic the throughput is actually higher
without enforcing iperf / netperf to generate large packets.


Wei.

> Thanks
> Annie

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01 16:06         ` Wei Liu
@ 2013-07-01 16:53           ` Andrew Bennieston
  2013-07-01 17:55             ` Wei Liu
  2013-07-03 15:18             ` Wei Liu
  0 siblings, 2 replies; 27+ messages in thread
From: Andrew Bennieston @ 2013-07-01 16:53 UTC (permalink / raw)
  To: Wei Liu; +Cc: annie li, xen-devel, ian.campbell, stefano.stabellini

On 01/07/13 17:06, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 11:59:08PM +0800, annie li wrote:
> [...]
>>>>> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
>>>>> not yet upstreamed.
>>>> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>>>>
>>>> <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
>>>>
>>> Yes. But I believe there's been several versions posted. The link you
>>> have is not the latest version.
>>>
>>>>> 2. Mechanism to negotiate max slots frontend can use: mapping requires
>>>>> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
>>>>>
>>>>> 3. Lazy flushing mechanism or persistent grants: ???
>>>> I did some test with persistent grants before, it did not show
>>>> better performance than grant copy. But I was using the default
>>>> params of netperf, and not tried large packet size. Your results
>>>> reminds me that maybe persistent grants would get similar results
>>>> with larger packet size too.
>>>>
>>> "No better performance" -- that's because both mechanisms are copying?
>>> However I presume persistent grant can scale better? From an earlier
>>> email last week, I read that copying is done by the guest so that this
>>> mechanism scales much better than hypervisor copying in blk's case.
>>
>> The original persistent patch does memcpy in both netback and
>> netfront side. I am thinking maybe the performance can become better
>> if removing the memcpy from netfront.
>
> I would say that removing copy in netback can scale better.
>
>> Moreover, I also have a feeling that we got persistent grant
>> performance based on default netperf params test, just like wei's
>> hack which does not get better performance without large packets. So
>> let me try some test with large packets though.
>>
>
> Sadly enough, I found out today these sort of test seems to be quite
> inconsistent. On a Intel 10G Nic the throughput is actually higher
> without enforcing iperf / netperf to generate large packets.

When I have made performance measurements using iperf, I found that for 
a given point in the parameter space (e.g. for a fixed number of guests, 
interfaces, fixed parameters to iperf, fixed test run duration, etc.) 
the variation was typically _smaller than_ +/- 1 Gbit/s on a 10G NIC.

I notice that your results don't include any error bars or indication of 
standard deviation...

With this sort of data (or, really, any data) measuring at least 5 times 
will help to get an idea of the fluctuations present (i.e. a measure of 
statistical uncertainty) by quoting a mean +/- standard deviation. 
Having the standard deviation (or other estimator for the uncertainty in 
the results) allows us to better determine how significant this 
difference in results really is.

For example, is the high throughput you quoted (~ 14 Gbit/s) an upward 
fluctuation, and the low value (~6) a downward fluctuation? Having a 
mean and standard deviation would allow us to determine just how 
(in)compatible these values are.

Assuming a Gaussian distribution (and when sampled sufficient times, 
"everything" tends to a Gaussian) you have an almost 5% chance that a 
result lies more than 2 standard deviations from the mean (and a 0.3% 
chance that it lies more than 3 s.d. from the mean!). Results that 
appear "high" or "low" may, therefore, not be entirely unexpected. 
Having a measure of the standard deviation provides some basis against 
which to determine how likely it is that a measured value is just 
statistical fluctuation, or whether it is a significant result.

Another thing I noticed is that you're running the iperf test for only 5 
seconds. I have found in the past that iperf (or, more likely, TCP) 
takes a while to "ramp up" (even with all parameters fixed e.g. "-l 
<size> -w <size>") and that tests run for 2 minutes or more (e.g. "-t 
120") give much more stable results.

Andrew.

>
>
> Wei.
>
>> Thanks
>> Annie

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01 16:53           ` Andrew Bennieston
@ 2013-07-01 17:55             ` Wei Liu
  2013-07-03 15:18             ` Wei Liu
  1 sibling, 0 replies; 27+ messages in thread
From: Wei Liu @ 2013-07-01 17:55 UTC (permalink / raw)
  To: Andrew Bennieston
  Cc: annie li, xen-devel, Wei Liu, ian.campbell, stefano.stabellini

On Mon, Jul 01, 2013 at 05:53:27PM +0100, Andrew Bennieston wrote:
[...]
> >
> >Sadly enough, I found out today these sort of test seems to be quite
> >inconsistent. On a Intel 10G Nic the throughput is actually higher
> >without enforcing iperf / netperf to generate large packets.
> 
> When I have made performance measurements using iperf, I found that
> for a given point in the parameter space (e.g. for a fixed number of
> guests, interfaces, fixed parameters to iperf, fixed test run
> duration, etc.) the variation was typically _smaller than_ +/- 1
> Gbit/s on a 10G NIC.
> 

I was talking about virtual interface v.s. real hardware. The parameters
that maximize throughput for one case don't seem to be working for the
other case. The deviation for a specific interface is rather small.

> I notice that your results don't include any error bars or
> indication of standard deviation...
> 
> With this sort of data (or, really, any data) measuring at least 5
> times will help to get an idea of the fluctuations present (i.e. a
> measure of statistical uncertainty) by quoting a mean +/- standard
> deviation. Having the standard deviation (or other estimator for the
> uncertainty in the results) allows us to better determine how
> significant this difference in results really is.
> 
> For example, is the high throughput you quoted (~ 14 Gbit/s) an
> upward fluctuation, and the low value (~6) a downward fluctuation?
> Having a mean and standard deviation would allow us to determine
> just how (in)compatible these values are.
> 

I ran those tests for several times and picked the number that appeared
most. Anyway I will try to come up with better visualized graphs.

> Assuming a Gaussian distribution (and when sampled sufficient times,
> "everything" tends to a Gaussian) you have an almost 5% chance that
> a result lies more than 2 standard deviations from the mean (and a
> 0.3% chance that it lies more than 3 s.d. from the mean!). Results
> that appear "high" or "low" may, therefore, not be entirely
> unexpected. Having a measure of the standard deviation provides some
> basis against which to determine how likely it is that a measured
> value is just statistical fluctuation, or whether it is a
> significant result.
> 
> Another thing I noticed is that you're running the iperf test for
> only 5 seconds. I have found in the past that iperf (or, more
> likely, TCP) takes a while to "ramp up" (even with all parameters
> fixed e.g. "-l <size> -w <size>") and that tests run for 2 minutes
> or more (e.g. "-t 120") give much more stable results.
> 

Hmm... for me the lenght of the test doesn't make much difference,
that's why I've chosen such a short time. As you mentioned this I intend
to run the tests a big longer.

> Andrew.
> 
> >
> >
> >Wei.
> >
> >>Thanks
> >>Annie

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Interesting observation with network event notification and batching
  2013-07-01 16:53           ` Andrew Bennieston
  2013-07-01 17:55             ` Wei Liu
@ 2013-07-03 15:18             ` Wei Liu
  1 sibling, 0 replies; 27+ messages in thread
From: Wei Liu @ 2013-07-03 15:18 UTC (permalink / raw)
  To: Andrew Bennieston
  Cc: annie li, xen-devel, Wei Liu, ian.campbell, stefano.stabellini

On Mon, Jul 01, 2013 at 05:53:27PM +0100, Andrew Bennieston wrote:
[...]
> >I would say that removing copy in netback can scale better.
> >
> >>Moreover, I also have a feeling that we got persistent grant
> >>performance based on default netperf params test, just like wei's
> >>hack which does not get better performance without large packets. So
> >>let me try some test with large packets though.
> >>
> >
> >Sadly enough, I found out today these sort of test seems to be quite
> >inconsistent. On a Intel 10G Nic the throughput is actually higher
> >without enforcing iperf / netperf to generate large packets.
> 
> When I have made performance measurements using iperf, I found that
> for a given point in the parameter space (e.g. for a fixed number of
> guests, interfaces, fixed parameters to iperf, fixed test run
> duration, etc.) the variation was typically _smaller than_ +/- 1
> Gbit/s on a 10G NIC.
> 
> I notice that your results don't include any error bars or
> indication of standard deviation...
> 
> With this sort of data (or, really, any data) measuring at least 5
> times will help to get an idea of the fluctuations present (i.e. a
> measure of statistical uncertainty) by quoting a mean +/- standard
> deviation. Having the standard deviation (or other estimator for the
> uncertainty in the results) allows us to better determine how
> significant this difference in results really is.
> 
> For example, is the high throughput you quoted (~ 14 Gbit/s) an
> upward fluctuation, and the low value (~6) a downward fluctuation?
> Having a mean and standard deviation would allow us to determine
> just how (in)compatible these values are.
> 
> Assuming a Gaussian distribution (and when sampled sufficient times,
> "everything" tends to a Gaussian) you have an almost 5% chance that
> a result lies more than 2 standard deviations from the mean (and a
> 0.3% chance that it lies more than 3 s.d. from the mean!). Results
> that appear "high" or "low" may, therefore, not be entirely
> unexpected. Having a measure of the standard deviation provides some
> basis against which to determine how likely it is that a measured
> value is just statistical fluctuation, or whether it is a
> significant result.
> 
> Another thing I noticed is that you're running the iperf test for
> only 5 seconds. I have found in the past that iperf (or, more
> likely, TCP) takes a while to "ramp up" (even with all parameters
> fixed e.g. "-l <size> -w <size>") and that tests run for 2 minutes
> or more (e.g. "-t 120") give much more stable results.
> 
> Andrew.
> 

Here you go, results for the new conducted benchmarks. Was about to do
graph but found out not really worth it because it's only single stream.

For iperf tests unit is Gb/s, for netperf tests unit is Mb/s.

COPY SCHEME
iperf -c  10.80.237.127 -t 120
6.19 6.23 6.26 6.25 6.27
mean 6.24    s.d. 0.031622776601759

iperf -c 10.80.237.127 -t 120  -l 131072
6.07 6.07 6.03 6.06 6.06
mean 6.058   s.d. 0.016431676725514

netperf -H 10.80.237.127 -l120 -f m
5662.55 5636.6 5641.52 5631.39 5630.98
mean 5640.608   s.d. 13.0001642297036

netperf -H 10.80.237.127 -l120 -f m -- -s 131072 -S 131072
5831.19 5833.03 5829.54 5838.89 5830.5
mean 5832.63  s.d. 3.72512415992628


PERMANENT MAP SCHEME
"iperf -c  10.80.237.127 -t 120
2.42 2.41 2.41 2.42 2.43
mean 2.418   s.d. 0.00836660026531

iperf -c 10.80.237.127 -t 120  -l 131072
14.3 14.2 14.2 14.4 14.3
mean 14.28  s.d. 0.083666002653234

netperf -H 10.80.237.127 -l120 -f m
4632.27 4630.08 4633.18 4641.25 4632.23
mean 4633.802   s.d. 4.31656924013371

netperf -H 10.80.237.127 -l120 -f m -- -s 131072 -S 131072
10556.04  10532.89 10541.83 10552.77 10546.77
mean 10546.06  s.d. 9.17156475133789


Short run of iperf / netperf was conducted before each test run so that
the system was "warmed-up".

The results show that the single stream performance is quite stable.
Also there's not much difference between running tests for 5s or 120s.


Wei.
> >
> >
> >Wei.
> >
> >>Thanks
> >>Annie

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-07-03 15:18 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-12 10:14 Interesting observation with network event notification and batching Wei Liu
2013-06-14 18:53 ` Konrad Rzeszutek Wilk
2013-06-16  9:54   ` Wei Liu
2013-06-17  9:38     ` Ian Campbell
2013-06-17  9:56       ` Andrew Bennieston
2013-06-17 10:46         ` Wei Liu
2013-06-17 10:56           ` Andrew Bennieston
2013-06-17 11:08             ` Ian Campbell
2013-06-17 11:55               ` Andrew Bennieston
2013-06-17 10:06       ` Jan Beulich
2013-06-17 10:16         ` Ian Campbell
2013-06-17 10:35       ` Wei Liu
2013-06-17 11:34         ` annie li
2013-06-16 12:46   ` Wei Liu
2013-06-28 16:15 ` Wei Liu
2013-07-01  7:48   ` annie li
2013-07-01  8:54     ` Wei Liu
2013-07-01 14:29       ` Stefano Stabellini
2013-07-01 14:39         ` Wei Liu
2013-07-01 14:54           ` Stefano Stabellini
2013-07-01 15:59       ` annie li
2013-07-01 16:06         ` Wei Liu
2013-07-01 16:53           ` Andrew Bennieston
2013-07-01 17:55             ` Wei Liu
2013-07-03 15:18             ` Wei Liu
2013-07-01 14:19     ` Stefano Stabellini
2013-07-01 15:59       ` annie li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.