* Interesting observation with network event notification and batching
@ 2013-06-12 10:14 Wei Liu
2013-06-14 18:53 ` Konrad Rzeszutek Wilk
2013-06-28 16:15 ` Wei Liu
0 siblings, 2 replies; 27+ messages in thread
From: Wei Liu @ 2013-06-12 10:14 UTC (permalink / raw)
To: xen-devel
Cc: wei.liu2, ian.campbell, stefano.stabellini, konrad.wilk,
annie.li, andrew.bennieston
Hi all
I'm hacking on a netback trying to identify whether TLB flushes causes
heavy performance penalty on Tx path. The hack is quite nasty (you would
not want to know, trust me).
Basically what is doesn't is, 1) alter network protocol to pass along
mfns instead of grant references, 2) when the backend sees a new mfn,
map it RO and cache it in its own address space.
With this hack, now we have some sort of zero-copy TX path. Backend
doesn't need to issue any grant copy / map operation any more. When it
sees a new packet in the ring, it just needs to pick up the pages
in its own address space and assemble packets with those pages then pass
the packet on to network stack.
In theory this should boost performance, but in practice it is the other
way around. This hack makes Xen network more than 50% slower than before
(OMG). Further investigation shows that with this hack the batching
ability is gone. Before this hack, netback batches like 64 slots in one
interrupt event, however after this hack, it only batches 3 slots in one
interrupt event -- that's no batching at all because we can expect one
packet to occupy 3 slots.
Time to have some figures (iperf from DomU to Dom0).
Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots
per batch 64.
After the hack, throughput: 2.5 Gb/s, average slots per batch 3.
After the hack, adds in 64 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots
per batch 6.
After the hack, adds in 256 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots
per batch 26.
After the hack, adds in 512 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots
per batch 26.
After the hack, adds in 768 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots
per batch 25.
After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context
switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots
per batch 25.
Average slots per batch is calculate as followed:
1. count total_slots processed from start of day
2. count tx_count which is the number of tx_action function gets
invoked
3. avg_slots_per_tx = total_slots / tx_count
The counter-intuition figures imply that there is something wrong with
the currently batching mechanism. Probably we need to fine-tune the
batching behavior for network and play with event pointers in the ring
(actually I'm looking into it now). It would be good to have some input
on this.
Konrad, IIRC you once mentioned you discovered something with event
notification, what's that?
To all, any thoughts?
Wei.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-12 10:14 Interesting observation with network event notification and batching Wei Liu
@ 2013-06-14 18:53 ` Konrad Rzeszutek Wilk
2013-06-16 9:54 ` Wei Liu
2013-06-16 12:46 ` Wei Liu
2013-06-28 16:15 ` Wei Liu
1 sibling, 2 replies; 27+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-06-14 18:53 UTC (permalink / raw)
To: Wei Liu
Cc: annie.li, stefano.stabellini, andrew.bennieston, ian.campbell, xen-devel
On Wed, Jun 12, 2013 at 11:14:51AM +0100, Wei Liu wrote:
> Hi all
>
> I'm hacking on a netback trying to identify whether TLB flushes causes
> heavy performance penalty on Tx path. The hack is quite nasty (you would
> not want to know, trust me).
>
> Basically what is doesn't is, 1) alter network protocol to pass along
You probably meant: "what it does" ?
> mfns instead of grant references, 2) when the backend sees a new mfn,
> map it RO and cache it in its own address space.
>
> With this hack, now we have some sort of zero-copy TX path. Backend
> doesn't need to issue any grant copy / map operation any more. When it
> sees a new packet in the ring, it just needs to pick up the pages
> in its own address space and assemble packets with those pages then pass
> the packet on to network stack.
Uh, so not sure I understand the RO part. If dom0 is mapping it won't
that trigger a PTE update? And doesn't somebody (either the guest or
initial domain) do a grant mapping to let the hypervisor know it is
OK to map a grant?
Or is dom0 actually permitted to map the MFN of any guest without using
the grants? In which case you are then using the _PAGE_IOMAP
somewhere and setting up vmap entries with the MFN's that point to the
foreign domain - I think?
>
> In theory this should boost performance, but in practice it is the other
> way around. This hack makes Xen network more than 50% slower than before
> (OMG). Further investigation shows that with this hack the batching
> ability is gone. Before this hack, netback batches like 64 slots in one
That is quite interesting.
> interrupt event, however after this hack, it only batches 3 slots in one
> interrupt event -- that's no batching at all because we can expect one
> packet to occupy 3 slots.
Right.
>
> Time to have some figures (iperf from DomU to Dom0).
>
> Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots
> per batch 64.
>
> After the hack, throughput: 2.5 Gb/s, average slots per batch 3.
>
> After the hack, adds in 64 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots
> per batch 6.
>
> After the hack, adds in 256 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots
> per batch 26.
>
> After the hack, adds in 512 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots
> per batch 26.
>
> After the hack, adds in 768 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots
> per batch 25.
>
> After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context
> switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots
> per batch 25.
>
How do you get it to do more HYPERVISR_xen_version? Did you just add
a (for i = 1024; i>0;i--) hypervisor_yield();
in netback?
> Average slots per batch is calculate as followed:
> 1. count total_slots processed from start of day
> 2. count tx_count which is the number of tx_action function gets
> invoked
> 3. avg_slots_per_tx = total_slots / tx_count
>
> The counter-intuition figures imply that there is something wrong with
> the currently batching mechanism. Probably we need to fine-tune the
> batching behavior for network and play with event pointers in the ring
> (actually I'm looking into it now). It would be good to have some input
> on this.
I am still unsure I understand hwo your changes would incur more
of the yields.
>
> Konrad, IIRC you once mentioned you discovered something with event
> notification, what's that?
They were bizzare. I naively expected some form of # of physical NIC
interrupts to be around the same as the VIF or less. And I figured
that the amount of interrupts would be constant irregardless of the
size of the packets. In other words #packets == #interrupts.
In reality the number of interrupts the VIF had was about the same while
for the NIC it would fluctuate. (I can't remember the details).
But it was odd and I didn't go deeper in it to figure out what
was happening. And also to figure out if for the VIF we could
do something of #packets != #interrupts. And hopefully some
mechanism to adjust so that the amount of interrupts would
be lesser per packets (hand waving here).
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-14 18:53 ` Konrad Rzeszutek Wilk
@ 2013-06-16 9:54 ` Wei Liu
2013-06-17 9:38 ` Ian Campbell
2013-06-16 12:46 ` Wei Liu
1 sibling, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-06-16 9:54 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: Wei Liu, ian.campbell, stefano.stabellini, xen-devel, annie.li,
andrew.bennieston
On Fri, Jun 14, 2013 at 02:53:03PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 12, 2013 at 11:14:51AM +0100, Wei Liu wrote:
> > Hi all
> >
> > I'm hacking on a netback trying to identify whether TLB flushes causes
> > heavy performance penalty on Tx path. The hack is quite nasty (you would
> > not want to know, trust me).
> >
> > Basically what is doesn't is, 1) alter network protocol to pass along
>
> You probably meant: "what it does" ?
>
Oh yes! Muscle memory got me!
> > mfns instead of grant references, 2) when the backend sees a new mfn,
> > map it RO and cache it in its own address space.
> >
> > With this hack, now we have some sort of zero-copy TX path. Backend
> > doesn't need to issue any grant copy / map operation any more. When it
> > sees a new packet in the ring, it just needs to pick up the pages
> > in its own address space and assemble packets with those pages then pass
> > the packet on to network stack.
>
> Uh, so not sure I understand the RO part. If dom0 is mapping it won't
> that trigger a PTE update? And doesn't somebody (either the guest or
> initial domain) do a grant mapping to let the hypervisor know it is
> OK to map a grant?
>
It is very easy to issue HYPERVISOR_mmu_udpate to alter Dom0's mapping,
because Dom0 is priveleged.
> Or is dom0 actually permitted to map the MFN of any guest without using
> the grants? In which case you are then using the _PAGE_IOMAP
> somewhere and setting up vmap entries with the MFN's that point to the
> foreign domain - I think?
>
Sort of, but I didn't use vmap, I used alloc_page to get actual pages.
Then I modified the underlying PTE to point to the MFN from netfront.
> >
> > In theory this should boost performance, but in practice it is the other
> > way around. This hack makes Xen network more than 50% slower than before
> > (OMG). Further investigation shows that with this hack the batching
> > ability is gone. Before this hack, netback batches like 64 slots in one
>
> That is quite interesting.
>
> > interrupt event, however after this hack, it only batches 3 slots in one
> > interrupt event -- that's no batching at all because we can expect one
> > packet to occupy 3 slots.
>
> Right.
> >
> > Time to have some figures (iperf from DomU to Dom0).
> >
> > Before the hack, doing grant copy, throughput: 7.9 Gb/s, average slots
> > per batch 64.
> >
> > After the hack, throughput: 2.5 Gb/s, average slots per batch 3.
> >
> > After the hack, adds in 64 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 3.2 Gb/s, average slots
> > per batch 6.
> >
> > After the hack, adds in 256 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 5.2 Gb/s, average slots
> > per batch 26.
> >
> > After the hack, adds in 512 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 7.9 Gb/s, average slots
> > per batch 26.
> >
> > After the hack, adds in 768 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 5.6 Gb/s, average slots
> > per batch 25.
> >
> > After the hack, adds in 1024 HYPERVISOR_xen_version (it just does context
> > switch into hypervisor) in Tx path, throughput: 4.4 Gb/s, average slots
> > per batch 25.
> >
>
> How do you get it to do more HYPERVISR_xen_version? Did you just add
> a (for i = 1024; i>0;i--) hypervisor_yield();
for (i = 0; i < X; i++) (void)HYPERVISOR_xen_version(0, NULL);
>
> in netback?
> > Average slots per batch is calculate as followed:
> > 1. count total_slots processed from start of day
> > 2. count tx_count which is the number of tx_action function gets
> > invoked
> > 3. avg_slots_per_tx = total_slots / tx_count
> >
> > The counter-intuition figures imply that there is something wrong with
> > the currently batching mechanism. Probably we need to fine-tune the
> > batching behavior for network and play with event pointers in the ring
> > (actually I'm looking into it now). It would be good to have some input
> > on this.
>
> I am still unsure I understand hwo your changes would incur more
> of the yields.
It's not yielding. At least that's not the purpose of that hypercall.
HYPERVISOR_xen_version(0, NULL) only does guest -> hypervisor -> guest
context switching. The original purpose of HYPERVISOR_xen_version(0,
NULL) is to force guest to check pending events.
Since you mentioned yeilding, I will also try to do yielding and post
figures.
> >
> > Konrad, IIRC you once mentioned you discovered something with event
> > notification, what's that?
>
> They were bizzare. I naively expected some form of # of physical NIC
> interrupts to be around the same as the VIF or less. And I figured
> that the amount of interrupts would be constant irregardless of the
> size of the packets. In other words #packets == #interrupts.
>
It could be that the frontend notifies the backend for every packet it
sends. This is not desirable and I don't expect the ring to behave that
way.
> In reality the number of interrupts the VIF had was about the same while
> for the NIC it would fluctuate. (I can't remember the details).
>
I'm not sure I understand you here. But for the NIC, if you see the
number of interrupt goes from high to low that's expected. When the NIC
has very high interrupt rate it turns to polling mode.
> But it was odd and I didn't go deeper in it to figure out what
> was happening. And also to figure out if for the VIF we could
> do something of #packets != #interrupts. And hopefully some
> mechanism to adjust so that the amount of interrupts would
> be lesser per packets (hand waving here).
I'm trying to do this now.
Wei.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-14 18:53 ` Konrad Rzeszutek Wilk
2013-06-16 9:54 ` Wei Liu
@ 2013-06-16 12:46 ` Wei Liu
1 sibling, 0 replies; 27+ messages in thread
From: Wei Liu @ 2013-06-16 12:46 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: Wei Liu, ian.campbell, stefano.stabellini, xen-devel, annie.li,
andrew.bennieston
On Fri, Jun 14, 2013 at 02:53:03PM -0400, Konrad Rzeszutek Wilk wrote:
[...]>
> How do you get it to do more HYPERVISR_xen_version? Did you just add
> a (for i = 1024; i>0;i--) hypervisor_yield();
>
Here are the figures to replace HYPERVISOR_xen_version(0, NULL) with
HYPERVISOR_sched_op(SCHEDOP_yield, NULL).
64 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 5.15G/s,
average slots per tx 25
128 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 7.75G/s,
average slots per tx 26
512 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 1.74G/s,
average slots per tx 18
1024 HYPERVISOR_sched_op(SCHEDOP_yield, NULL), throughput 998M/s,
average slots per tx 18
Please note that Dom0 and DomU runs on different PCPUs.
I think this kind of behavior has something to do with scheduler. But
down to the bottom we should really fix notification mechanism.
Wei.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-16 9:54 ` Wei Liu
@ 2013-06-17 9:38 ` Ian Campbell
2013-06-17 9:56 ` Andrew Bennieston
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Ian Campbell @ 2013-06-17 9:38 UTC (permalink / raw)
To: Wei Liu
Cc: annie.li, xen-devel, andrew.bennieston, stefano.stabellini,
Konrad Rzeszutek Wilk
On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> > > Konrad, IIRC you once mentioned you discovered something with event
> > > notification, what's that?
> >
> > They were bizzare. I naively expected some form of # of physical NIC
> > interrupts to be around the same as the VIF or less. And I figured
> > that the amount of interrupts would be constant irregardless of the
> > size of the packets. In other words #packets == #interrupts.
> >
>
> It could be that the frontend notifies the backend for every packet it
> sends. This is not desirable and I don't expect the ring to behave that
> way.
It is probably worth checking that things are working how we think they
should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
suitable points to maximise batching.
Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
loop right? This would push the req_event pointer to just after the last
request, meaning the net request enqueued by the frontend would cause a
notification -- even though the backend is actually still continuing to
process requests and would have picked up that packet without further
notification. n this case there is a fair bit of work left in the
backend for this iteration i.e. plenty of opportunity for the frontend
to queue more requests.
The comments in ring.h say:
* These macros will set the req_event/rsp_event field to trigger a
* notification on the very next message that is enqueued. If you want to
* create batches of work (i.e., only receive a notification after several
* messages have been enqueued) then you will need to create a customised
* version of the FINAL_CHECK macro in your own code, which sets the event
* field appropriately.
Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
(and other similar loops) and add a FINAL check at the very end?
> > But it was odd and I didn't go deeper in it to figure out what
> > was happening. And also to figure out if for the VIF we could
> > do something of #packets != #interrupts. And hopefully some
> > mechanism to adjust so that the amount of interrupts would
> > be lesser per packets (hand waving here).
>
> I'm trying to do this now.
What scheme do you have in mind?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-17 9:38 ` Ian Campbell
@ 2013-06-17 9:56 ` Andrew Bennieston
2013-06-17 10:46 ` Wei Liu
2013-06-17 10:06 ` Jan Beulich
2013-06-17 10:35 ` Wei Liu
2 siblings, 1 reply; 27+ messages in thread
From: Andrew Bennieston @ 2013-06-17 9:56 UTC (permalink / raw)
To: Ian Campbell
Cc: annie.li, xen-devel, Wei Liu, stefano.stabellini, Konrad Rzeszutek Wilk
On 17/06/13 10:38, Ian Campbell wrote:
> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>> Konrad, IIRC you once mentioned you discovered something with event
>>>> notification, what's that?
>>>
>>> They were bizzare. I naively expected some form of # of physical NIC
>>> interrupts to be around the same as the VIF or less. And I figured
>>> that the amount of interrupts would be constant irregardless of the
>>> size of the packets. In other words #packets == #interrupts.
>>>
>>
>> It could be that the frontend notifies the backend for every packet it
>> sends. This is not desirable and I don't expect the ring to behave that
>> way.
I have observed this kind of behaviour during network performance tests
in which I periodically checked the ring state during an iperf session.
It looked to me like the frontend was sending notifications far too
often, but that the backend was sending them very infrequently, so the
Tx (from guest) ring was mostly empty and the Rx (to guest) ring was
mostly full. This has the effect of both front and backend having to
block occasionally waiting for the other end to clear or fill a ring,
even though there is more data available.
My initial theory was that this was caused in part by the shared event
channel, however I expect that Wei is testing on top of a kernel with
his split event channel features?
>
> It is probably worth checking that things are working how we think they
> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> suitable points to maximise batching.
>
> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> loop right? This would push the req_event pointer to just after the last
> request, meaning the net request enqueued by the frontend would cause a
> notification -- even though the backend is actually still continuing to
> process requests and would have picked up that packet without further
> notification. n this case there is a fair bit of work left in the
> backend for this iteration i.e. plenty of opportunity for the frontend
> to queue more requests.
>
> The comments in ring.h say:
> * These macros will set the req_event/rsp_event field to trigger a
> * notification on the very next message that is enqueued. If you want to
> * create batches of work (i.e., only receive a notification after several
> * messages have been enqueued) then you will need to create a customised
> * version of the FINAL_CHECK macro in your own code, which sets the event
> * field appropriately.
>
> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> (and other similar loops) and add a FINAL check at the very end?
>
>>> But it was odd and I didn't go deeper in it to figure out what
>>> was happening. And also to figure out if for the VIF we could
>>> do something of #packets != #interrupts. And hopefully some
>>> mechanism to adjust so that the amount of interrupts would
>>> be lesser per packets (hand waving here).
>>
>> I'm trying to do this now.
>
> What scheme do you have in mind?
As I mentioned above, filling a ring completely appears to be almost as
bad as sending too many notifications. The ideal scheme may involve
trying to balance the ring at some "half-full" state, depending on the
capacity for the front- and backends to process requests and responses.
Andrew.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-17 9:38 ` Ian Campbell
2013-06-17 9:56 ` Andrew Bennieston
@ 2013-06-17 10:06 ` Jan Beulich
2013-06-17 10:16 ` Ian Campbell
2013-06-17 10:35 ` Wei Liu
2 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2013-06-17 10:06 UTC (permalink / raw)
To: Ian Campbell, Wei Liu
Cc: annie.li, xen-devel, andrew.bennieston, Konrad Rzeszutek Wilk,
stefano.stabellini
>>> On 17.06.13 at 11:38, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>> > > Konrad, IIRC you once mentioned you discovered something with event
>> > > notification, what's that?
>> >
>> > They were bizzare. I naively expected some form of # of physical NIC
>> > interrupts to be around the same as the VIF or less. And I figured
>> > that the amount of interrupts would be constant irregardless of the
>> > size of the packets. In other words #packets == #interrupts.
>> >
>>
>> It could be that the frontend notifies the backend for every packet it
>> sends. This is not desirable and I don't expect the ring to behave that
>> way.
>
> It is probably worth checking that things are working how we think they
> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> suitable points to maximise batching.
>
> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> loop right? This would push the req_event pointer to just after the last
> request, meaning the net request enqueued by the frontend would cause a
> notification -- even though the backend is actually still continuing to
> process requests and would have picked up that packet without further
> notification. n this case there is a fair bit of work left in the
> backend for this iteration i.e. plenty of opportunity for the frontend
> to queue more requests.
>
> The comments in ring.h say:
> * These macros will set the req_event/rsp_event field to trigger a
> * notification on the very next message that is enqueued. If you want to
> * create batches of work (i.e., only receive a notification after several
> * messages have been enqueued) then you will need to create a customised
> * version of the FINAL_CHECK macro in your own code, which sets the event
> * field appropriately.
>
> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> (and other similar loops) and add a FINAL check at the very end?
But then again the macro doesn't update req_event when there
are unconsumed requests already upon entry to the macro.
Jan
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-17 10:06 ` Jan Beulich
@ 2013-06-17 10:16 ` Ian Campbell
0 siblings, 0 replies; 27+ messages in thread
From: Ian Campbell @ 2013-06-17 10:16 UTC (permalink / raw)
To: Jan Beulich
Cc: Wei Liu, Konrad Rzeszutek Wilk, stefano.stabellini, xen-devel,
annie.li, andrew.bennieston
On Mon, 2013-06-17 at 11:06 +0100, Jan Beulich wrote:
> >>> On 17.06.13 at 11:38, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> >> > > Konrad, IIRC you once mentioned you discovered something with event
> >> > > notification, what's that?
> >> >
> >> > They were bizzare. I naively expected some form of # of physical NIC
> >> > interrupts to be around the same as the VIF or less. And I figured
> >> > that the amount of interrupts would be constant irregardless of the
> >> > size of the packets. In other words #packets == #interrupts.
> >> >
> >>
> >> It could be that the frontend notifies the backend for every packet it
> >> sends. This is not desirable and I don't expect the ring to behave that
> >> way.
> >
> > It is probably worth checking that things are working how we think they
> > should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> > netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> > suitable points to maximise batching.
> >
> > Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> > loop right? This would push the req_event pointer to just after the last
> > request, meaning the net request enqueued by the frontend would cause a
> > notification -- even though the backend is actually still continuing to
> > process requests and would have picked up that packet without further
> > notification. n this case there is a fair bit of work left in the
> > backend for this iteration i.e. plenty of opportunity for the frontend
> > to queue more requests.
> >
> > The comments in ring.h say:
> > * These macros will set the req_event/rsp_event field to trigger a
> > * notification on the very next message that is enqueued. If you want to
> > * create batches of work (i.e., only receive a notification after several
> > * messages have been enqueued) then you will need to create a customised
> > * version of the FINAL_CHECK macro in your own code, which sets the event
> > * field appropriately.
> >
> > Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> > (and other similar loops) and add a FINAL check at the very end?
>
> But then again the macro doesn't update req_event when there
> are unconsumed requests already upon entry to the macro.
My concern was that when we process the last request currently on the
ring we immediately set it forward, even though netback goes on to do a
bunch more work (including e.g. the grant copies) before looping back
and looking for more work. That's a potentially large window for the
frontend to enqueue and then needlessly notify a new packet.
It could potentially lead to a pathological case of notifying every
packet unnecessarily.
Ian.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-17 9:38 ` Ian Campbell
2013-06-17 9:56 ` Andrew Bennieston
2013-06-17 10:06 ` Jan Beulich
@ 2013-06-17 10:35 ` Wei Liu
2013-06-17 11:34 ` annie li
2 siblings, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-06-17 10:35 UTC (permalink / raw)
To: Ian Campbell
Cc: Wei Liu, stefano.stabellini, Konrad Rzeszutek Wilk, xen-devel,
annie.li, andrew.bennieston
On Mon, Jun 17, 2013 at 10:38:33AM +0100, Ian Campbell wrote:
> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> > > > Konrad, IIRC you once mentioned you discovered something with event
> > > > notification, what's that?
> > >
> > > They were bizzare. I naively expected some form of # of physical NIC
> > > interrupts to be around the same as the VIF or less. And I figured
> > > that the amount of interrupts would be constant irregardless of the
> > > size of the packets. In other words #packets == #interrupts.
> > >
> >
> > It could be that the frontend notifies the backend for every packet it
> > sends. This is not desirable and I don't expect the ring to behave that
> > way.
>
> It is probably worth checking that things are working how we think they
> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> suitable points to maximise batching.
>
> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> loop right? This would push the req_event pointer to just after the last
> request, meaning the net request enqueued by the frontend would cause a
> notification -- even though the backend is actually still continuing to
> process requests and would have picked up that packet without further
> notification. n this case there is a fair bit of work left in the
> backend for this iteration i.e. plenty of opportunity for the frontend
> to queue more requests.
>
> The comments in ring.h say:
> * These macros will set the req_event/rsp_event field to trigger a
> * notification on the very next message that is enqueued. If you want to
> * create batches of work (i.e., only receive a notification after several
> * messages have been enqueued) then you will need to create a customised
> * version of the FINAL_CHECK macro in your own code, which sets the event
> * field appropriately.
>
> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> (and other similar loops) and add a FINAL check at the very end?
>
> > > But it was odd and I didn't go deeper in it to figure out what
> > > was happening. And also to figure out if for the VIF we could
> > > do something of #packets != #interrupts. And hopefully some
> > > mechanism to adjust so that the amount of interrupts would
> > > be lesser per packets (hand waving here).
> >
> > I'm trying to do this now.
>
> What scheme do you have in mind?
Basically the one you mentioned above.
Playing with various event pointers now.
Wei.
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-17 9:56 ` Andrew Bennieston
@ 2013-06-17 10:46 ` Wei Liu
2013-06-17 10:56 ` Andrew Bennieston
0 siblings, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-06-17 10:46 UTC (permalink / raw)
To: Andrew Bennieston
Cc: Wei Liu, Ian Campbell, stefano.stabellini, Konrad Rzeszutek Wilk,
xen-devel, annie.li
On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
> On 17/06/13 10:38, Ian Campbell wrote:
> >On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> >>>>Konrad, IIRC you once mentioned you discovered something with event
> >>>>notification, what's that?
> >>>
> >>>They were bizzare. I naively expected some form of # of physical NIC
> >>>interrupts to be around the same as the VIF or less. And I figured
> >>>that the amount of interrupts would be constant irregardless of the
> >>>size of the packets. In other words #packets == #interrupts.
> >>>
> >>
> >>It could be that the frontend notifies the backend for every packet it
> >>sends. This is not desirable and I don't expect the ring to behave that
> >>way.
>
> I have observed this kind of behaviour during network performance
> tests in which I periodically checked the ring state during an iperf
> session. It looked to me like the frontend was sending notifications
> far too often, but that the backend was sending them very
> infrequently, so the Tx (from guest) ring was mostly empty and the
> Rx (to guest) ring was mostly full. This has the effect of both
> front and backend having to block occasionally waiting for the other
> end to clear or fill a ring, even though there is more data
> available.
>
> My initial theory was that this was caused in part by the shared
> event channel, however I expect that Wei is testing on top of a
> kernel with his split event channel features?
>
Yes, with split event channels.
And during tests the interrupt counts, frontend TX has 6 figures
interrupt number while frontend RX has 2 figures number.
> >
> >It is probably worth checking that things are working how we think they
> >should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> >netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> >suitable points to maximise batching.
> >
> >Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> >loop right? This would push the req_event pointer to just after the last
> >request, meaning the net request enqueued by the frontend would cause a
> >notification -- even though the backend is actually still continuing to
> >process requests and would have picked up that packet without further
> >notification. n this case there is a fair bit of work left in the
> >backend for this iteration i.e. plenty of opportunity for the frontend
> >to queue more requests.
> >
> >The comments in ring.h say:
> > * These macros will set the req_event/rsp_event field to trigger a
> > * notification on the very next message that is enqueued. If you want to
> > * create batches of work (i.e., only receive a notification after several
> > * messages have been enqueued) then you will need to create a customised
> > * version of the FINAL_CHECK macro in your own code, which sets the event
> > * field appropriately.
> >
> >Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> >(and other similar loops) and add a FINAL check at the very end?
> >
> >>>But it was odd and I didn't go deeper in it to figure out what
> >>>was happening. And also to figure out if for the VIF we could
> >>>do something of #packets != #interrupts. And hopefully some
> >>>mechanism to adjust so that the amount of interrupts would
> >>>be lesser per packets (hand waving here).
> >>
> >>I'm trying to do this now.
> >
> >What scheme do you have in mind?
>
> As I mentioned above, filling a ring completely appears to be almost
> as bad as sending too many notifications. The ideal scheme may
> involve trying to balance the ring at some "half-full" state,
> depending on the capacity for the front- and backends to process
> requests and responses.
>
I don't think filling the ring full causes any problem, that's just
conceptually the same as "half-full" state if you need to throttle the
ring.
The real problem is how to do notifications correctly.
Wei.
> Andrew.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-17 10:46 ` Wei Liu
@ 2013-06-17 10:56 ` Andrew Bennieston
2013-06-17 11:08 ` Ian Campbell
0 siblings, 1 reply; 27+ messages in thread
From: Andrew Bennieston @ 2013-06-17 10:56 UTC (permalink / raw)
To: Wei Liu
Cc: annie.li, xen-devel, Ian Campbell, stefano.stabellini,
Konrad Rzeszutek Wilk
On 17/06/13 11:46, Wei Liu wrote:
> On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
>> On 17/06/13 10:38, Ian Campbell wrote:
>>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>>>> Konrad, IIRC you once mentioned you discovered something with event
>>>>>> notification, what's that?
>>>>>
>>>>> They were bizzare. I naively expected some form of # of physical NIC
>>>>> interrupts to be around the same as the VIF or less. And I figured
>>>>> that the amount of interrupts would be constant irregardless of the
>>>>> size of the packets. In other words #packets == #interrupts.
>>>>>
>>>>
>>>> It could be that the frontend notifies the backend for every packet it
>>>> sends. This is not desirable and I don't expect the ring to behave that
>>>> way.
>>
>> I have observed this kind of behaviour during network performance
>> tests in which I periodically checked the ring state during an iperf
>> session. It looked to me like the frontend was sending notifications
>> far too often, but that the backend was sending them very
>> infrequently, so the Tx (from guest) ring was mostly empty and the
>> Rx (to guest) ring was mostly full. This has the effect of both
>> front and backend having to block occasionally waiting for the other
>> end to clear or fill a ring, even though there is more data
>> available.
>>
>> My initial theory was that this was caused in part by the shared
>> event channel, however I expect that Wei is testing on top of a
>> kernel with his split event channel features?
>>
>
> Yes, with split event channels.
>
> And during tests the interrupt counts, frontend TX has 6 figures
> interrupt number while frontend RX has 2 figures number.
>
>>>
>>> It is probably worth checking that things are working how we think they
>>> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
>>> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
>>> suitable points to maximise batching.
>>>
>>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
>>> loop right? This would push the req_event pointer to just after the last
>>> request, meaning the net request enqueued by the frontend would cause a
>>> notification -- even though the backend is actually still continuing to
>>> process requests and would have picked up that packet without further
>>> notification. n this case there is a fair bit of work left in the
>>> backend for this iteration i.e. plenty of opportunity for the frontend
>>> to queue more requests.
>>>
>>> The comments in ring.h say:
>>> * These macros will set the req_event/rsp_event field to trigger a
>>> * notification on the very next message that is enqueued. If you want to
>>> * create batches of work (i.e., only receive a notification after several
>>> * messages have been enqueued) then you will need to create a customised
>>> * version of the FINAL_CHECK macro in your own code, which sets the event
>>> * field appropriately.
>>>
>>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
>>> (and other similar loops) and add a FINAL check at the very end?
>>>
>>>>> But it was odd and I didn't go deeper in it to figure out what
>>>>> was happening. And also to figure out if for the VIF we could
>>>>> do something of #packets != #interrupts. And hopefully some
>>>>> mechanism to adjust so that the amount of interrupts would
>>>>> be lesser per packets (hand waving here).
>>>>
>>>> I'm trying to do this now.
>>>
>>> What scheme do you have in mind?
>>
>> As I mentioned above, filling a ring completely appears to be almost
>> as bad as sending too many notifications. The ideal scheme may
>> involve trying to balance the ring at some "half-full" state,
>> depending on the capacity for the front- and backends to process
>> requests and responses.
>>
>
> I don't think filling the ring full causes any problem, that's just
> conceptually the same as "half-full" state if you need to throttle the
> ring.
My understanding was that filling the ring will cause the producer to
sleep until slots become available (i.e. the until the consumer notifies
it that it has removed something from the ring).
I'm just concerned that overly aggressive batching may lead to a
situation where the consumer is sitting idle, waiting for a notification
that the producer hasn't yet sent because it can still fill more slots
on the ring. When the ring is completely full, the producer would have
to wait for the ring to partially empty. At this point, the consumer
would hold off notifying because it can still batch more processing, so
the producer is left waiting. (Repeat as required). It would be better
to have both producer and consumer running concurrently.
I mention this mainly so that we don't end up with a swing to the polar
opposite of what we have now, which (to my mind) is just as bad. Clearly
this is an edge case, but if there's a reason I'm missing that this
can't happen (e.g. after a period of inactivity) then don't hesitate to
point it out :)
(Perhaps "half-full" was misleading... the optimal state may be "just
enough room for one more packet", or something along those lines...)
Andrew
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-17 10:56 ` Andrew Bennieston
@ 2013-06-17 11:08 ` Ian Campbell
2013-06-17 11:55 ` Andrew Bennieston
0 siblings, 1 reply; 27+ messages in thread
From: Ian Campbell @ 2013-06-17 11:08 UTC (permalink / raw)
To: Andrew Bennieston
Cc: annie.li, xen-devel, Wei Liu, stefano.stabellini, Konrad Rzeszutek Wilk
On Mon, 2013-06-17 at 11:56 +0100, Andrew Bennieston wrote:
> On 17/06/13 11:46, Wei Liu wrote:
> > On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
> >> On 17/06/13 10:38, Ian Campbell wrote:
> >>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
> >>>>>> Konrad, IIRC you once mentioned you discovered something with event
> >>>>>> notification, what's that?
> >>>>>
> >>>>> They were bizzare. I naively expected some form of # of physical NIC
> >>>>> interrupts to be around the same as the VIF or less. And I figured
> >>>>> that the amount of interrupts would be constant irregardless of the
> >>>>> size of the packets. In other words #packets == #interrupts.
> >>>>>
> >>>>
> >>>> It could be that the frontend notifies the backend for every packet it
> >>>> sends. This is not desirable and I don't expect the ring to behave that
> >>>> way.
> >>
> >> I have observed this kind of behaviour during network performance
> >> tests in which I periodically checked the ring state during an iperf
> >> session. It looked to me like the frontend was sending notifications
> >> far too often, but that the backend was sending them very
> >> infrequently, so the Tx (from guest) ring was mostly empty and the
> >> Rx (to guest) ring was mostly full. This has the effect of both
> >> front and backend having to block occasionally waiting for the other
> >> end to clear or fill a ring, even though there is more data
> >> available.
> >>
> >> My initial theory was that this was caused in part by the shared
> >> event channel, however I expect that Wei is testing on top of a
> >> kernel with his split event channel features?
> >>
> >
> > Yes, with split event channels.
> >
> > And during tests the interrupt counts, frontend TX has 6 figures
> > interrupt number while frontend RX has 2 figures number.
> >
> >>>
> >>> It is probably worth checking that things are working how we think they
> >>> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
> >>> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
> >>> suitable points to maximise batching.
> >>>
> >>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
> >>> loop right? This would push the req_event pointer to just after the last
> >>> request, meaning the net request enqueued by the frontend would cause a
> >>> notification -- even though the backend is actually still continuing to
> >>> process requests and would have picked up that packet without further
> >>> notification. n this case there is a fair bit of work left in the
> >>> backend for this iteration i.e. plenty of opportunity for the frontend
> >>> to queue more requests.
> >>>
> >>> The comments in ring.h say:
> >>> * These macros will set the req_event/rsp_event field to trigger a
> >>> * notification on the very next message that is enqueued. If you want to
> >>> * create batches of work (i.e., only receive a notification after several
> >>> * messages have been enqueued) then you will need to create a customised
> >>> * version of the FINAL_CHECK macro in your own code, which sets the event
> >>> * field appropriately.
> >>>
> >>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
> >>> (and other similar loops) and add a FINAL check at the very end?
> >>>
> >>>>> But it was odd and I didn't go deeper in it to figure out what
> >>>>> was happening. And also to figure out if for the VIF we could
> >>>>> do something of #packets != #interrupts. And hopefully some
> >>>>> mechanism to adjust so that the amount of interrupts would
> >>>>> be lesser per packets (hand waving here).
> >>>>
> >>>> I'm trying to do this now.
> >>>
> >>> What scheme do you have in mind?
> >>
> >> As I mentioned above, filling a ring completely appears to be almost
> >> as bad as sending too many notifications. The ideal scheme may
> >> involve trying to balance the ring at some "half-full" state,
> >> depending on the capacity for the front- and backends to process
> >> requests and responses.
> >>
> >
> > I don't think filling the ring full causes any problem, that's just
> > conceptually the same as "half-full" state if you need to throttle the
> > ring.
> My understanding was that filling the ring will cause the producer to
> sleep until slots become available (i.e. the until the consumer notifies
> it that it has removed something from the ring).
>
> I'm just concerned that overly aggressive batching may lead to a
> situation where the consumer is sitting idle, waiting for a notification
> that the producer hasn't yet sent because it can still fill more slots
> on the ring. When the ring is completely full, the producer would have
> to wait for the ring to partially empty. At this point, the consumer
> would hold off notifying because it can still batch more processing, so
> the producer is left waiting. (Repeat as required). It would be better
> to have both producer and consumer running concurrently.
>
> I mention this mainly so that we don't end up with a swing to the polar
> opposite of what we have now, which (to my mind) is just as bad. Clearly
> this is an edge case, but if there's a reason I'm missing that this
> can't happen (e.g. after a period of inactivity) then don't hesitate to
> point it out :)
Doesn't the separation between req_event and rsp_event help here?
So if the producer fills the ring, it will sleep, but set rsp_event
appropriately that when the backend completes some (but not all) work it
will be woken up so that it can put extra stuff on the ring.
It shouldn't need to wait for the backend to process the whole batch for
this.
>
> (Perhaps "half-full" was misleading... the optimal state may be "just
> enough room for one more packet", or something along those lines...)
>
> Andrew
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-17 10:35 ` Wei Liu
@ 2013-06-17 11:34 ` annie li
0 siblings, 0 replies; 27+ messages in thread
From: annie li @ 2013-06-17 11:34 UTC (permalink / raw)
To: Wei Liu
Cc: andrew.bennieston, Konrad Rzeszutek Wilk, xen-devel,
Ian Campbell, stefano.stabellini
On 2013-6-17 18:35, Wei Liu wrote:
> On Mon, Jun 17, 2013 at 10:38:33AM +0100, Ian Campbell wrote:
>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>>> Konrad, IIRC you once mentioned you discovered something with event
>>>>> notification, what's that?
>>>> They were bizzare. I naively expected some form of # of physical NIC
>>>> interrupts to be around the same as the VIF or less. And I figured
>>>> that the amount of interrupts would be constant irregardless of the
>>>> size of the packets. In other words #packets == #interrupts.
>>>>
>>> It could be that the frontend notifies the backend for every packet it
>>> sends. This is not desirable and I don't expect the ring to behave that
>>> way.
>> It is probably worth checking that things are working how we think they
>> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
>> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
>> suitable points to maximise batching.
>>
>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
>> loop right? This would push the req_event pointer to just after the last
>> request, meaning the net request enqueued by the frontend would cause a
>> notification -- even though the backend is actually still continuing to
>> process requests and would have picked up that packet without further
>> notification. n this case there is a fair bit of work left in the
>> backend for this iteration i.e. plenty of opportunity for the frontend
>> to queue more requests.
>>
>> The comments in ring.h say:
>> * These macros will set the req_event/rsp_event field to trigger a
>> * notification on the very next message that is enqueued. If you want to
>> * create batches of work (i.e., only receive a notification after several
>> * messages have been enqueued) then you will need to create a customised
>> * version of the FINAL_CHECK macro in your own code, which sets the event
>> * field appropriately.
>>
>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
>> (and other similar loops) and add a FINAL check at the very end?
>>
>>>> But it was odd and I didn't go deeper in it to figure out what
>>>> was happening. And also to figure out if for the VIF we could
>>>> do something of #packets != #interrupts. And hopefully some
>>>> mechanism to adjust so that the amount of interrupts would
>>>> be lesser per packets (hand waving here).
>>> I'm trying to do this now.
>> What scheme do you have in mind?
> Basically the one you mentioned above.
>
> Playing with various event pointers now.
Did you collect data of how much requests netback processes when
req_event is updated in RING_FINAL_CHECK_FOR_REQUESTS? I assume this
value is pretty small from your test result. How about not updating
req_event every time when there is no unconsumed request in
RING_FINAL_CHECK_FOR_REQUESTS?
Thanks
Annie
>
>
> Wei.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-17 11:08 ` Ian Campbell
@ 2013-06-17 11:55 ` Andrew Bennieston
0 siblings, 0 replies; 27+ messages in thread
From: Andrew Bennieston @ 2013-06-17 11:55 UTC (permalink / raw)
To: Ian Campbell
Cc: annie.li, xen-devel, Wei Liu, stefano.stabellini, Konrad Rzeszutek Wilk
On 17/06/13 12:08, Ian Campbell wrote:
> On Mon, 2013-06-17 at 11:56 +0100, Andrew Bennieston wrote:
>> On 17/06/13 11:46, Wei Liu wrote:
>>> On Mon, Jun 17, 2013 at 10:56:12AM +0100, Andrew Bennieston wrote:
>>>> On 17/06/13 10:38, Ian Campbell wrote:
>>>>> On Sun, 2013-06-16 at 10:54 +0100, Wei Liu wrote:
>>>>>>>> Konrad, IIRC you once mentioned you discovered something with event
>>>>>>>> notification, what's that?
>>>>>>>
>>>>>>> They were bizzare. I naively expected some form of # of physical NIC
>>>>>>> interrupts to be around the same as the VIF or less. And I figured
>>>>>>> that the amount of interrupts would be constant irregardless of the
>>>>>>> size of the packets. In other words #packets == #interrupts.
>>>>>>>
>>>>>>
>>>>>> It could be that the frontend notifies the backend for every packet it
>>>>>> sends. This is not desirable and I don't expect the ring to behave that
>>>>>> way.
>>>>
>>>> I have observed this kind of behaviour during network performance
>>>> tests in which I periodically checked the ring state during an iperf
>>>> session. It looked to me like the frontend was sending notifications
>>>> far too often, but that the backend was sending them very
>>>> infrequently, so the Tx (from guest) ring was mostly empty and the
>>>> Rx (to guest) ring was mostly full. This has the effect of both
>>>> front and backend having to block occasionally waiting for the other
>>>> end to clear or fill a ring, even though there is more data
>>>> available.
>>>>
>>>> My initial theory was that this was caused in part by the shared
>>>> event channel, however I expect that Wei is testing on top of a
>>>> kernel with his split event channel features?
>>>>
>>>
>>> Yes, with split event channels.
>>>
>>> And during tests the interrupt counts, frontend TX has 6 figures
>>> interrupt number while frontend RX has 2 figures number.
>>>
>>>>>
>>>>> It is probably worth checking that things are working how we think they
>>>>> should. i.e. that netback's calls to RING_FINAL_CHECK_FOR_.. and
>>>>> netfront's calls to RING_PUSH_..._AND_CHECK_NOTIFY are placed at
>>>>> suitable points to maximise batching.
>>>>>
>>>>> Is the RING_FINAL_CHECK_FOR_REQUESTS inside the xen_netbk_tx_build_gops
>>>>> loop right? This would push the req_event pointer to just after the last
>>>>> request, meaning the net request enqueued by the frontend would cause a
>>>>> notification -- even though the backend is actually still continuing to
>>>>> process requests and would have picked up that packet without further
>>>>> notification. n this case there is a fair bit of work left in the
>>>>> backend for this iteration i.e. plenty of opportunity for the frontend
>>>>> to queue more requests.
>>>>>
>>>>> The comments in ring.h say:
>>>>> * These macros will set the req_event/rsp_event field to trigger a
>>>>> * notification on the very next message that is enqueued. If you want to
>>>>> * create batches of work (i.e., only receive a notification after several
>>>>> * messages have been enqueued) then you will need to create a customised
>>>>> * version of the FINAL_CHECK macro in your own code, which sets the event
>>>>> * field appropriately.
>>>>>
>>>>> Perhaps we want to just use RING_HAS_UNCONSUMED_REQUESTS in that loop
>>>>> (and other similar loops) and add a FINAL check at the very end?
>>>>>
>>>>>>> But it was odd and I didn't go deeper in it to figure out what
>>>>>>> was happening. And also to figure out if for the VIF we could
>>>>>>> do something of #packets != #interrupts. And hopefully some
>>>>>>> mechanism to adjust so that the amount of interrupts would
>>>>>>> be lesser per packets (hand waving here).
>>>>>>
>>>>>> I'm trying to do this now.
>>>>>
>>>>> What scheme do you have in mind?
>>>>
>>>> As I mentioned above, filling a ring completely appears to be almost
>>>> as bad as sending too many notifications. The ideal scheme may
>>>> involve trying to balance the ring at some "half-full" state,
>>>> depending on the capacity for the front- and backends to process
>>>> requests and responses.
>>>>
>>>
>>> I don't think filling the ring full causes any problem, that's just
>>> conceptually the same as "half-full" state if you need to throttle the
>>> ring.
>> My understanding was that filling the ring will cause the producer to
>> sleep until slots become available (i.e. the until the consumer notifies
>> it that it has removed something from the ring).
>>
>> I'm just concerned that overly aggressive batching may lead to a
>> situation where the consumer is sitting idle, waiting for a notification
>> that the producer hasn't yet sent because it can still fill more slots
>> on the ring. When the ring is completely full, the producer would have
>> to wait for the ring to partially empty. At this point, the consumer
>> would hold off notifying because it can still batch more processing, so
>> the producer is left waiting. (Repeat as required). It would be better
>> to have both producer and consumer running concurrently.
>>
>> I mention this mainly so that we don't end up with a swing to the polar
>> opposite of what we have now, which (to my mind) is just as bad. Clearly
>> this is an edge case, but if there's a reason I'm missing that this
>> can't happen (e.g. after a period of inactivity) then don't hesitate to
>> point it out :)
>
> Doesn't the separation between req_event and rsp_event help here?
>
> So if the producer fills the ring, it will sleep, but set rsp_event
> appropriately that when the backend completes some (but not all) work it
> will be woken up so that it can put extra stuff on the ring.
>
> It shouldn't need to wait for the backend to process the whole batch for
> this.
Right. As long as this logic doesn't get inadvertently changed in an
attempt to improve batching of events!
>
>>
>> (Perhaps "half-full" was misleading... the optimal state may be "just
>> enough room for one more packet", or something along those lines...)
>>
>> Andrew
>>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-12 10:14 Interesting observation with network event notification and batching Wei Liu
2013-06-14 18:53 ` Konrad Rzeszutek Wilk
@ 2013-06-28 16:15 ` Wei Liu
2013-07-01 7:48 ` annie li
1 sibling, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-06-28 16:15 UTC (permalink / raw)
To: xen-devel
Cc: wei.liu2, ian.campbell, stefano.stabellini, annie.li, andrew.bennieston
Hi all,
After collecting more stats and comparing copying / mapping cases, I now
have some more interesting finds, which might contradict what I said
before.
I tuned the runes I used for benchmark to make sure iperf and netperf
generate large packets (~64K). Here are the runes I use:
iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
COPY MAP
iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)
PPI 2.90 1.07
SPI 37.75 13.69
PPN 2.90 1.07
SPN 37.75 13.69
tx_count 31808 174769
nr_napi_schedule 31805 174697
total_packets 92354 187408
total_reqs 1200793 2392614
netperf Tput: 5.8Gb/s 10.5Gb/s
PPI 2.13 1.00
SPI 36.70 16.73
PPN 2.13 1.31
SPN 36.70 16.75
tx_count 57635 205599
nr_napi_schedule 57633 205311
total_packets 122800 270254
total_reqs 2115068 3439751
PPI: packets processed per interrupt
SPI: slots processed per interrupt
PPN: packets processed per napi schedule
SPN: slots processed per napi schedule
tx_count: interrupt count
total_reqs: total slots used during test
* Notification and batching
Is notification and batching really a problem? I'm not so sure now. My
first thought when I didn't measure PPI / PPN / SPI / SPN in copying
case was that "in that case netback *must* have better batching" which
turned out not very true -- copying mode makes netback slower, however
the batching gained is not hugh.
Ideally we still want to batch as much as possible. Possible way
includes playing with the 'weight' parameter in NAPI. But as the figures
show batching seems not to be very important for throughput, at least
for now. If the NAPI framework and netfront / netback are doing their
jobs as designed we might not need to worry about this now.
Andrew, do you have any thought on this? You found out that NAPI didn't
scale well with multi-threaded iperf in DomU, do you have any handle how
that can happen?
* Thoughts on zero-copy TX
With this hack we are able to achieve 10Gb/s single stream, which is
good. But, with classic XenoLinux kernel which has zero copy TX we
didn't able to achieve this. I also developed another zero copy netback
prototype one year ago with Ian's out-of-tree skb frag destructor patch
series. That prototype couldn't achieve 10Gb/s either (IIRC the
performance was more or less the same as copying mode, about 6~7Gb/s).
My hack maps all necessary pages permantently, there is no unmap, we
skip lots of page table manipulation and TLB flushes. So my basic
conclusion is that page table manipulation and TLB flushes do incur
heavy performance penalty.
This hack can be upstreamed in no way. If we're to re-introduce
zero-copy TX, we would need to implement some sort of lazy flushing
mechanism. I haven't thought this through. Presumably this mechanism
would also benefit blk somehow? I'm not sure yet.
Could persistent mapping (with the to-be-developed reclaim / MRU list
mechanism) be useful here? So that we can unify blk and net drivers?
* Changes required to introduce zero-copy TX
1. SKB frag destructor series: to track life cycle of SKB frags. This is
not yet upstreamed.
2. Mechanism to negotiate max slots frontend can use: mapping requires
backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
3. Lazy flushing mechanism or persistent grants: ???
Wei.
* Note
In my previous tests I only ran iperf and didn't have the right rune to
generate large packets. Iperf seems to have a behavior to increase
packet size as time goes by. In the copying case the packet size was
increased to 64K eventually while in the mapping case odd thing happened
(I believe that must due to the bug in my hack :-/) -- packet size was
always the default size (8K). Adding '-l 131072' to iperf makes sure
that the packet is always 64K.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-06-28 16:15 ` Wei Liu
@ 2013-07-01 7:48 ` annie li
2013-07-01 8:54 ` Wei Liu
2013-07-01 14:19 ` Stefano Stabellini
0 siblings, 2 replies; 27+ messages in thread
From: annie li @ 2013-07-01 7:48 UTC (permalink / raw)
To: Wei Liu; +Cc: andrew.bennieston, ian.campbell, stefano.stabellini, xen-devel
[-- Attachment #1.1: Type: text/plain, Size: 4882 bytes --]
On 2013-6-29 0:15, Wei Liu wrote:
> Hi all,
>
> After collecting more stats and comparing copying / mapping cases, I now
> have some more interesting finds, which might contradict what I said
> before.
>
> I tuned the runes I used for benchmark to make sure iperf and netperf
> generate large packets (~64K). Here are the runes I use:
>
> iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>
> COPY MAP
> iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)
So with default iperf setting, copy is about 7.9G, and map is about
2.5G? How about the result of netperf without large packets?
> PPI 2.90 1.07
> SPI 37.75 13.69
> PPN 2.90 1.07
> SPN 37.75 13.69
> tx_count 31808 174769
Seems interrupt count does not affect the performance at all with -l
131072 -w 128k.
> nr_napi_schedule 31805 174697
> total_packets 92354 187408
> total_reqs 1200793 2392614
>
> netperf Tput: 5.8Gb/s 10.5Gb/s
> PPI 2.13 1.00
> SPI 36.70 16.73
> PPN 2.13 1.31
> SPN 36.70 16.75
> tx_count 57635 205599
> nr_napi_schedule 57633 205311
> total_packets 122800 270254
> total_reqs 2115068 3439751
>
> PPI: packets processed per interrupt
> SPI: slots processed per interrupt
> PPN: packets processed per napi schedule
> SPN: slots processed per napi schedule
> tx_count: interrupt count
> total_reqs: total slots used during test
>
> * Notification and batching
>
> Is notification and batching really a problem? I'm not so sure now. My
> first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> case was that "in that case netback *must* have better batching" which
> turned out not very true -- copying mode makes netback slower, however
> the batching gained is not hugh.
>
> Ideally we still want to batch as much as possible. Possible way
> includes playing with the 'weight' parameter in NAPI. But as the figures
> show batching seems not to be very important for throughput, at least
> for now. If the NAPI framework and netfront / netback are doing their
> jobs as designed we might not need to worry about this now.
>
> Andrew, do you have any thought on this? You found out that NAPI didn't
> scale well with multi-threaded iperf in DomU, do you have any handle how
> that can happen?
>
> * Thoughts on zero-copy TX
>
> With this hack we are able to achieve 10Gb/s single stream, which is
> good. But, with classic XenoLinux kernel which has zero copy TX we
> didn't able to achieve this. I also developed another zero copy netback
> prototype one year ago with Ian's out-of-tree skb frag destructor patch
> series. That prototype couldn't achieve 10Gb/s either (IIRC the
> performance was more or less the same as copying mode, about 6~7Gb/s).
>
> My hack maps all necessary pages permantently, there is no unmap, we
> skip lots of page table manipulation and TLB flushes. So my basic
> conclusion is that page table manipulation and TLB flushes do incur
> heavy performance penalty.
>
> This hack can be upstreamed in no way. If we're to re-introduce
> zero-copy TX, we would need to implement some sort of lazy flushing
> mechanism. I haven't thought this through. Presumably this mechanism
> would also benefit blk somehow? I'm not sure yet.
>
> Could persistent mapping (with the to-be-developed reclaim / MRU list
> mechanism) be useful here? So that we can unify blk and net drivers?
>
> * Changes required to introduce zero-copy TX
>
> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
> not yet upstreamed.
Are you mentioning this one
http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
<http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
>
> 2. Mechanism to negotiate max slots frontend can use: mapping requires
> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
>
> 3. Lazy flushing mechanism or persistent grants: ???
I did some test with persistent grants before, it did not show better
performance than grant copy. But I was using the default params of
netperf, and not tried large packet size. Your results reminds me that
maybe persistent grants would get similar results with larger packet
size too.
Thanks
Annie
[-- Attachment #1.2: Type: text/html, Size: 5710 bytes --]
[-- Attachment #2: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 7:48 ` annie li
@ 2013-07-01 8:54 ` Wei Liu
2013-07-01 14:29 ` Stefano Stabellini
2013-07-01 15:59 ` annie li
2013-07-01 14:19 ` Stefano Stabellini
1 sibling, 2 replies; 27+ messages in thread
From: Wei Liu @ 2013-07-01 8:54 UTC (permalink / raw)
To: annie li
Cc: Wei Liu, ian.campbell, stefano.stabellini, xen-devel, andrew.bennieston
On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
>
> On 2013-6-29 0:15, Wei Liu wrote:
> >Hi all,
> >
> >After collecting more stats and comparing copying / mapping cases, I now
> >have some more interesting finds, which might contradict what I said
> >before.
> >
> >I tuned the runes I used for benchmark to make sure iperf and netperf
> >generate large packets (~64K). Here are the runes I use:
> >
> > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> >
> > COPY MAP
> >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)
>
> So with default iperf setting, copy is about 7.9G, and map is about
> 2.5G? How about the result of netperf without large packets?
>
First question, yes.
Second question, 5.8Gb/s. And I believe for the copying scheme without
large packet the throuput is more or less the same.
> > PPI 2.90 1.07
> > SPI 37.75 13.69
> > PPN 2.90 1.07
> > SPN 37.75 13.69
> > tx_count 31808 174769
>
> Seems interrupt count does not affect the performance at all with -l
> 131072 -w 128k.
>
Right.
> > nr_napi_schedule 31805 174697
> > total_packets 92354 187408
> > total_reqs 1200793 2392614
> >
> >netperf Tput: 5.8Gb/s 10.5Gb/s
> > PPI 2.13 1.00
> > SPI 36.70 16.73
> > PPN 2.13 1.31
> > SPN 36.70 16.75
> > tx_count 57635 205599
> > nr_napi_schedule 57633 205311
> > total_packets 122800 270254
> > total_reqs 2115068 3439751
> >
> > PPI: packets processed per interrupt
> > SPI: slots processed per interrupt
> > PPN: packets processed per napi schedule
> > SPN: slots processed per napi schedule
> > tx_count: interrupt count
> > total_reqs: total slots used during test
> >
> >* Notification and batching
> >
> >Is notification and batching really a problem? I'm not so sure now. My
> >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> >case was that "in that case netback *must* have better batching" which
> >turned out not very true -- copying mode makes netback slower, however
> >the batching gained is not hugh.
> >
> >Ideally we still want to batch as much as possible. Possible way
> >includes playing with the 'weight' parameter in NAPI. But as the figures
> >show batching seems not to be very important for throughput, at least
> >for now. If the NAPI framework and netfront / netback are doing their
> >jobs as designed we might not need to worry about this now.
> >
> >Andrew, do you have any thought on this? You found out that NAPI didn't
> >scale well with multi-threaded iperf in DomU, do you have any handle how
> >that can happen?
> >
> >* Thoughts on zero-copy TX
> >
> >With this hack we are able to achieve 10Gb/s single stream, which is
> >good. But, with classic XenoLinux kernel which has zero copy TX we
> >didn't able to achieve this. I also developed another zero copy netback
> >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> >performance was more or less the same as copying mode, about 6~7Gb/s).
> >
> >My hack maps all necessary pages permantently, there is no unmap, we
> >skip lots of page table manipulation and TLB flushes. So my basic
> >conclusion is that page table manipulation and TLB flushes do incur
> >heavy performance penalty.
> >
> >This hack can be upstreamed in no way. If we're to re-introduce
> >zero-copy TX, we would need to implement some sort of lazy flushing
> >mechanism. I haven't thought this through. Presumably this mechanism
> >would also benefit blk somehow? I'm not sure yet.
> >
> >Could persistent mapping (with the to-be-developed reclaim / MRU list
> >mechanism) be useful here? So that we can unify blk and net drivers?
> >
> >* Changes required to introduce zero-copy TX
> >
> >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> >not yet upstreamed.
>
> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>
> <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
>
Yes. But I believe there's been several versions posted. The link you
have is not the latest version.
> >
> >2. Mechanism to negotiate max slots frontend can use: mapping requires
> >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> >
> >3. Lazy flushing mechanism or persistent grants: ???
>
> I did some test with persistent grants before, it did not show
> better performance than grant copy. But I was using the default
> params of netperf, and not tried large packet size. Your results
> reminds me that maybe persistent grants would get similar results
> with larger packet size too.
>
"No better performance" -- that's because both mechanisms are copying?
However I presume persistent grant can scale better? From an earlier
email last week, I read that copying is done by the guest so that this
mechanism scales much better than hypervisor copying in blk's case.
Wei.
> Thanks
> Annie
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 7:48 ` annie li
2013-07-01 8:54 ` Wei Liu
@ 2013-07-01 14:19 ` Stefano Stabellini
2013-07-01 15:59 ` annie li
1 sibling, 1 reply; 27+ messages in thread
From: Stefano Stabellini @ 2013-07-01 14:19 UTC (permalink / raw)
To: annie li
Cc: Wei Liu, ian.campbell, stefano.stabellini, xen-devel, andrew.bennieston
Could you please use plain text emails in the future?
On Mon, 1 Jul 2013, annie li wrote:
> On 2013-6-29 0:15, Wei Liu wrote:
>
> Hi all,
>
> After collecting more stats and comparing copying / mapping cases, I now
> have some more interesting finds, which might contradict what I said
> before.
>
> I tuned the runes I used for benchmark to make sure iperf and netperf
> generate large packets (~64K). Here are the runes I use:
>
> iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>
> COPY MAP
> iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)
>
>
> So with default iperf setting, copy is about 7.9G, and map is about 2.5G? How about the result of netperf without large packets?
>
> PPI 2.90 1.07
> SPI 37.75 13.69
> PPN 2.90 1.07
> SPN 37.75 13.69
> tx_count 31808 174769
>
>
> Seems interrupt count does not affect the performance at all with -l 131072 -w 128k.
>
> nr_napi_schedule 31805 174697
> total_packets 92354 187408
> total_reqs 1200793 2392614
>
> netperf Tput: 5.8Gb/s 10.5Gb/s
> PPI 2.13 1.00
> SPI 36.70 16.73
> PPN 2.13 1.31
> SPN 36.70 16.75
> tx_count 57635 205599
> nr_napi_schedule 57633 205311
> total_packets 122800 270254
> total_reqs 2115068 3439751
>
> PPI: packets processed per interrupt
> SPI: slots processed per interrupt
> PPN: packets processed per napi schedule
> SPN: slots processed per napi schedule
> tx_count: interrupt count
> total_reqs: total slots used during test
>
> * Notification and batching
>
> Is notification and batching really a problem? I'm not so sure now. My
> first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> case was that "in that case netback *must* have better batching" which
> turned out not very true -- copying mode makes netback slower, however
> the batching gained is not hugh.
>
> Ideally we still want to batch as much as possible. Possible way
> includes playing with the 'weight' parameter in NAPI. But as the figures
> show batching seems not to be very important for throughput, at least
> for now. If the NAPI framework and netfront / netback are doing their
> jobs as designed we might not need to worry about this now.
>
> Andrew, do you have any thought on this? You found out that NAPI didn't
> scale well with multi-threaded iperf in DomU, do you have any handle how
> that can happen?
>
> * Thoughts on zero-copy TX
>
> With this hack we are able to achieve 10Gb/s single stream, which is
> good. But, with classic XenoLinux kernel which has zero copy TX we
> didn't able to achieve this. I also developed another zero copy netback
> prototype one year ago with Ian's out-of-tree skb frag destructor patch
> series. That prototype couldn't achieve 10Gb/s either (IIRC the
> performance was more or less the same as copying mode, about 6~7Gb/s).
>
> My hack maps all necessary pages permantently, there is no unmap, we
> skip lots of page table manipulation and TLB flushes. So my basic
> conclusion is that page table manipulation and TLB flushes do incur
> heavy performance penalty.
>
> This hack can be upstreamed in no way. If we're to re-introduce
> zero-copy TX, we would need to implement some sort of lazy flushing
> mechanism. I haven't thought this through. Presumably this mechanism
> would also benefit blk somehow? I'm not sure yet.
>
> Could persistent mapping (with the to-be-developed reclaim / MRU list
> mechanism) be useful here? So that we can unify blk and net drivers?
>
> * Changes required to introduce zero-copy TX
>
> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
> not yet upstreamed.
>
>
> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>
>
> 2. Mechanism to negotiate max slots frontend can use: mapping requires
> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
>
> 3. Lazy flushing mechanism or persistent grants: ???
>
>
> I did some test with persistent grants before, it did not show better performance than grant copy. But I was using the default
> params of netperf, and not tried large packet size. Your results reminds me that maybe persistent grants would get similar
> results with larger packet size too.
>
> Thanks
> Annie
>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 8:54 ` Wei Liu
@ 2013-07-01 14:29 ` Stefano Stabellini
2013-07-01 14:39 ` Wei Liu
2013-07-01 15:59 ` annie li
1 sibling, 1 reply; 27+ messages in thread
From: Stefano Stabellini @ 2013-07-01 14:29 UTC (permalink / raw)
To: Wei Liu
Cc: ian.campbell, stefano.stabellini, xen-devel, annie li, andrew.bennieston
On Mon, 1 Jul 2013, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> >
> > On 2013-6-29 0:15, Wei Liu wrote:
> > >Hi all,
> > >
> > >After collecting more stats and comparing copying / mapping cases, I now
> > >have some more interesting finds, which might contradict what I said
> > >before.
> > >
> > >I tuned the runes I used for benchmark to make sure iperf and netperf
> > >generate large packets (~64K). Here are the runes I use:
> > >
> > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > >
> > > COPY MAP
> > >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)
> >
> > So with default iperf setting, copy is about 7.9G, and map is about
> > 2.5G? How about the result of netperf without large packets?
> >
>
> First question, yes.
>
> Second question, 5.8Gb/s. And I believe for the copying scheme without
> large packet the throuput is more or less the same.
>
> > > PPI 2.90 1.07
> > > SPI 37.75 13.69
> > > PPN 2.90 1.07
> > > SPN 37.75 13.69
> > > tx_count 31808 174769
> >
> > Seems interrupt count does not affect the performance at all with -l
> > 131072 -w 128k.
> >
>
> Right.
>
> > > nr_napi_schedule 31805 174697
> > > total_packets 92354 187408
> > > total_reqs 1200793 2392614
> > >
> > >netperf Tput: 5.8Gb/s 10.5Gb/s
> > > PPI 2.13 1.00
> > > SPI 36.70 16.73
> > > PPN 2.13 1.31
> > > SPN 36.70 16.75
> > > tx_count 57635 205599
> > > nr_napi_schedule 57633 205311
> > > total_packets 122800 270254
> > > total_reqs 2115068 3439751
> > >
> > > PPI: packets processed per interrupt
> > > SPI: slots processed per interrupt
> > > PPN: packets processed per napi schedule
> > > SPN: slots processed per napi schedule
> > > tx_count: interrupt count
> > > total_reqs: total slots used during test
> > >
> > >* Notification and batching
> > >
> > >Is notification and batching really a problem? I'm not so sure now. My
> > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> > >case was that "in that case netback *must* have better batching" which
> > >turned out not very true -- copying mode makes netback slower, however
> > >the batching gained is not hugh.
> > >
> > >Ideally we still want to batch as much as possible. Possible way
> > >includes playing with the 'weight' parameter in NAPI. But as the figures
> > >show batching seems not to be very important for throughput, at least
> > >for now. If the NAPI framework and netfront / netback are doing their
> > >jobs as designed we might not need to worry about this now.
> > >
> > >Andrew, do you have any thought on this? You found out that NAPI didn't
> > >scale well with multi-threaded iperf in DomU, do you have any handle how
> > >that can happen?
> > >
> > >* Thoughts on zero-copy TX
> > >
> > >With this hack we are able to achieve 10Gb/s single stream, which is
> > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > >didn't able to achieve this. I also developed another zero copy netback
> > >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> > >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> > >performance was more or less the same as copying mode, about 6~7Gb/s).
> > >
> > >My hack maps all necessary pages permantently, there is no unmap, we
> > >skip lots of page table manipulation and TLB flushes. So my basic
> > >conclusion is that page table manipulation and TLB flushes do incur
> > >heavy performance penalty.
> > >
> > >This hack can be upstreamed in no way. If we're to re-introduce
> > >zero-copy TX, we would need to implement some sort of lazy flushing
> > >mechanism. I haven't thought this through. Presumably this mechanism
> > >would also benefit blk somehow? I'm not sure yet.
> > >
> > >Could persistent mapping (with the to-be-developed reclaim / MRU list
> > >mechanism) be useful here? So that we can unify blk and net drivers?
> > >
> > >* Changes required to introduce zero-copy TX
> > >
> > >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> > >not yet upstreamed.
> >
> > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> >
> > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> >
>
> Yes. But I believe there's been several versions posted. The link you
> have is not the latest version.
>
> > >
> > >2. Mechanism to negotiate max slots frontend can use: mapping requires
> > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> > >
> > >3. Lazy flushing mechanism or persistent grants: ???
> >
> > I did some test with persistent grants before, it did not show
> > better performance than grant copy. But I was using the default
> > params of netperf, and not tried large packet size. Your results
> > reminds me that maybe persistent grants would get similar results
> > with larger packet size too.
> >
>
> "No better performance" -- that's because both mechanisms are copying?
> However I presume persistent grant can scale better? From an earlier
> email last week, I read that copying is done by the guest so that this
> mechanism scales much better than hypervisor copying in blk's case.
Yes, I always expected persistent grants to be faster then
gnttab_copy but I was very surprised by the difference in performances:
http://marc.info/?l=xen-devel&m=137234605929944
I think it's worth trying persistent grants on PV network, although it's
very unlikely that they are going to improve the throughput by 5 Gb/s.
Also once we have both PV block and network using persistent grants,
we might incur the grant table limit, see this email:
http://marc.info/?l=xen-devel&m=137183474618974
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 14:29 ` Stefano Stabellini
@ 2013-07-01 14:39 ` Wei Liu
2013-07-01 14:54 ` Stefano Stabellini
0 siblings, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-07-01 14:39 UTC (permalink / raw)
To: Stefano Stabellini
Cc: Wei Liu, ian.campbell, xen-devel, annie li, andrew.bennieston
On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote:
> On Mon, 1 Jul 2013, Wei Liu wrote:
> > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > >
> > > On 2013-6-29 0:15, Wei Liu wrote:
> > > >Hi all,
> > > >
> > > >After collecting more stats and comparing copying / mapping cases, I now
> > > >have some more interesting finds, which might contradict what I said
> > > >before.
> > > >
> > > >I tuned the runes I used for benchmark to make sure iperf and netperf
> > > >generate large packets (~64K). Here are the runes I use:
> > > >
> > > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > > > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > > >
> > > > COPY MAP
> > > >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)
> > >
> > > So with default iperf setting, copy is about 7.9G, and map is about
> > > 2.5G? How about the result of netperf without large packets?
> > >
> >
> > First question, yes.
> >
> > Second question, 5.8Gb/s. And I believe for the copying scheme without
> > large packet the throuput is more or less the same.
> >
> > > > PPI 2.90 1.07
> > > > SPI 37.75 13.69
> > > > PPN 2.90 1.07
> > > > SPN 37.75 13.69
> > > > tx_count 31808 174769
> > >
> > > Seems interrupt count does not affect the performance at all with -l
> > > 131072 -w 128k.
> > >
> >
> > Right.
> >
> > > > nr_napi_schedule 31805 174697
> > > > total_packets 92354 187408
> > > > total_reqs 1200793 2392614
> > > >
> > > >netperf Tput: 5.8Gb/s 10.5Gb/s
> > > > PPI 2.13 1.00
> > > > SPI 36.70 16.73
> > > > PPN 2.13 1.31
> > > > SPN 36.70 16.75
> > > > tx_count 57635 205599
> > > > nr_napi_schedule 57633 205311
> > > > total_packets 122800 270254
> > > > total_reqs 2115068 3439751
> > > >
> > > > PPI: packets processed per interrupt
> > > > SPI: slots processed per interrupt
> > > > PPN: packets processed per napi schedule
> > > > SPN: slots processed per napi schedule
> > > > tx_count: interrupt count
> > > > total_reqs: total slots used during test
> > > >
> > > >* Notification and batching
> > > >
> > > >Is notification and batching really a problem? I'm not so sure now. My
> > > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> > > >case was that "in that case netback *must* have better batching" which
> > > >turned out not very true -- copying mode makes netback slower, however
> > > >the batching gained is not hugh.
> > > >
> > > >Ideally we still want to batch as much as possible. Possible way
> > > >includes playing with the 'weight' parameter in NAPI. But as the figures
> > > >show batching seems not to be very important for throughput, at least
> > > >for now. If the NAPI framework and netfront / netback are doing their
> > > >jobs as designed we might not need to worry about this now.
> > > >
> > > >Andrew, do you have any thought on this? You found out that NAPI didn't
> > > >scale well with multi-threaded iperf in DomU, do you have any handle how
> > > >that can happen?
> > > >
> > > >* Thoughts on zero-copy TX
> > > >
> > > >With this hack we are able to achieve 10Gb/s single stream, which is
> > > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > > >didn't able to achieve this. I also developed another zero copy netback
> > > >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> > > >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> > > >performance was more or less the same as copying mode, about 6~7Gb/s).
> > > >
> > > >My hack maps all necessary pages permantently, there is no unmap, we
> > > >skip lots of page table manipulation and TLB flushes. So my basic
> > > >conclusion is that page table manipulation and TLB flushes do incur
> > > >heavy performance penalty.
> > > >
> > > >This hack can be upstreamed in no way. If we're to re-introduce
> > > >zero-copy TX, we would need to implement some sort of lazy flushing
> > > >mechanism. I haven't thought this through. Presumably this mechanism
> > > >would also benefit blk somehow? I'm not sure yet.
> > > >
> > > >Could persistent mapping (with the to-be-developed reclaim / MRU list
> > > >mechanism) be useful here? So that we can unify blk and net drivers?
> > > >
> > > >* Changes required to introduce zero-copy TX
> > > >
> > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> > > >not yet upstreamed.
> > >
> > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > >
> > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > >
> >
> > Yes. But I believe there's been several versions posted. The link you
> > have is not the latest version.
> >
> > > >
> > > >2. Mechanism to negotiate max slots frontend can use: mapping requires
> > > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> > > >
> > > >3. Lazy flushing mechanism or persistent grants: ???
> > >
> > > I did some test with persistent grants before, it did not show
> > > better performance than grant copy. But I was using the default
> > > params of netperf, and not tried large packet size. Your results
> > > reminds me that maybe persistent grants would get similar results
> > > with larger packet size too.
> > >
> >
> > "No better performance" -- that's because both mechanisms are copying?
> > However I presume persistent grant can scale better? From an earlier
> > email last week, I read that copying is done by the guest so that this
> > mechanism scales much better than hypervisor copying in blk's case.
>
> Yes, I always expected persistent grants to be faster then
> gnttab_copy but I was very surprised by the difference in performances:
>
> http://marc.info/?l=xen-devel&m=137234605929944
>
> I think it's worth trying persistent grants on PV network, although it's
> very unlikely that they are going to improve the throughput by 5 Gb/s.
>
I think it can improve aggregated throughput, however its not likely to
improve single stream throughput.
> Also once we have both PV block and network using persistent grants,
> we might incur the grant table limit, see this email:
>
> http://marc.info/?l=xen-devel&m=137183474618974
Yes, indeed.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 14:39 ` Wei Liu
@ 2013-07-01 14:54 ` Stefano Stabellini
0 siblings, 0 replies; 27+ messages in thread
From: Stefano Stabellini @ 2013-07-01 14:54 UTC (permalink / raw)
To: Wei Liu
Cc: ian.campbell, Stefano Stabellini, xen-devel, annie li, andrew.bennieston
On Mon, 1 Jul 2013, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 03:29:45PM +0100, Stefano Stabellini wrote:
> > On Mon, 1 Jul 2013, Wei Liu wrote:
> > > On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
> > > >
> > > > On 2013-6-29 0:15, Wei Liu wrote:
> > > > >Hi all,
> > > > >
> > > > >After collecting more stats and comparing copying / mapping cases, I now
> > > > >have some more interesting finds, which might contradict what I said
> > > > >before.
> > > > >
> > > > >I tuned the runes I used for benchmark to make sure iperf and netperf
> > > > >generate large packets (~64K). Here are the runes I use:
> > > > >
> > > > > iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
> > > > > netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
> > > > >
> > > > > COPY MAP
> > > > >iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)
> > > >
> > > > So with default iperf setting, copy is about 7.9G, and map is about
> > > > 2.5G? How about the result of netperf without large packets?
> > > >
> > >
> > > First question, yes.
> > >
> > > Second question, 5.8Gb/s. And I believe for the copying scheme without
> > > large packet the throuput is more or less the same.
> > >
> > > > > PPI 2.90 1.07
> > > > > SPI 37.75 13.69
> > > > > PPN 2.90 1.07
> > > > > SPN 37.75 13.69
> > > > > tx_count 31808 174769
> > > >
> > > > Seems interrupt count does not affect the performance at all with -l
> > > > 131072 -w 128k.
> > > >
> > >
> > > Right.
> > >
> > > > > nr_napi_schedule 31805 174697
> > > > > total_packets 92354 187408
> > > > > total_reqs 1200793 2392614
> > > > >
> > > > >netperf Tput: 5.8Gb/s 10.5Gb/s
> > > > > PPI 2.13 1.00
> > > > > SPI 36.70 16.73
> > > > > PPN 2.13 1.31
> > > > > SPN 36.70 16.75
> > > > > tx_count 57635 205599
> > > > > nr_napi_schedule 57633 205311
> > > > > total_packets 122800 270254
> > > > > total_reqs 2115068 3439751
> > > > >
> > > > > PPI: packets processed per interrupt
> > > > > SPI: slots processed per interrupt
> > > > > PPN: packets processed per napi schedule
> > > > > SPN: slots processed per napi schedule
> > > > > tx_count: interrupt count
> > > > > total_reqs: total slots used during test
> > > > >
> > > > >* Notification and batching
> > > > >
> > > > >Is notification and batching really a problem? I'm not so sure now. My
> > > > >first thought when I didn't measure PPI / PPN / SPI / SPN in copying
> > > > >case was that "in that case netback *must* have better batching" which
> > > > >turned out not very true -- copying mode makes netback slower, however
> > > > >the batching gained is not hugh.
> > > > >
> > > > >Ideally we still want to batch as much as possible. Possible way
> > > > >includes playing with the 'weight' parameter in NAPI. But as the figures
> > > > >show batching seems not to be very important for throughput, at least
> > > > >for now. If the NAPI framework and netfront / netback are doing their
> > > > >jobs as designed we might not need to worry about this now.
> > > > >
> > > > >Andrew, do you have any thought on this? You found out that NAPI didn't
> > > > >scale well with multi-threaded iperf in DomU, do you have any handle how
> > > > >that can happen?
> > > > >
> > > > >* Thoughts on zero-copy TX
> > > > >
> > > > >With this hack we are able to achieve 10Gb/s single stream, which is
> > > > >good. But, with classic XenoLinux kernel which has zero copy TX we
> > > > >didn't able to achieve this. I also developed another zero copy netback
> > > > >prototype one year ago with Ian's out-of-tree skb frag destructor patch
> > > > >series. That prototype couldn't achieve 10Gb/s either (IIRC the
> > > > >performance was more or less the same as copying mode, about 6~7Gb/s).
> > > > >
> > > > >My hack maps all necessary pages permantently, there is no unmap, we
> > > > >skip lots of page table manipulation and TLB flushes. So my basic
> > > > >conclusion is that page table manipulation and TLB flushes do incur
> > > > >heavy performance penalty.
> > > > >
> > > > >This hack can be upstreamed in no way. If we're to re-introduce
> > > > >zero-copy TX, we would need to implement some sort of lazy flushing
> > > > >mechanism. I haven't thought this through. Presumably this mechanism
> > > > >would also benefit blk somehow? I'm not sure yet.
> > > > >
> > > > >Could persistent mapping (with the to-be-developed reclaim / MRU list
> > > > >mechanism) be useful here? So that we can unify blk and net drivers?
> > > > >
> > > > >* Changes required to introduce zero-copy TX
> > > > >
> > > > >1. SKB frag destructor series: to track life cycle of SKB frags. This is
> > > > >not yet upstreamed.
> > > >
> > > > Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> > > >
> > > > <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> > > >
> > >
> > > Yes. But I believe there's been several versions posted. The link you
> > > have is not the latest version.
> > >
> > > > >
> > > > >2. Mechanism to negotiate max slots frontend can use: mapping requires
> > > > >backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> > > > >
> > > > >3. Lazy flushing mechanism or persistent grants: ???
> > > >
> > > > I did some test with persistent grants before, it did not show
> > > > better performance than grant copy. But I was using the default
> > > > params of netperf, and not tried large packet size. Your results
> > > > reminds me that maybe persistent grants would get similar results
> > > > with larger packet size too.
> > > >
> > >
> > > "No better performance" -- that's because both mechanisms are copying?
> > > However I presume persistent grant can scale better? From an earlier
> > > email last week, I read that copying is done by the guest so that this
> > > mechanism scales much better than hypervisor copying in blk's case.
> >
> > Yes, I always expected persistent grants to be faster then
> > gnttab_copy but I was very surprised by the difference in performances:
> >
> > http://marc.info/?l=xen-devel&m=137234605929944
> >
> > I think it's worth trying persistent grants on PV network, although it's
> > very unlikely that they are going to improve the throughput by 5 Gb/s.
> >
>
> I think it can improve aggregated throughput, however its not likely to
> improve single stream throughput.
you are probably right
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 8:54 ` Wei Liu
2013-07-01 14:29 ` Stefano Stabellini
@ 2013-07-01 15:59 ` annie li
2013-07-01 16:06 ` Wei Liu
1 sibling, 1 reply; 27+ messages in thread
From: annie li @ 2013-07-01 15:59 UTC (permalink / raw)
To: Wei Liu; +Cc: xen-devel, andrew.bennieston, ian.campbell, stefano.stabellini
On 2013-7-1 16:54, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 03:48:38PM +0800, annie li wrote:
>> On 2013-6-29 0:15, Wei Liu wrote:
>>> Hi all,
>>>
>>> After collecting more stats and comparing copying / mapping cases, I now
>>> have some more interesting finds, which might contradict what I said
>>> before.
>>>
>>> I tuned the runes I used for benchmark to make sure iperf and netperf
>>> generate large packets (~64K). Here are the runes I use:
>>>
>>> iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>>> netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>>>
>>> COPY MAP
>>> iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)
>> So with default iperf setting, copy is about 7.9G, and map is about
>> 2.5G? How about the result of netperf without large packets?
>>
> First question, yes.
>
> Second question, 5.8Gb/s. And I believe for the copying scheme without
> large packet the throuput is more or less the same.
>
>>> PPI 2.90 1.07
>>> SPI 37.75 13.69
>>> PPN 2.90 1.07
>>> SPN 37.75 13.69
>>> tx_count 31808 174769
>> Seems interrupt count does not affect the performance at all with -l
>> 131072 -w 128k.
>>
> Right.
>
>>> nr_napi_schedule 31805 174697
>>> total_packets 92354 187408
>>> total_reqs 1200793 2392614
>>>
>>> netperf Tput: 5.8Gb/s 10.5Gb/s
>>> PPI 2.13 1.00
>>> SPI 36.70 16.73
>>> PPN 2.13 1.31
>>> SPN 36.70 16.75
>>> tx_count 57635 205599
>>> nr_napi_schedule 57633 205311
>>> total_packets 122800 270254
>>> total_reqs 2115068 3439751
>>>
>>> PPI: packets processed per interrupt
>>> SPI: slots processed per interrupt
>>> PPN: packets processed per napi schedule
>>> SPN: slots processed per napi schedule
>>> tx_count: interrupt count
>>> total_reqs: total slots used during test
>>>
>>> * Notification and batching
>>>
>>> Is notification and batching really a problem? I'm not so sure now. My
>>> first thought when I didn't measure PPI / PPN / SPI / SPN in copying
>>> case was that "in that case netback *must* have better batching" which
>>> turned out not very true -- copying mode makes netback slower, however
>>> the batching gained is not hugh.
>>>
>>> Ideally we still want to batch as much as possible. Possible way
>>> includes playing with the 'weight' parameter in NAPI. But as the figures
>>> show batching seems not to be very important for throughput, at least
>>> for now. If the NAPI framework and netfront / netback are doing their
>>> jobs as designed we might not need to worry about this now.
>>>
>>> Andrew, do you have any thought on this? You found out that NAPI didn't
>>> scale well with multi-threaded iperf in DomU, do you have any handle how
>>> that can happen?
>>>
>>> * Thoughts on zero-copy TX
>>>
>>> With this hack we are able to achieve 10Gb/s single stream, which is
>>> good. But, with classic XenoLinux kernel which has zero copy TX we
>>> didn't able to achieve this. I also developed another zero copy netback
>>> prototype one year ago with Ian's out-of-tree skb frag destructor patch
>>> series. That prototype couldn't achieve 10Gb/s either (IIRC the
>>> performance was more or less the same as copying mode, about 6~7Gb/s).
>>>
>>> My hack maps all necessary pages permantently, there is no unmap, we
>>> skip lots of page table manipulation and TLB flushes. So my basic
>>> conclusion is that page table manipulation and TLB flushes do incur
>>> heavy performance penalty.
>>>
>>> This hack can be upstreamed in no way. If we're to re-introduce
>>> zero-copy TX, we would need to implement some sort of lazy flushing
>>> mechanism. I haven't thought this through. Presumably this mechanism
>>> would also benefit blk somehow? I'm not sure yet.
>>>
>>> Could persistent mapping (with the to-be-developed reclaim / MRU list
>>> mechanism) be useful here? So that we can unify blk and net drivers?
>>>
>>> * Changes required to introduce zero-copy TX
>>>
>>> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
>>> not yet upstreamed.
>> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>>
>> <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
>>
> Yes. But I believe there's been several versions posted. The link you
> have is not the latest version.
>
>>> 2. Mechanism to negotiate max slots frontend can use: mapping requires
>>> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
>>>
>>> 3. Lazy flushing mechanism or persistent grants: ???
>> I did some test with persistent grants before, it did not show
>> better performance than grant copy. But I was using the default
>> params of netperf, and not tried large packet size. Your results
>> reminds me that maybe persistent grants would get similar results
>> with larger packet size too.
>>
> "No better performance" -- that's because both mechanisms are copying?
> However I presume persistent grant can scale better? From an earlier
> email last week, I read that copying is done by the guest so that this
> mechanism scales much better than hypervisor copying in blk's case.
The original persistent patch does memcpy in both netback and netfront
side. I am thinking maybe the performance can become better if removing
the memcpy from netfront.
Moreover, I also have a feeling that we got persistent grant performance
based on default netperf params test, just like wei's hack which does
not get better performance without large packets. So let me try some
test with large packets though.
Thanks
Annie
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 14:19 ` Stefano Stabellini
@ 2013-07-01 15:59 ` annie li
0 siblings, 0 replies; 27+ messages in thread
From: annie li @ 2013-07-01 15:59 UTC (permalink / raw)
To: Stefano Stabellini; +Cc: andrew.bennieston, Wei Liu, ian.campbell, xen-devel
On 2013-7-1 22:19, Stefano Stabellini wrote:
> Could you please use plain text emails in the future?
Sure, sorry about that.
Thanks
Annie
>
> On Mon, 1 Jul 2013, annie li wrote:
>> On 2013-6-29 0:15, Wei Liu wrote:
>>
>> Hi all,
>>
>> After collecting more stats and comparing copying / mapping cases, I now
>> have some more interesting finds, which might contradict what I said
>> before.
>>
>> I tuned the runes I used for benchmark to make sure iperf and netperf
>> generate large packets (~64K). Here are the runes I use:
>>
>> iperf -c 10.80.237.127 -t 5 -l 131072 -w 128k (see note)
>> netperf -H 10.80.237.127 -l10 -f m -- -s 131072 -S 131072
>>
>> COPY MAP
>> iperf Tput: 6.5Gb/s 14Gb/s (was 2.5Gb/s)
>>
>>
>> So with default iperf setting, copy is about 7.9G, and map is about 2.5G? How about the result of netperf without large packets?
>>
>> PPI 2.90 1.07
>> SPI 37.75 13.69
>> PPN 2.90 1.07
>> SPN 37.75 13.69
>> tx_count 31808 174769
>>
>>
>> Seems interrupt count does not affect the performance at all with -l 131072 -w 128k.
>>
>> nr_napi_schedule 31805 174697
>> total_packets 92354 187408
>> total_reqs 1200793 2392614
>>
>> netperf Tput: 5.8Gb/s 10.5Gb/s
>> PPI 2.13 1.00
>> SPI 36.70 16.73
>> PPN 2.13 1.31
>> SPN 36.70 16.75
>> tx_count 57635 205599
>> nr_napi_schedule 57633 205311
>> total_packets 122800 270254
>> total_reqs 2115068 3439751
>>
>> PPI: packets processed per interrupt
>> SPI: slots processed per interrupt
>> PPN: packets processed per napi schedule
>> SPN: slots processed per napi schedule
>> tx_count: interrupt count
>> total_reqs: total slots used during test
>>
>> * Notification and batching
>>
>> Is notification and batching really a problem? I'm not so sure now. My
>> first thought when I didn't measure PPI / PPN / SPI / SPN in copying
>> case was that "in that case netback *must* have better batching" which
>> turned out not very true -- copying mode makes netback slower, however
>> the batching gained is not hugh.
>>
>> Ideally we still want to batch as much as possible. Possible way
>> includes playing with the 'weight' parameter in NAPI. But as the figures
>> show batching seems not to be very important for throughput, at least
>> for now. If the NAPI framework and netfront / netback are doing their
>> jobs as designed we might not need to worry about this now.
>>
>> Andrew, do you have any thought on this? You found out that NAPI didn't
>> scale well with multi-threaded iperf in DomU, do you have any handle how
>> that can happen?
>>
>> * Thoughts on zero-copy TX
>>
>> With this hack we are able to achieve 10Gb/s single stream, which is
>> good. But, with classic XenoLinux kernel which has zero copy TX we
>> didn't able to achieve this. I also developed another zero copy netback
>> prototype one year ago with Ian's out-of-tree skb frag destructor patch
>> series. That prototype couldn't achieve 10Gb/s either (IIRC the
>> performance was more or less the same as copying mode, about 6~7Gb/s).
>>
>> My hack maps all necessary pages permantently, there is no unmap, we
>> skip lots of page table manipulation and TLB flushes. So my basic
>> conclusion is that page table manipulation and TLB flushes do incur
>> heavy performance penalty.
>>
>> This hack can be upstreamed in no way. If we're to re-introduce
>> zero-copy TX, we would need to implement some sort of lazy flushing
>> mechanism. I haven't thought this through. Presumably this mechanism
>> would also benefit blk somehow? I'm not sure yet.
>>
>> Could persistent mapping (with the to-be-developed reclaim / MRU list
>> mechanism) be useful here? So that we can unify blk and net drivers?
>>
>> * Changes required to introduce zero-copy TX
>>
>> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
>> not yet upstreamed.
>>
>>
>> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>>
>>
>> 2. Mechanism to negotiate max slots frontend can use: mapping requires
>> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
>>
>> 3. Lazy flushing mechanism or persistent grants: ???
>>
>>
>> I did some test with persistent grants before, it did not show better performance than grant copy. But I was using the default
>> params of netperf, and not tried large packet size. Your results reminds me that maybe persistent grants would get similar
>> results with larger packet size too.
>>
>> Thanks
>> Annie
>>
>>
>>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 15:59 ` annie li
@ 2013-07-01 16:06 ` Wei Liu
2013-07-01 16:53 ` Andrew Bennieston
0 siblings, 1 reply; 27+ messages in thread
From: Wei Liu @ 2013-07-01 16:06 UTC (permalink / raw)
To: annie li
Cc: andrew.bennieston, xen-devel, Wei Liu, ian.campbell, stefano.stabellini
On Mon, Jul 01, 2013 at 11:59:08PM +0800, annie li wrote:
[...]
> >>>1. SKB frag destructor series: to track life cycle of SKB frags. This is
> >>>not yet upstreamed.
> >>Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
> >>
> >><http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
> >>
> >Yes. But I believe there's been several versions posted. The link you
> >have is not the latest version.
> >
> >>>2. Mechanism to negotiate max slots frontend can use: mapping requires
> >>>backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
> >>>
> >>>3. Lazy flushing mechanism or persistent grants: ???
> >>I did some test with persistent grants before, it did not show
> >>better performance than grant copy. But I was using the default
> >>params of netperf, and not tried large packet size. Your results
> >>reminds me that maybe persistent grants would get similar results
> >>with larger packet size too.
> >>
> >"No better performance" -- that's because both mechanisms are copying?
> >However I presume persistent grant can scale better? From an earlier
> >email last week, I read that copying is done by the guest so that this
> >mechanism scales much better than hypervisor copying in blk's case.
>
> The original persistent patch does memcpy in both netback and
> netfront side. I am thinking maybe the performance can become better
> if removing the memcpy from netfront.
I would say that removing copy in netback can scale better.
> Moreover, I also have a feeling that we got persistent grant
> performance based on default netperf params test, just like wei's
> hack which does not get better performance without large packets. So
> let me try some test with large packets though.
>
Sadly enough, I found out today these sort of test seems to be quite
inconsistent. On a Intel 10G Nic the throughput is actually higher
without enforcing iperf / netperf to generate large packets.
Wei.
> Thanks
> Annie
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 16:06 ` Wei Liu
@ 2013-07-01 16:53 ` Andrew Bennieston
2013-07-01 17:55 ` Wei Liu
2013-07-03 15:18 ` Wei Liu
0 siblings, 2 replies; 27+ messages in thread
From: Andrew Bennieston @ 2013-07-01 16:53 UTC (permalink / raw)
To: Wei Liu; +Cc: annie li, xen-devel, ian.campbell, stefano.stabellini
On 01/07/13 17:06, Wei Liu wrote:
> On Mon, Jul 01, 2013 at 11:59:08PM +0800, annie li wrote:
> [...]
>>>>> 1. SKB frag destructor series: to track life cycle of SKB frags. This is
>>>>> not yet upstreamed.
>>>> Are you mentioning this one http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html?
>>>>
>>>> <http://old-list-archives.xen.org/archives/html/xen-devel/2011-06/msg01711.html>
>>>>
>>> Yes. But I believe there's been several versions posted. The link you
>>> have is not the latest version.
>>>
>>>>> 2. Mechanism to negotiate max slots frontend can use: mapping requires
>>>>> backend's MAX_SKB_FRAGS >= frontend's MAX_SKB_FRAGS.
>>>>>
>>>>> 3. Lazy flushing mechanism or persistent grants: ???
>>>> I did some test with persistent grants before, it did not show
>>>> better performance than grant copy. But I was using the default
>>>> params of netperf, and not tried large packet size. Your results
>>>> reminds me that maybe persistent grants would get similar results
>>>> with larger packet size too.
>>>>
>>> "No better performance" -- that's because both mechanisms are copying?
>>> However I presume persistent grant can scale better? From an earlier
>>> email last week, I read that copying is done by the guest so that this
>>> mechanism scales much better than hypervisor copying in blk's case.
>>
>> The original persistent patch does memcpy in both netback and
>> netfront side. I am thinking maybe the performance can become better
>> if removing the memcpy from netfront.
>
> I would say that removing copy in netback can scale better.
>
>> Moreover, I also have a feeling that we got persistent grant
>> performance based on default netperf params test, just like wei's
>> hack which does not get better performance without large packets. So
>> let me try some test with large packets though.
>>
>
> Sadly enough, I found out today these sort of test seems to be quite
> inconsistent. On a Intel 10G Nic the throughput is actually higher
> without enforcing iperf / netperf to generate large packets.
When I have made performance measurements using iperf, I found that for
a given point in the parameter space (e.g. for a fixed number of guests,
interfaces, fixed parameters to iperf, fixed test run duration, etc.)
the variation was typically _smaller than_ +/- 1 Gbit/s on a 10G NIC.
I notice that your results don't include any error bars or indication of
standard deviation...
With this sort of data (or, really, any data) measuring at least 5 times
will help to get an idea of the fluctuations present (i.e. a measure of
statistical uncertainty) by quoting a mean +/- standard deviation.
Having the standard deviation (or other estimator for the uncertainty in
the results) allows us to better determine how significant this
difference in results really is.
For example, is the high throughput you quoted (~ 14 Gbit/s) an upward
fluctuation, and the low value (~6) a downward fluctuation? Having a
mean and standard deviation would allow us to determine just how
(in)compatible these values are.
Assuming a Gaussian distribution (and when sampled sufficient times,
"everything" tends to a Gaussian) you have an almost 5% chance that a
result lies more than 2 standard deviations from the mean (and a 0.3%
chance that it lies more than 3 s.d. from the mean!). Results that
appear "high" or "low" may, therefore, not be entirely unexpected.
Having a measure of the standard deviation provides some basis against
which to determine how likely it is that a measured value is just
statistical fluctuation, or whether it is a significant result.
Another thing I noticed is that you're running the iperf test for only 5
seconds. I have found in the past that iperf (or, more likely, TCP)
takes a while to "ramp up" (even with all parameters fixed e.g. "-l
<size> -w <size>") and that tests run for 2 minutes or more (e.g. "-t
120") give much more stable results.
Andrew.
>
>
> Wei.
>
>> Thanks
>> Annie
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 16:53 ` Andrew Bennieston
@ 2013-07-01 17:55 ` Wei Liu
2013-07-03 15:18 ` Wei Liu
1 sibling, 0 replies; 27+ messages in thread
From: Wei Liu @ 2013-07-01 17:55 UTC (permalink / raw)
To: Andrew Bennieston
Cc: annie li, xen-devel, Wei Liu, ian.campbell, stefano.stabellini
On Mon, Jul 01, 2013 at 05:53:27PM +0100, Andrew Bennieston wrote:
[...]
> >
> >Sadly enough, I found out today these sort of test seems to be quite
> >inconsistent. On a Intel 10G Nic the throughput is actually higher
> >without enforcing iperf / netperf to generate large packets.
>
> When I have made performance measurements using iperf, I found that
> for a given point in the parameter space (e.g. for a fixed number of
> guests, interfaces, fixed parameters to iperf, fixed test run
> duration, etc.) the variation was typically _smaller than_ +/- 1
> Gbit/s on a 10G NIC.
>
I was talking about virtual interface v.s. real hardware. The parameters
that maximize throughput for one case don't seem to be working for the
other case. The deviation for a specific interface is rather small.
> I notice that your results don't include any error bars or
> indication of standard deviation...
>
> With this sort of data (or, really, any data) measuring at least 5
> times will help to get an idea of the fluctuations present (i.e. a
> measure of statistical uncertainty) by quoting a mean +/- standard
> deviation. Having the standard deviation (or other estimator for the
> uncertainty in the results) allows us to better determine how
> significant this difference in results really is.
>
> For example, is the high throughput you quoted (~ 14 Gbit/s) an
> upward fluctuation, and the low value (~6) a downward fluctuation?
> Having a mean and standard deviation would allow us to determine
> just how (in)compatible these values are.
>
I ran those tests for several times and picked the number that appeared
most. Anyway I will try to come up with better visualized graphs.
> Assuming a Gaussian distribution (and when sampled sufficient times,
> "everything" tends to a Gaussian) you have an almost 5% chance that
> a result lies more than 2 standard deviations from the mean (and a
> 0.3% chance that it lies more than 3 s.d. from the mean!). Results
> that appear "high" or "low" may, therefore, not be entirely
> unexpected. Having a measure of the standard deviation provides some
> basis against which to determine how likely it is that a measured
> value is just statistical fluctuation, or whether it is a
> significant result.
>
> Another thing I noticed is that you're running the iperf test for
> only 5 seconds. I have found in the past that iperf (or, more
> likely, TCP) takes a while to "ramp up" (even with all parameters
> fixed e.g. "-l <size> -w <size>") and that tests run for 2 minutes
> or more (e.g. "-t 120") give much more stable results.
>
Hmm... for me the lenght of the test doesn't make much difference,
that's why I've chosen such a short time. As you mentioned this I intend
to run the tests a big longer.
> Andrew.
>
> >
> >
> >Wei.
> >
> >>Thanks
> >>Annie
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Interesting observation with network event notification and batching
2013-07-01 16:53 ` Andrew Bennieston
2013-07-01 17:55 ` Wei Liu
@ 2013-07-03 15:18 ` Wei Liu
1 sibling, 0 replies; 27+ messages in thread
From: Wei Liu @ 2013-07-03 15:18 UTC (permalink / raw)
To: Andrew Bennieston
Cc: annie li, xen-devel, Wei Liu, ian.campbell, stefano.stabellini
On Mon, Jul 01, 2013 at 05:53:27PM +0100, Andrew Bennieston wrote:
[...]
> >I would say that removing copy in netback can scale better.
> >
> >>Moreover, I also have a feeling that we got persistent grant
> >>performance based on default netperf params test, just like wei's
> >>hack which does not get better performance without large packets. So
> >>let me try some test with large packets though.
> >>
> >
> >Sadly enough, I found out today these sort of test seems to be quite
> >inconsistent. On a Intel 10G Nic the throughput is actually higher
> >without enforcing iperf / netperf to generate large packets.
>
> When I have made performance measurements using iperf, I found that
> for a given point in the parameter space (e.g. for a fixed number of
> guests, interfaces, fixed parameters to iperf, fixed test run
> duration, etc.) the variation was typically _smaller than_ +/- 1
> Gbit/s on a 10G NIC.
>
> I notice that your results don't include any error bars or
> indication of standard deviation...
>
> With this sort of data (or, really, any data) measuring at least 5
> times will help to get an idea of the fluctuations present (i.e. a
> measure of statistical uncertainty) by quoting a mean +/- standard
> deviation. Having the standard deviation (or other estimator for the
> uncertainty in the results) allows us to better determine how
> significant this difference in results really is.
>
> For example, is the high throughput you quoted (~ 14 Gbit/s) an
> upward fluctuation, and the low value (~6) a downward fluctuation?
> Having a mean and standard deviation would allow us to determine
> just how (in)compatible these values are.
>
> Assuming a Gaussian distribution (and when sampled sufficient times,
> "everything" tends to a Gaussian) you have an almost 5% chance that
> a result lies more than 2 standard deviations from the mean (and a
> 0.3% chance that it lies more than 3 s.d. from the mean!). Results
> that appear "high" or "low" may, therefore, not be entirely
> unexpected. Having a measure of the standard deviation provides some
> basis against which to determine how likely it is that a measured
> value is just statistical fluctuation, or whether it is a
> significant result.
>
> Another thing I noticed is that you're running the iperf test for
> only 5 seconds. I have found in the past that iperf (or, more
> likely, TCP) takes a while to "ramp up" (even with all parameters
> fixed e.g. "-l <size> -w <size>") and that tests run for 2 minutes
> or more (e.g. "-t 120") give much more stable results.
>
> Andrew.
>
Here you go, results for the new conducted benchmarks. Was about to do
graph but found out not really worth it because it's only single stream.
For iperf tests unit is Gb/s, for netperf tests unit is Mb/s.
COPY SCHEME
iperf -c 10.80.237.127 -t 120
6.19 6.23 6.26 6.25 6.27
mean 6.24 s.d. 0.031622776601759
iperf -c 10.80.237.127 -t 120 -l 131072
6.07 6.07 6.03 6.06 6.06
mean 6.058 s.d. 0.016431676725514
netperf -H 10.80.237.127 -l120 -f m
5662.55 5636.6 5641.52 5631.39 5630.98
mean 5640.608 s.d. 13.0001642297036
netperf -H 10.80.237.127 -l120 -f m -- -s 131072 -S 131072
5831.19 5833.03 5829.54 5838.89 5830.5
mean 5832.63 s.d. 3.72512415992628
PERMANENT MAP SCHEME
"iperf -c 10.80.237.127 -t 120
2.42 2.41 2.41 2.42 2.43
mean 2.418 s.d. 0.00836660026531
iperf -c 10.80.237.127 -t 120 -l 131072
14.3 14.2 14.2 14.4 14.3
mean 14.28 s.d. 0.083666002653234
netperf -H 10.80.237.127 -l120 -f m
4632.27 4630.08 4633.18 4641.25 4632.23
mean 4633.802 s.d. 4.31656924013371
netperf -H 10.80.237.127 -l120 -f m -- -s 131072 -S 131072
10556.04 10532.89 10541.83 10552.77 10546.77
mean 10546.06 s.d. 9.17156475133789
Short run of iperf / netperf was conducted before each test run so that
the system was "warmed-up".
The results show that the single stream performance is quite stable.
Also there's not much difference between running tests for 5s or 120s.
Wei.
> >
> >
> >Wei.
> >
> >>Thanks
> >>Annie
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2013-07-03 15:18 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-12 10:14 Interesting observation with network event notification and batching Wei Liu
2013-06-14 18:53 ` Konrad Rzeszutek Wilk
2013-06-16 9:54 ` Wei Liu
2013-06-17 9:38 ` Ian Campbell
2013-06-17 9:56 ` Andrew Bennieston
2013-06-17 10:46 ` Wei Liu
2013-06-17 10:56 ` Andrew Bennieston
2013-06-17 11:08 ` Ian Campbell
2013-06-17 11:55 ` Andrew Bennieston
2013-06-17 10:06 ` Jan Beulich
2013-06-17 10:16 ` Ian Campbell
2013-06-17 10:35 ` Wei Liu
2013-06-17 11:34 ` annie li
2013-06-16 12:46 ` Wei Liu
2013-06-28 16:15 ` Wei Liu
2013-07-01 7:48 ` annie li
2013-07-01 8:54 ` Wei Liu
2013-07-01 14:29 ` Stefano Stabellini
2013-07-01 14:39 ` Wei Liu
2013-07-01 14:54 ` Stefano Stabellini
2013-07-01 15:59 ` annie li
2013-07-01 16:06 ` Wei Liu
2013-07-01 16:53 ` Andrew Bennieston
2013-07-01 17:55 ` Wei Liu
2013-07-03 15:18 ` Wei Liu
2013-07-01 14:19 ` Stefano Stabellini
2013-07-01 15:59 ` annie li
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.