All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about high CPU load during iperf ethernet testing
@ 2014-09-22 13:01 Iurii Konovalenko
  2014-09-23 16:41 ` Stefano Stabellini
  2014-09-23 16:48 ` Ian Campbell
  0 siblings, 2 replies; 10+ messages in thread
From: Iurii Konovalenko @ 2014-09-22 13:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Julien Grall, Ian Campbell, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 1569 bytes --]

Hello, all!

I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5).
Xen version is 4.4.
I run only Linux (kernel 3.8) as Dom0, no other active domains (For clear
tests results I decided not to start DomU).
iperf server is started on host, iperf client is started on board with
command line "*iperf -c 192.168.2.10 -w 256k -m -f M -d -t 60*".

During test I studied CPU load with top tool on Dom0, and saw, that one
VCPU is totally loaded, spending about 50% in software IRQs, and 50% in
system.
Running the same test on clear Linux without Xen, I saw that CPU load is
about 2-4%.

I decided to debug a bit, so I used "*({register uint64_t _r; asm
volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r" (_r)); _r; })*" command
to read timer counter before and after operations I want to test.

In such way I've found, that most time of CPU is spent in functions
*enable_irq/disable_irq_nosync* and
*spin_lock_irqsave/spin_unlock_irqrestore* (mostly in "*mrs    %0, cpsr
@ arch_local_irq_save*"/*"msr    cpsr_c, %0    @ local_irq_restore"*). When
running without Xen it should not take so much time.

So, could anyone explain me some questions:
1. Is it normal behaviour?
2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap
access to cpsr register, that leads to additional overhead, but I can't
find place in sources where it happens.

Thank you in advance.
Best regards.

Iurii Konovalenko | Senior Software Engineer
GlobalLogic
P +3.8044.492.9695 M +38.099.932.2909
S yufuntik
www.globallogic.com
http://www.globallogic.com/email_disclaimer.txt

[-- Attachment #1.2: Type: text/html, Size: 4046 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about high CPU load during iperf ethernet testing
  2014-09-22 13:01 Question about high CPU load during iperf ethernet testing Iurii Konovalenko
@ 2014-09-23 16:41 ` Stefano Stabellini
  2014-09-24 16:37   ` Iurii Konovalenko
  2014-09-23 16:48 ` Ian Campbell
  1 sibling, 1 reply; 10+ messages in thread
From: Stefano Stabellini @ 2014-09-23 16:41 UTC (permalink / raw)
  To: Iurii Konovalenko
  Cc: Stefano Stabellini, Ian Campbell, Julien Grall, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2153 bytes --]

On Mon, 22 Sep 2014, Iurii Konovalenko wrote:
> Hello, all!
> 
> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5).
> Xen version is 4.4.
> I run only Linux (kernel 3.8) as Dom0, no other active domains (For clear tests results I decided not to start DomU).
> iperf server is started on host, iperf client is started on board with command line "iperf -c 192.168.2.10 -w 256k -m
> -f M -d -t 60".

Just to double check: you are running the iperf test in Dom0, correct?


> During test I studied CPU load with top tool on Dom0, and saw, that one VCPU is totally loaded, spending about 50% in
> software IRQs, and 50% in system.
> Running the same test on clear Linux without Xen, I saw that CPU load is about 2-4%.
> 
> I decided to debug a bit, so I used "({register uint64_t _r; asm volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r"
> (_r)); _r; })" command to read timer counter before and after operations I want to test.
> 
> In such way I've found, that most time of CPU is spent in functions enable_irq/disable_irq_nosync and
> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs    %0, cpsr    @ arch_local_irq_save"/"msr    cpsr_c, %0    @
> local_irq_restore"). When running without Xen it should not take so much time.

There is nothing Xen specific in the Linux ARM implementation of
spin_lock_irqsave/spin_unlock_irqrestore and
enable_irq/disable_irq_nosync.


> So, could anyone explain me some questions:
> 1. Is it normal behaviour?

No, it is not normal.
Assuming that you assign all the memory to Dom0 and as many vcpu as
physical cpus on your platform then you should get the same numbers as
native.


> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap access to cpsr register, that leads to
> additional overhead, but I can't find place in sources where it happens.

We don't trap cpsr.


> Thank you in advance.
> Best regards.
> 
> Iurii Konovalenko | Senior Software Engineer
> GlobalLogic
> P +3.8044.492.9695 M +38.099.932.2909  
> S yufuntik
> www.globallogic.com
> http://www.globallogic.com/email_disclaimer.txt
> 
> 

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about high CPU load during iperf ethernet testing
  2014-09-22 13:01 Question about high CPU load during iperf ethernet testing Iurii Konovalenko
  2014-09-23 16:41 ` Stefano Stabellini
@ 2014-09-23 16:48 ` Ian Campbell
  2014-09-24 17:01   ` Iurii Konovalenko
  1 sibling, 1 reply; 10+ messages in thread
From: Ian Campbell @ 2014-09-23 16:48 UTC (permalink / raw)
  To: Iurii Konovalenko; +Cc: Stefano Stabellini, Julien Grall, xen-devel

On Mon, 2014-09-22 at 16:01 +0300, Iurii Konovalenko wrote:

> 
> I decided to debug a bit, so I used "({register uint64_t _r; asm
> volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r" (_r)); _r; })"
> command to read timer counter before and after operations I want to
> test.

I think that is CNTPCT aka the physical timer. This is trapped under
Xen. If you want an untrapped source of time you should use CNTVCT which
is p15,0,c14.

I expect the results are unreliable due to this.

Also watch out that the granularity of those two timers is usually far
below the CPU clock frequency, so one tick of them can potentially
represent quite a few clock cycles/instructions IIRC.

Ian.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about high CPU load during iperf ethernet testing
  2014-09-23 16:41 ` Stefano Stabellini
@ 2014-09-24 16:37   ` Iurii Konovalenko
  2014-09-24 17:57     ` Stefano Stabellini
  0 siblings, 1 reply; 10+ messages in thread
From: Iurii Konovalenko @ 2014-09-24 16:37 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Julien Grall, Ian Campbell, xen-devel

Hi, Stefano!
Thank you for your reply!

On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Mon, 22 Sep 2014, Iurii Konovalenko wrote:
>> Hello, all!
>>
>> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5).
>> Xen version is 4.4.
>> I run only Linux (kernel 3.8) as Dom0, no other active domains (For clear tests results I decided not to start DomU).
>> iperf server is started on host, iperf client is started on board with command line "iperf -c 192.168.2.10 -w 256k -m
>> -f M -d -t 60".
>
> Just to double check: you are running the iperf test in Dom0, correct?

Yes, iperf is running in Dom0.

>> During test I studied CPU load with top tool on Dom0, and saw, that one VCPU is totally loaded, spending about 50% in
>> software IRQs, and 50% in system.
>> Running the same test on clear Linux without Xen, I saw that CPU load is about 2-4%.
>>
>> I decided to debug a bit, so I used "({register uint64_t _r; asm volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r"
>> (_r)); _r; })" command to read timer counter before and after operations I want to test.
>>
>> In such way I've found, that most time of CPU is spent in functions enable_irq/disable_irq_nosync and
>> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs    %0, cpsr    @ arch_local_irq_save"/"msr    cpsr_c, %0    @
>> local_irq_restore"). When running without Xen it should not take so much time.
>
> There is nothing Xen specific in the Linux ARM implementation of
> spin_lock_irqsave/spin_unlock_irqrestore and
> enable_irq/disable_irq_nosync.
>

That is strange, because my explorations show a lot of time is spent
there, for example in spin_unlock_irqrestore (mostly in  mrs
instuction) about 20%, when running in Dom0.

>
>> So, could anyone explain me some questions:
>> 1. Is it normal behaviour?
>
> No, it is not normal.
> Assuming that you assign all the memory to Dom0 and as many vcpu as
> physical cpus on your platform then you should get the same numbers as
> native.

OK, so I might do something wrong.

>> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap access to cpsr register, that leads to
>> additional overhead, but I can't find place in sources where it happens.
>
> We don't trap cpsr.

It is strange, because it was only one my assumption, where time can be spent.
So could you please advise where to go to understand the reason of
such high VCPU load?

Best regards.

Iurii Konovalenko | Senior Software Engineer
GlobalLogic

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about high CPU load during iperf ethernet testing
  2014-09-23 16:48 ` Ian Campbell
@ 2014-09-24 17:01   ` Iurii Konovalenko
  2014-09-24 17:51     ` Stefano Stabellini
  0 siblings, 1 reply; 10+ messages in thread
From: Iurii Konovalenko @ 2014-09-24 17:01 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Stefano Stabellini, Julien Grall, xen-devel

Hi, Ian!

On Tue, Sep 23, 2014 at 7:48 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Mon, 2014-09-22 at 16:01 +0300, Iurii Konovalenko wrote:
>
>>
>> I decided to debug a bit, so I used "({register uint64_t _r; asm
>> volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r" (_r)); _r; })"
>> command to read timer counter before and after operations I want to
>> test.
>
> I think that is CNTPCT aka the physical timer. This is trapped under
> Xen. If you want an untrapped source of time you should use CNTVCT which
> is p15,0,c14.
>
> I expect the results are unreliable due to this.

Thanks a lot for advice.
Arm docs say CNTPCT is p15,0,c14, CNTVCT is p15,1,c14. Now I tried both,
but results are almost equal.

> Also watch out that the granularity of those two timers is usually far
> below the CPU clock frequency, so one tick of them can potentially
> represent quite a few clock cycles/instructions IIRC.

I do not need to get very accurate  value, just to understand place,
where time is spent.
These part of code is called many-many times, I accumulate all single
times, so approximately it seams good for me.

I made some experiments, like:
- evaluated time of whole ethernet poll function
- broke this function into several parts and measured time of that
parts, then summed.
Values were almost the same value, so I decided, that in such way I
can understand which part spends more time, and which spends less.

Please, could you advice method to accurate measure time in Xen?

> Ian.
>

Best regards.

Iurii Konovalenko | Senior Software Engineer
GlobalLogic

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about high CPU load during iperf ethernet testing
  2014-09-24 17:01   ` Iurii Konovalenko
@ 2014-09-24 17:51     ` Stefano Stabellini
  0 siblings, 0 replies; 10+ messages in thread
From: Stefano Stabellini @ 2014-09-24 17:51 UTC (permalink / raw)
  To: Iurii Konovalenko
  Cc: Stefano Stabellini, Ian Campbell, Julien Grall, xen-devel

On Wed, 24 Sep 2014, Iurii Konovalenko wrote:
> Hi, Ian!
> 
> On Tue, Sep 23, 2014 at 7:48 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Mon, 2014-09-22 at 16:01 +0300, Iurii Konovalenko wrote:
> >
> >>
> >> I decided to debug a bit, so I used "({register uint64_t _r; asm
> >> volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r" (_r)); _r; })"
> >> command to read timer counter before and after operations I want to
> >> test.
> >
> > I think that is CNTPCT aka the physical timer. This is trapped under
> > Xen. If you want an untrapped source of time you should use CNTVCT which
> > is p15,0,c14.
> >
> > I expect the results are unreliable due to this.
> 
> Thanks a lot for advice.
> Arm docs say CNTPCT is p15,0,c14, CNTVCT is p15,1,c14. Now I tried both,
> but results are almost equal.
> 
> > Also watch out that the granularity of those two timers is usually far
> > below the CPU clock frequency, so one tick of them can potentially
> > represent quite a few clock cycles/instructions IIRC.
> 
> I do not need to get very accurate  value, just to understand place,
> where time is spent.
> These part of code is called many-many times, I accumulate all single
> times, so approximately it seams good for me.
> 
> I made some experiments, like:
> - evaluated time of whole ethernet poll function
> - broke this function into several parts and measured time of that
> parts, then summed.
> Values were almost the same value, so I decided, that in such way I
> can understand which part spends more time, and which spends less.
> 
> Please, could you advice method to accurate measure time in Xen?

In the Xen hypervisor? You can simply call get_s_time().
In the guest kernel using CNTVCT is the best option as it is not
trapped.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about high CPU load during iperf ethernet testing
  2014-09-24 16:37   ` Iurii Konovalenko
@ 2014-09-24 17:57     ` Stefano Stabellini
  2014-09-25 14:12       ` Iurii Konovalenko
  0 siblings, 1 reply; 10+ messages in thread
From: Stefano Stabellini @ 2014-09-24 17:57 UTC (permalink / raw)
  To: Iurii Konovalenko
  Cc: xen-devel, Ian Campbell, Julien Grall, Stefano Stabellini

On Wed, 24 Sep 2014, Iurii Konovalenko wrote:
> Hi, Stefano!
> Thank you for your reply!
> 
> On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > On Mon, 22 Sep 2014, Iurii Konovalenko wrote:
> >> Hello, all!
> >>
> >> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5).
> >> Xen version is 4.4.
> >> I run only Linux (kernel 3.8) as Dom0, no other active domains (For clear tests results I decided not to start DomU).
> >> iperf server is started on host, iperf client is started on board with command line "iperf -c 192.168.2.10 -w 256k -m
> >> -f M -d -t 60".
> >
> > Just to double check: you are running the iperf test in Dom0, correct?
> 
> Yes, iperf is running in Dom0.
> 
> >> During test I studied CPU load with top tool on Dom0, and saw, that one VCPU is totally loaded, spending about 50% in
> >> software IRQs, and 50% in system.
> >> Running the same test on clear Linux without Xen, I saw that CPU load is about 2-4%.
> >>
> >> I decided to debug a bit, so I used "({register uint64_t _r; asm volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r"
> >> (_r)); _r; })" command to read timer counter before and after operations I want to test.
> >>
> >> In such way I've found, that most time of CPU is spent in functions enable_irq/disable_irq_nosync and
> >> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs    %0, cpsr    @ arch_local_irq_save"/"msr    cpsr_c, %0    @
> >> local_irq_restore"). When running without Xen it should not take so much time.
> >
> > There is nothing Xen specific in the Linux ARM implementation of
> > spin_lock_irqsave/spin_unlock_irqrestore and
> > enable_irq/disable_irq_nosync.
> >
> 
> That is strange, because my explorations show a lot of time is spent
> there, for example in spin_unlock_irqrestore (mostly in  mrs
> instuction) about 20%, when running in Dom0.

Unless you are doing something wrong in your measurements, if you really
narrowed it down to one instruction then I would try to do the same on a
different SoC of another vendor to see if it is actually an hardware issue.


> >> So, could anyone explain me some questions:
> >> 1. Is it normal behaviour?
> >
> > No, it is not normal.
> > Assuming that you assign all the memory to Dom0 and as many vcpu as
> > physical cpus on your platform then you should get the same numbers as
> > native.
> 
> OK, so I might do something wrong.
> 
> >> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap access to cpsr register, that leads to
> >> additional overhead, but I can't find place in sources where it happens.
> >
> > We don't trap cpsr.
> 
> It is strange, because it was only one my assumption, where time can be spent.
> So could you please advise where to go to understand the reason of
> such high VCPU load?

I don't know. When you say that arch_local_irq_save is the one taking
all the time, do you actually have something like:

time1 = read CNTVCT;
arch_local_irq_save();
time2 = read CNTVCT;
printk(time2-time1);

in your code?



> Best regards.
> 
> Iurii Konovalenko | Senior Software Engineer
> GlobalLogic
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about high CPU load during iperf ethernet testing
  2014-09-24 17:57     ` Stefano Stabellini
@ 2014-09-25 14:12       ` Iurii Konovalenko
  2014-09-25 15:10         ` Ian Campbell
  0 siblings, 1 reply; 10+ messages in thread
From: Iurii Konovalenko @ 2014-09-25 14:12 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Julien Grall, Ian Campbell, xen-devel

On Wed, Sep 24, 2014 at 8:57 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Wed, 24 Sep 2014, Iurii Konovalenko wrote:
>> Hi, Stefano!
>> Thank you for your reply!
>>
>> On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini
>> <stefano.stabellini@eu.citrix.com> wrote:
>> > On Mon, 22 Sep 2014, Iurii Konovalenko wrote:
>> >> Hello, all!
>> >>
>> >> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5).
>> >> Xen version is 4.4.
>> >> I run only Linux (kernel 3.8) as Dom0, no other active domains (For clear tests results I decided not to start DomU).
>> >> iperf server is started on host, iperf client is started on board with command line "iperf -c 192.168.2.10 -w 256k -m
>> >> -f M -d -t 60".
>> >
>> > Just to double check: you are running the iperf test in Dom0, correct?
>>
>> Yes, iperf is running in Dom0.
>>
>> >> During test I studied CPU load with top tool on Dom0, and saw, that one VCPU is totally loaded, spending about 50% in
>> >> software IRQs, and 50% in system.
>> >> Running the same test on clear Linux without Xen, I saw that CPU load is about 2-4%.
>> >>
>> >> I decided to debug a bit, so I used "({register uint64_t _r; asm volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r"
>> >> (_r)); _r; })" command to read timer counter before and after operations I want to test.
>> >>
>> >> In such way I've found, that most time of CPU is spent in functions enable_irq/disable_irq_nosync and
>> >> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs    %0, cpsr    @ arch_local_irq_save"/"msr    cpsr_c, %0    @
>> >> local_irq_restore"). When running without Xen it should not take so much time.
>> >
>> > There is nothing Xen specific in the Linux ARM implementation of
>> > spin_lock_irqsave/spin_unlock_irqrestore and
>> > enable_irq/disable_irq_nosync.
>> >
>>
>> That is strange, because my explorations show a lot of time is spent
>> there, for example in spin_unlock_irqrestore (mostly in  mrs
>> instuction) about 20%, when running in Dom0.
>
> Unless you are doing something wrong in your measurements, if you really
> narrowed it down to one instruction then I would try to do the same on a
> different SoC of another vendor to see if it is actually an hardware issue.
>
>
>> >> So, could anyone explain me some questions:
>> >> 1. Is it normal behaviour?
>> >
>> > No, it is not normal.
>> > Assuming that you assign all the memory to Dom0 and as many vcpu as
>> > physical cpus on your platform then you should get the same numbers as
>> > native.
>>
>> OK, so I might do something wrong.
>>
>> >> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap access to cpsr register, that leads to
>> >> additional overhead, but I can't find place in sources where it happens.
>> >
>> > We don't trap cpsr.
>>
>> It is strange, because it was only one my assumption, where time can be spent.
>> So could you please advise where to go to understand the reason of
>> such high VCPU load?
>
> I don't know. When you say that arch_local_irq_save is the one taking
> all the time, do you actually have something like:
>
> time1 = read CNTVCT;
> arch_local_irq_save();
> time2 = read CNTVCT;
> printk(time2-time1);
>
> in your code?

Almost like this, with only difference that I accumulate difference of
time in variable, and when it's value is grater then 1s, make print,
like this:

volatile u64 total;
#define SECONDS(_s)     ((s64)((_s)  * 1000000000ULL))
static inline s64 ticks_to_ns(uint64_t ticks)
{
    return muldiv64(ticks, SECONDS(1), 1000 * 6144 );
}

time1 = read CNTVCT;
arch_local_irq_save();
time2 = read CNTVCT;
total += time2-time1;
if(ticks_to_ns(total) > 1000000000)
{
    printk("1 second spent here");
    total = 0;
}

Best regards.

Iurii Konovalenko | Senior Software Engineer
GlobalLogic

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about high CPU load during iperf ethernet testing
  2014-09-25 14:12       ` Iurii Konovalenko
@ 2014-09-25 15:10         ` Ian Campbell
  2014-09-27 11:20           ` Iurii Konovalenko
  0 siblings, 1 reply; 10+ messages in thread
From: Ian Campbell @ 2014-09-25 15:10 UTC (permalink / raw)
  To: Iurii Konovalenko; +Cc: Julien Grall, xen-devel, Stefano Stabellini

On Thu, 2014-09-25 at 17:12 +0300, Iurii Konovalenko wrote:
> On Wed, Sep 24, 2014 at 8:57 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > On Wed, 24 Sep 2014, Iurii Konovalenko wrote:
> >> Hi, Stefano!
> >> Thank you for your reply!
> >>
> >> On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini
> >> <stefano.stabellini@eu.citrix.com> wrote:
> >> > On Mon, 22 Sep 2014, Iurii Konovalenko wrote:
> >> >> Hello, all!
> >> >>
> >> >> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5).
> >> >> Xen version is 4.4.
> >> >> I run only Linux (kernel 3.8) as Dom0, no other active domains (For clear tests results I decided not to start DomU).
> >> >> iperf server is started on host, iperf client is started on board with command line "iperf -c 192.168.2.10 -w 256k -m
> >> >> -f M -d -t 60".
> >> >
> >> > Just to double check: you are running the iperf test in Dom0, correct?
> >>
> >> Yes, iperf is running in Dom0.
> >>
> >> >> During test I studied CPU load with top tool on Dom0, and saw, that one VCPU is totally loaded, spending about 50% in
> >> >> software IRQs, and 50% in system.
> >> >> Running the same test on clear Linux without Xen, I saw that CPU load is about 2-4%.
> >> >>
> >> >> I decided to debug a bit, so I used "({register uint64_t _r; asm volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r"
> >> >> (_r)); _r; })" command to read timer counter before and after operations I want to test.
> >> >>
> >> >> In such way I've found, that most time of CPU is spent in functions enable_irq/disable_irq_nosync and
> >> >> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs    %0, cpsr    @ arch_local_irq_save"/"msr    cpsr_c, %0    @
> >> >> local_irq_restore"). When running without Xen it should not take so much time.
> >> >
> >> > There is nothing Xen specific in the Linux ARM implementation of
> >> > spin_lock_irqsave/spin_unlock_irqrestore and
> >> > enable_irq/disable_irq_nosync.
> >> >
> >>
> >> That is strange, because my explorations show a lot of time is spent
> >> there, for example in spin_unlock_irqrestore (mostly in  mrs
> >> instuction) about 20%, when running in Dom0.
> >
> > Unless you are doing something wrong in your measurements, if you really
> > narrowed it down to one instruction then I would try to do the same on a
> > different SoC of another vendor to see if it is actually an hardware issue.
> >
> >
> >> >> So, could anyone explain me some questions:
> >> >> 1. Is it normal behaviour?
> >> >
> >> > No, it is not normal.
> >> > Assuming that you assign all the memory to Dom0 and as many vcpu as
> >> > physical cpus on your platform then you should get the same numbers as
> >> > native.
> >>
> >> OK, so I might do something wrong.
> >>
> >> >> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap access to cpsr register, that leads to
> >> >> additional overhead, but I can't find place in sources where it happens.
> >> >
> >> > We don't trap cpsr.
> >>
> >> It is strange, because it was only one my assumption, where time can be spent.
> >> So could you please advise where to go to understand the reason of
> >> such high VCPU load?
> >
> > I don't know. When you say that arch_local_irq_save is the one taking
> > all the time, do you actually have something like:
> >
> > time1 = read CNTVCT;
> > arch_local_irq_save();
> > time2 = read CNTVCT;
> > printk(time2-time1);
> >
> > in your code?
> 
> Almost like this, with only difference that I accumulate difference of
> time in variable, and when it's value is grater then 1s, make print,
> like this:
> 
> volatile u64 total;
> #define SECONDS(_s)     ((s64)((_s)  * 1000000000ULL))
> static inline s64 ticks_to_ns(uint64_t ticks)
> {
>     return muldiv64(ticks, SECONDS(1), 1000 * 6144 );
> }
> 
> time1 = read CNTVCT;
> arch_local_irq_save();
> time2 = read CNTVCT;
> total += time2-time1;
> if(ticks_to_ns(total) > 1000000000)
> {
>     printk("1 second spent here");
>     total = 0;
> }

Doesn't that just tell you that this code is called frequently and hence
clocks up the total faster? As opposed to telling you that this
operation is slow.

I'd expect IRQs to go on and off quite a lot under normal operation e.g.
certain spinlock variants will disable irqs for example and might be
frequently used.

Ian.

> 
> Best regards.
> 
> Iurii Konovalenko | Senior Software Engineer
> GlobalLogic
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about high CPU load during iperf ethernet testing
  2014-09-25 15:10         ` Ian Campbell
@ 2014-09-27 11:20           ` Iurii Konovalenko
  0 siblings, 0 replies; 10+ messages in thread
From: Iurii Konovalenko @ 2014-09-27 11:20 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Julien Grall, xen-devel, Stefano Stabellini

On Thu, Sep 25, 2014 at 6:10 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Thu, 2014-09-25 at 17:12 +0300, Iurii Konovalenko wrote:
>> On Wed, Sep 24, 2014 at 8:57 PM, Stefano Stabellini
>> <stefano.stabellini@eu.citrix.com> wrote:
>> > On Wed, 24 Sep 2014, Iurii Konovalenko wrote:
>> >> Hi, Stefano!
>> >> Thank you for your reply!
>> >>
>> >> On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini
>> >> <stefano.stabellini@eu.citrix.com> wrote:
>> >> > On Mon, 22 Sep 2014, Iurii Konovalenko wrote:
>> >> >> Hello, all!
>> >> >>
>> >> >> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5).
>> >> >> Xen version is 4.4.
>> >> >> I run only Linux (kernel 3.8) as Dom0, no other active domains (For clear tests results I decided not to start DomU).
>> >> >> iperf server is started on host, iperf client is started on board with command line "iperf -c 192.168.2.10 -w 256k -m
>> >> >> -f M -d -t 60".
>> >> >
>> >> > Just to double check: you are running the iperf test in Dom0, correct?
>> >>
>> >> Yes, iperf is running in Dom0.
>> >>
>> >> >> During test I studied CPU load with top tool on Dom0, and saw, that one VCPU is totally loaded, spending about 50% in
>> >> >> software IRQs, and 50% in system.
>> >> >> Running the same test on clear Linux without Xen, I saw that CPU load is about 2-4%.
>> >> >>
>> >> >> I decided to debug a bit, so I used "({register uint64_t _r; asm volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r"
>> >> >> (_r)); _r; })" command to read timer counter before and after operations I want to test.
>> >> >>
>> >> >> In such way I've found, that most time of CPU is spent in functions enable_irq/disable_irq_nosync and
>> >> >> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs    %0, cpsr    @ arch_local_irq_save"/"msr    cpsr_c, %0    @
>> >> >> local_irq_restore"). When running without Xen it should not take so much time.
>> >> >
>> >> > There is nothing Xen specific in the Linux ARM implementation of
>> >> > spin_lock_irqsave/spin_unlock_irqrestore and
>> >> > enable_irq/disable_irq_nosync.
>> >> >
>> >>
>> >> That is strange, because my explorations show a lot of time is spent
>> >> there, for example in spin_unlock_irqrestore (mostly in  mrs
>> >> instuction) about 20%, when running in Dom0.
>> >
>> > Unless you are doing something wrong in your measurements, if you really
>> > narrowed it down to one instruction then I would try to do the same on a
>> > different SoC of another vendor to see if it is actually an hardware issue.
>> >
>> >
>> >> >> So, could anyone explain me some questions:
>> >> >> 1. Is it normal behaviour?
>> >> >
>> >> > No, it is not normal.
>> >> > Assuming that you assign all the memory to Dom0 and as many vcpu as
>> >> > physical cpus on your platform then you should get the same numbers as
>> >> > native.
>> >>
>> >> OK, so I might do something wrong.
>> >>
>> >> >> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap access to cpsr register, that leads to
>> >> >> additional overhead, but I can't find place in sources where it happens.
>> >> >
>> >> > We don't trap cpsr.
>> >>
>> >> It is strange, because it was only one my assumption, where time can be spent.
>> >> So could you please advise where to go to understand the reason of
>> >> such high VCPU load?
>> >
>> > I don't know. When you say that arch_local_irq_save is the one taking
>> > all the time, do you actually have something like:
>> >
>> > time1 = read CNTVCT;
>> > arch_local_irq_save();
>> > time2 = read CNTVCT;
>> > printk(time2-time1);
>> >
>> > in your code?
>>
>> Almost like this, with only difference that I accumulate difference of
>> time in variable, and when it's value is grater then 1s, make print,
>> like this:
>>
>> volatile u64 total;
>> #define SECONDS(_s)     ((s64)((_s)  * 1000000000ULL))
>> static inline s64 ticks_to_ns(uint64_t ticks)
>> {
>>     return muldiv64(ticks, SECONDS(1), 1000 * 6144 );
>> }
>>
>> time1 = read CNTVCT;
>> arch_local_irq_save();
>> time2 = read CNTVCT;
>> total += time2-time1;
>> if(ticks_to_ns(total) > 1000000000)
>> {
>>     printk("1 second spent here");
>>     total = 0;
>> }
>
> Doesn't that just tell you that this code is called frequently and hence
> clocks up the total faster? As opposed to telling you that this
> operation is slow.
>
> I'd expect IRQs to go on and off quite a lot under normal operation e.g.
> certain spinlock variants will disable irqs for example and might be
> frequently used.
>
> Ian.

Of cause it can be because of frequent calls, but why such difference
with and without Xen?
I thought number of spinlock calls is more or less equal with Xen or without.

Best regards.

Iurii Konovalenko | Senior Software Engineer
GlobalLogic

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-09-27 11:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-22 13:01 Question about high CPU load during iperf ethernet testing Iurii Konovalenko
2014-09-23 16:41 ` Stefano Stabellini
2014-09-24 16:37   ` Iurii Konovalenko
2014-09-24 17:57     ` Stefano Stabellini
2014-09-25 14:12       ` Iurii Konovalenko
2014-09-25 15:10         ` Ian Campbell
2014-09-27 11:20           ` Iurii Konovalenko
2014-09-23 16:48 ` Ian Campbell
2014-09-24 17:01   ` Iurii Konovalenko
2014-09-24 17:51     ` Stefano Stabellini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.