From mboxrd@z Thu Jan 1 00:00:00 1970 From: Iurii Konovalenko Subject: Re: Question about high CPU load during iperf ethernet testing Date: Thu, 25 Sep 2014 17:12:04 +0300 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Stefano Stabellini Cc: Julien Grall , Ian Campbell , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On Wed, Sep 24, 2014 at 8:57 PM, Stefano Stabellini wrote: > On Wed, 24 Sep 2014, Iurii Konovalenko wrote: >> Hi, Stefano! >> Thank you for your reply! >> >> On Tue, Sep 23, 2014 at 7:41 PM, Stefano Stabellini >> wrote: >> > On Mon, 22 Sep 2014, Iurii Konovalenko wrote: >> >> Hello, all! >> >> >> >> I am running iperf ethernet tests on DRA7XX_EVM board (OMAP5). >> >> Xen version is 4.4. >> >> I run only Linux (kernel 3.8) as Dom0, no other active domains (For clear tests results I decided not to start DomU). >> >> iperf server is started on host, iperf client is started on board with command line "iperf -c 192.168.2.10 -w 256k -m >> >> -f M -d -t 60". >> > >> > Just to double check: you are running the iperf test in Dom0, correct? >> >> Yes, iperf is running in Dom0. >> >> >> During test I studied CPU load with top tool on Dom0, and saw, that one VCPU is totally loaded, spending about 50% in >> >> software IRQs, and 50% in system. >> >> Running the same test on clear Linux without Xen, I saw that CPU load is about 2-4%. >> >> >> >> I decided to debug a bit, so I used "({register uint64_t _r; asm volatile("mrrc " "p15, 0, %0, %H0, c14" ";" : "=r" >> >> (_r)); _r; })" command to read timer counter before and after operations I want to test. >> >> >> >> In such way I've found, that most time of CPU is spent in functions enable_irq/disable_irq_nosync and >> >> spin_lock_irqsave/spin_unlock_irqrestore (mostly in "mrs %0, cpsr @ arch_local_irq_save"/"msr cpsr_c, %0 @ >> >> local_irq_restore"). When running without Xen it should not take so much time. >> > >> > There is nothing Xen specific in the Linux ARM implementation of >> > spin_lock_irqsave/spin_unlock_irqrestore and >> > enable_irq/disable_irq_nosync. >> > >> >> That is strange, because my explorations show a lot of time is spent >> there, for example in spin_unlock_irqrestore (mostly in mrs >> instuction) about 20%, when running in Dom0. > > Unless you are doing something wrong in your measurements, if you really > narrowed it down to one instruction then I would try to do the same on a > different SoC of another vendor to see if it is actually an hardware issue. > > >> >> So, could anyone explain me some questions: >> >> 1. Is it normal behaviour? >> > >> > No, it is not normal. >> > Assuming that you assign all the memory to Dom0 and as many vcpu as >> > physical cpus on your platform then you should get the same numbers as >> > native. >> >> OK, so I might do something wrong. >> >> >> 2. Does hypervisor trap cpsr register? I suppose, that hypervisor trap access to cpsr register, that leads to >> >> additional overhead, but I can't find place in sources where it happens. >> > >> > We don't trap cpsr. >> >> It is strange, because it was only one my assumption, where time can be spent. >> So could you please advise where to go to understand the reason of >> such high VCPU load? > > I don't know. When you say that arch_local_irq_save is the one taking > all the time, do you actually have something like: > > time1 = read CNTVCT; > arch_local_irq_save(); > time2 = read CNTVCT; > printk(time2-time1); > > in your code? Almost like this, with only difference that I accumulate difference of time in variable, and when it's value is grater then 1s, make print, like this: volatile u64 total; #define SECONDS(_s) ((s64)((_s) * 1000000000ULL)) static inline s64 ticks_to_ns(uint64_t ticks) { return muldiv64(ticks, SECONDS(1), 1000 * 6144 ); } time1 = read CNTVCT; arch_local_irq_save(); time2 = read CNTVCT; total += time2-time1; if(ticks_to_ns(total) > 1000000000) { printk("1 second spent here"); total = 0; } Best regards. Iurii Konovalenko | Senior Software Engineer GlobalLogic