All of lore.kernel.org
 help / color / mirror / Atom feed
* Module vs Kernel main performacne
@ 2012-05-29 23:50 Abu Rasheda
  2012-05-30  4:18 ` Mulyadi Santosa
  2012-06-07 23:36 ` Peter Senna Tschudin
  0 siblings, 2 replies; 16+ messages in thread
From: Abu Rasheda @ 2012-05-29 23:50 UTC (permalink / raw)
  To: kernelnewbies

Hi,

I am working on x8_64 arch. Profiled (oprofile) Linux kernel module
and notice that whole lot of cycles are spent in copy_from_user call.
I compared same flow from kernel proper and noticed that for more data
through put cycles spent in copy_from_user are much less. Kernel
proper has 1/8 cycles compared to module. (There is a user process
which keeps sending data, like iperf)

Used perf tool to gather some statistics and found that call from kernel proper

185,719,857,837 cpu-cycles               #    3.318 GHz
     [90.01%]
  99,886,030,243 instructions              #    0.54  insns per cycle
       [95.00%]
    1,696,072,702 cache-references     #   30.297 M/sec
   [94.99%]
       786,929,244 cache-misses           #   46.397 % of all cache
refs     [95.00%]
  16,867,747,688 branch-instructions   #  301.307 M/sec
   [95.03%]
         86,752,646 branch-misses          #    0.51% of all branches
       [95.00%]
    5,482,768,332 bus-cycles                #   97.938 M/sec
        [20.08%]
    55967.269801 cpu-clock
    55981.842225 task-clock                 #    0.933 CPUs utilized

and call from kernel module

 9,388,787,678 cpu-cycles               #    1.527 GHz
    [89.77%]
 1,706,203,221 instructions             #    0.18  insns per cycle
    [94.59%]
    551,010,961 cache-references    #   89.588 M/sec                   [94.73%]
   369,632,492 cache-misses           #   67.083 % of all cache refs
  [95.18%]
   291,358,658 branch-instructions   #   47.372 M/sec                   [94.68%]
    10,291,678 branch-misses           #    3.53% of all branches
   [95.01%]
  582,651,999 bus-cycles                 #   94.733 M/sec
     [20.55%]
 6112.471585 cpu-clock
 6150.490210 task-clock                 #    0.102 CPUs utilized
                367 page-faults                #    0.000 M/sec
                367 minor-faults                #    0.000 M/sec
                    0 major-faults                #    0.000 M/sec
           25,770 context-switches        #    0.004 M/sec
                 23 cpu-migrations            #    0.000 M/sec


So obviously, CPU is stalling when it is copying data and there are
more cache misses. My question is, is there a difference calling
copy_from_user from kernel proper compared to calling from LKM ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-05-29 23:50 Module vs Kernel main performacne Abu Rasheda
@ 2012-05-30  4:18 ` Mulyadi Santosa
  2012-05-30  4:51   ` Abu Rasheda
  2012-06-07 23:36 ` Peter Senna Tschudin
  1 sibling, 1 reply; 16+ messages in thread
From: Mulyadi Santosa @ 2012-05-30  4:18 UTC (permalink / raw)
  To: kernelnewbies

Hi...

On Wed, May 30, 2012 at 6:50 AM, Abu Rasheda <rcpilot2010@gmail.com> wrote:
> So obviously, CPU is stalling when it is copying data and there are
> more cache misses. My question is, is there a difference calling
> copy_from_user from kernel proper compared to calling from LKM ?

Theoritically, it should be the same. However, one thing that might
interest you is that the fact that linux kernel module memory area is
prepared through vmalloc(), thus there is a chance they are not
physically contigous...whereas the main kernel image are using
page_alloc() IIRC thus physically contigous.

What I meant here is, there must be difference speed when you copy
onto something contigous vs non contigous. IIRC at least it will waste
some portion of L1/L2 cache.

Just my 2 cents, maybe I am wrong somewhere...


-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-05-30  4:18 ` Mulyadi Santosa
@ 2012-05-30  4:51   ` Abu Rasheda
  2012-05-30 16:45     ` Mulyadi Santosa
  0 siblings, 1 reply; 16+ messages in thread
From: Abu Rasheda @ 2012-05-30  4:51 UTC (permalink / raw)
  To: kernelnewbies

> What I meant here is, there must be difference speed when you copy
> onto something contigous vs non contigous. IIRC at least it will waste
> some portion of L1/L2 cache.

When you say, LKM area is prepared with vmalloc is it for code /
executable you refering too ? if so will it matter for data copy ?

Point # 2. Some one was saying that on atleast MIPS it takes more
cycle to call kernel main function from module because of log jump.
Does it apply to x86_64 to ?

To teat above two should I make my module part of static kernel ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-05-30  4:51   ` Abu Rasheda
@ 2012-05-30 16:45     ` Mulyadi Santosa
  2012-05-30 21:44       ` Abu Rasheda
  0 siblings, 1 reply; 16+ messages in thread
From: Mulyadi Santosa @ 2012-05-30 16:45 UTC (permalink / raw)
  To: kernelnewbies

Hi...

On Wed, May 30, 2012 at 11:51 AM, Abu Rasheda <rcpilot2010@gmail.com> wrote:
> When you say, LKM area is prepared with vmalloc is it for code /
> executable you refering too ?

Yes, AFAIK memory area code and static data in linux kernel module is
allocated via vmalloc().

>if so will it matter for data copy ?

see my previous reply :)

>
> Point # 2. Some one was saying that on atleast MIPS it takes more
> cycle to call kernel main function from module because of log jump.
> Does it apply to x86_64 to ?

IIRC long jump means jumping more than 64 KB...but that's in real mode
in 32 bit...so I am not sure whether it still applies in protected
mode.

> To teat above two should I make my module part of static kernel ?

good idea....i think you can try that... :)

-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-05-30 16:45     ` Mulyadi Santosa
@ 2012-05-30 21:44       ` Abu Rasheda
  2012-05-31  0:17         ` Abu Rasheda
  2012-05-31  5:35         ` Mulyadi Santosa
  0 siblings, 2 replies; 16+ messages in thread
From: Abu Rasheda @ 2012-05-30 21:44 UTC (permalink / raw)
  To: kernelnewbies

I did another experiment.

Wrote a stand alone module and user program which does ioctl and pass
buffer to kernel module.

User program passes a buffer through ioctl and kernel module does
kmalloc on it and calls copy_from_user, kfree and return. Test program
send 120 gigabyte data to module.

If I pass 1k buffer per call, I get

115,396,349,819 instructions              #    0.90  insns per cycle
      [95.00%]

as I increase size of buffer, insns per cycle keep decreasing. Here is the data:

    1k 0.90  insns per cycle
    8k 0.43  insns per cycle
  43k 0.18  insns per cycle
100k 0.08  insns per cycle

Showing that cop_from_user is more efficient when copy data is small,
why it is so ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-05-30 21:44       ` Abu Rasheda
@ 2012-05-31  0:17         ` Abu Rasheda
  2012-05-31  5:35         ` Mulyadi Santosa
  1 sibling, 0 replies; 16+ messages in thread
From: Abu Rasheda @ 2012-05-31  0:17 UTC (permalink / raw)
  To: kernelnewbies

On Wed, May 30, 2012 at 2:44 PM, Abu Rasheda <rcpilot2010@gmail.com> wrote:
> I did another experiment.
>
> Wrote a stand alone module and user program which does ioctl and pass
> buffer to kernel module.
>
> User program passes a buffer through ioctl and kernel module does
> kmalloc on it and calls copy_from_user, kfree and return. Test program
> send 120 gigabyte data to module.
>
> If I pass 1k buffer per call, I get
>
> 115,396,349,819 instructions ? ? ? ? ? ? ?# ? ?0.90 ?insns per cycle
> ? ? ?[95.00%]
>
> as I increase size of buffer, insns per cycle keep decreasing. Here is the data:
>
> ? ?1k 0.90 ?insns per cycle
> ? ?8k 0.43 ?insns per cycle
> ?43k 0.18 ?insns per cycle
> 100k 0.08 ?insns per cycle
>
> Showing that cop_from_user is more efficient when copy data is small,
> why it is so ?

Did another experiment:

User program sending 43k and allocating 43k after entering ioctl and
copy_from_user smaller portion in each call to copy_from_user:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
copy_from_user  0.25k at a time 0.56  insns per cycle
copy_from_user  0.50k at a time 0.42  insns per cycle
copy_from_user  1.00k at a time 0.36  insns per cycle
copy_from_user  2.00k at a time 0.29  insns per cycle
copy_from_user  3.00k at a time 0.26  insns per cycle
copy_from_user  4.00k at a time 0.23  insns per cycle
copy_from_user  8.00k at a time 0.21  insns per cycle
copy_from_user 16.00k at a time 0.19  insns per cycle


User program sending 43k, allocating smaller chunk and sending that
chunk to call to copy_from_user:
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Allocated 0.25k and copy_from_user  0.25k at a time 1.04 insns per cycle
Allocated 0.50k and copy_from_user  0.50k at a time 0.90 insns per cycle
Allocated 1.00k and copy_from_user  1.00k at a time 0.79 insns per cycle
Allocated 2.00k and copy_from_user  2.00k at a time 0.67 insns per cycle
Allocated 4.00k and copy_from_user  4.00k at a time 0.53 insns per cycle
Allocated 8.00k and copy_from_user  8.00k at a time 0.42 insns per cycle
Allocated 16.00k and copy_from_user 16.00k at a time 0.33 insns per cycle
Allocated 32.00k and copy_from_user 32.00k at a time 0.22 insns per cycle

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-05-30 21:44       ` Abu Rasheda
  2012-05-31  0:17         ` Abu Rasheda
@ 2012-05-31  5:35         ` Mulyadi Santosa
  2012-05-31 13:35           ` Abu Rasheda
  1 sibling, 1 reply; 16+ messages in thread
From: Mulyadi Santosa @ 2012-05-31  5:35 UTC (permalink / raw)
  To: kernelnewbies

Hi...

On Thu, May 31, 2012 at 4:44 AM, Abu Rasheda <rcpilot2010@gmail.com> wrote:
> as I increase size of buffer, insns per cycle keep decreasing. Here is the data:
>
> ? ?1k 0.90 ?insns per cycle
> ? ?8k 0.43 ?insns per cycle
> ?43k 0.18 ?insns per cycle
> 100k 0.08 ?insns per cycle
>
> Showing that cop_from_user is more efficient when copy data is small,
> why it is so ?

you meant, the bigger the buffer, the fewer the instructions, right?

Not sure why, but I am sure it will reach some peak point.

Anyway, you did kmalloc and then kfree()? I think that's why...bigger
buffer will grab large chunk from slab...and again likely it's
physically contigous. Also, it will be placed in the same cache line.

Whereas the smaller one....will hit allocate/free cycle more...thus
flushing the L1/L2 cache even more.

CMIIW people...

-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-05-31  5:35         ` Mulyadi Santosa
@ 2012-05-31 13:35           ` Abu Rasheda
  2012-06-01  0:27             ` Chetan Nanda
  0 siblings, 1 reply; 16+ messages in thread
From: Abu Rasheda @ 2012-05-31 13:35 UTC (permalink / raw)
  To: kernelnewbies

On Wed, May 30, 2012 at 10:35 PM, Mulyadi Santosa
<mulyadi.santosa@gmail.com> wrote:
> Hi...
>
> On Thu, May 31, 2012 at 4:44 AM, Abu Rasheda <rcpilot2010@gmail.com> wrote:
>> as I increase size of buffer, insns per cycle keep decreasing. Here is the data:
>>
>> ? ?1k 0.90 ?insns per cycle
>> ? ?8k 0.43 ?insns per cycle
>> ?43k 0.18 ?insns per cycle
>> 100k 0.08 ?insns per cycle
>>
>> Showing that copy_from_user is more efficient when copy data is small,
>> why it is so ?
>
> you meant, the bigger the buffer, the fewer the instructions, right?

yes

>
> Not sure why, but I am sure it will reach some peak point.
>
> Anyway, you did kmalloc and then kfree()? I think that's why...bigger
> buffer will grab large chunk from slab...and again likely it's
> physically contigous. Also, it will be placed in the same cache line.
>
> Whereas the smaller one....will hit allocate/free cycle more...thus
> flushing the L1/L2 cache even more.

It seems to be doing opposite, bigger the allocation / copy longer stall is.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-05-31 13:35           ` Abu Rasheda
@ 2012-06-01  0:27             ` Chetan Nanda
  2012-06-01 18:52               ` Abu Rasheda
  0 siblings, 1 reply; 16+ messages in thread
From: Chetan Nanda @ 2012-06-01  0:27 UTC (permalink / raw)
  To: kernelnewbies

On May 31, 2012 9:37 PM, "Abu Rasheda" <rcpilot2010@gmail.com> wrote:
>
> On Wed, May 30, 2012 at 10:35 PM, Mulyadi Santosa
> <mulyadi.santosa@gmail.com> wrote:
> > Hi...
> >
> > On Thu, May 31, 2012 at 4:44 AM, Abu Rasheda <rcpilot2010@gmail.com>
wrote:
> >> as I increase size of buffer, insns per cycle keep decreasing. Here is
the data:
> >>
> >>    1k 0.90  insns per cycle
> >>    8k 0.43  insns per cycle
> >>  43k 0.18  insns per cycle
> >> 100k 0.08  insns per cycle
> >>
> >> Showing that copy_from_user is more efficient when copy data is small,
> >> why it is so ?
> >
> > you meant, the bigger the buffer, the fewer the instructions, right?
>
> yes
>
If the buffer at user side is more then a page, then it may be that
complete user space buffer is not available in memory and kernel spend time
in processing page fault
> >
> > Not sure why, but I am sure it will reach some peak point.
> >
> > Anyway, you did kmalloc and then kfree()? I think that's why...bigger
> > buffer will grab large chunk from slab...and again likely it's
> > physically contigous. Also, it will be placed in the same cache line.
> >
> > Whereas the smaller one....will hit allocate/free cycle more...thus
> > flushing the L1/L2 cache even more.
>
> It seems to be doing opposite, bigger the allocation / copy longer stall
is.
>
> _______________________________________________
> Kernelnewbies mailing list
> Kernelnewbies at kernelnewbies.org
> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20120601/14efcdb3/attachment.html 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-06-01  0:27             ` Chetan Nanda
@ 2012-06-01 18:52               ` Abu Rasheda
  2012-06-07 13:11                 ` Peter Senna Tschudin
  0 siblings, 1 reply; 16+ messages in thread
From: Abu Rasheda @ 2012-06-01 18:52 UTC (permalink / raw)
  To: kernelnewbies

>
> If the buffer at user side is more then a page, then it may be that
> complete user space buffer is not available in memory and kernel spend time
> in processing page fault
>

I have attached code for module and user program. If anyone is bored over
the weekend they are welcome to try and explain the behavior.

Abu Rasheda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20120601/8a7dc407/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: m.tgz
Type: application/x-gzip
Size: 18825 bytes
Desc: not available
Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20120601/8a7dc407/attachment.tgz 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-06-01 18:52               ` Abu Rasheda
@ 2012-06-07 13:11                 ` Peter Senna Tschudin
  2012-06-07 17:47                   ` Abu Rasheda
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Senna Tschudin @ 2012-06-07 13:11 UTC (permalink / raw)
  To: kernelnewbies

Hello Abu,

I had to include <linux/module.h> or an error was issued about "THIS_MODULE".

What Kernel version are you using? I'm trying to compile it and I'm
getting the error:

[peter at ace m]$ make
make -C /lib/modules/3.3.7-1.fc17.x86_64/build SUBDIRS=`pwd` modules
make[1]: Entering directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
  CC [M]  /tmp/m/m.o
/tmp/m/m.c:36:2: error: unknown field ?ioctl? specified in initializer
/tmp/m/m.c:36:2: warning: initialization from incompatible pointer
type [enabled by default]
/tmp/m/m.c:36:2: warning: (near initialization for ?m_fops.llseek?)
[enabled by default]
make[2]: *** [/tmp/m/m.o] Error 1
make[1]: *** [_module_/tmp/m] Error 2
make[1]: Leaving directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
make: *** [module] Error 2

According to:
http://lxr.linux.no/linux+v3.4.1/include/linux/fs.h#L1609

There is no .ioctl at struct file_operations...

Can you share how you've used perf/oprofile on your module/Kernel code?

[]'s

Peter


On Fri, Jun 1, 2012 at 3:52 PM, Abu Rasheda <rcpilot2010@gmail.com> wrote:
>> If the buffer at user side is more then a page, then it may be that
>> complete user space buffer is not available in memory and kernel spend time
>> in processing page fault
>
>
> I have attached code for module and user program. If anyone is bored over
> the weekend they are welcome to try and explain the?behavior.
>
> Abu Rasheda
>
> _______________________________________________
> Kernelnewbies mailing list
> Kernelnewbies at kernelnewbies.org
> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>



-- 
Peter Senna Tschudin
peter.senna at gmail.com
gpg id: 48274C36

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-06-07 13:11                 ` Peter Senna Tschudin
@ 2012-06-07 17:47                   ` Abu Rasheda
  2012-06-07 18:10                     ` Peter Senna Tschudin
  0 siblings, 1 reply; 16+ messages in thread
From: Abu Rasheda @ 2012-06-07 17:47 UTC (permalink / raw)
  To: kernelnewbies

>
> Hello Abu,
>
> I had to include <linux/module.h> or an error was issued about
> "THIS_MODULE".
>

I am running this tool on Scientific Linux 6.0, which is 2.6.32 kernel. I
know this is old but this is what I have for my product.


> What Kernel version are you using? I'm trying to compile it and I'm
> getting the error:
>
> [peter at ace m]$ make
> make -C /lib/modules/3.3.7-1.fc17.x86_64/build SUBDIRS=`pwd` modules
> make[1]: Entering directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
>  CC [M]  /tmp/m/m.o
> /tmp/m/m.c:36:2: error: unknown field ?ioctl? specified in initializer
> /tmp/m/m.c:36:2: warning: initialization from incompatible pointer
> type [enabled by default]
> /tmp/m/m.c:36:2: warning: (near initialization for ?m_fops.llseek?)
> [enabled by default]
> make[2]: *** [/tmp/m/m.o] Error 1
> make[1]: *** [_module_/tmp/m] Error 2
> make[1]: Leaving directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
> make: *** [module] Error 2
>
> According to:
> http://lxr.linux.no/linux+v3.4.1/include/linux/fs.h#L1609
>
> There is no .ioctl at struct file_operations...
>
> Can you share how you've used perf/oprofile on your module/Kernel code?
>
> []'s
>
> Peter


for perf:

perf stat -e
cpu-cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branch-instructions,branch-misses,bus-cycles,cpu-clock,task-clock,page-faults,minor-faults,major-faults,context-switches,cpu-migrations,alignment-faults,emulation-faults,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-icache-loads,L1-icache-load-misses,L1-icache-prefetches,L1-icache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches,LLC-prefetch-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,dTLB-prefetches,dTLB-prefetch-misses,iTLB-loads,iTLB-load-misses,branch-loads,branch-load-misses,syscalls:sys_enter_sendmsg,syscalls:sys_exit_sendmsg,sched:sched_wakeup,sched:sched_stat_sleep
./prog

for oprofile:

# opcontrol --reset
# opcontrol --vmlinux=/boot/vmlinux.64
# opcontrol --start
# ./a.out
# opcontrol --shutdown
# opreport -l -p
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20120607/ab773484/attachment.html 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-06-07 17:47                   ` Abu Rasheda
@ 2012-06-07 18:10                     ` Peter Senna Tschudin
  2012-06-09  1:52                       ` Abu Rasheda
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Senna Tschudin @ 2012-06-07 18:10 UTC (permalink / raw)
  To: kernelnewbies

Hello Abu,

On Thu, Jun 7, 2012 at 2:47 PM, Abu Rasheda <rcpilot2010@gmail.com> wrote:
>> Hello Abu,
>>
>> I had to include <linux/module.h> or an error was issued about
>> "THIS_MODULE".
>
>
> I am running this tool on Scientific Linux 6.0, which is 2.6.32 kernel. I
> know this is old but this is what I have for my product.
>
>
>>
>> What Kernel version are you using? I'm trying to compile it and I'm
>> getting the error:
>>
>> [peter at ace m]$ make
>> make -C /lib/modules/3.3.7-1.fc17.x86_64/build SUBDIRS=`pwd` modules
>> make[1]: Entering directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
>> ?CC [M] ?/tmp/m/m.o
>> /tmp/m/m.c:36:2: error: unknown field ?ioctl? specified in initializer
>> /tmp/m/m.c:36:2: warning: initialization from incompatible pointer
>> type [enabled by default]
>> /tmp/m/m.c:36:2: warning: (near initialization for ?m_fops.llseek?)
>> [enabled by default]
>> make[2]: *** [/tmp/m/m.o] Error 1
>> make[1]: *** [_module_/tmp/m] Error 2
>> make[1]: Leaving directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
>> make: *** [module] Error 2
>>
>> According to:
>> http://lxr.linux.no/linux+v3.4.1/include/linux/fs.h#L1609
>>
>> There is no .ioctl at struct file_operations...
>>
>> Can you share how you've used perf/oprofile on your module/Kernel code?
>>
>> []'s
>>
>> Peter
>
>
> for perf:
>
> perf stat -e
> cpu-cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branch-instructions,branch-misses,bus-cycles,cpu-clock,task-clock,page-faults,minor-faults,major-faults,context-switches,cpu-migrations,alignment-faults,emulation-faults,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-icache-loads,L1-icache-load-misses,L1-icache-prefetches,L1-icache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches,LLC-prefetch-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,dTLB-prefetches,dTLB-prefetch-misses,iTLB-loads,iTLB-load-misses,branch-loads,branch-load-misses,syscalls:sys_enter_sendmsg,syscalls:sys_exit_sendmsg,sched:sched_wakeup,sched:sched_stat_sleep
> ./prog
>
> for oprofile:
>
> # opcontrol --reset
> # opcontrol --vmlinux=/boot/vmlinux.64
> # opcontrol --start
> # ./a.out
> # opcontrol --shutdown
> # opreport -l -p

Thanks! I'll try it now.

I've made changes to your code, so it "probably" will:
 - Run on 3.4 Kernel
 - Partially meet Kernel coding style (Try to run scripts/checkpatch.pl -f m.c)
 - Stop working due lack of locking at m_ioctl(). I'm working on this now... :-)

See it at: http://pastebin.com/sibPrQJL

[]'s

Peter


-- 
Peter Senna Tschudin
peter.senna at gmail.com
gpg id: 48274C36

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-05-29 23:50 Module vs Kernel main performacne Abu Rasheda
  2012-05-30  4:18 ` Mulyadi Santosa
@ 2012-06-07 23:36 ` Peter Senna Tschudin
  2012-06-07 23:41   ` Abu Rasheda
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Senna Tschudin @ 2012-06-07 23:36 UTC (permalink / raw)
  To: kernelnewbies

Hi again!

On Tue, May 29, 2012 at 8:50 PM, Abu Rasheda <rcpilot2010@gmail.com> wrote:
> Hi,
>
> I am working on x8_64 arch. Profiled (oprofile) Linux kernel module
> and notice that whole lot of cycles are spent in copy_from_user call.
> I compared same flow from kernel proper and noticed that for more data
> through put cycles spent in copy_from_user are much less. Kernel
> proper has 1/8 cycles compared to module. (There is a user process
> which keeps sending data, like iperf)
>
> Used perf tool to gather some statistics and found that call from kernel proper
>
> 185,719,857,837 cpu-cycles ? ? ? ? ? ? ? # ? ?3.318 GHz
> ? ? [90.01%]
> ?99,886,030,243 instructions ? ? ? ? ? ? ?# ? ?0.54 ?insns per cycle
> ? ? ? [95.00%]
> ? ?1,696,072,702 cache-references ? ? # ? 30.297 M/sec
> ? [94.99%]
> ? ? ? 786,929,244 cache-misses ? ? ? ? ? # ? 46.397 % of all cache
> refs ? ? [95.00%]
> ?16,867,747,688 branch-instructions ? # ?301.307 M/sec
> ? [95.03%]
> ? ? ? ? 86,752,646 branch-misses ? ? ? ? ?# ? ?0.51% of all branches
> ? ? ? [95.00%]
> ? ?5,482,768,332 bus-cycles ? ? ? ? ? ? ? ?# ? 97.938 M/sec
> ? ? ? ?[20.08%]
> ? ?55967.269801 cpu-clock
> ? ?55981.842225 task-clock ? ? ? ? ? ? ? ? # ? ?0.933 CPUs utilized
>
> and call from kernel module
>
> ?9,388,787,678 cpu-cycles ? ? ? ? ? ? ? # ? ?1.527 GHz
> ? ?[89.77%]
> ?1,706,203,221 instructions ? ? ? ? ? ? # ? ?0.18 ?insns per cycle
> ? ?[94.59%]
> ? ?551,010,961 cache-references ? ?# ? 89.588 M/sec ? ? ? ? ? ? ? ? ? [94.73%]
> ? 369,632,492 cache-misses ? ? ? ? ? # ? 67.083 % of all cache refs
> ?[95.18%]
> ? 291,358,658 branch-instructions ? # ? 47.372 M/sec ? ? ? ? ? ? ? ? ? [94.68%]
> ? ?10,291,678 branch-misses ? ? ? ? ? # ? ?3.53% of all branches
> ? [95.01%]
> ?582,651,999 bus-cycles ? ? ? ? ? ? ? ? # ? 94.733 M/sec
> ? ? [20.55%]
> ?6112.471585 cpu-clock
> ?6150.490210 task-clock ? ? ? ? ? ? ? ? # ? ?0.102 CPUs utilized
> ? ? ? ? ? ? ? ?367 page-faults ? ? ? ? ? ? ? ?# ? ?0.000 M/sec
> ? ? ? ? ? ? ? ?367 minor-faults ? ? ? ? ? ? ? ?# ? ?0.000 M/sec
> ? ? ? ? ? ? ? ? ? ?0 major-faults ? ? ? ? ? ? ? ?# ? ?0.000 M/sec
> ? ? ? ? ? 25,770 context-switches ? ? ? ?# ? ?0.004 M/sec
> ? ? ? ? ? ? ? ? 23 cpu-migrations ? ? ? ? ? ?# ? ?0.000 M/sec

How did you call from Kernel module?

>
>
> So obviously, CPU is stalling when it is copying data and there are
> more cache misses. My question is, is there a difference calling
> copy_from_user from kernel proper compared to calling from LKM ?
>
> _______________________________________________
> Kernelnewbies mailing list
> Kernelnewbies at kernelnewbies.org
> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

[]'s

-- 
Peter Senna Tschudin
peter.senna at gmail.com
gpg id: 48274C36

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-06-07 23:36 ` Peter Senna Tschudin
@ 2012-06-07 23:41   ` Abu Rasheda
  0 siblings, 0 replies; 16+ messages in thread
From: Abu Rasheda @ 2012-06-07 23:41 UTC (permalink / raw)
  To: kernelnewbies

<peter.senna@gmail.com> wrote:

> Hi again!
>

Hi


> How did you call from Kernel module?


In original code, copied data is dmaed and in experimental code data is
dropped.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20120607/adfa954b/attachment.html 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Module vs Kernel main performacne
  2012-06-07 18:10                     ` Peter Senna Tschudin
@ 2012-06-09  1:52                       ` Abu Rasheda
  0 siblings, 0 replies; 16+ messages in thread
From: Abu Rasheda @ 2012-06-09  1:52 UTC (permalink / raw)
  To: kernelnewbies

I modified my module (m.c). Still sending buffer from user space using
ioctl, but instead of copying data from buffer provided by user, I have
allocated (kmalloc) a buffer and I copy from this buffer to another kernel
buffer which is allocated each time this module ioclt is invoked.

copy_from_user is now replaced with memcpy. I still see processor stall.
This means the buffer allocated per call is the cause.

Abu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20120608/43f027cc/attachment.html 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-06-09  1:52 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-29 23:50 Module vs Kernel main performacne Abu Rasheda
2012-05-30  4:18 ` Mulyadi Santosa
2012-05-30  4:51   ` Abu Rasheda
2012-05-30 16:45     ` Mulyadi Santosa
2012-05-30 21:44       ` Abu Rasheda
2012-05-31  0:17         ` Abu Rasheda
2012-05-31  5:35         ` Mulyadi Santosa
2012-05-31 13:35           ` Abu Rasheda
2012-06-01  0:27             ` Chetan Nanda
2012-06-01 18:52               ` Abu Rasheda
2012-06-07 13:11                 ` Peter Senna Tschudin
2012-06-07 17:47                   ` Abu Rasheda
2012-06-07 18:10                     ` Peter Senna Tschudin
2012-06-09  1:52                       ` Abu Rasheda
2012-06-07 23:36 ` Peter Senna Tschudin
2012-06-07 23:41   ` Abu Rasheda

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.