On 05/19/2015 01:06 AM, Deucher, Alexander wrote:
>> -----Original Message-----
>> From: Denys Vlasenko [mailto:vda.linux@googlemail.com]
>> Sent: Monday, May 18, 2015 6:50 PM
>> To: Koenig, Christian
>> Cc: Denys Vlasenko; Deucher, Alexander; Linux Kernel Mailing List
>> Subject: Re: [PATCH v2] radeon: Deinline indirect register accessor functions
>>
>> On Mon, May 18, 2015 at 9:09 PM, Christian König
>> <christian.koenig@amd.com> wrote:
>>>> r600_uvd_ctx_rreg: 111 bytes, 4 callsites
>>>> r600_uvd_ctx_wreg: 113 bytes, 5 callsites
>>>> eg_pif_phy0_rreg: 106 bytes, 13 callsites
>>>> eg_pif_phy0_wreg: 108 bytes, 13 callsites
>>>> eg_pif_phy1_rreg: 107 bytes, 13 callsites
>>>> eg_pif_phy1_wreg: 108 bytes, 13 callsites
>>>> rv370_pcie_rreg: 111 bytes, 21 callsites
>>>> rv370_pcie_wreg: 113 bytes, 24 callsites
>>>> r600_rcu_rreg: 111 bytes, 16 callsites
>>>> r600_rcu_wreg: 113 bytes, 25 callsites
>>>> cik_didt_rreg: 106 bytes, 10 callsites
>>>> cik_didt_wreg: 107 bytes, 10 callsites
>>>> tn_smc_rreg: 106 bytes, 126 callsites
>>>> tn_smc_wreg: 107 bytes, 116 callsites
>>>> eg_cg_rreg: 107 bytes, 20 callsites
>>>> eg_cg_wreg: 108 bytes, 52 callsites
>>
>>> Sorry haven't noticed that before:
>>>
>>> radeon_device.c is most likely not the right place for the non-inlined
>>> functions. Please move them into to the appropriate files for each
>>> generation.
>>
>> Will do (probably tomorrow, not today).
> 
> Is this whole exercise really worthwhile?
> This will be the 3rd or 4th time these have been inlined/uninlined.

When code grows by 65000 bytes, there ought to be a good reason to inline.
I don't see it.

Let's take a look what these functions actually do. cik_didt_wreg is():

       spin_lock_irqsave(&rdev->didt_idx_lock, flags);
       WREG32(CIK_DIDT_IND_INDEX, (reg));
       WREG32(CIK_DIDT_IND_DATA, (v));
       spin_unlock_irqrestore(&rdev->didt_idx_lock, flags);

this compiles to (on defconfig + radeon enabled):

       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       48 83 ec 20             sub    $0x20,%rsp
       4c 89 65 e8             mov    %r12,-0x18(%rbp)
       4c 8d a7 cc 01 00 00    lea    0x1cc(%rdi),%r12
       48 89 5d e0             mov    %rbx,-0x20(%rbp)
       48 89 fb                mov    %rdi,%rbx
       4c 89 6d f0             mov    %r13,-0x10(%rbp)
       4c 89 75 f8             mov    %r14,-0x8(%rbp)
       4c 89 e7                mov    %r12,%rdi
       41 89 d6                mov    %edx,%r14d
       41 89 f5                mov    %esi,%r13d
       e8 20 6b 4d 00          callq  <_raw_spin_lock_irqsave> //spin_lock_irqsave
       48 8b 93 d0 01 00 00    mov    0x1d0(%rbx),%rdx
       44 89 aa 00 ca 00 00    mov    %r13d,0xca00(%rdx)       //WREG32
       48 8b 93 d0 01 00 00    mov    0x1d0(%rbx),%rdx
       44 89 b2 04 ca 00 00    mov    %r14d,0xca04(%rdx)       //WREG32
       4c 89 e7                mov    %r12,%rdi
       48 89 c6                mov    %rax,%rsi
       e8 b9 69 4d 00          callq  <_raw_spin_unlock_irqrestore> //spin_unlock_irqrestore
       48 8b 5d e0             mov    -0x20(%rbp),%rbx
       4c 8b 65 e8             mov    -0x18(%rbp),%r12
       4c 8b 6d f0             mov    -0x10(%rbp),%r13
       4c 8b 75 f8             mov    -0x8(%rbp),%r14
       c9                      leaveq
       c3                      retq

<_raw_spin_lock_irqsave>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       9c                      pushfq
       58                      pop    %rax
       fa                      cli
       ba 00 01 00 00          mov    $0x100,%edx
       f0 66 0f c1 17          lock xadd %dx,(%rdi)  // expensive
       0f b6 ce                movzbl %dh,%ecx
       38 d1                   cmp    %dl,%cl
       75 04                   jne    <_raw_spin_lock_irqsave+0x1c>
       5d                      pop    %rbp
       c3                      retq
       f3 90                   pause
       0f b6 17                movzbl (%rdi),%edx
       38 ca                   cmp    %cl,%dl
       75 f7                   jne    <_raw_spin_lock_irqsave+0x1a>
       5d                      pop    %rbp
       c3                      retq

<_raw_spin_unlock_irqrestore>:
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       80 07 01                addb   $0x1,(%rdi)
       56                      push   %rsi
       9d                      popfq                  //expensive
       5d                      pop    %rbp
       c3                      retq

Now, using attached test program, I measure how long
call+ret pair takes:

# ./timing_test64 callret
400000000 loops in 0.71467s = 1.79 nsec/loop for callret

Unlocked read-modify-write memory operation:

# ./timing_test64 or
400000000 loops in 0.86119s = 2.15 nsec/loop for or

Locked read-modify-write memory operations:

# ./timing_test64 lock_or
100000000 loops in 0.68902s = 6.89 nsec/loop for lock_or
# ./timing_test64 lock_xadd
100000000 loops in 0.68582s = 6.86 nsec/loop for lock_xadd

And POPF:

# ./timing_test64 popf
100000000 loops in 0.68861s = 6.89 nsec/loop for popf

This is on Sandy Bridge CPU with cycle time of about 0.30 ns:

# ./timing_test64 nothing
2000000000 loops in 0.59716s = 0.30 nsec/loop for nothing


So, what do we see?

call+ret takes 5 cycles. This is cheaper that one unlocked
RMW memory operation which is 7 cycles.

Locked RMW is 21 cycles in the ideal case (this is what
spin_lock_irqsave does). POPF is also 21 cycles
(spin_unlock_irqrestore does this). Add to this two mmio
accesses (easily 50s of cycles) and all other necessary operations
visible in the assembly code - 5 memory stores,
7 memory loads, and two call+ret pairs.

I expect overhead of call+ret added by deinlining to be in 1-4%,
if you run a microbenchmark which does nothing but one of these ops.
-- 
vda