On 05/19/2015 01:06 AM, Deucher, Alexander wrote: >> -----Original Message----- >> From: Denys Vlasenko [mailto:vda.linux@googlemail.com] >> Sent: Monday, May 18, 2015 6:50 PM >> To: Koenig, Christian >> Cc: Denys Vlasenko; Deucher, Alexander; Linux Kernel Mailing List >> Subject: Re: [PATCH v2] radeon: Deinline indirect register accessor functions >> >> On Mon, May 18, 2015 at 9:09 PM, Christian König >> wrote: >>>> r600_uvd_ctx_rreg: 111 bytes, 4 callsites >>>> r600_uvd_ctx_wreg: 113 bytes, 5 callsites >>>> eg_pif_phy0_rreg: 106 bytes, 13 callsites >>>> eg_pif_phy0_wreg: 108 bytes, 13 callsites >>>> eg_pif_phy1_rreg: 107 bytes, 13 callsites >>>> eg_pif_phy1_wreg: 108 bytes, 13 callsites >>>> rv370_pcie_rreg: 111 bytes, 21 callsites >>>> rv370_pcie_wreg: 113 bytes, 24 callsites >>>> r600_rcu_rreg: 111 bytes, 16 callsites >>>> r600_rcu_wreg: 113 bytes, 25 callsites >>>> cik_didt_rreg: 106 bytes, 10 callsites >>>> cik_didt_wreg: 107 bytes, 10 callsites >>>> tn_smc_rreg: 106 bytes, 126 callsites >>>> tn_smc_wreg: 107 bytes, 116 callsites >>>> eg_cg_rreg: 107 bytes, 20 callsites >>>> eg_cg_wreg: 108 bytes, 52 callsites >> >>> Sorry haven't noticed that before: >>> >>> radeon_device.c is most likely not the right place for the non-inlined >>> functions. Please move them into to the appropriate files for each >>> generation. >> >> Will do (probably tomorrow, not today). > > Is this whole exercise really worthwhile? > This will be the 3rd or 4th time these have been inlined/uninlined. When code grows by 65000 bytes, there ought to be a good reason to inline. I don't see it. Let's take a look what these functions actually do. cik_didt_wreg is(): spin_lock_irqsave(&rdev->didt_idx_lock, flags); WREG32(CIK_DIDT_IND_INDEX, (reg)); WREG32(CIK_DIDT_IND_DATA, (v)); spin_unlock_irqrestore(&rdev->didt_idx_lock, flags); this compiles to (on defconfig + radeon enabled): 55 push %rbp 48 89 e5 mov %rsp,%rbp 48 83 ec 20 sub $0x20,%rsp 4c 89 65 e8 mov %r12,-0x18(%rbp) 4c 8d a7 cc 01 00 00 lea 0x1cc(%rdi),%r12 48 89 5d e0 mov %rbx,-0x20(%rbp) 48 89 fb mov %rdi,%rbx 4c 89 6d f0 mov %r13,-0x10(%rbp) 4c 89 75 f8 mov %r14,-0x8(%rbp) 4c 89 e7 mov %r12,%rdi 41 89 d6 mov %edx,%r14d 41 89 f5 mov %esi,%r13d e8 20 6b 4d 00 callq <_raw_spin_lock_irqsave> //spin_lock_irqsave 48 8b 93 d0 01 00 00 mov 0x1d0(%rbx),%rdx 44 89 aa 00 ca 00 00 mov %r13d,0xca00(%rdx) //WREG32 48 8b 93 d0 01 00 00 mov 0x1d0(%rbx),%rdx 44 89 b2 04 ca 00 00 mov %r14d,0xca04(%rdx) //WREG32 4c 89 e7 mov %r12,%rdi 48 89 c6 mov %rax,%rsi e8 b9 69 4d 00 callq <_raw_spin_unlock_irqrestore> //spin_unlock_irqrestore 48 8b 5d e0 mov -0x20(%rbp),%rbx 4c 8b 65 e8 mov -0x18(%rbp),%r12 4c 8b 6d f0 mov -0x10(%rbp),%r13 4c 8b 75 f8 mov -0x8(%rbp),%r14 c9 leaveq c3 retq <_raw_spin_lock_irqsave>: 55 push %rbp 48 89 e5 mov %rsp,%rbp 9c pushfq 58 pop %rax fa cli ba 00 01 00 00 mov $0x100,%edx f0 66 0f c1 17 lock xadd %dx,(%rdi) // expensive 0f b6 ce movzbl %dh,%ecx 38 d1 cmp %dl,%cl 75 04 jne <_raw_spin_lock_irqsave+0x1c> 5d pop %rbp c3 retq f3 90 pause 0f b6 17 movzbl (%rdi),%edx 38 ca cmp %cl,%dl 75 f7 jne <_raw_spin_lock_irqsave+0x1a> 5d pop %rbp c3 retq <_raw_spin_unlock_irqrestore>: 55 push %rbp 48 89 e5 mov %rsp,%rbp 80 07 01 addb $0x1,(%rdi) 56 push %rsi 9d popfq //expensive 5d pop %rbp c3 retq Now, using attached test program, I measure how long call+ret pair takes: # ./timing_test64 callret 400000000 loops in 0.71467s = 1.79 nsec/loop for callret Unlocked read-modify-write memory operation: # ./timing_test64 or 400000000 loops in 0.86119s = 2.15 nsec/loop for or Locked read-modify-write memory operations: # ./timing_test64 lock_or 100000000 loops in 0.68902s = 6.89 nsec/loop for lock_or # ./timing_test64 lock_xadd 100000000 loops in 0.68582s = 6.86 nsec/loop for lock_xadd And POPF: # ./timing_test64 popf 100000000 loops in 0.68861s = 6.89 nsec/loop for popf This is on Sandy Bridge CPU with cycle time of about 0.30 ns: # ./timing_test64 nothing 2000000000 loops in 0.59716s = 0.30 nsec/loop for nothing So, what do we see? call+ret takes 5 cycles. This is cheaper that one unlocked RMW memory operation which is 7 cycles. Locked RMW is 21 cycles in the ideal case (this is what spin_lock_irqsave does). POPF is also 21 cycles (spin_unlock_irqrestore does this). Add to this two mmio accesses (easily 50s of cycles) and all other necessary operations visible in the assembly code - 5 memory stores, 7 memory loads, and two call+ret pairs. I expect overhead of call+ret added by deinlining to be in 1-4%, if you run a microbenchmark which does nothing but one of these ops. -- vda