On 8/7/2019 3:56 PM, Xing Zhengjun wrote: > > > On 7/24/2019 1:17 PM, Xing Zhengjun wrote: >> >> >> On 7/12/2019 2:42 PM, Xing Zhengjun wrote: >>> Hi Trond, >>> >>>      I attached perf-profile part big changes, hope it is useful for >>> analyzing the issue. >> >> Ping... > > ping... > ping... >> >>> >>> >>> In testcase: fsmark >>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz >>> with 384G memory >>> with following parameters: >>> >>>          iterations: 20x >>>          nr_threads: 64t >>>          disk: 1BRD_48G >>>          fs: xfs >>>          fs2: nfsv4 >>>          filesize: 4M >>>          test_size: 80G >>>          sync_method: fsyncBeforeClose >>>          cpufreq_governor: performance >>> >>> test-description: The fsmark is a file system benchmark to test >>> synchronous write workloads, for example, mail servers workload. >>> test-url: https://sourceforge.net/projects/fsmark/ >>> >>> commit: >>>    e791f8e938 ("SUNRPC: Convert xs_send_kvec() to use iov_iter_kvec()") >>>    0472e47660 ("SUNRPC: Convert socket page send code to use >>> iov_iter()") >>> >>> e791f8e9380d945e 0472e476604998c127f3c80d291 >>> ---------------- --------------------------- >>>           %stddev     %change         %stddev >>>               \          |                \ >>>      527.29           -22.6%     407.96        fsmark.files_per_sec >>>        1.97 ± 11%      +0.9        2.88 ±  4% >>> perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry >>> >>>        0.00            +0.9        0.93 ±  4% >>> perf-profile.calltrace.cycles-pp.tcp_write_xmit.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg.xs_sendpages >>> >>>        2.11 ± 10%      +0.9        3.05 ±  4% >>> perf-profile.calltrace.cycles-pp.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary >>> >>>        5.29 ±  2%      +1.2        6.46 ±  7% >>> perf-profile.calltrace.cycles-pp.svc_recv.nfsd.kthread.ret_from_fork >>>        9.61 ±  5%      +3.1       12.70 ±  2% >>> perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork >>>        9.27 ±  5%      +3.1       12.40 ±  2% >>> perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork >>> >>>       34.52 ±  4%      +3.3       37.78 ±  2% >>> perf-profile.calltrace.cycles-pp.ret_from_fork >>>       34.52 ±  4%      +3.3       37.78 ±  2% >>> perf-profile.calltrace.cycles-pp.kthread.ret_from_fork >>>        0.00            +3.4        3.41 ±  4% >>> perf-profile.calltrace.cycles-pp.memcpy_erms.memcpy_from_page._copy_from_iter_full.tcp_sendmsg_locked.tcp_sendmsg >>> >>>        0.00            +3.4        3.44 ±  4% >>> perf-profile.calltrace.cycles-pp.memcpy_from_page._copy_from_iter_full.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg >>> >>>        0.00            +3.5        3.54 ±  4% >>> perf-profile.calltrace.cycles-pp._copy_from_iter_full.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg.xs_sendpages >>> >>>        2.30 ±  5%      +3.7        6.02 ±  3% >>> perf-profile.calltrace.cycles-pp.__rpc_execute.rpc_async_schedule.process_one_work.worker_thread.kthread >>> >>>        2.30 ±  5%      +3.7        6.02 ±  3% >>> perf-profile.calltrace.cycles-pp.rpc_async_schedule.process_one_work.worker_thread.kthread.ret_from_fork >>> >>>        1.81 ±  4%      +3.8        5.59 ±  4% >>> perf-profile.calltrace.cycles-pp.call_transmit.__rpc_execute.rpc_async_schedule.process_one_work.worker_thread >>> >>>        1.80 ±  3%      +3.8        5.59 ±  3% >>> perf-profile.calltrace.cycles-pp.xprt_transmit.call_transmit.__rpc_execute.rpc_async_schedule.process_one_work >>> >>>        1.73 ±  4%      +3.8        5.54 ±  4% >>> perf-profile.calltrace.cycles-pp.xs_tcp_send_request.xprt_transmit.call_transmit.__rpc_execute.rpc_async_schedule >>> >>>        1.72 ±  4%      +3.8        5.54 ±  4% >>> perf-profile.calltrace.cycles-pp.xs_sendpages.xs_tcp_send_request.xprt_transmit.call_transmit.__rpc_execute >>> >>>        0.00            +5.4        5.42 ±  4% >>> perf-profile.calltrace.cycles-pp.tcp_sendmsg_locked.tcp_sendmsg.sock_sendmsg.xs_sendpages.xs_tcp_send_request >>> >>>        0.00            +5.5        5.52 ±  4% >>> perf-profile.calltrace.cycles-pp.tcp_sendmsg.sock_sendmsg.xs_sendpages.xs_tcp_send_request.xprt_transmit >>> >>>        0.00            +5.5        5.53 ±  4% >>> perf-profile.calltrace.cycles-pp.sock_sendmsg.xs_sendpages.xs_tcp_send_request.xprt_transmit.call_transmit >>> >>>        9.61 ±  5%      +3.1       12.70 ±  2% >>> perf-profile.children.cycles-pp.worker_thread >>>        9.27 ±  5%      +3.1       12.40 ±  2% >>> perf-profile.children.cycles-pp.process_one_work >>>        6.19            +3.2        9.40 ±  4% >>> perf-profile.children.cycles-pp.memcpy_erms >>>       34.53 ±  4%      +3.3       37.78 ±  2% >>> perf-profile.children.cycles-pp.ret_from_fork >>>       34.52 ±  4%      +3.3       37.78 ±  2% >>> perf-profile.children.cycles-pp.kthread >>>        0.00            +3.5        3.46 ±  4% >>> perf-profile.children.cycles-pp.memcpy_from_page >>>        0.00            +3.6        3.56 ±  4% >>> perf-profile.children.cycles-pp._copy_from_iter_full >>>        2.47 ±  4%      +3.7        6.18 ±  3% >>> perf-profile.children.cycles-pp.__rpc_execute >>>        2.30 ±  5%      +3.7        6.02 ±  3% >>> perf-profile.children.cycles-pp.rpc_async_schedule >>>        1.90 ±  4%      +3.8        5.67 ±  3% >>> perf-profile.children.cycles-pp.call_transmit >>>        1.89 ±  3%      +3.8        5.66 ±  3% >>> perf-profile.children.cycles-pp.xprt_transmit >>>        1.82 ±  4%      +3.8        5.62 ±  3% >>> perf-profile.children.cycles-pp.xs_tcp_send_request >>>        1.81 ±  4%      +3.8        5.62 ±  3% >>> perf-profile.children.cycles-pp.xs_sendpages >>>        0.21 ± 17%      +5.3        5.48 ±  4% >>> perf-profile.children.cycles-pp.tcp_sendmsg_locked >>>        0.25 ± 18%      +5.3        5.59 ±  3% >>> perf-profile.children.cycles-pp.tcp_sendmsg >>>        0.26 ± 16%      +5.3        5.60 ±  3% >>> perf-profile.children.cycles-pp.sock_sendmsg >>>        1.19 ±  5%      +0.5        1.68 ±  3% >>> perf-profile.self.cycles-pp.get_page_from_freelist >>>        6.10            +3.2        9.27 ±  4% >>> perf-profile.self.cycles-pp.memcpy_erms >>> >>> >>> On 7/9/2019 10:39 AM, Xing Zhengjun wrote: >>>> Hi Trond, >>>> >>>> On 7/8/2019 7:44 PM, Trond Myklebust wrote: >>>>> I've asked several times now about how to interpret your results. >>>>> As far as I can tell from your numbers, the overhead appears to be >>>>> entirely contained in the NUMA section of your results. >>>>> IOW: it would appear to be a scheduling overhead due to NUMA. I've >>>>> been asking whether or not that is a correct interpretation of the >>>>> numbers you published. >>>> Thanks for your feedback. I used the same hardware and the same test >>>> parameters to test the two commits: >>>>     e791f8e938 ("SUNRPC: Convert xs_send_kvec() to use >>>> iov_iter_kvec()") >>>>     0472e47660 ("SUNRPC: Convert socket page send code to use >>>> iov_iter()") >>>> >>>> If it is caused by NUMA, why only commit 0472e47660 throughput is >>>> decreased? The filesystem we test is NFS, commit 0472e47660 is >>>> related with the network, could you help to check if have any other >>>> clues for the regression. Thanks. >>>> >>> >> > -- Zhengjun Xing