All of lore.kernel.org
 help / color / mirror / Atom feed
* tuning linux for high network performance?
@ 2002-10-23 10:18 Roy Sigurd Karlsbakk
  2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 39+ messages in thread
From: Roy Sigurd Karlsbakk @ 2002-10-23 10:18 UTC (permalink / raw)
  To: netdev; +Cc: Kernel mailing list

hi

I've got this video server serving video for VoD. problem is the P4 1.8 seems 
to be maxed out by a few system calls. The below output is for ~50 clients 
streaming at ~4.5Mbps. if trying to increase this to ~70, the CPU maxes out.

Does anyone have an idea?

bash-2.05# readprofile | sort -rn +2 | head -30
154203 default_idle                             2409.4219
212723 csum_partial_copy_generic                916.9095
100164 handle_IRQ_event                         695.5833
 24979 system_call                              390.2969
 37300 e1000_intr                               388.5417
119699 ide_intr                                 340.0540
 30598 skb_release_data                         273.1964
 40740 do_softirq                               195.8654
131818 do_wp_page                               164.7725
  9935 fget                                     155.2344
 24747 kfree                                    154.6687
 10911 del_timer                                113.6562
 11683 ip_conntrack_find_get                     91.2734
  4120 sock_poll                                 85.8333
  9357 ip_ct_find_proto                          83.5446
  5194 sock_wfree                                81.1562
  4929 add_wait_queue                            77.0156
  8361 flush_tlb_page                            74.6518
  4571 remove_wait_queue                         71.4219
  2191 __brelse                                  68.4688
 29477 skb_clone                                 68.2338
  8562 do_gettimeofday                           59.4583
  5673 process_timeout                           59.0938
 11097 tcp_v4_send_check                         57.7969
  6124 kfree_skbmem                              54.6786
 17115 tcp_poll                                  53.4844
 21130 nf_hook_slow                              52.8250
  8299 ip_ct_refresh                             51.8687
 15429 __kfree_skb                               50.7533
  1059 lru_cache_del                             46.0435


roy
-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RESEND] tuning linux for high network performance?
  2002-10-23 10:18 tuning linux for high network performance? Roy Sigurd Karlsbakk
@ 2002-10-23 11:06 ` Roy Sigurd Karlsbakk
  2002-10-23 13:01   ` bert hubert
  2002-10-23 18:01   ` [RESEND] tuning linux for high network performance? Denis Vlasenko
  0 siblings, 2 replies; 39+ messages in thread
From: Roy Sigurd Karlsbakk @ 2002-10-23 11:06 UTC (permalink / raw)
  To: netdev; +Cc: Kernel mailing list

> I've got this video server serving video for VoD. problem is the P4 1.8
> seems to be maxed out by a few system calls. The below output is for ~50
> clients streaming at ~4.5Mbps. if trying to increase this to ~70, the CPU
> maxes out.
>
> Does anyone have an idea?

...adding the whole profile output - sorted by the first column this time...

905182 total                                      0.4741
121426 csum_partial_copy_generic                474.3203
 93633 default_idle                             1800.6346
 74665 do_wp_page                               111.1086
 65857 ide_intr                                 184.9916
 53636 handle_IRQ_event                         432.5484
 21973 do_softirq                               107.7108
 20498 e1000_intr                               244.0238
 19800 do_page_fault                             16.8081
 19395 skb_clone                                 45.7429
 14564 system_call                              260.0714
 13592 kfree                                     89.4211
 13557 skb_release_data                         116.8707
 13025 ide_do_request                            17.6970
 12988 do_rw_disk                                 8.4557
 11841 tcp_sendmsg                                2.6814
 11720 nf_hook_slow                              29.0099
 11712 tcp_poll                                  34.0465
 10688 schedule                                   7.8588
 10386 __kfree_skb                               34.1645
 10052 ipt_do_table                              10.1741
  8286 fget                                     115.0833
  7436 tcp_v4_send_check                         44.2619
  7191 e1000_clean_tx_irq                        16.6458
  7031 kmalloc                                   18.1211
  6610 tcp_write_xmit                             9.3892
  6241 tcp_clean_rtx_queue                        8.0425
  6232 ip_conntrack_find_get                     51.9333
  6140 ide_dmaproc                                8.4341
  6125 tcp_packet                                14.0482
  5858 qdisc_restart                             15.4158
  5734 e1000_xmit_frame                           5.6660
  5709 tcp_v4_rcv                                 3.7363
  5703 sys_rt_sigprocmask                        11.4060
  5445 tcp_transmit_skb                           3.7500
  5273 alloc_skb                                 11.8761
  4961 ide_wait_stat                             18.7917
  4790 ip_ct_find_proto                          44.3519
  4782 add_timer                                 18.3923
  4760 ip_ct_refresh                             29.7500
  4729 do_anonymous_page                         17.6455
  4616 e1000_clean_rx_irq                         4.9106
  4464 do_gettimeofday                           37.2000
  4359 flush_tlb_page                            38.9196
  4209 ip_finish_output2                         16.4414
  3731 get_hash_table                            23.3188
  3714 eth_type_trans                            21.1023
  3712 __make_request                             2.3375
  3680 __ip_conntrack_find                       12.7778
  3480 ip_route_input                             9.1579
  3363 kfree_skbmem                              32.3365
  3295 __switch_to                               15.2546
  3205 fput                                      13.1352
  3143 rmqueue                                    5.3452
  3137 ip_conntrack_in                            5.0272
  3008 sync_timers                              250.6667
  2861 sock_wfree                                47.6833
  2580 ip_queue_xmit                              2.0347
  2578 process_timeout                           26.8542
  2577 netif_rx                                   6.1357
  2555 get_user_pages                             6.2623
  2504 sock_poll                                 62.6000
  2346 ide_build_sglist                           5.6394
  2316 brw_kiovec                                 2.5619
  2256 csum_partial                               7.8333
  2251 ip_queue_xmit2                             4.2958
  2198 start_request                              4.0404
  2186 dev_queue_xmit                             2.7462
  2167 timer_bh                                   2.2203
  2162 __free_pages_ok                            3.1608
  2157 zap_page_range                             2.4400
  1942 mark_dirty_kiobuf                         21.1087
  1733 process_backlog                            5.9349
  1719 tcp_rcv_established                        0.8493
  1689 add_wait_queue                            32.4808
  1650 mod_timer                                  6.1567
  1603 wait_kio                                  17.4239
  1575 net_rx_action                              4.8611
  1554 get_pid                                    4.0052
  1434 lru_cache_add                             12.8036
  1429 handle_mm_fault                            7.7663
  1397 ip_local_deliver_finish                    4.5357
  1357 nf_iterate                                10.2803
  1350 e1000_alloc_rx_buffers                     5.2734
  1298 do_select                                  2.5155
  1268 unlock_page                               12.1923
  1209 submit_bh                                 10.7946
  1184 add_entropy_words                          5.9200
  1175 __brelse                                  36.7188
  1125 __pollwait                                 7.8125
  1108 shrink_list                                5.7708
  1099 generic_make_request                       3.6151
  1080 __free_pages                              33.7500
  1052 tcp_ack                                    1.2524
  1020 ip_rcv                                     1.0851
   986 raid0_make_request                         2.9345
   898 ext3_direct_io_get_block                   4.7766
   883 pfifo_fast_dequeue                        11.6184
   863 sys_gettimeofday                           5.5321
   828 tcp_ack_update_window                      3.9808
   813 ipt_local_out_hook                         7.8173
   761 __lru_cache_del                            6.5603
   756 sys_write                                  2.9531
   742 __rdtsc_delay                             26.5000
   730 uhci_interrupt                             3.3182
   718 net_tx_action                              2.5643
   710 batch_entropy_store                        3.9444
   701 add_timer_randomness                       3.3066
   666 tasklet_hi_action                          4.1625
   662 sys_nanosleep                              1.7796
   635 set_page_dirty                             5.4741
   627 __tcp_data_snd_check                       3.1350
   611 netif_receive_skb                          1.9094
   601 pfifo_fast_enqueue                         5.3661
   590 del_timer_sync                             4.3382
   587 lru_cache_del                             26.6818
   574 get_unmapped_area                          2.1103
   561 wait_for_tcp_memory                        0.7835
   557 ip_refrag                                  5.8021
   546 ip_conntrack_local                         6.2045
   515 sys_select                                 0.4486
   507 __tcp_select_window                        2.2634
   481 ext3_get_branch                            2.2689
   433 ip_output                                  1.2301
   389 ip_confirm                                 9.7250
   384 find_vma                                   4.5714
   379 set_bh_page                                9.4750
   376 tcp_v4_do_rcv                              1.0108
   370 tcp_ack_no_tstamp                          1.4453
   368 batch_entropy_process                      2.0000
   365 ide_build_dmatable                         1.1551
   364 ip_rcv_finish                              0.7696
   363 kmem_cache_free                            2.8359
   362 __wake_up                                  1.8854
   338 ext3_get_block_handle                      0.5152
   336 inet_sendmsg                               5.2500
   318 bh_action                                  2.3382
   298 tcp_data_queue                             0.1064
   272 md_make_request                            2.6154
   272 ext3_block_to_path                         0.9714
   248 sock_sendmsg                               1.8235
   244 __alloc_pages                              0.6932
   238 kmem_cache_alloc                           0.7532
   226 __free_pte                                 3.1389
   225 tcp_ack_probe                              1.3720
   224 __run_task_queue                           2.0741
   213 ide_get_queue                              5.3250
   209 __ip_ct_find_proto                         4.3542
   202 get_sample_stats                           1.7414
   195 tcp_write_space                            1.6810
   185 schedule_timeout                           1.1859
   184 do_signal                                  0.2992
   182 ipt_hook                                   4.5500
   171 generic_direct_IO                          0.5552
   168 can_share_swap_page                        1.8261
   157 ip_local_deliver                           0.3965
   155 get_conntrack_index                        2.7679
   145 tcp_push_one                               0.5664
   145 ide_set_handler                            1.2500
   144 pipe_poll                                  1.4400
   143 max_select_fd                              0.8938
   139 pte_alloc                                  0.5792
   138 del_timer                                  1.6429
   136 sock_write                                 0.7234
   131 poll_freewait                              1.9265
   131 getblk                                     1.7237
   130 send_sig_info                              0.8553
   128 __release_sock                             1.4545
   125 ret_from_sys_call                          7.3529
   120 ext3_direct_IO                             0.1807
   118 tcp_pkt_to_tuple                           3.6875
   117 find_vma_prev                              0.6648
   114 do_no_page                                 0.2298
   112 tqueue_bh                                  4.0000
   112 follow_page                                1.0769
   110 bread                                      1.1000
   108 e1000_rx_checksum                          1.2273
   107 generic_file_direct_IO                     0.1938
   102 add_interrupt_randomness                   2.5500
    97 remove_wait_queue                          1.7321
    96 mark_page_accessed                         2.0000
    91 kill_something_info                        0.2645
    85 invert_tuple                               1.9318
    81 exit_notify                                0.1151
    81 cpu_idle                                   0.9643
    80 tcp_new_space                              0.6061
    79 nf_register_queue_handler                  0.5197
    75 uhci_remove_pending_qhs                    0.3906
    69 pdc202xx_dmaproc                           0.1250
    68 sys_read                                   0.2656
    68 nf_reinject                                0.1491
    66 map_user_kiobuf                            0.2619
    65 find_vma_prepare                           0.6500
    64 generic_file_read                          0.2353
    61 check_pgt_cache                            2.5417
    60 free_pages                                 1.8750
    58 error_code                                 0.9667
    57 vm_enough_memory                           0.5481
    56 __delay                                    1.4000
    55 __const_udelay                             1.0577
    53 tcp_ioctl                                  0.0908
    53 journal_commit_transaction                 0.0132
    53 do_munmap                                  0.0901
    52 _alloc_pages                               2.1667
    51 uhci_finish_completion                     0.4554
    51 credit_entropy_store                       1.1591
    50 rh_report_status                           0.1953
    50 free_page_and_swap_cache                   0.8929
    49 sys_rt_sigsuspend                          0.1750
    49 nr_free_pages                              0.6125
    49 do_mmap_pgoff                              0.0394
    48 e1000_update_stats                         0.0307
    48 do_get_write_access                        0.0366
    48 __journal_file_buffer                      0.0916
    48 __get_free_pages                           2.0000
    48 .text.lock.e1000_main                      1.7143
    47 expand_kiobuf                              0.3092
    46 uhci_free_pending_qhs                      0.4600
    46 tcp_parse_options                          0.0833
    46 kmem_cache_size                            5.7500
    45 rb_erase                                   0.2083
    44 unmap_kiobuf                               0.6111
    41 tcp_cwnd_application_limited               0.3106
    41 rh_int_timer_do                            0.1165
    41 init_or_cleanup                            0.1424
    40 sync_unlocked_inodes                       0.0901
    40 init_buffer                                1.4286
    39 .text.lock.ip_input                        1.0000
    38 vma_merge                                  0.1301
    38 pfifo_fast_requeue                         0.6786
    38 ip_conntrack_get                           0.9500
    38 dev_watchdog                               0.2209
    37 .text.lock.ip_output                       0.2803
    36 do_check_pgt_cache                         0.1731
    35 tcp_retrans_try_collapse                   0.0576
    35 journal_add_journal_head                   0.1306
    34 ext3_get_inode_loc                         0.0914
    33 journal_write_revoke_records               0.1964
    32 fsync_buffers_list                         0.0860
    31 filemap_fdatasync                          0.1615
    31 __pmd_alloc                                1.5500
    30 sys_wait4                                  0.0305
    30 restore_sigcontext                         0.0949
    29 sys_sigreturn                              0.1169
    28 tcp_fastretrans_alert                      0.0224
    28 do_settimeofday                            0.1628
    28 do_ide_request                             1.4000
    27 unmap_fixup                                0.0785
    27 find_extend_vma                            0.1350
    27 eth_header_parse                           0.8438
    27 current_capacity                           0.6750
    26 save_i387                                  0.0478
    26 __journal_clean_checkpoint_list            0.2407
    25 update_atime                               0.3125
    25 tcp_v4_destroy_sock                        0.0718
    25 link_path_walk                             0.0102
    25 buffer_insert_inode_queue                  0.2841
    25 __journal_unfile_buffer                    0.0665
    24 sys_mmap2                                  0.1622
    24 rh_send_irq                                0.0896
    24 rb_insert_color                            0.1224
    24 ext3_do_update_inode                       0.0261
    24 balance_dirty_state                        0.3158
    24 add_wait_queue_exclusive                   0.4615
    24 __try_to_free_cp_buf                       0.4000
    23 free_kiobuf_bhs                            0.2396
    22 tcp_rcv_synsent_state_process              0.0169
    22 sys_munmap                                 0.2619
    22 start_this_handle                          0.0598
    22 sock_rfree                                 1.3750
    22 setup_sigcontext                           0.0743
    22 flush_tlb_mm                               0.1964
    22 do_exit                                    0.0301
    22 alloc_kiobuf_bhs                           0.1170
    22 __rb_erase_color                           0.0567
    21 tcp_mem_schedule                           0.0477
    21 setup_frame                                0.0482
    21 __generic_copy_to_user                     0.3500
    20 unlock_buffer                              0.3125
    20 journal_write_metadata_buffer              0.0240
    20 d_lookup                                   0.0704
    20 copy_skb_header                            0.0980
    19 sync_old_buffers                           0.1218
    19 sock_mmap                                  0.4750
    19 skb_split                                  0.0344
    19 select_bits_alloc                          0.7917
    19 get_info_ptr                               0.2065
    17 tcp_write_wakeup                           0.0363
    17 ret_from_exception                         0.6800
    17 kiobuf_wait_for_io                         0.1062
    17 journal_unlock_journal_head                0.1518
    17 bad_signal                                 0.1250
    16 tcp_probe_timer                            0.0952
    16 tcp_close                                  0.0083
    16 ip_route_output_slow                       0.0099
    16 __mark_inode_dirty                         0.0952
    16 .text.lock.timer                           0.1250
    16 .text.lock.tcp                             0.0152
    15 journal_cancel_revoke                      0.0765
    15 ext3_bmap                                  0.1500
    15 do_fork                                    0.0074
    15 blk_grow_request_list                      0.0833
    14 tcp_v4_conn_request                        0.0145
    14 sync_supers                                0.0507
    14 log_start_commit                           0.0946
    14 lock_vma_mappings                          0.3500
    14 journal_dirty_metadata                     0.0354
    14 file_read_actor                            0.0625
    14 __insert_vm_struct                         0.1400
    13 tcp_time_to_recover                        0.0290
    13 sys_ioctl                                  0.0259
    13 lookup_swap_cache                          0.1625
    13 ip_build_xmit_slow                         0.0099
    13 invalidate_inode_pages                     0.0739
    13 ext3_dirty_inode                           0.0478
    13 bmap                                       0.2955
    12 tcp_collapse                               0.0143
    12 sys_socketcall                             0.0234
    12 put_filp                                   0.1364
    12 make_pages_present                         0.0968
    12 journal_get_write_access                   0.1304
    12 generic_file_write                         0.0061
    12 e1000_ioctl                                0.3333
    11 uhci_transfer_result                       0.0316
    11 tcp_try_to_open                            0.0348
    11 tcp_recvmsg                                0.0045
    11 tcp_create_openreq_child                   0.0092
    11 sys_kill                                   0.1250
    11 schedule_tail                              0.0786
    11 osync_buffers_list                         0.0859
    11 journal_stop                               0.0255
    11 do_sigpending                              0.0887
    10 tcp_unhash                                 0.0397
    10 tcp_send_probe0                            0.0424
    10 tcp_rcv_state_process                      0.0040
    10 sys_poll                                   0.0138
    10 inet_shutdown                              0.0208
    10 execute_drive_cmd                          0.0221
    10 __put_unused_buffer_head                   0.1136
     9 tcp_write_timer                            0.0395
     9 tcp_send_skb                               0.0191
     9 tcp_make_synack                            0.0082
     9 set_buffer_flushtime                       0.4500
     9 raid0_status                               0.2045
     9 copy_page_range                            0.0205
     8 kupdate                                    0.0274
     8 journal_get_descriptor_buffer              0.0741
     8 get_empty_filp                             0.0253
     8 ext3_write_super                           0.0741
     8 count_active_tasks                         0.1111
     8 atomic_dec_and_lock                        0.1111
     8 __lock_page                                0.0400
     8 __journal_remove_journal_head              0.0250
     8 __ip_conntrack_confirm                     0.0115
     8 __block_prepare_write                      0.0105
     7 tcp_invert_tuple                           0.2188
     7 ports_active                               0.1346
     7 pipe_write                                 0.0112
     7 kjournald                                  0.0130
     7 handle_signal                              0.0273
     7 grow_buffers                               0.0254
     7 ext3_get_block                             0.0700
     7 balance_classzone                          0.0151
     7 __jbd_kmalloc                              0.0625
     7 .text.lock.swap                            0.1296
     6 vsnprintf                                  0.0057
     6 tcp_v4_send_reset                          0.0176
     6 tcp_accept                                 0.0105
     6 sleep_on                                   0.0500
     6 select_bits_free                           0.3750
     6 pipe_read                                  0.0118
     6 number                                     0.0055
     6 ip_route_output_key                        0.0165
     6 inet_accept                                0.0136
     6 get_unused_buffer_head                     0.0375
     6 dput                                       0.0176
     6 cleanup_rbuf                               0.0273
     6 __journal_remove_checkpoint                0.0556
     6 __journal_drop_transaction                 0.0087
     6 __find_get_page                            0.0938
     6 .text.lock.netfilter                       0.0260
     5 vmtruncate_list                            0.0625
     5 vfs_permission                             0.0208
     5 tcp_v4_hnd_req                             0.0147
     5 tcp_init_cwnd                              0.0500
     5 tcp_check_urg                              0.0158
     5 tcp_check_sack_reneging                    0.0240
     5 sys_fork                                   0.1786
     5 sock_setsockopt                            0.0034
     5 sock_init_data                             0.0161
     5 sock_def_readable                          0.0521
     5 release_x86_irqs                           0.0595
     5 release_task                               0.0109
     5 refile_buffer                              0.1389
     5 pipe_release                               0.0368
     5 path_init                                  0.0129
     5 nr_free_buffer_pages                       0.0625
     5 mprotect_fixup                             0.0043
     5 log_space_left                             0.1562
     5 ll_rw_block                                0.0119
     5 journal_start                              0.0272
     5 init_bh                                    0.2083
     5 get_zeroed_page                            0.1389
     5 ext3_commit_write                          0.0078
     5 e1000_tx_timeout                           0.2500
     5 do_poll                                    0.0227
     5 bdfind                                     0.1389
     5 add_keyboard_randomness                    0.1250
     5 __wait_on_buffer                           0.0338
     5 __vma_link                                 0.0284
     5 __tcp_mem_reclaim                          0.0595
     5 __rb_rotate_left                           0.0781
     4 write_profile                              0.0244
     4 tcp_v4_syn_recv_sock                       0.0064
     4 tcp_v4_search_req                          0.0278
     4 tcp_v4_route_req                           0.0192
     4 tcp_v4_init_sock                           0.0169
     4 tcp_cwnd_restart                           0.0263
     4 tcp_check_req                              0.0043
     4 tcp_check_reno_reordering                  0.0500
     4 sys_mprotect                               0.0078
     4 strncpy_from_user                          0.0500
     4 sock_def_wakeup                            0.0625
     4 sock_alloc                                 0.0208
     4 skb_copy_datagram_iovec                    0.0071
     4 lookup_mnt                                 0.0476
     4 locks_remove_posix                         0.0096
     4 invalidate_inode_buffers                   0.0370
     4 init_conntrack                             0.0043
     4 halfMD4Transform                           0.0068
     4 find_or_create_page                        0.0164
     4 filp_close                                 0.0238
     4 ext3_reserve_inode_write                   0.0233
     4 ext3_find_goal                             0.0213
     4 do_fcntl                                   0.0059
     4 dnotify_flush                              0.0345
     4 d_alloc                                    0.0105
     4 add_blkdev_randomness                      0.0526
     4 _stext                                     0.0500
     4 __journal_insert_checkpoint                0.0167
     4 __find_lock_page_helper                    0.0323
     4 .text.lock.inode                           0.0086
     3 wait_for_tcp_connect                       0.0054
     3 tcp_v4_get_port                            0.0045
     3 tcp_put_port                               0.0150
     3 tcp_init_xmit_timers                       0.0221
     3 tcp_clear_xmit_timers                      0.0234
     3 tcp_add_reno_sack                          0.0357
     3 sys_sched_getscheduler                     0.0288
     3 sys_fcntl64                                0.0221
     3 sys_accept                                 0.0119
     3 sock_ioctl                                 0.0268
     3 sock_fasync                                0.0038
     3 sock_def_error_report                      0.0312
     3 rt_check_expire__thr                       0.0077
     3 rh_init_int_timer                          0.0278
     3 reset_hc                                   0.0167
     3 register_gifconf                           0.0938
     3 read_chan                                  0.0016
     3 put_unused_buffer_head                     0.0833
     3 pipe_ioctl                                 0.0375
     3 permission                                 0.0227
     3 open_namei                                 0.0024
     3 mm_release                                 0.0833
     3 locks_remove_flock                         0.0163
     3 ksoftirqd                                  0.0153
     3 journal_file_buffer                        0.0682
     3 iput                                       0.0060
     3 ip_build_and_send_pkt                      0.0067
     3 interruptible_sleep_on                     0.0250
     3 inet_sock_destruct                         0.0080
     3 inet_ioctl                                 0.0079
     3 inet_create                                0.0048
     3 immediate_bh                               0.1071
     3 get_unused_fd                              0.0077
     3 get_empty_inode                            0.0179
     3 flush_tlb_all_ipi                          0.0395
     3 filemap_fdatawait                          0.0214
     3 fd_install                                 0.0441
     3 ext3_prepare_write                         0.0056
     3 ext3_mark_iloc_dirty                       0.0357
     3 e1000_watchdog                             0.0064
     3 e1000_read_phy_reg                         0.0179
     3 d_invalidate                               0.0214
     3 create_buffers                             0.0125
     3 cp_new_stat64                              0.0095
     3 copy_mm                                    0.0040
     3 copy_files                                 0.0043
     3 bdget                                      0.0078
     3 __insert_into_lru_list                     0.0300
     3 __global_restore_flags                     0.0417
     3 __get_user_4                               0.1250
     2 write_ldt                                  0.0037
     2 walk_page_buffers                          0.0161
     2 tcp_try_undo_partial                       0.0093
     2 tcp_try_undo_dsack                         0.0294
     2 tcp_send_ack                               0.0100
     2 tcp_retransmit_skb                         0.0034
     2 tcp_new                                    0.0333
     2 tcp_init_metrics                           0.0063
     2 tcp_fragment                               0.0029
     2 tcp_fixup_sndbuf                           0.0455
     2 tcp_enter_loss                             0.0051
     2 tcp_destroy_sock                           0.0043
     2 tcp_close_state                            0.0104
     2 tcp_child_process                          0.0134
     2 tcp_bucket_create                          0.0263
     2 tasklet_init                               0.0500
     2 sys_close                                  0.0179
     2 sock_recvmsg                               0.0116
     2 sock_map_fd                                0.0052
     2 sk_free                                    0.0172
     2 sk_alloc                                   0.0208
     2 sem_exit                                   0.0038
     2 reschedule                                 0.1667
     2 put_files_struct                           0.0109
     2 path_release                               0.0417
     2 path_lookup                                0.0556
     2 mmput                                      0.0172
     2 kiobuf_init                                0.0238
     2 journal_unfile_buffer                      0.0556
     2 journal_get_undo_access                    0.0070
     2 journal_dirty_data                         0.0047
     2 ip_mc_drop_socket                          0.0156
     2 idedisk_open                               0.0156
     2 grow_dev_page                              0.0122
     2 getname                                    0.0128
     2 generic_unplug_device                      0.0333
     2 generic_file_llseek                        0.0135
     2 free_kiovec                                0.0200
     2 flush_signal_handlers                      0.0333
     2 filemap_nopage                             0.0040
     2 ext3_writepage_trans_blocks                0.0152
     2 ext3_getblk                                0.0030
     2 do_generic_file_read                       0.0017
     2 destroy_inode                              0.0455
     2 deliver_to_old_ones                        0.0114
     2 copy_namespace                             0.0023
     2 clear_inode                                0.0122
     2 clean_inode                                0.0109
     2 block_prepare_write                        0.0179
     2 alloc_kiovec                               0.0161
     2 add_page_to_hash_queue                     0.0455
     2 activate_page                              0.0139
     2 __tcp_v4_lookup_listener                   0.0208
     2 __journal_refile_buffer                    0.0088
     2 __generic_copy_from_user                   0.0227
     2 __find_lock_page                           0.0500
     2 __down_trylock                             0.0263
     2 __down_failed_trylock                      0.1667
     2 __block_commit_write                       0.0098
     2 .text.lock.sched                           0.0042
     1 vt_console_device                          0.0250
     1 vgacon_save_screen                         0.0114
     1 udp_sendmsg                                0.0010
     1 tty_write                                  0.0015
     1 tty_ioctl                                  0.0011
     1 tcp_xmit_retransmit_queue                  0.0010
     1 tcp_xmit_probe_skb                         0.0086
     1 tcp_v4_synq_add                            0.0063
     1 tcp_v4_rebuild_header                      0.0028
     1 tcp_timewait_kill                          0.0045
     1 tcp_sync_mss                               0.0081
     1 tcp_reset_keepalive_timer                  0.0250
     1 tcp_reset                                  0.0039
     1 tcp_recv_urg                               0.0044
     1 tcp_incr_quickack                          0.0167
     1 tcp_error                                  0.0139
     1 sys_time                                   0.0119
     1 sys_stat64                                 0.0086
     1 sys_modify_ldt                             0.0106
     1 sys_lstat64                                0.0089
     1 sys_llseek                                 0.0034
     1 sys_getppid                                0.0250
     1 sys_getpeername                            0.0081
     1 sys_fstat64                                0.0104
     1 sys_clone                                  0.0250
     1 sys_brk                                    0.0042
     1 sys_access                                 0.0034
     1 svc_udp_recvfrom                           0.0014
     1 sock_wmalloc                               0.0125
     1 sock_release                               0.0104
     1 sock_read                                  0.0064
     1 sock_create                                0.0036
     1 skb_recv_datagram                          0.0042
     1 show_mem                                   0.0033
     1 setup_rt_frame                             0.0015
     1 setscheduler                               0.0024
     1 secure_tcp_sequence_number                 0.0051
     1 restart_request                            0.0132
     1 remove_inode_page                          0.0192
     1 remove_expectations                        0.0208
     1 proc_pid_lookup                            0.0020
     1 proc_lookup                                0.0068
     1 pdc202xx_reset                             0.0074
     1 path_walk                                  0.0357
     1 opost                                      0.0023
     1 old_mmap                                   0.0033
     1 normal_poll                                0.0035
     1 nfs3svc_encode_attrstat                    0.0020
     1 n_tty_receive_buf                          0.0002
     1 move_addr_to_user                          0.0119
     1 mm_init                                    0.0051
     1 memory_open                                0.0050
     1 kmem_cache_grow                            0.0018
     1 kill_fasync                                0.0172
     1 journal_free_journal_head                  0.0500
     1 journal_bmap                               0.0089
     1 journal_blocks_per_page                    0.0312
     1 journal_alloc_journal_head                 0.0096
     1 is_read_only                               0.0147
     1 ip_ct_gather_frags                         0.0031
     1 init_private_file                          0.0093
     1 init_once                                  0.0038
     1 init_buffer_head                           0.0182
     1 inet_release                               0.0125
     1 inet_getname                               0.0083
     1 inet_autobind                              0.0023
     1 get_pipe_inode                             0.0057
     1 free_pgtables                              0.0071
     1 fn_hash_lookup                             0.0045
     1 find_inlist_lock                           0.0035
     1 file_move                                  0.0139
     1 fcntl_dirnotify                            0.0032
     1 ext3_write_inode                           0.0192
     1 ext3_test_allocatable                      0.0156
     1 ext3_release_file                          0.0357
     1 ext3_read_inode                            0.0014
     1 ext3_open_file                             0.0250
     1 ext3_group_sparse                          0.0104
     1 ext3_file_write                            0.0053
     1 exit_sighand                               0.0100
     1 e1000_tbi_adjust_stats                     0.0021
     1 e1000_check_for_link                       0.0020
     1 do_timer                                   0.0125
     1 do_tcp_sendpages                           0.0004
     1 do_sys_settimeofday                        0.0064
     1 do_readv_writev                            0.0016
     1 do_pollfd                                  0.0074
     1 death_by_timeout                           0.0068
     1 d_instantiate                              0.0139
     1 cpu_raise_softirq                          0.0154
     1 copy_thread                                0.0071
     1 clear_page_tables                          0.0046
     1 clean_from_lists                           0.0139
     1 check_unthrottle                           0.0208
     1 change_protection                          0.0027
     1 cached_lookup                              0.0119
     1 add_to_page_cache_locked                   0.0081
     1 __user_walk                                0.0156
     1 __remove_inode_page                        0.0104
     1 __remove_from_lru_list                     0.0119
     1 __refile_buffer                            0.0109
     1 __rb_rotate_right                          0.0156
     1 __loop_delay                               0.0250
     1 .text.lock.super                           0.0071

-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk
@ 2002-10-23 13:01   ` bert hubert
  2002-10-23 13:21     ` David S. Miller
                       ` (2 more replies)
  2002-10-23 18:01   ` [RESEND] tuning linux for high network performance? Denis Vlasenko
  1 sibling, 3 replies; 39+ messages in thread
From: bert hubert @ 2002-10-23 13:01 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: netdev, Kernel mailing list

On Wed, Oct 23, 2002 at 01:06:18PM +0200, Roy Sigurd Karlsbakk wrote:
> > I've got this video server serving video for VoD. problem is the P4 1.8
> > seems to be maxed out by a few system calls. The below output is for ~50
> > clients streaming at ~4.5Mbps. if trying to increase this to ~70, the CPU
> > maxes out.

'50 clients *each* streaming at ~4.4MBps', better make that clear, otherwise
something is *very* broken. Also mention that you have an e1000 card which
does not do outgoing checksumming.

You'd think that a kernel would be able to do 250megabits of TCP checksums
though.

> ...adding the whole profile output - sorted by the first column this time...
> 
> 905182 total                                      0.4741
> 121426 csum_partial_copy_generic                474.3203
>  93633 default_idle                             1800.6346
>  74665 do_wp_page                               111.1086

Perhaps the 'copy' also entails grabbing the page from disk, leading to
inflated csum_partial_copy_generic stats?

Where are you serving from?

Regards,

bert

-- 
http://www.PowerDNS.com          Versatile DNS Software & Services
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 13:01   ` bert hubert
@ 2002-10-23 13:21     ` David S. Miller
  2002-10-23 13:42       ` Roy Sigurd Karlsbakk
  2002-10-23 13:41     ` [RESEND] tuning linux for high network performance? Roy Sigurd Karlsbakk
  2002-10-23 14:59     ` Nivedita Singhvi
  2 siblings, 1 reply; 39+ messages in thread
From: David S. Miller @ 2002-10-23 13:21 UTC (permalink / raw)
  To: bert hubert; +Cc: Roy Sigurd Karlsbakk, netdev, Kernel mailing list

On Wed, 2002-10-23 at 06:01, bert hubert wrote:
> Also mention that you have an e1000 card which
> does not do outgoing checksumming.

The e1000 can very well do hardware checksumming on transmit.

The missing piece of the puzzle is that his application is not
using sendfile(), without which no transmit checksum offload
can take place.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 18:01   ` [RESEND] tuning linux for high network performance? Denis Vlasenko
@ 2002-10-23 13:36     ` Roy Sigurd Karlsbakk
  2002-10-24 16:22       ` Denis Vlasenko
  2002-10-23 14:52     ` [RESEND] tuning linux for high network performance? Nivedita Singhvi
  1 sibling, 1 reply; 39+ messages in thread
From: Roy Sigurd Karlsbakk @ 2002-10-23 13:36 UTC (permalink / raw)
  To: vda, netdev; +Cc: Kernel mailing list

> >
> > 905182 total                                      0.4741
> > 121426 csum_partial_copy_generic                474.3203
>
> Well, maybe take a look at this func and try to optimize it?

I don't know assembly that good - sorry.

> >  93633 default_idle                             1800.6346
> >  74665 do_wp_page                               111.1086
>
> What's this?

do_wp_page is Defined as a function in: mm/memory.c

comments from the file:

/*
 * This routine handles present pages, when users try to write
 * to a shared page. It is done by copying the page to a new address
 * and decrementing the shared-page counter for the old page.
 *
 * Goto-purists beware: the only reason for goto's here is that it results
 * in better assembly code.. The "default" path will see no jumps at all.
 *
 * Note that this routine assumes that the protection checks have been
 * done by the caller (the low-level page fault routine in most cases).
 * Thus we can safely just mark it writable once we've done any necessary
 * COW.
 *
 * We also mark the page dirty at this point even though the page will
 * change only once the write actually happens. This avoids a few races,
 * and potentially makes it more efficient.
 *
 * We hold the mm semaphore and the page_table_lock on entry and exit
 * with the page_table_lock released.
 */

>
> >  65857 ide_intr                                 184.9916
>
> You have 1 ide_intr per 2 csum_partial_copy_generic... hmmm...
> how large is your readahead? I assume you'd like to fetch
> more sectors from ide per interrupt. (I hope you do DMA ;)

doing DMA - RAID-0 with 1MB chunk size on 4 disks.

> >  53636 handle_IRQ_event                         432.5484
> >  21973 do_softirq                               107.7108
> >  20498 e1000_intr                               244.0238
>
> I know zero about networking, but why 120 000 csum_partial_copy_generic
> and inly 20 000 nic interrupts? That may be abnormal.

sorry
I don't know

-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 13:01   ` bert hubert
  2002-10-23 13:21     ` David S. Miller
@ 2002-10-23 13:41     ` Roy Sigurd Karlsbakk
  2002-10-23 14:59     ` Nivedita Singhvi
  2 siblings, 0 replies; 39+ messages in thread
From: Roy Sigurd Karlsbakk @ 2002-10-23 13:41 UTC (permalink / raw)
  To: bert hubert; +Cc: netdev, Kernel mailing list

> '50 clients *each* streaming at ~4.4MBps', better make that clear,
> otherwise something is *very* broken. Also mention that you have an e1000
> card which does not do outgoing checksumming.

just to clearify

s/MBps/Mbps/
s/bps/bits per second/

> You'd think that a kernel would be able to do 250megabits of TCP checksums
> though.
>
> > ...adding the whole profile output - sorted by the first column this
> > time...
> >
> > 905182 total                                      0.4741
> > 121426 csum_partial_copy_generic                474.3203
> >  93633 default_idle                             1800.6346
> >  74665 do_wp_page                               111.1086
>
> Perhaps the 'copy' also entails grabbing the page from disk, leading to
> inflated csum_partial_copy_generic stats?

I really don't know. Just to clearify a little more - the server app uses 
O_DIRECT to read the data before tossing it to the socket.

> Where are you serving from?

What do you mean?

roy
-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 13:21     ` David S. Miller
@ 2002-10-23 13:42       ` Roy Sigurd Karlsbakk
  2002-10-23 17:01         ` bert hubert
  2002-10-24  4:11         ` David S. Miller
  0 siblings, 2 replies; 39+ messages in thread
From: Roy Sigurd Karlsbakk @ 2002-10-23 13:42 UTC (permalink / raw)
  To: David S. Miller, bert hubert; +Cc: netdev, Kernel mailing list

> The e1000 can very well do hardware checksumming on transmit.
>
> The missing piece of the puzzle is that his application is not
> using sendfile(), without which no transmit checksum offload
> can take place.

As far as I've understood, sendfile() won't do much good with large files. Is 
this right?

We're talking of 3-6GB files here ...

roy
-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 18:01   ` [RESEND] tuning linux for high network performance? Denis Vlasenko
  2002-10-23 13:36     ` Roy Sigurd Karlsbakk
@ 2002-10-23 14:52     ` Nivedita Singhvi
  1 sibling, 0 replies; 39+ messages in thread
From: Nivedita Singhvi @ 2002-10-23 14:52 UTC (permalink / raw)
  To: vda; +Cc: Roy Sigurd Karlsbakk, netdev

Denis Vlasenko wrote:

> I know zero about networking, but why 120 000 csum_partial_copy_generic
> and inly 20 000 nic interrupts? That may be abnormal.
> --
> vda

Because firstly, we pick up several packets per interrupt,
and additionally, the function is also called on the send side.

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 13:01   ` bert hubert
  2002-10-23 13:21     ` David S. Miller
  2002-10-23 13:41     ` [RESEND] tuning linux for high network performance? Roy Sigurd Karlsbakk
@ 2002-10-23 14:59     ` Nivedita Singhvi
  2002-10-23 15:26       ` O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
  2 siblings, 1 reply; 39+ messages in thread
From: Nivedita Singhvi @ 2002-10-23 14:59 UTC (permalink / raw)
  To: bert hubert; +Cc: Roy Sigurd Karlsbakk, netdev, Kernel mailing list

bert hubert wrote:

> > ...adding the whole profile output - sorted by the first column this time...
> >
> > 905182 total                                      0.4741
> > 121426 csum_partial_copy_generic                474.3203
> >  93633 default_idle                             1800.6346
> >  74665 do_wp_page                               111.1086
> 
> Perhaps the 'copy' also entails grabbing the page from disk, leading to
> inflated csum_partial_copy_generic stats?

I think this is strictly a copy from user space->kernel and vice versa.
This shouldnt include the disk access etc. 

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 39+ messages in thread

* O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?)
  2002-10-23 14:59     ` Nivedita Singhvi
@ 2002-10-23 15:26       ` Roy Sigurd Karlsbakk
  2002-10-23 16:34           ` Nivedita Singhvi
  0 siblings, 1 reply; 39+ messages in thread
From: Roy Sigurd Karlsbakk @ 2002-10-23 15:26 UTC (permalink / raw)
  To: Nivedita Singhvi, bert hubert; +Cc: netdev, Kernel mailing list

On Wednesday 23 October 2002 16:59, Nivedita Singhvi wrote:
> bert hubert wrote:
> > > ...adding the whole profile output - sorted by the first column this
> > > time...
> > >
> > > 905182 total                                      0.4741
> > > 121426 csum_partial_copy_generic                474.3203
> > >  93633 default_idle                             1800.6346
> > >  74665 do_wp_page                               111.1086
> >
> > Perhaps the 'copy' also entails grabbing the page from disk, leading to
> > inflated csum_partial_copy_generic stats?
>
> I think this is strictly a copy from user space->kernel and vice versa.
> This shouldnt include the disk access etc.

hm

I'm doing O_DIRECT read (from disk), so it needs to be user -> kernel, then.

any chance of using O_DIRECT to the socket?

-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: O_DIRECT sockets? (was [RESEND] tuning linux for high network  performance?)
  2002-10-23 15:26       ` O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
@ 2002-10-23 16:34           ` Nivedita Singhvi
  0 siblings, 0 replies; 39+ messages in thread
From: Nivedita Singhvi @ 2002-10-23 16:34 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: bert hubert, netdev, Kernel mailing list

Roy Sigurd Karlsbakk wrote:

> I'm doing O_DIRECT read (from disk), so it needs to be user -> kernel, then.
> 
> any chance of using O_DIRECT to the socket?

Hmm, I'm still not clear on why you cannot use sendfile()?
I was not aware of any upper limit to the file size in order
for sendfile() to be used?  From what little I know, this 
is exactly the kind of situation that sendfile was intended
to benefit. 

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?)
@ 2002-10-23 16:34           ` Nivedita Singhvi
  0 siblings, 0 replies; 39+ messages in thread
From: Nivedita Singhvi @ 2002-10-23 16:34 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: bert hubert, netdev, Kernel mailing list

Roy Sigurd Karlsbakk wrote:

> I'm doing O_DIRECT read (from disk), so it needs to be user -> kernel, then.
> 
> any chance of using O_DIRECT to the socket?

Hmm, I'm still not clear on why you cannot use sendfile()?
I was not aware of any upper limit to the file size in order
for sendfile() to be used?  From what little I know, this 
is exactly the kind of situation that sendfile was intended
to benefit. 

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 13:42       ` Roy Sigurd Karlsbakk
@ 2002-10-23 17:01         ` bert hubert
  2002-10-23 17:10           ` Ben Greear
                             ` (2 more replies)
  2002-10-24  4:11         ` David S. Miller
  1 sibling, 3 replies; 39+ messages in thread
From: bert hubert @ 2002-10-23 17:01 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: David S. Miller, netdev, Kernel mailing list

On Wed, Oct 23, 2002 at 03:42:48PM +0200, Roy Sigurd Karlsbakk wrote:
> > The e1000 can very well do hardware checksumming on transmit.
> >
> > The missing piece of the puzzle is that his application is not
> > using sendfile(), without which no transmit checksum offload
> > can take place.
> 
> As far as I've understood, sendfile() won't do much good with large files. Is 
> this right?

I still refuse to believe that a 1.8GHz Pentium4 can only checksum
250megabits/second. MD Raid5 does better and they probably don't use a
checksum as braindead as that used by TCP.

If the checksumming is not the problem, the copying is, which would be a
weakness of your hardware. The function profiled does both the copying and
the checksumming.

But 250megabits/second also seems low.

Dave? 

Regards,

bert

-- 
http://www.PowerDNS.com          Versatile DNS Software & Services
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 17:01         ` bert hubert
@ 2002-10-23 17:10           ` Ben Greear
  2002-10-23 17:11           ` Richard B. Johnson
  2002-10-23 17:12           ` Nivedita Singhvi
  2 siblings, 0 replies; 39+ messages in thread
From: Ben Greear @ 2002-10-23 17:10 UTC (permalink / raw)
  To: bert hubert
  Cc: Roy Sigurd Karlsbakk, David S. Miller, netdev, Kernel mailing list

bert hubert wrote:

> I still refuse to believe that a 1.8GHz Pentium4 can only checksum
> 250megabits/second. MD Raid5 does better and they probably don't use a
> checksum as braindead as that used by TCP.

For what it's worth, I have been able to send and receive 400+ Mbps
of traffic, by directional, on the same machine (ie, about 1600 Mbps
of payload across the PCI bus)

So, it's probably not the e1000 or networking code that is slowing you down.
(This was on a 64/66 PCI, Dual-AMD 2Ghz machine though,
are you running only 32/33 PCI?  If not, where did you find this motherboard!)

Have you tried just reading the information from disk and doing everying except
the final 'send/write/sendto' ?  That would help determine if it is your
file reads that are killing you.

Ben

-- 
Ben Greear <greearb@candelatech.com>       <Ben_Greear AT excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 17:01         ` bert hubert
  2002-10-23 17:10           ` Ben Greear
@ 2002-10-23 17:11           ` Richard B. Johnson
  2002-10-23 17:12           ` Nivedita Singhvi
  2 siblings, 0 replies; 39+ messages in thread
From: Richard B. Johnson @ 2002-10-23 17:11 UTC (permalink / raw)
  To: bert hubert
  Cc: Roy Sigurd Karlsbakk, David S. Miller, netdev, Kernel mailing list

On Wed, 23 Oct 2002, bert hubert wrote:

> On Wed, Oct 23, 2002 at 03:42:48PM +0200, Roy Sigurd Karlsbakk wrote:
> > > The e1000 can very well do hardware checksumming on transmit.
> > >
> > > The missing piece of the puzzle is that his application is not
> > > using sendfile(), without which no transmit checksum offload
> > > can take place.
> > 
> > As far as I've understood, sendfile() won't do much good with large files. Is 
> > this right?
> 
> I still refuse to believe that a 1.8GHz Pentium4 can only checksum
> 250megabits/second. MD Raid5 does better and they probably don't use a
> checksum as braindead as that used by TCP.
> 
> If the checksumming is not the problem, the copying is, which would be a
> weakness of your hardware. The function profiled does both the copying and
> the checksumming.
> 
> But 250megabits/second also seems low.
> 
> Dave? 
> 

Ordinary DUAL Pentium 400 MHz machine does this...


Calculating CPU speed...done
Testing checksum speed...done
Testing RAM copy...done
Testing I/O port speed...done

                     CPU Clock = 400  MHz
                checksum speed = 685  Mb/s
                      RAM copy = 1549 Mb/s
                I/O port speed = 654  kb/s


This is 685 megaBYTES per second.

                checksum speed = 685  Mb/s



Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
   Bush : The Fourth Reich of America



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 17:01         ` bert hubert
  2002-10-23 17:10           ` Ben Greear
  2002-10-23 17:11           ` Richard B. Johnson
@ 2002-10-23 17:12           ` Nivedita Singhvi
  2002-10-23 17:56             ` Richard B. Johnson
  2 siblings, 1 reply; 39+ messages in thread
From: Nivedita Singhvi @ 2002-10-23 17:12 UTC (permalink / raw)
  To: bert hubert
  Cc: Roy Sigurd Karlsbakk, David S. Miller, netdev, Kernel mailing list

bert hubert wrote:

> I still refuse to believe that a 1.8GHz Pentium4 can only checksum
> 250megabits/second. MD Raid5 does better and they probably don't use a
> checksum as braindead as that used by TCP.
> 
> If the checksumming is not the problem, the copying is, which would be a
> weakness of your hardware. The function profiled does both the copying and
> the checksumming.

Yep, its not so much the checksumming as the fact that this is
done over each byte of data and copied.

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 17:12           ` Nivedita Singhvi
@ 2002-10-23 17:56             ` Richard B. Johnson
  2002-10-23 18:07               ` Nivedita Singhvi
  0 siblings, 1 reply; 39+ messages in thread
From: Richard B. Johnson @ 2002-10-23 17:56 UTC (permalink / raw)
  To: Nivedita Singhvi
  Cc: bert hubert, Roy Sigurd Karlsbakk, David S. Miller, netdev,
	Kernel mailing list

On Wed, 23 Oct 2002, Nivedita Singhvi wrote:

> bert hubert wrote:
> 
> > I still refuse to believe that a 1.8GHz Pentium4 can only checksum
> > 250megabits/second. MD Raid5 does better and they probably don't use a
> > checksum as braindead as that used by TCP.
> > 
> > If the checksumming is not the problem, the copying is, which would be a
> > weakness of your hardware. The function profiled does both the copying and
> > the checksumming.
> 
> Yep, its not so much the checksumming as the fact that this is
> done over each byte of data and copied.
> 
> thanks,
> Nivedita

No. It's done over each word (short int) and the actual summation
takes place during the address calculation of the next word. This
gets you a checksum that is practically free.

A 400 MHz ix86 CPU will checksum/copy at 685 megabytes per second.
It will copy at 1,549 megabytes per second. Those are megaBYTES!

If you have slow network performance it has nothing to do with
either copy or checksum. Data transmission acts like a low-pass
filter. The dominant pole of that transfer function determines
the speed, that's why it's called dominant. If you measure
a data-rate of 10 megabytes/second. Nothing you do with copy
or checksum will affect it to any significant extent.

If you have a data-rate of 100 megabytes per second, then any
tinkering with copy will have an effective improvement ratio
of 100/1,559 ~= 0.064. If you have a data rate of 100 megabytes
per second and you tinker with checksum, you get an improvement
ratio of 100/685 ~=0.14. These are just not the things that are
affecting your performance.

If you were to double the checksumming speed, you increase the
throughput by 2 * 0.14 = 0.28 with the parameters shown.

The TCP/IP checksum is quite nice. It may have been discovered
by accident, but it's still nice. It works regardless of whether
you have a little endian or big endian machine. It also doesn't
wrap so you don't (usually) show a good checksum when the data
is bad. It does have the characteristic that if all the bits are
inverted, it will checksum good. However, there are not too many
real-world scenarios that would result in this inversion. So it's
not "brain-dead" as you state. A hardware checksum is really
quick because it's really easy.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
   Bush : The Fourth Reich of America



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk
  2002-10-23 13:01   ` bert hubert
@ 2002-10-23 18:01   ` Denis Vlasenko
  2002-10-23 13:36     ` Roy Sigurd Karlsbakk
  2002-10-23 14:52     ` [RESEND] tuning linux for high network performance? Nivedita Singhvi
  1 sibling, 2 replies; 39+ messages in thread
From: Denis Vlasenko @ 2002-10-23 18:01 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk, netdev

On 23 October 2002 09:06, Roy Sigurd Karlsbakk wrote:
> > I've got this video server serving video for VoD. problem is the P4
> > 1.8 seems to be maxed out by a few system calls. The below output
> > is for ~50 clients streaming at ~4.5Mbps. if trying to increase
> > this to ~70, the CPU maxes out.
> >
> > Does anyone have an idea?
>
> ...adding the whole profile output - sorted by the first column this
> time...
>
> 905182 total                                      0.4741
> 121426 csum_partial_copy_generic                474.3203

Well, maybe take a look at this func and try to optimize it?

>  93633 default_idle                             1800.6346
>  74665 do_wp_page                               111.1086

What's this?

>  65857 ide_intr                                 184.9916

You have 1 ide_intr per 2 csum_partial_copy_generic... hmmm...
how large is your readahead? I assume you'd like to fetch
more sectors from ide per interrupt. (I hope you do DMA ;)

>  53636 handle_IRQ_event                         432.5484
>  21973 do_softirq                               107.7108
>  20498 e1000_intr                               244.0238

I know zero about networking, but why 120 000 csum_partial_copy_generic
and inly 20 000 nic interrupts? That may be abnormal.
--
vda

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 17:56             ` Richard B. Johnson
@ 2002-10-23 18:07               ` Nivedita Singhvi
  2002-10-23 18:30                 ` Richard B. Johnson
  0 siblings, 1 reply; 39+ messages in thread
From: Nivedita Singhvi @ 2002-10-23 18:07 UTC (permalink / raw)
  To: root
  Cc: bert hubert, Roy Sigurd Karlsbakk, David S. Miller, netdev,
	Kernel mailing list

"Richard B. Johnson" wrote:

> No. It's done over each word (short int) and the actual summation
> takes place during the address calculation of the next word. This
> gets you a checksum that is practically free.

Yep, sorry, word, not byte. My bad. The cost is in the fact 
that this whole process involves loading each word of the data
stream into a register. Which is why I also used to consider
the checksum cost as negligible. 

> A 400 MHz ix86 CPU will checksum/copy at 685 megabytes per second.
> It will copy at 1,549 megabytes per second. Those are megaBYTES!

But then why the difference in the checksum/copy and copy?
Are you saying the checksum is not costing you 864 megabytes
a second??

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 18:07               ` Nivedita Singhvi
@ 2002-10-23 18:30                 ` Richard B. Johnson
  0 siblings, 0 replies; 39+ messages in thread
From: Richard B. Johnson @ 2002-10-23 18:30 UTC (permalink / raw)
  To: Nivedita Singhvi
  Cc: bert hubert, Roy Sigurd Karlsbakk, David S. Miller, netdev,
	Kernel mailing list

On Wed, 23 Oct 2002, Nivedita Singhvi wrote:

> "Richard B. Johnson" wrote:
> 
> > No. It's done over each word (short int) and the actual summation
> > takes place during the address calculation of the next word. This
> > gets you a checksum that is practically free.
> 
> Yep, sorry, word, not byte. My bad. The cost is in the fact 
> that this whole process involves loading each word of the data
> stream into a register. Which is why I also used to consider
> the checksum cost as negligible. 
> 
> > A 400 MHz ix86 CPU will checksum/copy at 685 megabytes per second.
> > It will copy at 1,549 megabytes per second. Those are megaBYTES!
> 
> But then why the difference in the checksum/copy and copy?
> Are you saying the checksum is not costing you 864 megabytes
> a second??

Costing you 864 megabytes per second?
Lets say the checksum was free. You are then able to INF bytes/per/sec.
So it's costing you INF bytes/per/sec?  No, it's costing you nothing.
If we were not dealing with INF, then 'Cost' is approximately 1/N, not
N. Cost is work_done_without_checksum - work_done_with_checksum. Because
of the low-pass filter pole, these numbers are practically the same.
But, you can get a measurable difference between any two large numbers.
This makes the 'cost' seem high. You need to make it relative to make
any sense, so a 'goodness' can be expressed as a ratio of the cost and
the work having been done.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
   Bush : The Fourth Reich of America



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 13:42       ` Roy Sigurd Karlsbakk
  2002-10-23 17:01         ` bert hubert
@ 2002-10-24  4:11         ` David S. Miller
  2002-10-24  9:37           ` Karen Shaeffer
  2002-10-24 10:30           ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
  1 sibling, 2 replies; 39+ messages in thread
From: David S. Miller @ 2002-10-24  4:11 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: bert hubert, netdev, Kernel mailing list

On Wed, 2002-10-23 at 06:42, Roy Sigurd Karlsbakk wrote:
> As far as I've understood, sendfile() won't do much good with large files. Is 
> this right?

There is always a benefit to using sendfile(), when you use
sendfile() the cpu doesn't touch one byte of the data if
the network card support TX checksumming.  The disk DMAs
to ram, then the net card DMAs from ram.  Simple as that.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-24  4:11         ` David S. Miller
@ 2002-10-24  9:37           ` Karen Shaeffer
  2002-10-24 10:30           ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
  1 sibling, 0 replies; 39+ messages in thread
From: Karen Shaeffer @ 2002-10-24  9:37 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Wed, Oct 23, 2002 at 09:11:09PM -0700, David S. Miller wrote:
> On Wed, 2002-10-23 at 06:42, Roy Sigurd Karlsbakk wrote:
> > As far as I've understood, sendfile() won't do much good with large files. Is 
> > this right?
> 
> There is always a benefit to using sendfile(), when you use
> sendfile() the cpu doesn't touch one byte of the data if
> the network card support TX checksumming.  The disk DMAs
> to ram, then the net card DMAs from ram.  Simple as that.

Referring to:

$ rpm -qf /usr/include/sys/sendfile.h
glibc-devel-2.2.5-40

quoting "sendfile.h"

#ifdef __USE_FILE_OFFSET64
# error "<sys/sendfile.h> cannot be used with _FILE_OFFSET_BITS=64"
#endif

So, how does one use sendfile() for large files that are greater than 2
GBytes? Am I missing something?

Thanks,
Karen
-- 
 Karen Shaeffer
 Neuralscape; Santa Cruz, Ca. 95060
 shaeffer@neuralscape.com  http://www.neuralscape.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: O_DIRECT sockets? (was [RESEND] tuning linux for high network  performance?)
  2002-10-23 16:34           ` Nivedita Singhvi
  (?)
@ 2002-10-24 10:14           ` Roy Sigurd Karlsbakk
  2002-10-24 10:46               ` David S. Miller
  -1 siblings, 1 reply; 39+ messages in thread
From: Roy Sigurd Karlsbakk @ 2002-10-24 10:14 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: bert hubert, netdev, Kernel mailing list

> Hmm, I'm still not clear on why you cannot use sendfile()?
> I was not aware of any upper limit to the file size in order
> for sendfile() to be used?  From what little I know, this
> is exactly the kind of situation that sendfile was intended
> to benefit.

I can't use sendfile(). I'm working with files > 4GB, and from man 2 sendfile:

ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

int main() {
	ssize_t s1;
	off_t offset;
	size_t count;

	printf("sizeof ssize_t: %d\n", sizeof s1);
	printf("sizeof size_t: %d\n", sizeof count);
	printf("sizeof off_t: %d\n", sizeof offset);
	return 0;
}

running it

$ ./sendfile_test
sizeof ssize_t: 4
sizeof size_t: 4
sizeof off_t: 4
$ 

as far as I'm concerned, this will not allow me to address files past the 4GB 
limit (or was it 2?)

roy

-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* sendfile64() anyone? (was [RESEND] tuning linux for high network performance?)
  2002-10-24  4:11         ` David S. Miller
  2002-10-24  9:37           ` Karen Shaeffer
@ 2002-10-24 10:30           ` Roy Sigurd Karlsbakk
  2002-10-24 10:47             ` David S. Miller
  1 sibling, 1 reply; 39+ messages in thread
From: Roy Sigurd Karlsbakk @ 2002-10-24 10:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: bert hubert, netdev, Kernel mailing list

On Thursday 24 October 2002 06:11, David S. Miller wrote:
> On Wed, 2002-10-23 at 06:42, Roy Sigurd Karlsbakk wrote:
> > As far as I've understood, sendfile() won't do much good with large
> > files. Is this right?
>
> There is always a benefit to using sendfile(), when you use
> sendfile() the cpu doesn't touch one byte of the data if
> the network card support TX checksumming.  The disk DMAs
> to ram, then the net card DMAs from ram.  Simple as that.

Are there any plans of implementing sendfile64() or sendfile() support for 
-D_FILE_OFFSET_BITS=64?

(from man 2 sendfile)
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

int main() {
        ssize_t s1;
        size_t count;
        off_t offset;

        printf("sizeof ssize_t: %d\n", sizeof s1);
        printf("sizeof size_t: %d\n", sizeof count);
        printf("sizeof off_t: %d\n", sizeof offset);
        return 0;
}
$ make
...
$ ./sendfile_test
sizeof ssize_t: 4
sizeof size_t: 4
sizeof off_t: 4
$ 

and - when attempting to build this with -D_FILE_OFFSET_BITS=64 

[roy@roy-sin micro_httpd-O_DIRECT]$ make sendfile_test
gcc -D_DEBUG -Wall -W -D_GNU_SOURCE -D_NO_DIR_ACCESS -D_FILE_OFFSET_BITS=64 
-D_LARGEFILE_SOURCE -DUSE_O_DIRECT -DINETD -Wno-unused -O0 -ggdb -c 
sendfile_test.c
In file included from sendfile_test.c:1:
/usr/include/sys/sendfile.h:26: #error "<sys/sendfile.h> cannot be used with 
_FILE_OFFSET_BITS=64"
make: *** [sendfile_test.o] Error 1

-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: O_DIRECT sockets? (was [RESEND] tuning linux for high network  performance?)
  2002-10-24 10:14           ` Roy Sigurd Karlsbakk
@ 2002-10-24 10:46               ` David S. Miller
  0 siblings, 0 replies; 39+ messages in thread
From: David S. Miller @ 2002-10-24 10:46 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk
  Cc: Nivedita Singhvi, bert hubert, netdev, Kernel mailing list

On Thu, 2002-10-24 at 03:14, Roy Sigurd Karlsbakk wrote:
> I can't use sendfile(). I'm working with files > 4GB, and from man 2 sendfile:

That's what sendfile64() is for.  In fact every vendor I am aware
of is shipping the sys_sendfile64() patch in their kernels and
an appropriately fixed up glibc.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?)
@ 2002-10-24 10:46               ` David S. Miller
  0 siblings, 0 replies; 39+ messages in thread
From: David S. Miller @ 2002-10-24 10:46 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk
  Cc: Nivedita Singhvi, bert hubert, netdev, Kernel mailing list

On Thu, 2002-10-24 at 03:14, Roy Sigurd Karlsbakk wrote:
> I can't use sendfile(). I'm working with files > 4GB, and from man 2 sendfile:

That's what sendfile64() is for.  In fact every vendor I am aware
of is shipping the sys_sendfile64() patch in their kernels and
an appropriately fixed up glibc.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: sendfile64() anyone? (was [RESEND] tuning linux for high network performance?)
  2002-10-24 10:30           ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
@ 2002-10-24 10:47             ` David S. Miller
  2002-10-24 11:07               ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 39+ messages in thread
From: David S. Miller @ 2002-10-24 10:47 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: bert hubert, netdev, Kernel mailing list

On Thu, 2002-10-24 at 03:30, Roy Sigurd Karlsbakk wrote:
> Are there any plans of implementing sendfile64() or sendfile() support for 
> -D_FILE_OFFSET_BITS=64?

This is old hat, and appears in every current vendor kernel I am
aware of and is in 2.5.x as well.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: sendfile64() anyone? (was [RESEND] tuning linux for high network performance?)
  2002-10-24 10:47             ` David S. Miller
@ 2002-10-24 11:07               ` Roy Sigurd Karlsbakk
  0 siblings, 0 replies; 39+ messages in thread
From: Roy Sigurd Karlsbakk @ 2002-10-24 11:07 UTC (permalink / raw)
  To: David S. Miller; +Cc: bert hubert, netdev, Kernel mailing list

On Thursday 24 October 2002 12:47, David S. Miller wrote:
> On Thu, 2002-10-24 at 03:30, Roy Sigurd Karlsbakk wrote:
> > Are there any plans of implementing sendfile64() or sendfile() support
> > for -D_FILE_OFFSET_BITS=64?
>
> This is old hat, and appears in every current vendor kernel I am
> aware of and is in 2.5.x as well.

then where can I find these patches? I cannot use 2.5, and I usually try to 
stick with an official kernel.

and - if this patch has been around all this time...

	why isn't it in the official kernel yet?

-- 
Roy Sigurd Karlsbakk, Datavaktmester
ProntoTV AS - http://www.pronto.tv/
Tel: +47 9801 3356

Computers are like air conditioners.
They stop working when you open Windows.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-24 16:22       ` Denis Vlasenko
@ 2002-10-24 11:50         ` Russell King
  2002-10-24 12:42           ` bert hubert
  2002-10-24 17:41           ` Denis Vlasenko
  0 siblings, 2 replies; 39+ messages in thread
From: Russell King @ 2002-10-24 11:50 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: Roy Sigurd Karlsbakk, netdev, Kernel mailing list

On Thu, Oct 24, 2002 at 02:22:25PM -0200, Denis Vlasenko wrote:
> Please delete memory.o, rerun make bzImage, capture gcc
> command used for compiling memory.c, modify it:
> 
> gcc ... -o memory.o  ->  gcc ... -S -o memory.s ...

Have you tried make mm/memory.s ?

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-24 11:50         ` Russell King
@ 2002-10-24 12:42           ` bert hubert
  2002-10-24 17:41           ` Denis Vlasenko
  1 sibling, 0 replies; 39+ messages in thread
From: bert hubert @ 2002-10-24 12:42 UTC (permalink / raw)
  To: Denis Vlasenko, Roy Sigurd Karlsbakk, netdev

On Thu, Oct 24, 2002 at 12:50:31PM +0100, Russell King wrote:
> > gcc ... -o memory.o  ->  gcc ... -S -o memory.s ...
> 
> Have you tried make mm/memory.s ?

or even make mm/memory.lst

-- 
http://www.PowerDNS.com          Versatile DNS Software & Services
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-23 13:36     ` Roy Sigurd Karlsbakk
@ 2002-10-24 16:22       ` Denis Vlasenko
  2002-10-24 11:50         ` Russell King
  0 siblings, 1 reply; 39+ messages in thread
From: Denis Vlasenko @ 2002-10-24 16:22 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk, netdev; +Cc: Kernel mailing list

On 23 October 2002 11:36, Roy Sigurd Karlsbakk wrote:
> > > 905182 total                                      0.4741
> > > 121426 csum_partial_copy_generic                474.3203
> >
> > Well, maybe take a look at this func and try to optimize it?
>
> I don't know assembly that good - sorry.

Well, I like it. Maybe I can look into it. Feel free
to bug me :-)

> > >  93633 default_idle                             1800.6346
> > >  74665 do_wp_page                               111.1086
> >
> > What's this?
>
> do_wp_page is Defined as a function in: mm/memory.c
>
> comments from the file:
> [snip]

Please delete memory.o, rerun make bzImage, capture gcc
command used for compiling memory.c, modify it:

gcc ... -o memory.o  ->  gcc ... -S -o memory.s ...

and examine assembler code. Maybe something will stick out
(or use objdump to disassemble memory.o, I recall nice
option to produce assembler output with C code intermixed
as comments!) (send disasmed listing to me offlist).

> > >  65857 ide_intr                                 184.9916
> >
> > You have 1 ide_intr per 2 csum_partial_copy_generic... hmmm...
> > how large is your readahead? I assume you'd like to fetch
> > more sectors from ide per interrupt. (I hope you do DMA ;)
>
> doing DMA - RAID-0 with 1MB chunk size on 4 disks.

You should aim at maxing out IDE performance.
Please find out how many sectors you read in one go.
Maybe:

# cat /proc/interrupts
# dd bs=1m count=1 if=/dev/hda of=/dev/null
# cat /proc/interrupts

and calculate how many IDE interrupts happened. (1mb = 2048 sectors)
--
vda

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND] tuning linux for high network performance?
  2002-10-24 11:50         ` Russell King
  2002-10-24 12:42           ` bert hubert
@ 2002-10-24 17:41           ` Denis Vlasenko
  2002-10-25 11:36             ` Csum and csum copyroutines benchmark Denis Vlasenko
  1 sibling, 1 reply; 39+ messages in thread
From: Denis Vlasenko @ 2002-10-24 17:41 UTC (permalink / raw)
  To: Russell King; +Cc: Roy Sigurd Karlsbakk, netdev, Kernel mailing list

On 24 October 2002 09:50, Russell King wrote:
> On Thu, Oct 24, 2002 at 02:22:25PM -0200, Denis Vlasenko wrote:
> > Please delete memory.o, rerun make bzImage, capture gcc
> > command used for compiling memory.c, modify it:
> >
> > gcc ... -o memory.o  ->  gcc ... -S -o memory.s ...
>
> Have you tried make mm/memory.s ?

No ;) but I have a feeling it will produce that file ;)))

I'm experimenting with different csum_ routines in userspace now.
--
vda

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Csum and csum copyroutines benchmark
  2002-10-25 11:36             ` Csum and csum copyroutines benchmark Denis Vlasenko
@ 2002-10-25  7:48               ` Momchil Velikov
  2002-10-25 13:59                 ` Denis Vlasenko
  2002-10-25 14:26               ` Daniel Egger
  1 sibling, 1 reply; 39+ messages in thread
From: Momchil Velikov @ 2002-10-25  7:48 UTC (permalink / raw)
  To: vda
  Cc: Russell King, Roy Sigurd Karlsbakk, netdev, Kernel mailing list,
	libc-alpha

>>>>> "Denis" == Denis Vlasenko <vda@port.imtp.ilyichevsk.odessa.ua> writes:

Denis> /me said:
>> I'm experimenting with different csum_ routines in userspace now.

Denis> Short conclusion: 
Denis> 1. It is possible to speed up csum routines for AMD processors by 30%.
Denis> 2. It is possible to speed up csum_copy routines for both AMD and Intel
Denis>    three times or more. Roy, do you like that? ;)

Additional data point:

Short summary:
1. Checksum - kernelpii_csum is ~19% faster
2. Copy - lernelpii_csum is ~6% faster

Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)

The only changes I made were to decrease the buffer size to 1K (as I
think this is more representative to a network packet size, correct me
if I'm wrong) and increase the runs to 1024. Max values are worthless
indeed.


Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took   941 max,  740 min cycles per kb. sum=0x44000077
                     kernel_csum - took   748 max,  742 min cycles per kb. sum=0x44000077
                     kernel_csum - took 60559 max,  742 min cycles per kb. sum=0x44000077
                  kernelpii_csum - took 52804 max,  601 min cycles per kb. sum=0x44000077
                kernelpiipf_csum - took 12930 max,  601 min cycles per kb. sum=0x44000077
                        pfm_csum - took 10161 max, 1402 min cycles per kb. sum=0x44000077
                       pfm2_csum - took   864 max,  838 min cycles per kb. sum=0x44000077
copy tests:
                     kernel_copy - took   339 max,  239 min cycles per kb. sum=0x44000077
                     kernel_copy - took   239 max,  239 min cycles per kb. sum=0x44000077
                     kernel_copy - took   239 max,  239 min cycles per kb. sum=0x44000077
                  kernelpii_copy - took   244 max,  225 min cycles per kb. sum=0x44000077
                      ntqpf_copy - took 10867 max,  512 min cycles per kb. sum=0x44000077
                     ntqpfm_copy - took   710 max,  403 min cycles per kb. sum=0x44000077
                        ntq_copy - took  4535 max,  443 min cycles per kb. sum=0x44000077
                     ntqpf2_copy - took   563 max,  555 min cycles per kb. sum=0x44000077
Done


HOWEVER ...

sometimes (say 1/30) I get the following output:

Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took   958 max,  740 min cycles per kb. sum=0x44000077
                     kernel_csum - took   748 max,  740 min cycles per kb. sum=0x44000077
                     kernel_csum - took   752 max,  740 min cycles per kb. sum=0x44000077
                  kernelpii_csum - took   624 max,  600 min cycles per kb. sum=0x44000077
                kernelpiipf_csum - took 877211 max,  601 min cycles per kb. sum=0x44000077
Bad sum
Aborted

which is to say that pfm_csum and pfm2_csum results are not to be
trusted (at least on PIII (or my kernel CONFIG_MPENTIUMIII=y
config?)).

~velco

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Csum and csum copyroutines benchmark
  2002-10-25 13:59                 ` Denis Vlasenko
@ 2002-10-25  9:47                   ` Momchil Velikov
  2002-10-25 10:19                   ` Alan Cox
  1 sibling, 0 replies; 39+ messages in thread
From: Momchil Velikov @ 2002-10-25  9:47 UTC (permalink / raw)
  To: vda; +Cc: Russell King, Roy Sigurd Karlsbakk, netdev, Kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 2863 bytes --]

>>>>> "Denis" == Denis Vlasenko <vda@port.imtp.ilyichevsk.odessa.ua> writes:

Denis> [please drop libc from CC:]
Denis> On 25 October 2002 05:48, Momchil Velikov wrote:
>>> Short conclusion:
>>> 1. It is possible to speed up csum routines for AMD processors
>>> by 30%.
>>> 2. It is possible to speed up csum_copy routines for both AMD
>>> andd Intel three times or more.

>> Additional data point:
>> 
>> Short summary:
>> 1. Checksum - kernelpii_csum is ~19% faster
>> 2. Copy - lernelpii_csum is ~6% faster
>> 
>> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
>> 
>> The only changes I made were to decrease the buffer size to 1K (as I
>> think this is more representative to a network packet size, correct
>> me if I'm wrong) and increase the runs to 1024. Max values are
>> worthless indeed.

Denis> Well, that makes it run entirely in L0 cache. This is unrealistic
Denis> for actual use. movntq is x3 faster when you hit RAM instead of L0.

Oops ...

Denis> You need to be more clever than that - generate pseudo-random
Denis> offsets in large buffer and run on ~1K pieces of that buffer.

Here it is:

Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took  8678 max,  808 min cycles per kb. sum=0x400270e8
                     kernel_csum - took   941 max,  808 min cycles per kb. sum=0x400270e8
                     kernel_csum - took 11604 max,  808 min cycles per kb. sum=0x400270e8
                  kernelpii_csum - took 28839 max,  664 min cycles per kb. sum=0x400270e8
                kernelpiipf_csum - took  9163 max,  665 min cycles per kb. sum=0x400270e8
                        pfm_csum - took  2788 max, 1470 min cycles per kb. sum=0x400270e8
                       pfm2_csum - took  1179 max,  915 min cycles per kb. sum=0x400270e8
copy tests:
                     kernel_copy - took   688 max,  263 min cycles per kb. sum=0x400270e8
                     kernel_copy - took   456 max,  263 min cycles per kb. sum=0x400270e8
                     kernel_copy - took 11241 max,  263 min cycles per kb. sum=0x400270e8
                  kernelpii_copy - took  7635 max,  246 min cycles per kb. sum=0x400270e8
                      ntqpf_copy - took  5349 max,  536 min cycles per kb. sum=0x400270e8
                     ntqpfm_copy - took   769 max,  425 min cycles per kb. sum=0x400270e8
                        ntq_copy - took   672 max,  469 min cycles per kb. sum=0x400270e8
                     ntqpf2_copy - took  8000 max,  579 min cycles per kb. sum=0x400270e8
Done

Ran on a 512K (my cache size) buffer, choosing each time a 1K
piece. (making the buffer larger (2M, 4M) does not make any
difference).

And the modified 0main.c is attached.

~velco

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0main.c --]
[-- Type: text/x-csrc, Size: 3996 bytes --]

#include <stdio.h>
#include <stdlib.h>

#define NAME(a) \
unsigned int a##csum(const unsigned char * buff, int len, \
			unsigned int sum); \
unsigned int a##copy(const char *src, char *dst, \
                        int len, int sum, int *src_err_ptr, int *dst_err_ptr)
			
/* This makes adding/removing test functions easier */
/* asm ones... */
NAME(kernel_);
NAME(kernelpii_);
NAME(kernelpiipf_);
/* and C */
#include "pfm_csum.c"
#include "pfm2_csum.c"
#include "ntq_copy.c"
#include "ntqpf_copy.c"
#include "ntqpf2_copy.c"
#include "ntqpfm_copy.c"

const int TRY_TIMES = 1024;
const int NBUFS = 512;
const int BUFSIZE = 1024;
const int POISON = 0; // want to check correctness?

typedef unsigned int csum_func(const unsigned char * buff, int len,
		unsigned int sum);
typedef unsigned int copy_func(const char *src, char *dst,
		int len, int sum, int *src_err_ptr, int *dst_err_ptr);

static inline long long rdtsc()
{
	unsigned int low,high;
	__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high));
	return low + (((long long)high)<<32);
}

int die(const char *msg) {
	puts(msg);
	abort();
	return 1;
}

unsigned test_one_csum(csum_func *func, char *name, char *buffer)
{
	int i;
	unsigned long long before,after,min,max;
	unsigned sum;
	
	// pick fastest run
	min = ~0ULL;
	max = 0;
	for (i=0;i<TRY_TIMES;i++) {
		before = rdtsc();
		unsigned sum2 = func(buffer + (rand () % NBUFS) * BUFSIZE,
				     BUFSIZE, 0);
		after = rdtsc();
		if (before>after) die("timer overflow");
		else {
			after-=before;
			if(min>after) min=after;
			if(max<after) max=after;
		}		
	}
	printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n",
		name,
		max / (BUFSIZE/1024),
		min / (BUFSIZE/1024),
		sum
		);
}
     
unsigned test_one_copy(copy_func *func, char *name, char *buffer)
{
	int i;
	unsigned long long before,after,min,max;
	unsigned sum;
	int err;

	// pick fastest run
	min = ~0ULL;
	max = 0;
	for (i=0; i<TRY_TIMES; i++) {
		if(POISON) memset(buffer,          0x55,BUFSIZE/2);
		if(POISON) memset(buffer+BUFSIZE/2,0xaa,BUFSIZE/2);
		buffer[0] = 0x77;
		buffer[BUFSIZE/2-1] = 0x44;
		before = rdtsc();
		char *buf = buffer + rand () % (NBUFS - 1);
		unsigned sum2 = func(buf,buf+BUFSIZE/2,BUFSIZE/2,0,&err,&err);
		after = rdtsc();
		if(POISON) if(memcmp(buffer,buffer+BUFSIZE/2,BUFSIZE/2)!=0) die("BAD copy!");
		if (before>after) die("timer overflow");
		else {
			after-=before;
			if(min>after) min=after;
			if(max<after) max=after;
		}		
	}
	printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n",
		name,
		max / (BUFSIZE/1024) / 2,
		min / (BUFSIZE/1024) / 2,
		sum
	);
	return sum;
}
     
     
void test_csum(char *buffer)
{
	unsigned sum;
	puts("csum tests:");

#define	TEST_CSUM(a) test_one_csum(a,#a,buffer)
	TEST_CSUM(kernel_csum	);
	TEST_CSUM(kernel_csum	);
	TEST_CSUM(kernel_csum	);
	TEST_CSUM(kernelpii_csum	);
	TEST_CSUM(kernelpiipf_csum);
	TEST_CSUM(pfm_csum	);
	TEST_CSUM(pfm2_csum	);
#undef TEST_CSUM
}   

void test_copy(char *buffer)
{
	unsigned sum;
	puts("copy tests:");

#define	TEST_COPY(a) test_one_copy(a,#a,buffer)
	sum =  TEST_COPY(kernel_copy	);
	sum == TEST_COPY(kernel_copy	) || die("Bad sum");
	sum == TEST_COPY(kernel_copy	) || die("Bad sum");
	sum == TEST_COPY(kernelpii_copy	) || die("Bad sum");
	sum == TEST_COPY(ntqpf_copy	) || die("Bad sum");
	sum == TEST_COPY(ntqpfm_copy	) || die("Bad sum");
	sum == TEST_COPY(ntq_copy	) || die("Bad sum");
	sum == TEST_COPY(ntqpf2_copy	) || die("Bad sum");
#undef TEST_COPY
}

int main()
{
	char *buffer_raw,*buffer;
	printf("Csum benchmark program\n"
		"buffer size: %i K\n"
		"Each test tried %i times, max and min CPU cycles are reported.\n"
		"Please disregard max values. They are due to system interference only.\n",
		BUFSIZE/1024,
		TRY_TIMES
	);
	
	buffer_raw = malloc(NBUFS * BUFSIZE+16);
	if(!buffer_raw) die("Malloc failed");
		
	buffer = (char*) ((((int)buffer_raw)+15) & (~0xF));
	
	test_csum(buffer);
	test_copy(buffer);

	puts("Done");
	free(buffer_raw);
	return 0;
}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Csum and csum copyroutines benchmark
  2002-10-25 13:59                 ` Denis Vlasenko
  2002-10-25  9:47                   ` Momchil Velikov
@ 2002-10-25 10:19                   ` Alan Cox
  2002-10-25 16:00                     ` Denis Vlasenko
  1 sibling, 1 reply; 39+ messages in thread
From: Alan Cox @ 2002-10-25 10:19 UTC (permalink / raw)
  To: vda
  Cc: Momchil Velikov, Russell King, Roy Sigurd Karlsbakk, netdev,
	Linux Kernel Mailing List

On Fri, 2002-10-25 at 14:59, Denis Vlasenko wrote:
> Well, that makes it run entirely in L0 cache. This is unrealistic
> for actual use. movntq is x3 faster when you hit RAM instead of L0.
> 
> You need to be more clever than that - generate pseudo-random
> offsets in large buffer and run on ~1K pieces of that buffer.

In a lot of cases its extremely realistic to assume the network buffers
are in cache. The copy/csum path is often touching just generated data,
or data we just accessed via read(). The csum RX path from a card with
DMA is probably somewhat different.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Csum and csum copyroutines benchmark
  2002-10-24 17:41           ` Denis Vlasenko
@ 2002-10-25 11:36             ` Denis Vlasenko
  2002-10-25  7:48               ` Momchil Velikov
  2002-10-25 14:26               ` Daniel Egger
  0 siblings, 2 replies; 39+ messages in thread
From: Denis Vlasenko @ 2002-10-25 11:36 UTC (permalink / raw)
  To: Russell King, Roy Sigurd Karlsbakk, netdev, Kernel mailing list
  Cc: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 5579 bytes --]

/me said:
> I'm experimenting with different csum_ routines in userspace now.

Short conclusion: 
1. It is possible to speed up csum routines for AMD processors by 30%.
2. It is possible to speed up csum_copy routines for both AMD and Intel
   three times or more. Roy, do you like that? ;)

Tests: they checksum 4MB block and csum_copy 2MB into second 2MB.
POISON=0/1 controls whether to perform correctness tests or not.
That slows down test very noticeably. What does glibc use for
memset/memcmp? for() loop?!!

With POISON=1 ntqpf2_copy bugs out, see its source. I left it in
to save repeating my work by others. BTW, i do NOT understand why
it does not work. ;) Anyone with cluebat?

IMHO the only way to make it optimal for all CPUs is to make these
functions race at kernel init and pick the best one.

tests on Celeron 1200 (100 MHz, x12 core)
=========================================
Csum benchmark program
buffer size: 4 Mb
Each test tried 16 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took   717 max,  704 min cycles per kb. sum=0x44000077
                     kernel_csum - took  4760 max,  704 min cycles per kb. sum=0x44000077
                     kernel_csum - took   722 max,  704 min cycles per kb. sum=0x44000077
                  kernelpii_csum - took   539 max,  528 min cycles per kb. sum=0x44000077
                kernelpiipf_csum - took   573 max,  529 min cycles per kb. sum=0x44000077
                        pfm_csum - took  1411 max, 1306 min cycles per kb. sum=0x44000077
                       pfm2_csum - took   875 max,  762 min cycles per kb. sum=0x44000077
copy tests:
                     kernel_copy - took  5738 max, 3423 min cycles per kb. sum=0x99aaaacc
                     kernel_copy - took  3517 max, 3431 min cycles per kb. sum=0x99aaaacc
                     kernel_copy - took  4385 max, 3432 min cycles per kb. sum=0x99aaaacc
                  kernelpii_copy - took  2912 max, 2752 min cycles per kb. sum=0x99aaaacc
                      ntqpf_copy - took  2010 max, 1700 min cycles per kb. sum=0x99aaaacc
                     ntqpfm_copy - took  1749 max, 1701 min cycles per kb. sum=0x99aaaacc
                        ntq_copy - took  2218 max, 2141 min cycles per kb. sum=0x99aaaacc
BAD copy! <-- ntqpf2_copy is buggy :) see its source
'copy tests' above are with POISON=1
These are with POISON=0:
                     kernel_copy - took  2009 max, 1935 min cycles per kb. sum=0x44000077
                     kernel_copy - took  2240 max, 1959 min cycles per kb. sum=0x44000077
                     kernel_copy - took  2197 max, 1936 min cycles per kb. sum=0x44000077
                  kernelpii_copy - took  2121 max, 1939 min cycles per kb. sum=0x44000077
                      ntqpf_copy - took   667 max,  548 min cycles per kb. sum=0x44000077
                     ntqpfm_copy - took   651 max,  546 min cycles per kb. sum=0x44000077
                        ntq_copy - took   660 max,  545 min cycles per kb. sum=0x44000077
                     ntqpf2_copy - took   644 max,  548 min cycles per kb. sum=0x44000077
Done

Tests on Duron 650 (100 MHz, x6,5 core)
=======================================
Csum benchmark program
buffer size: 4 Mb
Each test tried 16 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took  1090 max, 1051 min cycles per kb. sum=0x44000077
                     kernel_csum - took  1080 max, 1052 min cycles per kb. sum=0x44000077
                     kernel_csum - took  1178 max, 1058 min cycles per kb. sum=0x44000077
                  kernelpii_csum - took  1614 max, 1052 min cycles per kb. sum=0x44000077
                kernelpiipf_csum - took   976 max,  962 min cycles per kb. sum=0x44000077
                        pfm_csum - took   755 max,  746 min cycles per kb. sum=0x44000077
                       pfm2_csum - took   749 max,  745 min cycles per kb. sum=0x44000077
copy tests:
                     kernel_copy - took  1251 max, 1072 min cycles per kb. sum=0x99aaaacc
                     kernel_copy - took  1363 max, 1072 min cycles per kb. sum=0x99aaaacc
                     kernel_copy - took  1352 max, 1072 min cycles per kb. sum=0x99aaaacc
                  kernelpii_copy - took  1132 max, 1014 min cycles per kb. sum=0x99aaaacc
                      ntqpf_copy - took   514 max,  480 min cycles per kb. sum=0x99aaaacc
                     ntqpfm_copy - took   495 max,  482 min cycles per kb. sum=0x99aaaacc
                        ntq_copy - took  1153 max,  948 min cycles per kb. sum=0x99aaaacc
BAD copy! <-- ntqpf2_copy is buggy :) see its source
'copy tests' above are with POISON=1
These are with POISON=0:
                     kernel_copy - took  1145 max,  871 min cycles per kb. sum=0x44000077
                     kernel_copy - took   879 max,  871 min cycles per kb. sum=0x44000077
                     kernel_copy - took   876 max,  871 min cycles per kb. sum=0x44000077
                  kernelpii_copy - took  1019 max,  845 min cycles per kb. sum=0x44000077
                      ntqpf_copy - took  2972 max,  229 min cycles per kb. sum=0x44000077
                     ntqpfm_copy - took   248 max,  245 min cycles per kb. sum=0x44000077
                        ntq_copy - took   460 max,  452 min cycles per kb. sum=0x44000077
                     ntqpf2_copy - took   390 max,  340 min cycles per kb. sum=0x44000077
Done
--
vda

[-- Attachment #2: timing_csum_copy.tar.bz2 --]
[-- Type: application/x-bzip2, Size: 6589 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Csum and csum copyroutines benchmark
  2002-10-25  7:48               ` Momchil Velikov
@ 2002-10-25 13:59                 ` Denis Vlasenko
  2002-10-25  9:47                   ` Momchil Velikov
  2002-10-25 10:19                   ` Alan Cox
  0 siblings, 2 replies; 39+ messages in thread
From: Denis Vlasenko @ 2002-10-25 13:59 UTC (permalink / raw)
  To: Momchil Velikov
  Cc: Russell King, Roy Sigurd Karlsbakk, netdev, Kernel mailing list

[please drop libc from CC:]

On 25 October 2002 05:48, Momchil Velikov wrote:
>> Short conclusion:
>> 1. It is possible to speed up csum routines for AMD processors
>>    by 30%.
>> 2. It is possible to speed up csum_copy routines for both AMD
>>    andd Intel three times or more.

> Additional data point:
>
> Short summary:
> 1. Checksum - kernelpii_csum is ~19% faster
> 2. Copy - lernelpii_csum is ~6% faster
>
> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
>
> The only changes I made were to decrease the buffer size to 1K (as I
> think this is more representative to a network packet size, correct
> me if I'm wrong) and increase the runs to 1024. Max values are
> worthless indeed.

Well, that makes it run entirely in L0 cache. This is unrealistic
for actual use. movntq is x3 faster when you hit RAM instead of L0.

You need to be more clever than that - generate pseudo-random
offsets in large buffer and run on ~1K pieces of that buffer.

> HOWEVER ...
>
> sometimes (say 1/30) I get the following output:

Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took   958 max,  740 min cycles per kb. sum=0x44000077
                     kernel_csum - took   748 max,  740 min cycles per kb. sum=0x44000077
                     kernel_csum - took   752 max,  740 min cycles per kb. sum=0x44000077
                  kernelpii_csum - took   624 max,  600 min cycles per kb. sum=0x44000077
                kernelpiipf_csum - took 877211 max,  601 min cycles per kb. sum=0x44000077
Bad sum
Aborted

> which is to say that pfm_csum and pfm2_csum results are not to be
> trusted (at least on PIII (or my kernel CONFIG_MPENTIUMIII=y
> config?)).

No, it's my fault. Those routines are fast-hacked together, they
are actually can csym too little. I didn't get to handle arbitrary
buffer length, assuming it it a large power of two. See the source.
--
vda

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Csum and csum copyroutines benchmark
  2002-10-25 11:36             ` Csum and csum copyroutines benchmark Denis Vlasenko
  2002-10-25  7:48               ` Momchil Velikov
@ 2002-10-25 14:26               ` Daniel Egger
  1 sibling, 0 replies; 39+ messages in thread
From: Daniel Egger @ 2002-10-25 14:26 UTC (permalink / raw)
  To: vda; +Cc: Kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

Am Fre, 2002-10-25 um 13.36 schrieb Denis Vlasenko:

On Via Ezra 667 I get this:
Csum benchmark program
buffer size: 4 Mb
Each test tried 16 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took  2739 max, 2727 min cycles per kb. sum=0x44000077
                     kernel_csum - took  2733 max, 2727 min cycles per kb. sum=0x44000077
                     kernel_csum - took  2733 max, 2727 min cycles per kb. sum=0x44000077
                  kernelpii_csum - took  2691 max, 2686 min cycles per kb. sum=0x44000077
copy tests:
                     kernel_copy - took  2044 max, 2014 min cycles per kb. sum=0x44000077
                     kernel_copy - took  2026 max, 2016 min cycles per kb. sum=0x44000077
                     kernel_copy - took  2061 max, 2016 min cycles per kb. sum=0x44000077
                  kernelpii_copy - took  1526 max, 1523 min cycles per kb. sum=0x44000077
Done

The nt* functions do not work on this CPU.

-- 
Servus,
       Daniel

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Csum and csum copyroutines benchmark
  2002-10-25 10:19                   ` Alan Cox
@ 2002-10-25 16:00                     ` Denis Vlasenko
  0 siblings, 0 replies; 39+ messages in thread
From: Denis Vlasenko @ 2002-10-25 16:00 UTC (permalink / raw)
  To: Alan Cox
  Cc: Momchil Velikov, Russell King, Roy Sigurd Karlsbakk, netdev,
	Linux Kernel Mailing List

On 25 October 2002 08:19, Alan Cox wrote:
> On Fri, 2002-10-25 at 14:59, Denis Vlasenko wrote:
> > Well, that makes it run entirely in L0 cache. This is unrealistic
> > for actual use. movntq is x3 faster when you hit RAM instead of L0.
> >
> > You need to be more clever than that - generate pseudo-random
> > offsets in large buffer and run on ~1K pieces of that buffer.
>
> In a lot of cases its extremely realistic to assume the network
> buffers are in cache. The copy/csum path is often touching just
> generated data, or data we just accessed via read(). The csum RX path
> from a card with DMA is probably somewhat different.

'Touching' is not interesting since it will pump data
into cache, no matter how you 'touch' it.

Running benchmarks against 1K static buffer makes cache red hot
and causes _all writes_ to hit it. It may lead to wrong conclusions. 

Is _dst_ buffer of csum_copy going to be used by processor soon?
If yes, we shouldn't use movntq, we want to cache dst.
If no, we should by all means use movntq.
If sometimes, then optimal strategy does not exist. :(
--
vda

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2002-10-25 14:29 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-23 10:18 tuning linux for high network performance? Roy Sigurd Karlsbakk
2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk
2002-10-23 13:01   ` bert hubert
2002-10-23 13:21     ` David S. Miller
2002-10-23 13:42       ` Roy Sigurd Karlsbakk
2002-10-23 17:01         ` bert hubert
2002-10-23 17:10           ` Ben Greear
2002-10-23 17:11           ` Richard B. Johnson
2002-10-23 17:12           ` Nivedita Singhvi
2002-10-23 17:56             ` Richard B. Johnson
2002-10-23 18:07               ` Nivedita Singhvi
2002-10-23 18:30                 ` Richard B. Johnson
2002-10-24  4:11         ` David S. Miller
2002-10-24  9:37           ` Karen Shaeffer
2002-10-24 10:30           ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
2002-10-24 10:47             ` David S. Miller
2002-10-24 11:07               ` Roy Sigurd Karlsbakk
2002-10-23 13:41     ` [RESEND] tuning linux for high network performance? Roy Sigurd Karlsbakk
2002-10-23 14:59     ` Nivedita Singhvi
2002-10-23 15:26       ` O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
2002-10-23 16:34         ` Nivedita Singhvi
2002-10-23 16:34           ` Nivedita Singhvi
2002-10-24 10:14           ` Roy Sigurd Karlsbakk
2002-10-24 10:46             ` David S. Miller
2002-10-24 10:46               ` David S. Miller
2002-10-23 18:01   ` [RESEND] tuning linux for high network performance? Denis Vlasenko
2002-10-23 13:36     ` Roy Sigurd Karlsbakk
2002-10-24 16:22       ` Denis Vlasenko
2002-10-24 11:50         ` Russell King
2002-10-24 12:42           ` bert hubert
2002-10-24 17:41           ` Denis Vlasenko
2002-10-25 11:36             ` Csum and csum copyroutines benchmark Denis Vlasenko
2002-10-25  7:48               ` Momchil Velikov
2002-10-25 13:59                 ` Denis Vlasenko
2002-10-25  9:47                   ` Momchil Velikov
2002-10-25 10:19                   ` Alan Cox
2002-10-25 16:00                     ` Denis Vlasenko
2002-10-25 14:26               ` Daniel Egger
2002-10-23 14:52     ` [RESEND] tuning linux for high network performance? Nivedita Singhvi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.