linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible bug from kernel 2.6.22 and above
@ 2007-11-21 20:34 Jie Chen
  2007-11-21 22:14 ` Eric Dumazet
  2007-12-05 20:36 ` Possible bug from kernel 2.6.22 and above Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Jie Chen @ 2007-11-21 20:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jie Chen

[-- Attachment #1: Type: text/plain, Size: 3835 bytes --]

Hi, there:

     We have a simple pthread program that measures the synchronization 
overheads for various synchronization mechanisms such as spin locks, 
barriers (the barrier is implemented using queue-based barrier 
algorithm) and so on. We have dual quad-core AMD opterons (barcelona) 
clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 
distribution. Before we moved to this kernel, we had kernel 2.6.21. 
These two kernels are configured identical and compiled with the same 
gcc 4.1.2 compiler. Under the old kernel, we observed that the 
performance of these overheads increases as the number of threads 
increases from 2 to 8. The following are the values of total time and 
overhead for all threads acquiring a pthread spin lock and all threads 
executing a barrier synchronization call.

Kernel 2.6.21
Number of Threads              2          4           6         8
SpinLock (Time micro second)   10.5618    10.58538    10.5915   10.643
                   (Overhead)   0.073      0.05746     0.102805 
0.154563
Barrier (Time micro second)    11.020410  11.678125   11.9889   12.38002
                  (Overhead)    0.531660   1.1502      1.500112 1.891617

Each thread is bound to a particular core using pthread_setaffinity_np.

Kernel 2.6.23.8
Number of Threads              2          4           6         8
SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
                  (Overhead)    4.345417   6.617207    3.949435  0.110985
Barrier (Time micro second)    19.462255  20.285117   16.19395  12.37662
                  (Overhead)    8.957755   9.784722    5.699590  1.869518

It is clearly that the synchronization overhead increases as the number 
of threads increases in the kernel 2.6.21. But the synchronization 
overhead actually decreases as the number of threads increases in the 
kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as 
well). This certainly is not a correct behavior. The kernels are 
configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
configuration file is in the attachment of this e-mail.

 From what we have read, there was a new scheduler (CFS) appeared from 
2.6.22. We are not sure whether the above behavior is caused by the new 
scheduler.

Finally, our machine cpu information is listed in the following:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 2
model name      : Quad-Core AMD Opteron(tm) Processor 2347
stepping        : 10
cpu MHz         : 1909.801
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp
  lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm 
cmp_legacy svm
extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 3822.95
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

In addition, we have schedstat and sched_debug files in the /proc 
directory.

Thank you for all your help to solve this puzzle. If you need more 
information, please let us know.


P.S. I like to be cc'ed on the discussions related to this problem.


###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


[-- Attachment #2: kernel-2.6.23.8-config --]
[-- Type: text/plain, Size: 19970 bytes --]

CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_IKCONFIG=m
CONFIG_IKCONFIG_PROC=y
CONFIG_CPUSETS=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_SYSCTL=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
CONFIG_RT_MUTEXES=y
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_CFQ=y
CONFIG_X86_PC=y
CONFIG_MK8=y
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_DISCONTIGMEM_MANUAL=y
CONFIG_DISCONTIGMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_MIGRATION=y
CONFIG_RESOURCES_64BIT=y
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_IOMMU=y
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_AMD=y
CONFIG_KEXEC=y
CONFIG_HZ_100=y
CONFIG_K8_NB=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_PM=y
CONFIG_SUSPEND_SMP_POSSIBLE=y
CONFIG_HIBERNATION_SMP_POSSIBLE=y
CONFIG_ACPI=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=m
CONFIG_ACPI_BATTERY=m
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=m
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
CONFIG_ACPI_ASUS=m
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=m
CONFIG_PCIEAER=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_HT_IRQ=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_FAKE=m
CONFIG_HOTPLUG_PCI_ACPI=m
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
CONFIG_HOTPLUG_PCI_SHPC=m
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
CONFIG_NET_KEY=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_FIB_HASH=y
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_CUBIC=y
CONFIG_NETWORK_SECMARK=y
CONFIG_BRIDGE=m
CONFIG_VLAN_8021Q=m
CONFIG_LLC=m
CONFIG_NET_PKTGEN=m
CONFIG_IRDA=m
CONFIG_IRLAN=m
CONFIG_IRCOMM=m
CONFIG_IRDA_CACHE_LAST_LSAP=y
CONFIG_IRDA_FAST_RR=y
CONFIG_IRTTY_SIR=m
CONFIG_DONGLE=y
CONFIG_ESI_DONGLE=m
CONFIG_ACTISYS_DONGLE=m
CONFIG_TEKRAM_DONGLE=m
CONFIG_TOIM3232_DONGLE=m
CONFIG_LITELINK_DONGLE=m
CONFIG_MA600_DONGLE=m
CONFIG_GIRBIL_DONGLE=m
CONFIG_MCP2120_DONGLE=m
CONFIG_OLD_BELKIN_DONGLE=m
CONFIG_ACT200L_DONGLE=m
CONFIG_USB_IRDA=m
CONFIG_SIGMATEL_FIR=m
CONFIG_NSC_FIR=m
CONFIG_WINBOND_FIR=m
CONFIG_SMC_IRCC_FIR=m
CONFIG_ALI_FIR=m
CONFIG_VLSI_FIR=m
CONFIG_VIA_FIR=m
CONFIG_MCS_FIR=m
CONFIG_BT=m
CONFIG_BT_L2CAP=m
CONFIG_BT_SCO=m
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_HIDP=m
CONFIG_BT_HCIUSB=m
CONFIG_BT_HCIUSB_SCO=y
CONFIG_BT_HCIUART=m
CONFIG_BT_HCIUART_H4=y
CONFIG_BT_HCIUART_BCSP=y
CONFIG_BT_HCIBCM203X=m
CONFIG_BT_HCIBPA10X=m
CONFIG_BT_HCIBFUSB=m
CONFIG_BT_HCIVHCI=m
CONFIG_WIRELESS_EXT=y
CONFIG_IEEE80211=m
CONFIG_IEEE80211_CRYPT_WEP=m
CONFIG_IEEE80211_CRYPT_CCMP=m
CONFIG_IEEE80211_SOFTMAC=m
CONFIG_IEEE80211_SOFTMAC_DEBUG=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
CONFIG_PARPORT_NOT_PC=y
CONFIG_PNP=y
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=m
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_SX8=m
CONFIG_BLK_DEV_UB=m
CONFIG_BLK_DEV_RAM=y
CONFIG_MISC_DEVICES=y
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECD=m
CONFIG_BLK_DEV_IDEFLOPPY=y
CONFIG_BLK_DEV_IDESCSI=m
CONFIG_IDE_TASK_IOCTL=y
CONFIG_IDE_PROC_FS=y
CONFIG_IDE_GENERIC=y
CONFIG_BLK_DEV_IDEPNP=y
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
CONFIG_IDEPCI_PCIBUS_ORDER=y
CONFIG_BLK_DEV_GENERIC=y
CONFIG_BLK_DEV_IDEDMA_PCI=y
CONFIG_BLK_DEV_AEC62XX=y
CONFIG_BLK_DEV_ALI15X3=y
CONFIG_BLK_DEV_AMD74XX=y
CONFIG_BLK_DEV_ATIIXP=y
CONFIG_BLK_DEV_CMD64X=y
CONFIG_BLK_DEV_HPT34X=y
CONFIG_BLK_DEV_HPT366=y
CONFIG_BLK_DEV_PIIX=y
CONFIG_BLK_DEV_IT821X=y
CONFIG_BLK_DEV_PDC202XX_OLD=y
CONFIG_BLK_DEV_PDC202XX_NEW=y
CONFIG_BLK_DEV_SVWKS=y
CONFIG_BLK_DEV_SIIMAGE=y
CONFIG_BLK_DEV_SIS5513=y
CONFIG_BLK_DEV_VIA82CXXX=y
CONFIG_BLK_DEV_IDEDMA=y
CONFIG_RAID_ATTRS=m
CONFIG_SCSI=m
CONFIG_SCSI_DMA=y
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y
CONFIG_BLK_DEV_SD=m
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_WAIT_SCAN=m
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_BLK_DEV_3W_XXXX_RAID=m
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_ACARD=m
CONFIG_SCSI_AACRAID=m
CONFIG_SCSI_AIC7XXX=m
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_SCSI_AIC79XX=m
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=m
CONFIG_MEGARAID_MAILBOX=m
CONFIG_MEGARAID_LEGACY=m
CONFIG_MEGARAID_SAS=m
CONFIG_SCSI_HPTIOP=m
CONFIG_SCSI_BUSLOGIC=m
CONFIG_SCSI_GDTH=m
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
CONFIG_SCSI_PPA=m
CONFIG_SCSI_IMM=m
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_MMIO=y
CONFIG_SCSI_QLOGIC_1280=m
CONFIG_SCSI_QLA_FC=m
CONFIG_SCSI_LPFC=m
CONFIG_SCSI_DC395x=m
CONFIG_SCSI_DC390T=m
CONFIG_ATA=m
CONFIG_ATA_ACPI=y
CONFIG_SATA_SVW=m
CONFIG_ATA_PIIX=m
CONFIG_SATA_NV=m
CONFIG_PDC_ADMA=m
CONFIG_SATA_QSTOR=m
CONFIG_SATA_PROMISE=m
CONFIG_SATA_SX4=m
CONFIG_SATA_SIL=m
CONFIG_SATA_SIL24=m
CONFIG_SATA_SIS=m
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
CONFIG_SATA_VITESSE=m
CONFIG_PATA_AMD=m
CONFIG_PATA_ATIIXP=m
CONFIG_PATA_CS5520=m
CONFIG_PATA_EFAR=m
CONFIG_ATA_GENERIC=m
CONFIG_PATA_IT821X=m
CONFIG_PATA_JMICRON=m
CONFIG_PATA_TRIFLEX=m
CONFIG_PATA_MPIIX=m
CONFIG_PATA_OLDPIIX=m
CONFIG_PATA_NETCELL=m
CONFIG_PATA_RZ1000=m
CONFIG_PATA_SERVERWORKS=m
CONFIG_PATA_PDC2027X=m
CONFIG_PATA_SIL680=m
CONFIG_PATA_SIS=m
CONFIG_PATA_VIA=m
CONFIG_PATA_WINBOND=m
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
CONFIG_MD_RAID5_RESHAPE=y
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_BLK_DEV_DM=m
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_MIRROR=m
CONFIG_DM_ZERO=m
CONFIG_DM_MULTIPATH=m
CONFIG_DM_MULTIPATH_EMC=m
CONFIG_FUSION=y
CONFIG_FUSION_SPI=m
CONFIG_FUSION_FC=m
CONFIG_FUSION_SAS=m
CONFIG_FUSION_CTL=m
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_BONDING=m
CONFIG_EQUALIZER=m
CONFIG_TUN=m
CONFIG_PHYLIB=m
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_FIXED_PHY=m
CONFIG_FIXED_MII_10_FDX=y
CONFIG_FIXED_MII_100_FDX=y
CONFIG_NET_ETHERNET=y
CONFIG_MII=m
CONFIG_NET_VENDOR_3COM=y
CONFIG_VORTEX=m
CONFIG_TYPHOON=m
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
CONFIG_AMD8111_ETH=m
CONFIG_AMD8111E_NAPI=y
CONFIG_B44=m
CONFIG_FORCEDETH=m
CONFIG_E100=m
CONFIG_EPIC100=m
CONFIG_SUNDANCE=m
CONFIG_VIA_RHINE=m
CONFIG_VIA_RHINE_MMIO=y
CONFIG_VIA_RHINE_NAPI=y
CONFIG_NETDEV_1000=y
CONFIG_ACENIC=m
CONFIG_DL2K=m
CONFIG_E1000=m
CONFIG_E1000_NAPI=y
CONFIG_NS83820=m
CONFIG_HAMACHI=m
CONFIG_YELLOWFIN=m
CONFIG_R8169=m
CONFIG_R8169_NAPI=y
CONFIG_R8169_VLAN=y
CONFIG_SIS190=m
CONFIG_SKGE=m
CONFIG_SKY2=m
CONFIG_VIA_VELOCITY=m
CONFIG_TIGON3=m
CONFIG_BNX2=m
CONFIG_NETDEV_10000=y
CONFIG_CHELSIO_T1=m
CONFIG_CHELSIO_T1_NAPI=y
CONFIG_IXGB=m
CONFIG_IXGB_NAPI=y
CONFIG_S2IO=m
CONFIG_S2IO_NAPI=y
CONFIG_MYRI10GE=m
CONFIG_MLX4_CORE=m
CONFIG_MLX4_DEBUG=y
CONFIG_NETCONSOLE=m
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_EVDEV=y
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_SERIAL=m
CONFIG_MOUSE_VSXXXAA=m
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
CONFIG_INPUT_UINPUT=m
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_SERIAL_NONSTANDARD=y
CONFIG_CYCLADES=m
CONFIG_SYNCLINK=m
CONFIG_SYNCLINKMP=m
CONFIG_SYNCLINK_GT=m
CONFIG_N_HDLC=m
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_JSM=m
CONFIG_UNIX98_PTYS=y
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
CONFIG_TIPAR=m
CONFIG_WATCHDOG=y
CONFIG_SOFT_WATCHDOG=m
CONFIG_SC520_WDT=m
CONFIG_I6300ESB_WDT=m
CONFIG_W83627HF_WDT=m
CONFIG_W83877F_WDT=m
CONFIG_W83977F_WDT=m
CONFIG_MACHZ_WDT=m
CONFIG_PCIPCWATCHDOG=m
CONFIG_WDTPCI=m
CONFIG_WDT_501_PCI=y
CONFIG_USBPCWATCHDOG=m
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_NVRAM=y
CONFIG_RTC=y
CONFIG_R3964=m
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_MWAVE=m
CONFIG_PC8736x_GPIO=m
CONFIG_NSC_GPIO=m
CONFIG_HPET=y
CONFIG_HANGCHECK_TIMER=m
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCF=m
CONFIG_I2C_ALGOPCA=m
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD756_S4882=m
CONFIG_I2C_AMD8111=m
CONFIG_I2C_I801=m
CONFIG_I2C_I810=m
CONFIG_I2C_PIIX4=m
CONFIG_I2C_NFORCE2=m
CONFIG_I2C_OCORES=m
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
CONFIG_I2C_PROSAVAGE=m
CONFIG_I2C_SAVAGE4=m
CONFIG_I2C_SIS5595=m
CONFIG_I2C_SIS630=m
CONFIG_I2C_SIS96X=m
CONFIG_I2C_STUB=m
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m
CONFIG_I2C_VOODOO3=m
CONFIG_SENSORS_DS1337=m
CONFIG_SENSORS_DS1374=m
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
CONFIG_SENSORS_PCA9539=m
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_MAX6875=m
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
CONFIG_SENSORS_ABITUGURU=m
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ADM9240=m
CONFIG_SENSORS_ASB100=m
CONFIG_SENSORS_ATXP1=m
CONFIG_SENSORS_DS1621=m
CONFIG_SENSORS_F71805F=m
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_FSCPOS=m
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
CONFIG_SENSORS_MAX1619=m
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_SIS5595=m
CONFIG_SENSORS_SMSC47M1=m
CONFIG_SENSORS_SMSC47M192=m
CONFIG_SENSORS_SMSC47B397=m
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83791D=m
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83627HF=m
CONFIG_SENSORS_W83627EHF=m
CONFIG_SENSORS_HDAPS=m
CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
CONFIG_VIDEO_V4L2=y
CONFIG_VIDEO_CAPTURE_DRIVERS=y
CONFIG_VIDEO_HELPER_CHIPS_AUTO=y
CONFIG_V4L_USB_DRIVERS=y
CONFIG_RADIO_ADAPTERS=y
CONFIG_DAB=y
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_BACKLIGHT_CLASS_DEVICE=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FONT_8x16=y
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_USB_HID=y
CONFIG_HID_FF=y
CONFIG_HID_PID=y
CONFIG_LOGITECH_FF=y
CONFIG_THRUSTMASTER_FF=y
CONFIG_USB_HIDDEV=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_SPLIT_ISO=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_ISP116X_HCD=m
CONFIG_USB_OHCI_HCD=m
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=m
CONFIG_USB_SL811_HCD=m
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m
CONFIG_USB_STORAGE=m
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_DPCM=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
CONFIG_USB_LIBUSUAL=y
CONFIG_USB_MDC800=m
CONFIG_USB_MICROTEK=m
CONFIG_USB_MON=y
CONFIG_USB_USS720=m
CONFIG_USB_SERIAL=m
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_AIRPRIME=m
CONFIG_USB_SERIAL_ARK3116=m
CONFIG_USB_SERIAL_BELKIN=m
CONFIG_USB_SERIAL_WHITEHEAT=m
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
CONFIG_USB_SERIAL_CP2101=m
CONFIG_USB_SERIAL_CYPRESS_M8=m
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_FUNSOFT=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
CONFIG_USB_SERIAL_IR=m
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
CONFIG_USB_SERIAL_GARMIN=m
CONFIG_USB_SERIAL_IPW=m
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
CONFIG_USB_SERIAL_KEYSPAN_MPR=y
CONFIG_USB_SERIAL_KEYSPAN_USA28=y
CONFIG_USB_SERIAL_KEYSPAN_USA28X=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
CONFIG_USB_SERIAL_KEYSPAN_USA19=y
CONFIG_USB_SERIAL_KEYSPAN_USA18X=y
CONFIG_USB_SERIAL_KEYSPAN_USA19W=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
CONFIG_USB_SERIAL_KLSI=m
CONFIG_USB_SERIAL_KOBIL_SCT=m
CONFIG_USB_SERIAL_MCT_U232=m
CONFIG_USB_SERIAL_NAVMAN=m
CONFIG_USB_SERIAL_PL2303=m
CONFIG_USB_SERIAL_HP4X=m
CONFIG_USB_SERIAL_SAFE=m
CONFIG_USB_SERIAL_SAFE_PADDED=y
CONFIG_USB_SERIAL_SIERRAWIRELESS=m
CONFIG_USB_SERIAL_TI=m
CONFIG_USB_SERIAL_CYBERJACK=m
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_OPTION=m
CONFIG_USB_SERIAL_OMNINET=m
CONFIG_USB_EZUSB=y
CONFIG_USB_EMI62=m
CONFIG_USB_EMI26=m
CONFIG_USB_AUERSWALD=m
CONFIG_USB_RIO500=m
CONFIG_USB_LEGOTOWER=m
CONFIG_USB_LCD=m
CONFIG_USB_LED=m
CONFIG_USB_IDMOUSE=m
CONFIG_USB_APPLEDISPLAY=m
CONFIG_USB_SISUSBVGA=m
CONFIG_USB_SISUSBVGA_CON=y
CONFIG_USB_LD=m
CONFIG_USB_TEST=m
CONFIG_INFINIBAND=m
CONFIG_INFINIBAND_USER_MAD=m
CONFIG_INFINIBAND_USER_ACCESS=m
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=m
CONFIG_INFINIBAND_MTHCA_DEBUG=y
CONFIG_INFINIBAND_IPATH=m
CONFIG_INFINIBAND_AMSO1100=m
CONFIG_MLX4_INFINIBAND=m
CONFIG_INFINIBAND_IPOIB=m
CONFIG_INFINIBAND_IPOIB_DEBUG=y
CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y
CONFIG_INFINIBAND_SRP=m
CONFIG_INFINIBAND_ISER=m
CONFIG_DMA_ENGINE=y
CONFIG_NET_DMA=y
CONFIG_INTEL_IOATDMA=m
CONFIG_VIRTUALIZATION=y
CONFIG_EDD=m
CONFIG_DELL_RBU=m
CONFIG_DCDBAS=m
CONFIG_DMIID=y
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y
CONFIG_FS_XIP=y
CONFIG_EXT3_FS=m
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_JBD=m
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=m
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
CONFIG_JFS_FS=m
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_SECURITY=y
CONFIG_FS_POSIX_ACL=y
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
CONFIG_XFS_SECURITY=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_OCFS2_FS=m
CONFIG_ROMFS_FS=m
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_DNOTIFY=y
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_NTFS_FS=m
CONFIG_NTFS_RW=y
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_RAMFS=y
CONFIG_CONFIGFS_FS=m
CONFIG_UFS_FS=m
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
CONFIG_CIFS=m
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CODA_FS=m
CONFIG_PARTITION_ADVANCED=y
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_SGI_PARTITION=y
CONFIG_SUN_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_UTF8=m
CONFIG_PROFILING=y
CONFIG_OPROFILE=m
CONFIG_KPROBES=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_RODATA=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_KEYS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y
CONFIG_SECURITY=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_CAPABILITIES=y
CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=m
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_CBC=m
CONFIG_CRYPTO_PCBC=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_HW=y
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC32=y
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above
  2007-11-21 20:34 Possible bug from kernel 2.6.22 and above Jie Chen
@ 2007-11-21 22:14 ` Eric Dumazet
  2007-11-22  1:52   ` Jie Chen
  2007-12-05 20:36 ` Possible bug from kernel 2.6.22 and above Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2007-11-21 22:14 UTC (permalink / raw)
  To: Jie Chen; +Cc: linux-kernel

Jie Chen a écrit :
> Hi, there:
> 
>     We have a simple pthread program that measures the synchronization 
> overheads for various synchronization mechanisms such as spin locks, 
> barriers (the barrier is implemented using queue-based barrier 
> algorithm) and so on. We have dual quad-core AMD opterons (barcelona) 
> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 
> distribution. Before we moved to this kernel, we had kernel 2.6.21. 
> These two kernels are configured identical and compiled with the same 
> gcc 4.1.2 compiler. Under the old kernel, we observed that the 
> performance of these overheads increases as the number of threads 
> increases from 2 to 8. The following are the values of total time and 
> overhead for all threads acquiring a pthread spin lock and all threads 
> executing a barrier synchronization call.

Could you post the source of your test program ?

spinlock are ... spining and should not call linux scheduler, so I have no 
idea why a kernel change could modify your results.

Also I suspect you'll have better results with Fedora Core 8 (since glibc was 
updated to use private futexes in v 2.7), at least for the barrier ops.


> 
> Kernel 2.6.21
> Number of Threads              2          4           6         8
> SpinLock (Time micro second)   10.5618    10.58538    10.5915   10.643
>                   (Overhead)   0.073      0.05746     0.102805 0.154563
> Barrier (Time micro second)    11.020410  11.678125   11.9889   12.38002
>                  (Overhead)    0.531660   1.1502      1.500112 1.891617
> 
> Each thread is bound to a particular core using pthread_setaffinity_np.
> 
> Kernel 2.6.23.8
> Number of Threads              2          4           6         8
> SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
>                  (Overhead)    4.345417   6.617207    3.949435  0.110985
> Barrier (Time micro second)    19.462255  20.285117   16.19395  12.37662
>                  (Overhead)    8.957755   9.784722    5.699590  1.869518
> 
> It is clearly that the synchronization overhead increases as the number 
> of threads increases in the kernel 2.6.21. But the synchronization 
> overhead actually decreases as the number of threads increases in the 
> kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as 
> well). This certainly is not a correct behavior. The kernels are 
> configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
> configuration file is in the attachment of this e-mail.
> 
>  From what we have read, there was a new scheduler (CFS) appeared from 
> 2.6.22. We are not sure whether the above behavior is caused by the new 
> scheduler.
> 
> Finally, our machine cpu information is listed in the following:
> 
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 16
> model           : 2
> model name      : Quad-Core AMD Opteron(tm) Processor 2347
> stepping        : 10
> cpu MHz         : 1909.801
> cache size      : 512 KB
> physical id     : 0
> siblings        : 4
> core id         : 0
> cpu cores       : 4
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 5
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov
> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> pdpe1gb rdtscp
>  lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm 
> cmp_legacy svm
> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
> bogomips        : 3822.95
> TLB size        : 1024 4K pages
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> In addition, we have schedstat and sched_debug files in the /proc 
> directory.
> 
> Thank you for all your help to solve this puzzle. If you need more 
> information, please let us know.
> 
> 
> P.S. I like to be cc'ed on the discussions related to this problem.
> 
> 
> ###############################################
> Jie Chen
> Scientific Computing Group
> Thomas Jefferson National Accelerator Facility
> 12000, Jefferson Ave.
> Newport News, VA 23606
> 
> (757)269-5046 (office) (757)269-6248 (fax)
> chen@jlab.org
> ###############################################
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above
  2007-11-21 22:14 ` Eric Dumazet
@ 2007-11-22  1:52   ` Jie Chen
  2007-11-22  2:32     ` Simon Holm Thøgersen
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-11-22  1:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel

Eric Dumazet wrote:
> Jie Chen a écrit :
>> Hi, there:
>>
>>     We have a simple pthread program that measures the synchronization 
>> overheads for various synchronization mechanisms such as spin locks, 
>> barriers (the barrier is implemented using queue-based barrier 
>> algorithm) and so on. We have dual quad-core AMD opterons (barcelona) 
>> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 
>> distribution. Before we moved to this kernel, we had kernel 2.6.21. 
>> These two kernels are configured identical and compiled with the same 
>> gcc 4.1.2 compiler. Under the old kernel, we observed that the 
>> performance of these overheads increases as the number of threads 
>> increases from 2 to 8. The following are the values of total time and 
>> overhead for all threads acquiring a pthread spin lock and all threads 
>> executing a barrier synchronization call.
> 
> Could you post the source of your test program ?
> 


Hi, Eric:

Thank you for the quick response. You can get the source code containing 
the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a 
data parallel threading package for physics calculation. The test code 
is pthread_sync in the src directory once you unpack the gz file. To 
configure and build this package is very simple: configure and make. The 
  test program is built by make check. The number of threads is 
controlled by QMT_NUM_THREADS. The package is using pthread spin lock, 
but the barrier is implemented using a queue-based barrier algorithm 
proposed by  J. B. Carter of University of Utah (2005).





> spinlock are ... spining and should not call linux scheduler, so I have 
> no idea why a kernel change could modify your results.
> 
> Also I suspect you'll have better results with Fedora Core 8 (since 
> glibc was updated to use private futexes in v 2.7), at least for the 
> barrier ops.
> 
> 

I am not sure what the biggest change between kernel 2.6.21 and 2.6.22 
(23) is? Is the scheduler the biggest change between these versions? Can 
the scheduler of kernel somehow effect the performance? I know the 
scheduler is trying to do load balance and so on. Can the scheduler move 
threads to different cores according to the load balance algorithm even 
though the threads are bound to cores using pthread_setaffinity_np call 
when the number of threads is fewer than the number of cores? I am 
thinking about this because the performance of our test code is roughly 
the same for both kernels when the number of threads equals to the 
number of cores.

>>
>> Kernel 2.6.21
>> Number of Threads              2          4           6         8
>> SpinLock (Time micro second)   10.5618    10.58538    10.5915   10.643
>>                   (Overhead)   0.073      0.05746     0.102805 0.154563
>> Barrier (Time micro second)    11.020410  11.678125   11.9889   12.38002
>>                  (Overhead)    0.531660   1.1502      1.500112 1.891617
>>
>> Each thread is bound to a particular core using pthread_setaffinity_np.
>>
>> Kernel 2.6.23.8
>> Number of Threads              2          4           6         8
>> SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
>>                  (Overhead)    4.345417   6.617207    3.949435  0.110985
>> Barrier (Time micro second)    19.462255  20.285117   16.19395  12.37662
>>                  (Overhead)    8.957755   9.784722    5.699590  1.869518
>>
>> It is clearly that the synchronization overhead increases as the 
>> number of threads increases in the kernel 2.6.21. But the 
>> synchronization overhead actually decreases as the number of threads 
>> increases in the kernel 2.6.23.8 (We observed the same behavior on 
>> kernel 2.6.22 as well). This certainly is not a correct behavior. The 
>> kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
>> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
>> configuration file is in the attachment of this e-mail.
>>
>>  From what we have read, there was a new scheduler (CFS) appeared from 
>> 2.6.22. We are not sure whether the above behavior is caused by the 
>> new scheduler.
>>
>> Finally, our machine cpu information is listed in the following:
>>
>> processor       : 0
>> vendor_id       : AuthenticAMD
>> cpu family      : 16
>> model           : 2
>> model name      : Quad-Core AMD Opteron(tm) Processor 2347
>> stepping        : 10
>> cpu MHz         : 1909.801
>> cache size      : 512 KB
>> physical id     : 0
>> siblings        : 4
>> core id         : 0
>> cpu cores       : 4
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 5
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
>> mca cmov
>> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
>> pdpe1gb rdtscp
>>  lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm 
>> cmp_legacy svm
>> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
>> bogomips        : 3822.95
>> TLB size        : 1024 4K pages
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 48 bits physical, 48 bits virtual
>> power management: ts ttp tm stc 100mhzsteps hwpstate
>>
>> In addition, we have schedstat and sched_debug files in the /proc 
>> directory.
>>
>> Thank you for all your help to solve this puzzle. If you need more 
>> information, please let us know.
>>
>>
>> P.S. I like to be cc'ed on the discussions related to this problem.
>>

Thank you for your help and happy thanksgiving !

-- 
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# chen@jlab.org
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above
  2007-11-22  1:52   ` Jie Chen
@ 2007-11-22  2:32     ` Simon Holm Thøgersen
  2007-11-22  2:58       ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Simon Holm Thøgersen @ 2007-11-22  2:32 UTC (permalink / raw)
  To: Jie Chen; +Cc: Eric Dumazet, linux-kernel


ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
> Eric Dumazet wrote:
> > Jie Chen a écrit :
> >> Hi, there:
> >>
> >>     We have a simple pthread program that measures the synchronization 
> >> overheads for various synchronization mechanisms such as spin locks, 
> >> barriers (the barrier is implemented using queue-based barrier 
> >> algorithm) and so on. We have dual quad-core AMD opterons (barcelona) 
> >> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 
> >> distribution. Before we moved to this kernel, we had kernel 2.6.21. 
> >> These two kernels are configured identical and compiled with the same 
> >> gcc 4.1.2 compiler. Under the old kernel, we observed that the 
> >> performance of these overheads increases as the number of threads 
> >> increases from 2 to 8. The following are the values of total time and 
> >> overhead for all threads acquiring a pthread spin lock and all threads 
> >> executing a barrier synchronization call.
> > 
> > Could you post the source of your test program ?
> > 
> 
> 
> Hi, Eric:
> 
> Thank you for the quick response. You can get the source code containing 
> the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a 
> data parallel threading package for physics calculation. The test code 
> is pthread_sync in the src directory once you unpack the gz file. To 
> configure and build this package is very simple: configure and make. The 
>   test program is built by make check. The number of threads is 
> controlled by QMT_NUM_THREADS. The package is using pthread spin lock, 
> but the barrier is implemented using a queue-based barrier algorithm 
> proposed by  J. B. Carter of University of Utah (2005).
> 
> 
> 
> 
> 
> > spinlock are ... spining and should not call linux scheduler, so I have 
> > no idea why a kernel change could modify your results.
> > 
> > Also I suspect you'll have better results with Fedora Core 8 (since 
> > glibc was updated to use private futexes in v 2.7), at least for the 
> > barrier ops.
> > 
> > 
> 
> I am not sure what the biggest change between kernel 2.6.21 and 2.6.22 
> (23) is? Is the scheduler the biggest change between these versions? Can 
> the scheduler of kernel somehow effect the performance? I know the 
> scheduler is trying to do load balance and so on. Can the scheduler move 
> threads to different cores according to the load balance algorithm even 
> though the threads are bound to cores using pthread_setaffinity_np call 
> when the number of threads is fewer than the number of cores? I am 
> thinking about this because the performance of our test code is roughly 
> the same for both kernels when the number of threads equals to the 
> number of cores.
> 
There is a backport of the CFS scheduler to 2.6.21, see
http://lkml.org/lkml/2007/11/19/127

> >>
> >> Kernel 2.6.21
> >> Number of Threads              2          4           6         8
> >> SpinLock (Time micro second)   10.5618    10.58538    10.5915   10.643
> >>                   (Overhead)   0.073      0.05746     0.102805 0.154563
> >> Barrier (Time micro second)    11.020410  11.678125   11.9889   12.38002
> >>                  (Overhead)    0.531660   1.1502      1.500112 1.891617
> >>
> >> Each thread is bound to a particular core using pthread_setaffinity_np.
> >>
> >> Kernel 2.6.23.8
> >> Number of Threads              2          4           6         8
> >> SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
> >>                  (Overhead)    4.345417   6.617207    3.949435  0.110985
> >> Barrier (Time micro second)    19.462255  20.285117   16.19395  12.37662
> >>                  (Overhead)    8.957755   9.784722    5.699590  1.869518
> >>
> >> It is clearly that the synchronization overhead increases as the 
> >> number of threads increases in the kernel 2.6.21. But the 
> >> synchronization overhead actually decreases as the number of threads 
> >> increases in the kernel 2.6.23.8 (We observed the same behavior on 
> >> kernel 2.6.22 as well). This certainly is not a correct behavior. The 
> >> kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
> >> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
> >> configuration file is in the attachment of this e-mail.
> >>
> >>  From what we have read, there was a new scheduler (CFS) appeared from 
> >> 2.6.22. We are not sure whether the above behavior is caused by the 
> >> new scheduler.
> >>
> >> Finally, our machine cpu information is listed in the following:
> >>
> >> processor       : 0
> >> vendor_id       : AuthenticAMD
> >> cpu family      : 16
> >> model           : 2
> >> model name      : Quad-Core AMD Opteron(tm) Processor 2347
> >> stepping        : 10
> >> cpu MHz         : 1909.801
> >> cache size      : 512 KB
> >> physical id     : 0
> >> siblings        : 4
> >> core id         : 0
> >> cpu cores       : 4
> >> fpu             : yes
> >> fpu_exception   : yes
> >> cpuid level     : 5
> >> wp              : yes
> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> >> mca cmov
> >> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> >> pdpe1gb rdtscp
> >>  lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm 
> >> cmp_legacy svm
> >> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
> >> bogomips        : 3822.95
> >> TLB size        : 1024 4K pages
> >> clflush size    : 64
> >> cache_alignment : 64
> >> address sizes   : 48 bits physical, 48 bits virtual
> >> power management: ts ttp tm stc 100mhzsteps hwpstate
> >>
> >> In addition, we have schedstat and sched_debug files in the /proc 
> >> directory.
> >>
> >> Thank you for all your help to solve this puzzle. If you need more 
> >> information, please let us know.
> >>
> >>
> >> P.S. I like to be cc'ed on the discussions related to this problem.
> >>
> 
> Thank you for your help and happy thanksgiving !
> 


Simon Holm Thøgersen


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above
  2007-11-22  2:32     ` Simon Holm Thøgersen
@ 2007-11-22  2:58       ` Jie Chen
  2007-11-22 20:19         ` Matt Mackall
  2007-12-04 13:17         ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar
  0 siblings, 2 replies; 35+ messages in thread
From: Jie Chen @ 2007-11-22  2:58 UTC (permalink / raw)
  To: Simon Holm Thøgersen; +Cc: Eric Dumazet, linux-kernel

Simon Holm Thøgersen wrote:
> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:

> There is a backport of the CFS scheduler to 2.6.21, see
> http://lkml.org/lkml/2007/11/19/127
> 
Hi, Simon:

I will try that after the thanksgiving holiday to find out whether the 
odd behavior will show up using 2.6.21 with back ported CFS.

>>>> Kernel 2.6.21
>>>> Number of Threads              2          4           6         8
>>>> SpinLock (Time micro second)   10.5618    10.58538    10.5915   10.643
>>>>                   (Overhead)   0.073      0.05746     0.102805 0.154563
>>>> Barrier (Time micro second)    11.020410  11.678125   11.9889   12.38002
>>>>                  (Overhead)    0.531660   1.1502      1.500112 1.891617
>>>>
>>>> Each thread is bound to a particular core using pthread_setaffinity_np.
>>>>
>>>> Kernel 2.6.23.8
>>>> Number of Threads              2          4           6         8
>>>> SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
>>>>                  (Overhead)    4.345417   6.617207    3.949435  0.110985
>>>> Barrier (Time micro second)    19.462255  20.285117   16.19395  12.37662
>>>>                  (Overhead)    8.957755   9.784722    5.699590  1.869518
>>>>

> 
> 
> Simon Holm Thøgersen
> 
> 
I just ran a simple test to prove that the problem may be related to 
load balance of the scheduler. I first started 6 processes using 
"taskset -c 2 donothing&; taskset -c 3 donothing&; ..., taskset -c 7 
donothing". These 6 processes will run on core 2 to 7. Then I started my 
test program using two threads bound to core 0 and 1. Here is the result:

Two threads on Kernel 2.6.23.8:
SpinLock (Time micro second)             10.558255
          (Overhead)                      0.068965
Barrier  (Time micro second)             10.865520
          (Overhead)                      0.376230

Similarly, I started 4 donothing processes on core 4, 5, 6 and 7, and 
ran the test program. I have the following result:

Four threads on Kernel 2.6.23.8:
SpinLock (Time micro second)             10.579413
          (Overhead)                      0.090023
Barrier  (Time micro second)             11.363193
          (Overhead)                      0.873803

Finally, here is the result for 6 threads with two donothing processes 
running on core 6 and 7:

Six threads on Kernel 2.6.23.8:
SpinLock (Time micro second)             10.590030
          (Overhead)                      0.100940
Barrier  (Time micro second)             11.977548
          (Overhead)                      1.488458

Now the above results are very much similar to the results obtained for 
the kernel 2.6.21. I hope this helps you guys in some ways. Thank you.

-- 
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# chen@jlab.org
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above
  2007-11-22  2:58       ` Jie Chen
@ 2007-11-22 20:19         ` Matt Mackall
  2007-12-04 13:17         ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar
  1 sibling, 0 replies; 35+ messages in thread
From: Matt Mackall @ 2007-11-22 20:19 UTC (permalink / raw)
  To: Jie Chen; +Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Ingo Molnar

On Wed, Nov 21, 2007 at 09:58:10PM -0500, Jie Chen wrote:
> Simon Holm Th??gersen wrote:
> >ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
> 
> >There is a backport of the CFS scheduler to 2.6.21, see
> >http://lkml.org/lkml/2007/11/19/127
> >
> Hi, Simon:
> 
> I will try that after the thanksgiving holiday to find out whether the 
> odd behavior will show up using 2.6.21 with back ported CFS.
> 
> >>>>Kernel 2.6.21
> >>>>Number of Threads              2          4           6         8
> >>>>SpinLock (Time micro second)   10.5618    10.58538    10.5915   10.643
> >>>>                  (Overhead)   0.073      0.05746     0.102805 0.154563
> >>>>Barrier (Time micro second)    11.020410  11.678125   11.9889   12.38002
> >>>>                 (Overhead)    0.531660   1.1502      1.500112 1.891617
> >>>>
> >>>>Each thread is bound to a particular core using pthread_setaffinity_np.
> >>>>
> >>>>Kernel 2.6.23.8
> >>>>Number of Threads              2          4           6         8
> >>>>SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
> >>>>                 (Overhead)    4.345417   6.617207    3.949435  0.110985
> >>>>Barrier (Time micro second)    19.462255  20.285117   16.19395  12.37662
> >>>>                 (Overhead)    8.957755   9.784722    5.699590  1.869518
> >>>>
> 
> >
> >
> >Simon Holm Th??gersen
> >
> >
> I just ran a simple test to prove that the problem may be related to 
> load balance of the scheduler. I first started 6 processes using 
> "taskset -c 2 donothing&; taskset -c 3 donothing&; ..., taskset -c 7 
> donothing". These 6 processes will run on core 2 to 7. Then I started my 
> test program using two threads bound to core 0 and 1. Here is the result:
> 
> Two threads on Kernel 2.6.23.8:
> SpinLock (Time micro second)             10.558255
>          (Overhead)                      0.068965
> Barrier  (Time micro second)             10.865520
>          (Overhead)                      0.376230
> 
> Similarly, I started 4 donothing processes on core 4, 5, 6 and 7, and 
> ran the test program. I have the following result:
> 
> Four threads on Kernel 2.6.23.8:
> SpinLock (Time micro second)             10.579413
>          (Overhead)                      0.090023
> Barrier  (Time micro second)             11.363193
>          (Overhead)                      0.873803
> 
> Finally, here is the result for 6 threads with two donothing processes 
> running on core 6 and 7:
> 
> Six threads on Kernel 2.6.23.8:
> SpinLock (Time micro second)             10.590030
>          (Overhead)                      0.100940
> Barrier  (Time micro second)             11.977548
>          (Overhead)                      1.488458
> 
> Now the above results are very much similar to the results obtained for 
> the kernel 2.6.21. I hope this helps you guys in some ways. Thank you.

Yes, this really does look like a scheduling regression. I've added
Ingo to the cc: list. Next time you should pick a more descriptive
subject line - we've got lots of email about possible bugs.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-11-22  2:58       ` Jie Chen
  2007-11-22 20:19         ` Matt Mackall
@ 2007-12-04 13:17         ` Ingo Molnar
  2007-12-04 15:41           ` Jie Chen
  2007-12-05 15:29           ` Jie Chen
  1 sibling, 2 replies; 35+ messages in thread
From: Ingo Molnar @ 2007-12-04 13:17 UTC (permalink / raw)
  To: Jie Chen; +Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel


* Jie Chen <chen@jlab.org> wrote:

> Simon Holm Th??gersen wrote:
>> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
>
>> There is a backport of the CFS scheduler to 2.6.21, see
>> http://lkml.org/lkml/2007/11/19/127
>>
> Hi, Simon:
>
> I will try that after the thanksgiving holiday to find out whether the 
> odd behavior will show up using 2.6.21 with back ported CFS.

would be also nice to test this with 2.6.24-rc4.

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-04 13:17         ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar
@ 2007-12-04 15:41           ` Jie Chen
  2007-12-05 15:29           ` Jie Chen
  1 sibling, 0 replies; 35+ messages in thread
From: Jie Chen @ 2007-12-04 15:41 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>> Simon Holm Th??gersen wrote:
>>> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
>>> There is a backport of the CFS scheduler to 2.6.21, see
>>> http://lkml.org/lkml/2007/11/19/127
>>>
>> Hi, Simon:
>>
>> I will try that after the thanksgiving holiday to find out whether the 
>> odd behavior will show up using 2.6.21 with back ported CFS.
> 
> would be also nice to test this with 2.6.24-rc4.
> 
> 	Ingo
Hi, Ingo:

I will test 2.6.24-rc4 this week and let you know the result. Thanks.

-- 
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# chen@jlab.org
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-04 13:17         ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar
  2007-12-04 15:41           ` Jie Chen
@ 2007-12-05 15:29           ` Jie Chen
  2007-12-05 15:40             ` Ingo Molnar
  1 sibling, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-05 15:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>> Simon Holm Th??gersen wrote:
>>> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
>>> There is a backport of the CFS scheduler to 2.6.21, see
>>> http://lkml.org/lkml/2007/11/19/127
>>>
>> Hi, Simon:
>>
>> I will try that after the thanksgiving holiday to find out whether the 
>> odd behavior will show up using 2.6.21 with back ported CFS.
> 
> would be also nice to test this with 2.6.24-rc4.
> 
> 	Ingo
Hi, Ingo:

I just ran the same test on two 2.6.24-rc4 kernels: one with 
CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
off. The odd behavior I described in my previous e-mails were still 
there for both kernels. Let me know If I can be any more help. Thank you.

-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 15:29           ` Jie Chen
@ 2007-12-05 15:40             ` Ingo Molnar
  2007-12-05 16:16               ` Eric Dumazet
  2007-12-05 16:22               ` Jie Chen
  0 siblings, 2 replies; 35+ messages in thread
From: Ingo Molnar @ 2007-12-05 15:40 UTC (permalink / raw)
  To: Jie Chen
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

> I just ran the same test on two 2.6.24-rc4 kernels: one with 
> CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
> off. The odd behavior I described in my previous e-mails were still 
> there for both kernels. Let me know If I can be any more help. Thank 
> you.

ok, i had a look at your data, and i think this is the result of the 
scheduler balancing out to idle CPUs more agressively than before. Doing 
that is almost always a good idea though - but indeed it can result in 
"bad" numbers if all you do is to measure the ping-pong "performance" 
between two threads. (with no real work done by any of them).

the moment you saturate the system a bit more, the numbers should 
improve even with such a ping-pong test.

do you have testcode (or a modification of your testcase sourcecode) 
that simulates a real-life situation where 2.6.24-rc4 performs not as 
well as you'd like it to see? (or if qmt.tar.gz already contains that 
then please point me towards that portion of the test and how i should 
run it - thanks!)

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 15:40             ` Ingo Molnar
@ 2007-12-05 16:16               ` Eric Dumazet
  2007-12-05 16:25                 ` Ingo Molnar
  2007-12-05 16:22               ` Jie Chen
  1 sibling, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2007-12-05 16:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jie Chen, Simon Holm Th??gersen, linux-kernel, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1873 bytes --]

Ingo Molnar a écrit :
> * Jie Chen <chen@jlab.org> wrote:
> 
>> I just ran the same test on two 2.6.24-rc4 kernels: one with 
>> CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
>> off. The odd behavior I described in my previous e-mails were still 
>> there for both kernels. Let me know If I can be any more help. Thank 
>> you.
> 
> ok, i had a look at your data, and i think this is the result of the 
> scheduler balancing out to idle CPUs more agressively than before. Doing 
> that is almost always a good idea though - but indeed it can result in 
> "bad" numbers if all you do is to measure the ping-pong "performance" 
> between two threads. (with no real work done by any of them).
> 
> the moment you saturate the system a bit more, the numbers should 
> improve even with such a ping-pong test.
> 
> do you have testcode (or a modification of your testcase sourcecode) 
> that simulates a real-life situation where 2.6.24-rc4 performs not as 
> well as you'd like it to see? (or if qmt.tar.gz already contains that 
> then please point me towards that portion of the test and how i should 
> run it - thanks!)
> 
> 	Ingo

I cooked a program shorter than Jie one, to try to understand what was going 
on. Its a pure cpu burner program, with no thread synchronisation (but the 
pthread_join at the very end)

As each thread is bound to a given cpu, I am not sure the scheduler is allowed 
to balance to an idle cpu.

Unfortunatly I dont have a 4 way SMP idle machine available to
test it.

$ gcc -O2 -o burner burner.c
$ ./burner
Time to perform the unit of work on one thread is 0.040328 s
Time to perform the unit of work on 2 threads is 0.040221 s

I tried it on a 64 way machine (Thanks David :) ) and noticed some strange
results that may be related to the Niagara hardware (time for 64 threads was 
nearly the double for one thread)


[-- Attachment #2: burner.c --]
[-- Type: text/plain, Size: 2301 bytes --]

#include <pthread.h>
#include <sched.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

int blockthemall=1;
static void inline cpupause()
{
#if defined(i386)
 asm volatile("rep;nop":::"memory");
#else
 asm volatile("":::"memory");
#endif
}
/*
 * Determines number of cpus
 * Can be overiden by the NR_CPUS environment variable
 */
int number_of_cpus()
{
	char line[1024], *p;
	int cnt = 0;
	FILE *F;

	p = getenv("NR_CPUS");
	if (p)
		return atoi(p);
	F = fopen("/proc/cpuinfo", "r");
	if (F == NULL) {
		perror("/proc/cpuinfo");
		return 1;
	}
	while (fgets(line, sizeof(line), F) != NULL) {
		if (memcmp(line, "processor", 9) == 0)
			cnt++;
	}
	fclose(F);
	return cnt;
}

void compute_elapsed(struct timeval *delta, const struct timeval *t0)
{
	struct timeval t1;

	gettimeofday(&t1, NULL);
	delta->tv_sec = t1.tv_sec - t0->tv_sec;
	delta->tv_usec = t1.tv_usec - t0->tv_usec;
	if (delta->tv_usec < 0) {
		delta->tv_usec += 1000000;
		delta->tv_sec--;
	}
}

int nr_loops = 20*1000000;
double incr=0.3456;
void perform_work()
{
	int i;
	double t = 0.0;
	for (i = 0; i < nr_loops; i++) {
		t += incr;
		}
	if (t < 0.0) printf("well... should not happen\n");
}

void set_affinity(int cpu)
{
	long cpu_mask;
	int res;

	cpu_mask = 1L << cpu;
	res = sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask);
	if (res)
		perror("sched_setaffinity");
}

void *thread_work(void *arg)
{
	int cpu = (int)arg;
	set_affinity(cpu);
	while (blockthemall)
		cpupause();
	perform_work();
	return (void *)0;
}

main(int argc, char *argv[])
{
	struct timeval t0, delta;
	int nr_cpus, i;
	pthread_t *tids;

	gettimeofday(&t0, NULL);
	perform_work();
	compute_elapsed(&delta, &t0);
	printf("Time to perform the unit of work on one thread is %d.%06d s\n", delta.tv_sec, delta.tv_usec);

	nr_cpus = number_of_cpus();
	if (nr_cpus <= 1)
		return 0;
	tids = malloc(nr_cpus * sizeof(pthread_t));
	for (i = 1; i < nr_cpus; i++) {
		pthread_create(tids + i, NULL, thread_work, (void *)i);
	}

	set_affinity(0);
	gettimeofday(&t0, NULL);
	blockthemall=0;
	perform_work();
	for (i = 1; i < nr_cpus; i++)
		pthread_join(tids[i], NULL);
	compute_elapsed(&delta, &t0);
	printf("Time to perform the unit of work on %d threads is %d.%06d s\n", nr_cpus, delta.tv_sec, delta.tv_usec);
	
}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 15:40             ` Ingo Molnar
  2007-12-05 16:16               ` Eric Dumazet
@ 2007-12-05 16:22               ` Jie Chen
  2007-12-05 16:47                 ` Ingo Molnar
  1 sibling, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-05 16:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>> I just ran the same test on two 2.6.24-rc4 kernels: one with 
>> CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
>> off. The odd behavior I described in my previous e-mails were still 
>> there for both kernels. Let me know If I can be any more help. Thank 
>> you.
> 
> ok, i had a look at your data, and i think this is the result of the 
> scheduler balancing out to idle CPUs more agressively than before. Doing 
> that is almost always a good idea though - but indeed it can result in 
> "bad" numbers if all you do is to measure the ping-pong "performance" 
> between two threads. (with no real work done by any of them).
> 

My test code are not doing much work but measuring overhead of various 
synchronization mechanisms such as barrier and lock. I am trying to see 
the scalability of different implementations/algorithms on multi-core 
machines.

> the moment you saturate the system a bit more, the numbers should 
> improve even with such a ping-pong test.
> 
You are right. If I manually do load balance (bind unrelated processes 
on the other cores), my test code perform as well as it did in the 
kernel 2.6.21.
> do you have testcode (or a modification of your testcase sourcecode) 
> that simulates a real-life situation where 2.6.24-rc4 performs not as 
> well as you'd like it to see? (or if qmt.tar.gz already contains that 
> then please point me towards that portion of the test and how i should 
> run it - thanks!)

The qmt.tar.gz code contains a simple test program call pthread_sync 
under the src directory. You can change the number of threads by setting 
QMT_NUM_THREADS environment variable. You can build the qmt by doing 
configure --enable-public-release. I do not have Intel quad core 
machines, I am not sure whether the behavior will show up on Intel 
platform. Our cluster is dual quad-core opteron which has its own 
hardware problem :-).
http://hardware.slashdot.org/article.pl?sid=07/12/04/237248&from=rss

> 
> 	Ingo

Hi, Ingo:

My test code qmt can be found at ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz. 
There is a minor performance issue in qmt pointed out by Eric, which I 
have not put into the tar ball yet. If I can be any help, please let me 
know. Thank you very much.



-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 16:16               ` Eric Dumazet
@ 2007-12-05 16:25                 ` Ingo Molnar
  2007-12-05 16:29                   ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-05 16:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jie Chen, Simon Holm Th??gersen, linux-kernel, Peter Zijlstra


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> $ gcc -O2 -o burner burner.c
> $ ./burner
> Time to perform the unit of work on one thread is 0.040328 s
> Time to perform the unit of work on 2 threads is 0.040221 s

ok, but this actually suggests that scheduling is fine for this, 
correct?

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 16:25                 ` Ingo Molnar
@ 2007-12-05 16:29                   ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2007-12-05 16:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jie Chen, Simon Holm Th??gersen, linux-kernel, Peter Zijlstra

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> $ gcc -O2 -o burner burner.c
>> $ ./burner
>> Time to perform the unit of work on one thread is 0.040328 s
>> Time to perform the unit of work on 2 threads is 0.040221 s
> 
> ok, but this actually suggests that scheduling is fine for this, 
> correct?
> 
> 	Ingo
> 
> 

Yes, But this machine runs an old kernel. I was just giving you how to run it :)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 16:22               ` Jie Chen
@ 2007-12-05 16:47                 ` Ingo Molnar
  2007-12-05 17:47                   ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-05 16:47 UTC (permalink / raw)
  To: Jie Chen
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

>> the moment you saturate the system a bit more, the numbers should 
>> improve even with such a ping-pong test.
>
> You are right. If I manually do load balance (bind unrelated processes 
> on the other cores), my test code perform as well as it did in the 
> kernel 2.6.21.

so right now the results dont seem to be too bad to me - the higher 
overhead comes from two threads running on two different cores and 
incurring the overhead of cross-core communications. In a true 
spread-out workloads that synchronize occasionally you'd get the same 
kind of overhead so in fact this behavior is more informative of the 
real overhead i guess. In 2.6.21 the two threads would stick on the same 
core and produce artificially low latency - which would only be true in 
a real spread-out workload if all tasks ran on the same core. (which is 
hardly the thing you want on openmp)

In any case, if i misinterpreted your numbers or if you just disagree, 
or if have a workload/test that shows worse performance that it 
could/should, let me know.

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 16:47                 ` Ingo Molnar
@ 2007-12-05 17:47                   ` Jie Chen
  2007-12-05 20:03                     ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-05 17:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>>> the moment you saturate the system a bit more, the numbers should 
>>> improve even with such a ping-pong test.
>> You are right. If I manually do load balance (bind unrelated processes 
>> on the other cores), my test code perform as well as it did in the 
>> kernel 2.6.21.
> 
> so right now the results dont seem to be too bad to me - the higher 
> overhead comes from two threads running on two different cores and 
> incurring the overhead of cross-core communications. In a true 
> spread-out workloads that synchronize occasionally you'd get the same 
> kind of overhead so in fact this behavior is more informative of the 
> real overhead i guess. In 2.6.21 the two threads would stick on the same 
> core and produce artificially low latency - which would only be true in 
> a real spread-out workload if all tasks ran on the same core. (which is 
> hardly the thing you want on openmp)
> 

I use pthread_setaffinity_np call to bind one thread to one core. Unless 
  the kernel 2.6.21 does not honor the affinity, I do not see the 
difference running two threads on two cores between the new kernel and 
the old kernel. My test code does not do any numerical calculation, but 
it does spin waiting on shared/non-shared flags. The reason I am using 
the affinity is to test synchronization overheads among different cores.
In either the new and the old kernel, I do see 200% CPU usage when I ran 
my test code for two threads. Does this mean two threads are running on 
two cores? Also I verify a thread is indeed bound to a core by using 
pthread_getaffinity_np.

> In any case, if i misinterpreted your numbers or if you just disagree, 
> or if have a workload/test that shows worse performance that it 
> could/should, let me know.
> 
> 	Ingo

Hi, Ingo:

Since I am using affinity flag to bind each thread to a different core, 
the synchronization overhead should increases as the number of 
cores/threads increases. But what we observed in the new kernel is the 
opposite. The barrier overhead of two threads is 8.93 micro seconds vs 
1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This 
will confuse most of people who study the synchronization/communication 
scalability. I know my test code is not real-world computation which 
usually use up all cores. I hope I have explained myself clearly. Thank 
you very much.

-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 17:47                   ` Jie Chen
@ 2007-12-05 20:03                     ` Ingo Molnar
  2007-12-05 20:23                       ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-05 20:03 UTC (permalink / raw)
  To: Jie Chen
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

> Since I am using affinity flag to bind each thread to a different 
> core, the synchronization overhead should increases as the number of 
> cores/threads increases. But what we observed in the new kernel is the 
> opposite. The barrier overhead of two threads is 8.93 micro seconds vs 
> 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This 
> will confuse most of people who study the 
> synchronization/communication scalability. I know my test code is not 
> real-world computation which usually use up all cores. I hope I have 
> explained myself clearly. Thank you very much.

btw., could you try to not use the affinity mask and let the scheduler 
manage the spreading of tasks? It generally has a better knowledge about 
how tasks interrelate.

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 20:03                     ` Ingo Molnar
@ 2007-12-05 20:23                       ` Jie Chen
  2007-12-05 20:46                         ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-05 20:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>> Since I am using affinity flag to bind each thread to a different 
>> core, the synchronization overhead should increases as the number of 
>> cores/threads increases. But what we observed in the new kernel is the 
>> opposite. The barrier overhead of two threads is 8.93 micro seconds vs 
>> 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This 
>> will confuse most of people who study the 
>> synchronization/communication scalability. I know my test code is not 
>> real-world computation which usually use up all cores. I hope I have 
>> explained myself clearly. Thank you very much.
> 
> btw., could you try to not use the affinity mask and let the scheduler 
> manage the spreading of tasks? It generally has a better knowledge about 
> how tasks interrelate.
> 
> 	Ingo
Hi, Ingo:

I just disabled the affinity mask and reran the test. There were no 
significant changes for two threads (barrier overhead is around 9 
microseconds). As for 8 threads, the barrier overhead actually drops a 
little, which is good. Let me know whether I can be any help. Thank you 
very much.

-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above
  2007-11-21 20:34 Possible bug from kernel 2.6.22 and above Jie Chen
  2007-11-21 22:14 ` Eric Dumazet
@ 2007-12-05 20:36 ` Peter Zijlstra
  2007-12-05 20:53   ` Jie Chen
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2007-12-05 20:36 UTC (permalink / raw)
  To: Jie Chen; +Cc: linux-kernel, Eric Dumazet, Ingo Molnar, Simon Holm Th??gersen


On Wed, 2007-11-21 at 15:34 -0500, Jie Chen wrote:

> It is clearly that the synchronization overhead increases as the number 
> of threads increases in the kernel 2.6.21. But the synchronization 
> overhead actually decreases as the number of threads increases in the 
> kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as 
> well). This certainly is not a correct behavior. The kernels are 
> configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
> configuration file is in the attachment of this e-mail.
> 
>  From what we have read, there was a new scheduler (CFS) appeared from 
> 2.6.22. We are not sure whether the above behavior is caused by the new 
> scheduler.

If I read this correctly, you say that: .22 is the first bad one right?

The new scheduler (CFS) was introduced in .23, so it seems another
change would be responsible for this.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 20:23                       ` Jie Chen
@ 2007-12-05 20:46                         ` Ingo Molnar
  2007-12-05 20:52                           ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-05 20:46 UTC (permalink / raw)
  To: Jie Chen
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

> I just disabled the affinity mask and reran the test. There were no 
> significant changes for two threads (barrier overhead is around 9 
> microseconds). As for 8 threads, the barrier overhead actually drops a 
> little, which is good. Let me know whether I can be any help. Thank 
> you very much.

sorry to be dense, but could you give me instructions how i could remove 
the affinity mask and test the "barrier overhead" myself? I have built 
"pthread_sync" and it outputs numbers for me - which one would be the 
barrier overhead: Reference_time_1 ?

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 20:46                         ` Ingo Molnar
@ 2007-12-05 20:52                           ` Jie Chen
  2007-12-05 21:02                             ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-05 20:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>> I just disabled the affinity mask and reran the test. There were no 
>> significant changes for two threads (barrier overhead is around 9 
>> microseconds). As for 8 threads, the barrier overhead actually drops a 
>> little, which is good. Let me know whether I can be any help. Thank 
>> you very much.
> 
> sorry to be dense, but could you give me instructions how i could remove 
> the affinity mask and test the "barrier overhead" myself? I have built 
> "pthread_sync" and it outputs numbers for me - which one would be the 
> barrier overhead: Reference_time_1 ?
> 
> 	Ingo
Hi, Ingo:

To disable affinity, do configure --enable-public-release 
--disable-thread_affinity. You should see barrier overhead like the 
following:
Computing BARRIER time

Sample_size  Average     Min    Max          S.D.        Outliers
  20      19.486162   19.482250   19.491400    0.002740      0

BARRIER time =        19.486162 microseconds +/- 0.005371
BARRIER overhead =    8.996257 microseconds +/- 0.006575

The Reference_time_1 is the elapsed time for single thread doing simple 
loop without any synchronization. Thank you.

-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above
  2007-12-05 20:36 ` Possible bug from kernel 2.6.22 and above Peter Zijlstra
@ 2007-12-05 20:53   ` Jie Chen
  0 siblings, 0 replies; 35+ messages in thread
From: Jie Chen @ 2007-12-05 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Eric Dumazet, Ingo Molnar, Simon Holm Th??gersen

Peter Zijlstra wrote:
> On Wed, 2007-11-21 at 15:34 -0500, Jie Chen wrote:
> 
>> It is clearly that the synchronization overhead increases as the number 
>> of threads increases in the kernel 2.6.21. But the synchronization 
>> overhead actually decreases as the number of threads increases in the 
>> kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as 
>> well). This certainly is not a correct behavior. The kernels are 
>> configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, 
>> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel 
>> configuration file is in the attachment of this e-mail.
>>
>>  From what we have read, there was a new scheduler (CFS) appeared from 
>> 2.6.22. We are not sure whether the above behavior is caused by the new 
>> scheduler.
> 
> If I read this correctly, you say that: .22 is the first bad one right?
> 
> The new scheduler (CFS) was introduced in .23, so it seems another
> change would be responsible for this.
> 
> 
> 
Hi, Peter:

Yes. We did observe this in 2.6.22. Thank you.

-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 20:52                           ` Jie Chen
@ 2007-12-05 21:02                             ` Ingo Molnar
  2007-12-05 22:16                               ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-05 21:02 UTC (permalink / raw)
  To: Jie Chen
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

>> sorry to be dense, but could you give me instructions how i could 
>> remove the affinity mask and test the "barrier overhead" myself? I 
>> have built "pthread_sync" and it outputs numbers for me - which one 
>> would be the barrier overhead: Reference_time_1 ?
>
> To disable affinity, do configure --enable-public-release 
> --disable-thread_affinity. You should see barrier overhead like the 
> following: Computing BARRIER time
>
> Sample_size  Average     Min    Max          S.D.        Outliers
>  20      19.486162   19.482250   19.491400    0.002740      0
>
> BARRIER time =        19.486162 microseconds +/- 0.005371
> BARRIER overhead =    8.996257 microseconds +/- 0.006575

ok, i did that and rebuilt. I also did "make check" and got 
src/pthread_sync which i can run. The only thing i'm missing, if i run 
src/pthread_sync, it outputs "PARALLEL time":

 PARALLEL time =                           22.486103 microseconds +/- 3.944821
 PARALLEL overhead =                       10.638658 microseconds +/- 10.854154

not "BARRIER time". I've re-read the discussion and found no hint about 
how to build and run a barrier test. Either i missed it or it's so 
obvious to you that you didnt mention it :-)

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 21:02                             ` Ingo Molnar
@ 2007-12-05 22:16                               ` Jie Chen
  2007-12-06 10:43                                 ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-05 22:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>>> sorry to be dense, but could you give me instructions how i could 
>>> remove the affinity mask and test the "barrier overhead" myself? I 
>>> have built "pthread_sync" and it outputs numbers for me - which one 
>>> would be the barrier overhead: Reference_time_1 ?
>> To disable affinity, do configure --enable-public-release 
>> --disable-thread_affinity. You should see barrier overhead like the 
>> following: Computing BARRIER time
>>
>> Sample_size  Average     Min    Max          S.D.        Outliers
>>  20      19.486162   19.482250   19.491400    0.002740      0
>>
>> BARRIER time =        19.486162 microseconds +/- 0.005371
>> BARRIER overhead =    8.996257 microseconds +/- 0.006575
> 
> ok, i did that and rebuilt. I also did "make check" and got 
> src/pthread_sync which i can run. The only thing i'm missing, if i run 
> src/pthread_sync, it outputs "PARALLEL time":
> 
>  PARALLEL time =                           22.486103 microseconds +/- 3.944821
>  PARALLEL overhead =                       10.638658 microseconds +/- 10.854154
> 
> not "BARRIER time". I've re-read the discussion and found no hint about 
> how to build and run a barrier test. Either i missed it or it's so 
> obvious to you that you didnt mention it :-)
> 
> 	Ingo

Hi, Ingo:

Did you do configure --enable-public-release? My qmt is for qcd 
calculation (one type of physics code). Without the above flag one can 
only test PARALLEL overhead. Actually the PARALLEL benchmark has the 
same behavior as the BARRIER. Thanks.


###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-05 22:16                               ` Jie Chen
@ 2007-12-06 10:43                                 ` Ingo Molnar
  2007-12-06 16:29                                   ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-06 10:43 UTC (permalink / raw)
  To: Jie Chen
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

>> not "BARRIER time". I've re-read the discussion and found no hint 
>> about how to build and run a barrier test. Either i missed it or it's 
>> so obvious to you that you didnt mention it :-)
>>
>> 	Ingo
>
> Hi, Ingo:
>
> Did you do configure --enable-public-release? My qmt is for qcd 
> calculation (one type of physics code) [...]

yes, i did exactly as instructed.

> [...]. Without the above flag one can only test PARALLEL overhead. 
> Actually the PARALLEL benchmark has the same behavior as the BARRIER. 
> Thanks.

hm, but PARALLEL does not seem to do that much context switching. So 
basically you create the threads and do a few short runs to establish 
overhead? Threads do not get fork-balanced at the moment - but turning 
it on would be easy. Could you try the patch below - how does it impact 
your results? (and please keep affinity setting off)

	Ingo

----------->
Subject: sched: reactivate fork balancing
From: Ingo Molnar <mingo@elte.hu>

reactivate fork balancing.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/topology.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux/include/linux/topology.h
===================================================================
--- linux.orig/include/linux/topology.h
+++ linux/include/linux/topology.h
@@ -103,6 +103,7 @@
 	.forkexec_idx		= 0,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
+				| SD_BALANCE_FORK	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_IDLE		\
@@ -134,6 +135,7 @@
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
+				| SD_BALANCE_FORK	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_IDLE		\
@@ -165,6 +167,7 @@
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
+				| SD_BALANCE_FORK	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
 				| BALANCE_FOR_PKG_POWER,\

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-06 10:43                                 ` Ingo Molnar
@ 2007-12-06 16:29                                   ` Jie Chen
  2007-12-10 10:59                                     ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-06 16:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>>> not "BARRIER time". I've re-read the discussion and found no hint 
>>> about how to build and run a barrier test. Either i missed it or it's 
>>> so obvious to you that you didnt mention it :-)
>>>
>>> 	Ingo
>> Hi, Ingo:
>>
>> Did you do configure --enable-public-release? My qmt is for qcd 
>> calculation (one type of physics code) [...]
> 
> yes, i did exactly as instructed.
> 
>> [...]. Without the above flag one can only test PARALLEL overhead. 
>> Actually the PARALLEL benchmark has the same behavior as the BARRIER. 
>> Thanks.
> 
> hm, but PARALLEL does not seem to do that much context switching. So 
> basically you create the threads and do a few short runs to establish 
> overhead? Threads do not get fork-balanced at the moment - but turning 
> it on would be easy. Could you try the patch below - how does it impact 
> your results? (and please keep affinity setting off)
> 
> 	Ingo
> 
> ----------->
> Subject: sched: reactivate fork balancing
> From: Ingo Molnar <mingo@elte.hu>
> 
> reactivate fork balancing.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  include/linux/topology.h |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux/include/linux/topology.h
> ===================================================================
> --- linux.orig/include/linux/topology.h
> +++ linux/include/linux/topology.h
> @@ -103,6 +103,7 @@
>  	.forkexec_idx		= 0,			\
>  	.flags			= SD_LOAD_BALANCE	\
>  				| SD_BALANCE_NEWIDLE	\
> +				| SD_BALANCE_FORK	\
>  				| SD_BALANCE_EXEC	\
>  				| SD_WAKE_AFFINE	\
>  				| SD_WAKE_IDLE		\
> @@ -134,6 +135,7 @@
>  	.forkexec_idx		= 1,			\
>  	.flags			= SD_LOAD_BALANCE	\
>  				| SD_BALANCE_NEWIDLE	\
> +				| SD_BALANCE_FORK	\
>  				| SD_BALANCE_EXEC	\
>  				| SD_WAKE_AFFINE	\
>  				| SD_WAKE_IDLE		\
> @@ -165,6 +167,7 @@
>  	.forkexec_idx		= 1,			\
>  	.flags			= SD_LOAD_BALANCE	\
>  				| SD_BALANCE_NEWIDLE	\
> +				| SD_BALANCE_FORK	\
>  				| SD_BALANCE_EXEC	\
>  				| SD_WAKE_AFFINE	\
>  				| BALANCE_FOR_PKG_POWER,\
Hi, Ingo:

I did patch the header file and recompiled the kernel. I observed no 
difference (two threads overhead stays too high). Thank you.

-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-06 16:29                                   ` Jie Chen
@ 2007-12-10 10:59                                     ` Ingo Molnar
  2007-12-10 20:04                                       ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-10 10:59 UTC (permalink / raw)
  To: Jie Chen; +Cc: linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

> I did patch the header file and recompiled the kernel. I observed no 
> difference (two threads overhead stays too high). Thank you.

ok, i think i found it. You do this in your qmt/pthread_sync.c 
test-code:

 double get_time_of_day_()
 {
 ...
   err = gettimeofday(&ts, NULL);
 ...
 }

and then you use this in the measurement loop:

   for (k=0; k<=OUTERREPS; k++){
     start  = getclock();
     for (j=0; j<innerreps; j++){
 #ifdef _QMT_PUBLIC
       delay((void *)0, 0);
 #else
       delay(0, 0, 0, (void *)0);
 #endif
     }
     times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
   }

the problem is, this does not take the overhead of gettimeofday into 
account - which overhead can easily reach 10 usecs (the observed 
regression). Could you try to eliminate the gettimeofday overhead from 
your measurement?

gettimeofday overhead is something that might have changed from .21 to 
.22 on your box.

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-10 10:59                                     ` Ingo Molnar
@ 2007-12-10 20:04                                       ` Jie Chen
  2007-12-11 10:51                                         ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-10 20:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>> I did patch the header file and recompiled the kernel. I observed no 
>> difference (two threads overhead stays too high). Thank you.
> 
> ok, i think i found it. You do this in your qmt/pthread_sync.c 
> test-code:
> 
>  double get_time_of_day_()
>  {
>  ...
>    err = gettimeofday(&ts, NULL);
>  ...
>  }
> 
> and then you use this in the measurement loop:
> 
>    for (k=0; k<=OUTERREPS; k++){
>      start  = getclock();
>      for (j=0; j<innerreps; j++){
>  #ifdef _QMT_PUBLIC
>        delay((void *)0, 0);
>  #else
>        delay(0, 0, 0, (void *)0);
>  #endif
>      }
>      times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
>    }
> 
> the problem is, this does not take the overhead of gettimeofday into 
> account - which overhead can easily reach 10 usecs (the observed 
> regression). Could you try to eliminate the gettimeofday overhead from 
> your measurement?
> 
> gettimeofday overhead is something that might have changed from .21 to 
> .22 on your box.
> 
> 	Ingo

Hi, Ingo:

In my pthread_sync code, I first call refer () subroutine which actually 
establishes the elapsed time (reference time) for non-synchronized 
delay() using the gettimeofday. Then each synchronization overhead value 
is obtained by subtracting the reference time from the elapsed time with 
introduced synchronization. The effect of gettimeofday() should be 
minimal if the time difference (overhead value) is the interest here. 
Unless the gettimeofday behaves differently in the case of running 8 
threads .vs. running 2 threads.

I will try to replace gettimeofday with a lightweight timer call in my 
test code. Thank you very much.

-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-10 20:04                                       ` Jie Chen
@ 2007-12-11 10:51                                         ` Ingo Molnar
  2007-12-11 15:28                                           ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-11 10:51 UTC (permalink / raw)
  To: Jie Chen; +Cc: linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

>> and then you use this in the measurement loop:
>>
>>    for (k=0; k<=OUTERREPS; k++){
>>      start  = getclock();
>>      for (j=0; j<innerreps; j++){
>>  #ifdef _QMT_PUBLIC
>>        delay((void *)0, 0);
>>  #else
>>        delay(0, 0, 0, (void *)0);
>>  #endif
>>      }
>>      times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
>>    }
>>
>> the problem is, this does not take the overhead of gettimeofday into 
>> account - which overhead can easily reach 10 usecs (the observed 
>> regression). Could you try to eliminate the gettimeofday overhead from 
>> your measurement?
>>
>> gettimeofday overhead is something that might have changed from .21 to .22 
>> on your box.
>>
>> 	Ingo
>
> Hi, Ingo:
>
> In my pthread_sync code, I first call refer () subroutine which 
> actually establishes the elapsed time (reference time) for 
> non-synchronized delay() using the gettimeofday. Then each 
> synchronization overhead value is obtained by subtracting the 
> reference time from the elapsed time with introduced synchronization. 
> The effect of gettimeofday() should be minimal if the time difference 
> (overhead value) is the interest here. Unless the gettimeofday behaves 
> differently in the case of running 8 threads .vs. running 2 threads.
>
> I will try to replace gettimeofday with a lightweight timer call in my 
> test code. Thank you very much.

gettimeofday overhead is around 10 usecs here:

 2740  1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.000010>
 2740  1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.000010>

and that's the only thing that is going on when computing the reference 
time - and i see a similar syscall pattern in the PARALLEL and BARRIER 
calculations as well (with no real scheduling going on).

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-11 10:51                                         ` Ingo Molnar
@ 2007-12-11 15:28                                           ` Jie Chen
  2007-12-11 15:52                                             ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-11 15:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>>> and then you use this in the measurement loop:
>>>
>>>    for (k=0; k<=OUTERREPS; k++){
>>>      start  = getclock();
>>>      for (j=0; j<innerreps; j++){
>>>  #ifdef _QMT_PUBLIC
>>>        delay((void *)0, 0);
>>>  #else
>>>        delay(0, 0, 0, (void *)0);
>>>  #endif
>>>      }
>>>      times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
>>>    }
>>>
>>> the problem is, this does not take the overhead of gettimeofday into 
>>> account - which overhead can easily reach 10 usecs (the observed 
>>> regression). Could you try to eliminate the gettimeofday overhead from 
>>> your measurement?
>>>
>>> gettimeofday overhead is something that might have changed from .21 to .22 
>>> on your box.
>>>
>>> 	Ingo
>> Hi, Ingo:
>>
>> In my pthread_sync code, I first call refer () subroutine which 
>> actually establishes the elapsed time (reference time) for 
>> non-synchronized delay() using the gettimeofday. Then each 
>> synchronization overhead value is obtained by subtracting the 
>> reference time from the elapsed time with introduced synchronization. 
>> The effect of gettimeofday() should be minimal if the time difference 
>> (overhead value) is the interest here. Unless the gettimeofday behaves 
>> differently in the case of running 8 threads .vs. running 2 threads.
>>
>> I will try to replace gettimeofday with a lightweight timer call in my 
>> test code. Thank you very much.
> 
> gettimeofday overhead is around 10 usecs here:
> 
>  2740  1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.000010>
>  2740  1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.000010>
> 
> and that's the only thing that is going on when computing the reference 
> time - and i see a similar syscall pattern in the PARALLEL and BARRIER 
> calculations as well (with no real scheduling going on).
> 
> 	Ingo

Hi, Ingo:

I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs 
patch. The results of pthread_sync is the same as the non-patched 2.6.21 
kernel. This means the performance of is not related to the scheduler. 
As for overhead of the gettimeofday, there is no difference between 
2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both 
kernel.

So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you 
very much for all your help.

-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-11 15:28                                           ` Jie Chen
@ 2007-12-11 15:52                                             ` Ingo Molnar
  2007-12-11 16:39                                               ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-11 15:52 UTC (permalink / raw)
  To: Jie Chen; +Cc: linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

> Hi, Ingo:
>
> I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs 
> patch. The results of pthread_sync is the same as the non-patched 
> 2.6.21 kernel. This means the performance of is not related to the 
> scheduler. As for overhead of the gettimeofday, there is no difference 
> between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us 
> for both kernel.

could you please paste again the relevant portion of the output you get 
on a "good" .21 kernel versus the output you get on a "bad" .24 kernel?

> So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you 
> very much for all your help.

we'll figure it out i'm sure :)

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-11 15:52                                             ` Ingo Molnar
@ 2007-12-11 16:39                                               ` Jie Chen
  2007-12-11 21:23                                                 ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-11 16:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>> Hi, Ingo:
>>
>> I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs 
>> patch. The results of pthread_sync is the same as the non-patched 
>> 2.6.21 kernel. This means the performance of is not related to the 
>> scheduler. As for overhead of the gettimeofday, there is no difference 
>> between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us 
>> for both kernel.
> 
> could you please paste again the relevant portion of the output you get 
> on a "good" .21 kernel versus the output you get on a "bad" .24 kernel?
> 
>> So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you 
>> very much for all your help.
> 
> we'll figure it out i'm sure :)
> 
> 	Ingo

Hi, Ingo:

The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel.

2 threads:

Computing reference time 1

Sample_size       Average     Min     Max    S.D.       Outliers
  20     10.489085   10.488800   10.491100    0.000539      1

Reference_time_1 =  10.489085 microseconds +/- 0.001057

Computing PARALLEL time

Sample_size       Average     Min     Max  S.D.          Outliers
  20     11.106580   11.105650   11.109700    0.001255      0

PARALLEL time =       11.106580 microseconds +/- 0.002460
PARALLEL overhead =    0.617590 microseconds +/- 0.003409

8 threads:
Computing reference time 1

Sample_size       Average     Min    Max     S.D.          Outliers
  20        10.488735   10.488500   10.490700    0.000484      1

Reference_time_1 =     10.488735 microseconds +/- 0.000948

Computing PARALLEL time

Sample_size       Average     Min    Max     S.D.          Outliers
  20       13.000647   12.991050   13.052700    0.012592      1

PARALLEL time =        13.000647 microseconds +/- 0.024680
PARALLEL overhead =     2.511907 microseconds +/- 0.025594


Output for Kernel 2.6.24-rc4 #1 SMP

2 threads:
Computing reference time 1

Sample_size       Average     Min     Max    S.D.          Outliers
  20          10.510535   10.508600   10.518200    0.002237      1

Reference_time_1 =           10.510535 microseconds +/- 0.004384

Computing PARALLEL time

Sample_size       Average     Min    Max     S.D.          Outliers
  20         19.668450   19.650200   19.679650    0.008052      0

PARALLEL time =              19.668450 microseconds +/- 0.015782
PARALLEL overhead =           9.157945 microseconds +/- 0.018217

8 threads:
Computing reference time 1

Sample_size       Average     Min    Max    S.D.          Outliers
  20        10.491285   10.490100   10.494900    0.001085      1

Reference_time_1 =           10.491285 microseconds +/- 0.002127

Computing PARALLEL time

Sample_size       Average     Min    Max    S.D.          Outliers
  20        13.090080   13.079150   13.131450    0.010995      1

PARALLEL time =              13.090080 microseconds +/- 0.021550
PARALLEL overhead =          2.598590 microseconds +/- 0.024534

For 8 threads, both kernels have the similar performance number. But for 
2 threads, the 2.6.21 is much better than 2.6.24-rc4. Thank you.


-- 
###############################################
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
chen@jlab.org
###############################################


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-11 16:39                                               ` Jie Chen
@ 2007-12-11 21:23                                                 ` Ingo Molnar
  2007-12-11 22:11                                                   ` Jie Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-12-11 21:23 UTC (permalink / raw)
  To: Jie Chen; +Cc: linux-kernel, Peter Zijlstra


* Jie Chen <chen@jlab.org> wrote:

> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP 
> kernel.

> 2 threads:

> PARALLEL time =       11.106580 microseconds +/- 0.002460
> PARALLEL overhead =    0.617590 microseconds +/- 0.003409

> Output for Kernel 2.6.24-rc4 #1 SMP

> PARALLEL time =              19.668450 microseconds +/- 0.015782
> PARALLEL overhead =           9.157945 microseconds +/- 0.018217

ok, so the problem is that this PARALLEL time has an additional +9 usecs 
overhead, right? I dont see this myself on a Core2 CPU:

PARALLEL time =                           10.446933 microseconds +/- 0.078849
PARALLEL overhead =                       0.751732 microseconds +/- 0.177446

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-11 21:23                                                 ` Ingo Molnar
@ 2007-12-11 22:11                                                   ` Jie Chen
  2007-12-12 12:49                                                     ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Jie Chen @ 2007-12-11 22:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra

Ingo Molnar wrote:
> * Jie Chen <chen@jlab.org> wrote:
> 
>> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP 
>> kernel.
> 
>> 2 threads:
> 
>> PARALLEL time =       11.106580 microseconds +/- 0.002460
>> PARALLEL overhead =    0.617590 microseconds +/- 0.003409
> 
>> Output for Kernel 2.6.24-rc4 #1 SMP
> 
>> PARALLEL time =              19.668450 microseconds +/- 0.015782
>> PARALLEL overhead =           9.157945 microseconds +/- 0.018217
> 
> ok, so the problem is that this PARALLEL time has an additional +9 usecs 
> overhead, right? I dont see this myself on a Core2 CPU:
> 
> PARALLEL time =                           10.446933 microseconds +/- 0.078849
> PARALLEL overhead =                       0.751732 microseconds +/- 0.177446
> 
> 	Ingo
Hi, Ingo:

Yes, the extra 9 usecs overhead for running two threads in the 2.6.24 
kernel when there are total of 8 cores (2 quad opterons). What is the 
total number of cores do you have? I do not have machines that have dual 
quad Xeons here for direct comparisons. Thank you.

-- 
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# chen@jlab.org
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
  2007-12-11 22:11                                                   ` Jie Chen
@ 2007-12-12 12:49                                                     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2007-12-12 12:49 UTC (permalink / raw)
  To: Jie Chen; +Cc: Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]


On Tue, 2007-12-11 at 17:11 -0500, Jie Chen wrote:
> Ingo Molnar wrote:
> > * Jie Chen <chen@jlab.org> wrote:
> > 
> >> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP 
> >> kernel.
> > 
> >> 2 threads:
> > 
> >> PARALLEL time =       11.106580 microseconds +/- 0.002460
> >> PARALLEL overhead =    0.617590 microseconds +/- 0.003409
> > 
> >> Output for Kernel 2.6.24-rc4 #1 SMP
> > 
> >> PARALLEL time =              19.668450 microseconds +/- 0.015782
> >> PARALLEL overhead =           9.157945 microseconds +/- 0.018217
> > 
> > ok, so the problem is that this PARALLEL time has an additional +9 usecs 
> > overhead, right? I dont see this myself on a Core2 CPU:
> > 
> > PARALLEL time =                           10.446933 microseconds +/- 0.078849
> > PARALLEL overhead =                       0.751732 microseconds +/- 0.177446
> > 

On my dual socket AMD Athlon MP

2.6.20-13-generic

PARALLEL time =                           22.751875 microseconds +/- 21.370942
PARALLEL overhead =                       7.046595 microseconds +/- 24.370040

2.6.24-rc5

PARALLEL time =                           17.365543 microseconds +/- 3.295133
PARALLEL overhead =                       2.213722 microseconds +/- 4.797886



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2007-12-12 12:49 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-21 20:34 Possible bug from kernel 2.6.22 and above Jie Chen
2007-11-21 22:14 ` Eric Dumazet
2007-11-22  1:52   ` Jie Chen
2007-11-22  2:32     ` Simon Holm Thøgersen
2007-11-22  2:58       ` Jie Chen
2007-11-22 20:19         ` Matt Mackall
2007-12-04 13:17         ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar
2007-12-04 15:41           ` Jie Chen
2007-12-05 15:29           ` Jie Chen
2007-12-05 15:40             ` Ingo Molnar
2007-12-05 16:16               ` Eric Dumazet
2007-12-05 16:25                 ` Ingo Molnar
2007-12-05 16:29                   ` Eric Dumazet
2007-12-05 16:22               ` Jie Chen
2007-12-05 16:47                 ` Ingo Molnar
2007-12-05 17:47                   ` Jie Chen
2007-12-05 20:03                     ` Ingo Molnar
2007-12-05 20:23                       ` Jie Chen
2007-12-05 20:46                         ` Ingo Molnar
2007-12-05 20:52                           ` Jie Chen
2007-12-05 21:02                             ` Ingo Molnar
2007-12-05 22:16                               ` Jie Chen
2007-12-06 10:43                                 ` Ingo Molnar
2007-12-06 16:29                                   ` Jie Chen
2007-12-10 10:59                                     ` Ingo Molnar
2007-12-10 20:04                                       ` Jie Chen
2007-12-11 10:51                                         ` Ingo Molnar
2007-12-11 15:28                                           ` Jie Chen
2007-12-11 15:52                                             ` Ingo Molnar
2007-12-11 16:39                                               ` Jie Chen
2007-12-11 21:23                                                 ` Ingo Molnar
2007-12-11 22:11                                                   ` Jie Chen
2007-12-12 12:49                                                     ` Peter Zijlstra
2007-12-05 20:36 ` Possible bug from kernel 2.6.22 and above Peter Zijlstra
2007-12-05 20:53   ` Jie Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).