* Possible bug from kernel 2.6.22 and above @ 2007-11-21 20:34 Jie Chen 2007-11-21 22:14 ` Eric Dumazet 2007-12-05 20:36 ` Possible bug from kernel 2.6.22 and above Peter Zijlstra 0 siblings, 2 replies; 35+ messages in thread From: Jie Chen @ 2007-11-21 20:34 UTC (permalink / raw) To: linux-kernel; +Cc: Jie Chen [-- Attachment #1: Type: text/plain, Size: 3835 bytes --] Hi, there: We have a simple pthread program that measures the synchronization overheads for various synchronization mechanisms such as spin locks, barriers (the barrier is implemented using queue-based barrier algorithm) and so on. We have dual quad-core AMD opterons (barcelona) clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 distribution. Before we moved to this kernel, we had kernel 2.6.21. These two kernels are configured identical and compiled with the same gcc 4.1.2 compiler. Under the old kernel, we observed that the performance of these overheads increases as the number of threads increases from 2 to 8. The following are the values of total time and overhead for all threads acquiring a pthread spin lock and all threads executing a barrier synchronization call. Kernel 2.6.21 Number of Threads 2 4 6 8 SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643 (Overhead) 0.073 0.05746 0.102805 0.154563 Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002 (Overhead) 0.531660 1.1502 1.500112 1.891617 Each thread is bound to a particular core using pthread_setaffinity_np. Kernel 2.6.23.8 Number of Threads 2 4 6 8 SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 (Overhead) 4.345417 6.617207 3.949435 0.110985 Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662 (Overhead) 8.957755 9.784722 5.699590 1.869518 It is clearly that the synchronization overhead increases as the number of threads increases in the kernel 2.6.21. But the synchronization overhead actually decreases as the number of threads increases in the kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as well). This certainly is not a correct behavior. The kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel configuration file is in the attachment of this e-mail. From what we have read, there was a new scheduler (CFS) appeared from 2.6.22. We are not sure whether the above behavior is caused by the new scheduler. Finally, our machine cpu information is listed in the following: processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : Quad-Core AMD Opteron(tm) Processor 2347 stepping : 10 cpu MHz : 1909.801 cache size : 512 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw bogomips : 3822.95 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate In addition, we have schedstat and sched_debug files in the /proc directory. Thank you for all your help to solve this puzzle. If you need more information, please let us know. P.S. I like to be cc'ed on the discussions related to this problem. ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### [-- Attachment #2: kernel-2.6.23.8-config --] [-- Type: text/plain, Size: 19970 bytes --] CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_IKCONFIG=m CONFIG_IKCONFIG_PROC=y CONFIG_CPUSETS=y CONFIG_SYSFS_DEPRECATED=y CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_SYSCTL=y CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_EXTRA_PASS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y CONFIG_RT_MUTEXES=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODVERSIONS=y CONFIG_MODULE_SRCVERSION_ALL=y CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y CONFIG_BLK_DEV_IO_TRACE=y CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_CFQ=y CONFIG_X86_PC=y CONFIG_MK8=y CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=m CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y CONFIG_SCHED_MC=y CONFIG_PREEMPT_NONE=y CONFIG_NUMA=y CONFIG_K8_NUMA=y CONFIG_X86_64_ACPI_NUMA=y CONFIG_ARCH_DISCONTIGMEM_ENABLE=y CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_DISCONTIGMEM_MANUAL=y CONFIG_DISCONTIGMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_NEED_MULTIPLE_NODES=y CONFIG_MIGRATION=y CONFIG_RESOURCES_64BIT=y CONFIG_BOUNCE=y CONFIG_VIRT_TO_BUS=y CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_IOMMU=y CONFIG_SWIOTLB=y CONFIG_X86_MCE=y CONFIG_X86_MCE_AMD=y CONFIG_KEXEC=y CONFIG_HZ_100=y CONFIG_K8_NB=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_PM=y CONFIG_SUSPEND_SMP_POSSIBLE=y CONFIG_HIBERNATION_SMP_POSSIBLE=y CONFIG_ACPI=y CONFIG_ACPI_PROCFS=y CONFIG_ACPI_PROC_EVENT=y CONFIG_ACPI_AC=m CONFIG_ACPI_BATTERY=m CONFIG_ACPI_BUTTON=m CONFIG_ACPI_VIDEO=m CONFIG_ACPI_FAN=y CONFIG_ACPI_DOCK=m CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_THERMAL=y CONFIG_ACPI_NUMA=y CONFIG_ACPI_ASUS=m CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y CONFIG_ACPI_CONTAINER=y CONFIG_PCI=y CONFIG_PCI_DIRECT=y CONFIG_PCI_MMCONFIG=y CONFIG_PCIEPORTBUS=y CONFIG_HOTPLUG_PCI_PCIE=m CONFIG_PCIEAER=y CONFIG_ARCH_SUPPORTS_MSI=y CONFIG_PCI_MSI=y CONFIG_HT_IRQ=y CONFIG_HOTPLUG_PCI=y CONFIG_HOTPLUG_PCI_FAKE=m CONFIG_HOTPLUG_PCI_ACPI=m CONFIG_HOTPLUG_PCI_ACPI_IBM=m CONFIG_HOTPLUG_PCI_SHPC=m CONFIG_BINFMT_ELF=y CONFIG_BINFMT_MISC=y CONFIG_IA32_EMULATION=y CONFIG_COMPAT=y CONFIG_COMPAT_FOR_U64_ALIGNMENT=y CONFIG_SYSVIPC_COMPAT=y CONFIG_NET=y CONFIG_PACKET=y CONFIG_UNIX=y CONFIG_XFRM=y CONFIG_XFRM_USER=y CONFIG_NET_KEY=m CONFIG_INET=y CONFIG_IP_MULTICAST=y CONFIG_IP_FIB_HASH=y CONFIG_INET_XFRM_MODE_BEET=y CONFIG_INET_DIAG=m CONFIG_INET_TCP_DIAG=m CONFIG_TCP_CONG_CUBIC=y CONFIG_NETWORK_SECMARK=y CONFIG_BRIDGE=m CONFIG_VLAN_8021Q=m CONFIG_LLC=m CONFIG_NET_PKTGEN=m CONFIG_IRDA=m CONFIG_IRLAN=m CONFIG_IRCOMM=m CONFIG_IRDA_CACHE_LAST_LSAP=y CONFIG_IRDA_FAST_RR=y CONFIG_IRTTY_SIR=m CONFIG_DONGLE=y CONFIG_ESI_DONGLE=m CONFIG_ACTISYS_DONGLE=m CONFIG_TEKRAM_DONGLE=m CONFIG_TOIM3232_DONGLE=m CONFIG_LITELINK_DONGLE=m CONFIG_MA600_DONGLE=m CONFIG_GIRBIL_DONGLE=m CONFIG_MCP2120_DONGLE=m CONFIG_OLD_BELKIN_DONGLE=m CONFIG_ACT200L_DONGLE=m CONFIG_USB_IRDA=m CONFIG_SIGMATEL_FIR=m CONFIG_NSC_FIR=m CONFIG_WINBOND_FIR=m CONFIG_SMC_IRCC_FIR=m CONFIG_ALI_FIR=m CONFIG_VLSI_FIR=m CONFIG_VIA_FIR=m CONFIG_MCS_FIR=m CONFIG_BT=m CONFIG_BT_L2CAP=m CONFIG_BT_SCO=m CONFIG_BT_RFCOMM=m CONFIG_BT_RFCOMM_TTY=y CONFIG_BT_BNEP=m CONFIG_BT_BNEP_MC_FILTER=y CONFIG_BT_BNEP_PROTO_FILTER=y CONFIG_BT_HIDP=m CONFIG_BT_HCIUSB=m CONFIG_BT_HCIUSB_SCO=y CONFIG_BT_HCIUART=m CONFIG_BT_HCIUART_H4=y CONFIG_BT_HCIUART_BCSP=y CONFIG_BT_HCIBCM203X=m CONFIG_BT_HCIBPA10X=m CONFIG_BT_HCIBFUSB=m CONFIG_BT_HCIVHCI=m CONFIG_WIRELESS_EXT=y CONFIG_IEEE80211=m CONFIG_IEEE80211_CRYPT_WEP=m CONFIG_IEEE80211_CRYPT_CCMP=m CONFIG_IEEE80211_SOFTMAC=m CONFIG_IEEE80211_SOFTMAC_DEBUG=y CONFIG_STANDALONE=y CONFIG_PREVENT_FIRMWARE_BUILD=y CONFIG_FW_LOADER=y CONFIG_CONNECTOR=y CONFIG_PROC_EVENTS=y CONFIG_PARPORT=m CONFIG_PARPORT_PC=m CONFIG_PARPORT_SERIAL=m CONFIG_PARPORT_NOT_PC=y CONFIG_PNP=y CONFIG_PNPACPI=y CONFIG_BLK_DEV=y CONFIG_BLK_DEV_FD=m CONFIG_BLK_DEV_LOOP=m CONFIG_BLK_DEV_CRYPTOLOOP=m CONFIG_BLK_DEV_NBD=m CONFIG_BLK_DEV_SX8=m CONFIG_BLK_DEV_UB=m CONFIG_BLK_DEV_RAM=y CONFIG_MISC_DEVICES=y CONFIG_IDE=y CONFIG_BLK_DEV_IDE=y CONFIG_BLK_DEV_IDEDISK=y CONFIG_IDEDISK_MULTI_MODE=y CONFIG_BLK_DEV_IDECD=m CONFIG_BLK_DEV_IDEFLOPPY=y CONFIG_BLK_DEV_IDESCSI=m CONFIG_IDE_TASK_IOCTL=y CONFIG_IDE_PROC_FS=y CONFIG_IDE_GENERIC=y CONFIG_BLK_DEV_IDEPNP=y CONFIG_BLK_DEV_IDEPCI=y CONFIG_IDEPCI_SHARE_IRQ=y CONFIG_IDEPCI_PCIBUS_ORDER=y CONFIG_BLK_DEV_GENERIC=y CONFIG_BLK_DEV_IDEDMA_PCI=y CONFIG_BLK_DEV_AEC62XX=y CONFIG_BLK_DEV_ALI15X3=y CONFIG_BLK_DEV_AMD74XX=y CONFIG_BLK_DEV_ATIIXP=y CONFIG_BLK_DEV_CMD64X=y CONFIG_BLK_DEV_HPT34X=y CONFIG_BLK_DEV_HPT366=y CONFIG_BLK_DEV_PIIX=y CONFIG_BLK_DEV_IT821X=y CONFIG_BLK_DEV_PDC202XX_OLD=y CONFIG_BLK_DEV_PDC202XX_NEW=y CONFIG_BLK_DEV_SVWKS=y CONFIG_BLK_DEV_SIIMAGE=y CONFIG_BLK_DEV_SIS5513=y CONFIG_BLK_DEV_VIA82CXXX=y CONFIG_BLK_DEV_IDEDMA=y CONFIG_RAID_ATTRS=m CONFIG_SCSI=m CONFIG_SCSI_DMA=y CONFIG_SCSI_NETLINK=y CONFIG_SCSI_PROC_FS=y CONFIG_BLK_DEV_SD=m CONFIG_CHR_DEV_ST=m CONFIG_CHR_DEV_OSST=m CONFIG_BLK_DEV_SR=m CONFIG_BLK_DEV_SR_VENDOR=y CONFIG_CHR_DEV_SG=m CONFIG_CHR_DEV_SCH=m CONFIG_SCSI_MULTI_LUN=y CONFIG_SCSI_CONSTANTS=y CONFIG_SCSI_LOGGING=y CONFIG_SCSI_WAIT_SCAN=m CONFIG_SCSI_SPI_ATTRS=m CONFIG_SCSI_FC_ATTRS=m CONFIG_SCSI_ISCSI_ATTRS=m CONFIG_SCSI_SAS_ATTRS=m CONFIG_SCSI_LOWLEVEL=y CONFIG_ISCSI_TCP=m CONFIG_BLK_DEV_3W_XXXX_RAID=m CONFIG_SCSI_3W_9XXX=m CONFIG_SCSI_ACARD=m CONFIG_SCSI_AACRAID=m CONFIG_SCSI_AIC7XXX=m CONFIG_SCSI_AIC7XXX_OLD=m CONFIG_SCSI_AIC79XX=m CONFIG_MEGARAID_NEWGEN=y CONFIG_MEGARAID_MM=m CONFIG_MEGARAID_MAILBOX=m CONFIG_MEGARAID_LEGACY=m CONFIG_MEGARAID_SAS=m CONFIG_SCSI_HPTIOP=m CONFIG_SCSI_BUSLOGIC=m CONFIG_SCSI_GDTH=m CONFIG_SCSI_IPS=m CONFIG_SCSI_INITIO=m CONFIG_SCSI_INIA100=m CONFIG_SCSI_PPA=m CONFIG_SCSI_IMM=m CONFIG_SCSI_SYM53C8XX_2=m CONFIG_SCSI_SYM53C8XX_MMIO=y CONFIG_SCSI_QLOGIC_1280=m CONFIG_SCSI_QLA_FC=m CONFIG_SCSI_LPFC=m CONFIG_SCSI_DC395x=m CONFIG_SCSI_DC390T=m CONFIG_ATA=m CONFIG_ATA_ACPI=y CONFIG_SATA_SVW=m CONFIG_ATA_PIIX=m CONFIG_SATA_NV=m CONFIG_PDC_ADMA=m CONFIG_SATA_QSTOR=m CONFIG_SATA_PROMISE=m CONFIG_SATA_SX4=m CONFIG_SATA_SIL=m CONFIG_SATA_SIL24=m CONFIG_SATA_SIS=m CONFIG_SATA_ULI=m CONFIG_SATA_VIA=m CONFIG_SATA_VITESSE=m CONFIG_PATA_AMD=m CONFIG_PATA_ATIIXP=m CONFIG_PATA_CS5520=m CONFIG_PATA_EFAR=m CONFIG_ATA_GENERIC=m CONFIG_PATA_IT821X=m CONFIG_PATA_JMICRON=m CONFIG_PATA_TRIFLEX=m CONFIG_PATA_MPIIX=m CONFIG_PATA_OLDPIIX=m CONFIG_PATA_NETCELL=m CONFIG_PATA_RZ1000=m CONFIG_PATA_SERVERWORKS=m CONFIG_PATA_PDC2027X=m CONFIG_PATA_SIL680=m CONFIG_PATA_SIS=m CONFIG_PATA_VIA=m CONFIG_PATA_WINBOND=m CONFIG_MD=y CONFIG_BLK_DEV_MD=y CONFIG_MD_LINEAR=m CONFIG_MD_RAID0=m CONFIG_MD_RAID1=m CONFIG_MD_RAID10=m CONFIG_MD_RAID456=m CONFIG_MD_RAID5_RESHAPE=y CONFIG_MD_MULTIPATH=m CONFIG_MD_FAULTY=m CONFIG_BLK_DEV_DM=m CONFIG_DM_CRYPT=m CONFIG_DM_SNAPSHOT=m CONFIG_DM_MIRROR=m CONFIG_DM_ZERO=m CONFIG_DM_MULTIPATH=m CONFIG_DM_MULTIPATH_EMC=m CONFIG_FUSION=y CONFIG_FUSION_SPI=m CONFIG_FUSION_FC=m CONFIG_FUSION_SAS=m CONFIG_FUSION_CTL=m CONFIG_NETDEVICES=y CONFIG_DUMMY=m CONFIG_BONDING=m CONFIG_EQUALIZER=m CONFIG_TUN=m CONFIG_PHYLIB=m CONFIG_MARVELL_PHY=m CONFIG_DAVICOM_PHY=m CONFIG_QSEMI_PHY=m CONFIG_LXT_PHY=m CONFIG_CICADA_PHY=m CONFIG_VITESSE_PHY=m CONFIG_SMSC_PHY=m CONFIG_FIXED_PHY=m CONFIG_FIXED_MII_10_FDX=y CONFIG_FIXED_MII_100_FDX=y CONFIG_NET_ETHERNET=y CONFIG_MII=m CONFIG_NET_VENDOR_3COM=y CONFIG_VORTEX=m CONFIG_TYPHOON=m CONFIG_NET_PCI=y CONFIG_PCNET32=m CONFIG_AMD8111_ETH=m CONFIG_AMD8111E_NAPI=y CONFIG_B44=m CONFIG_FORCEDETH=m CONFIG_E100=m CONFIG_EPIC100=m CONFIG_SUNDANCE=m CONFIG_VIA_RHINE=m CONFIG_VIA_RHINE_MMIO=y CONFIG_VIA_RHINE_NAPI=y CONFIG_NETDEV_1000=y CONFIG_ACENIC=m CONFIG_DL2K=m CONFIG_E1000=m CONFIG_E1000_NAPI=y CONFIG_NS83820=m CONFIG_HAMACHI=m CONFIG_YELLOWFIN=m CONFIG_R8169=m CONFIG_R8169_NAPI=y CONFIG_R8169_VLAN=y CONFIG_SIS190=m CONFIG_SKGE=m CONFIG_SKY2=m CONFIG_VIA_VELOCITY=m CONFIG_TIGON3=m CONFIG_BNX2=m CONFIG_NETDEV_10000=y CONFIG_CHELSIO_T1=m CONFIG_CHELSIO_T1_NAPI=y CONFIG_IXGB=m CONFIG_IXGB_NAPI=y CONFIG_S2IO=m CONFIG_S2IO_NAPI=y CONFIG_MYRI10GE=m CONFIG_MLX4_CORE=m CONFIG_MLX4_DEBUG=y CONFIG_NETCONSOLE=m CONFIG_NETPOLL=y CONFIG_NETPOLL_TRAP=y CONFIG_NET_POLL_CONTROLLER=y CONFIG_INPUT=y CONFIG_INPUT_FF_MEMLESS=y CONFIG_INPUT_MOUSEDEV=y CONFIG_INPUT_EVDEV=y CONFIG_INPUT_KEYBOARD=y CONFIG_KEYBOARD_ATKBD=y CONFIG_INPUT_MOUSE=y CONFIG_MOUSE_PS2=y CONFIG_MOUSE_PS2_ALPS=y CONFIG_MOUSE_PS2_LOGIPS2PP=y CONFIG_MOUSE_PS2_SYNAPTICS=y CONFIG_MOUSE_PS2_LIFEBOOK=y CONFIG_MOUSE_PS2_TRACKPOINT=y CONFIG_MOUSE_SERIAL=m CONFIG_MOUSE_VSXXXAA=m CONFIG_INPUT_MISC=y CONFIG_INPUT_PCSPKR=m CONFIG_INPUT_UINPUT=m CONFIG_SERIO=y CONFIG_SERIO_I8042=y CONFIG_SERIO_SERPORT=y CONFIG_SERIO_LIBPS2=y CONFIG_SERIO_RAW=m CONFIG_VT=y CONFIG_VT_CONSOLE=y CONFIG_HW_CONSOLE=y CONFIG_VT_HW_CONSOLE_BINDING=y CONFIG_SERIAL_NONSTANDARD=y CONFIG_CYCLADES=m CONFIG_SYNCLINK=m CONFIG_SYNCLINKMP=m CONFIG_SYNCLINK_GT=m CONFIG_N_HDLC=m CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_SERIAL_8250_PCI=y CONFIG_SERIAL_8250_PNP=y CONFIG_SERIAL_8250_EXTENDED=y CONFIG_SERIAL_8250_MANY_PORTS=y CONFIG_SERIAL_8250_SHARE_IRQ=y CONFIG_SERIAL_8250_DETECT_IRQ=y CONFIG_SERIAL_8250_RSA=y CONFIG_SERIAL_CORE=y CONFIG_SERIAL_CORE_CONSOLE=y CONFIG_SERIAL_JSM=m CONFIG_UNIX98_PTYS=y CONFIG_PRINTER=m CONFIG_LP_CONSOLE=y CONFIG_PPDEV=m CONFIG_TIPAR=m CONFIG_WATCHDOG=y CONFIG_SOFT_WATCHDOG=m CONFIG_SC520_WDT=m CONFIG_I6300ESB_WDT=m CONFIG_W83627HF_WDT=m CONFIG_W83877F_WDT=m CONFIG_W83977F_WDT=m CONFIG_MACHZ_WDT=m CONFIG_PCIPCWATCHDOG=m CONFIG_WDTPCI=m CONFIG_WDT_501_PCI=y CONFIG_USBPCWATCHDOG=m CONFIG_HW_RANDOM=y CONFIG_HW_RANDOM_INTEL=m CONFIG_HW_RANDOM_AMD=m CONFIG_NVRAM=y CONFIG_RTC=y CONFIG_R3964=m CONFIG_AGP=y CONFIG_AGP_AMD64=y CONFIG_AGP_INTEL=y CONFIG_AGP_SIS=y CONFIG_AGP_VIA=y CONFIG_MWAVE=m CONFIG_PC8736x_GPIO=m CONFIG_NSC_GPIO=m CONFIG_HPET=y CONFIG_HANGCHECK_TIMER=m CONFIG_DEVPORT=y CONFIG_I2C=m CONFIG_I2C_BOARDINFO=y CONFIG_I2C_CHARDEV=m CONFIG_I2C_ALGOBIT=m CONFIG_I2C_ALGOPCF=m CONFIG_I2C_ALGOPCA=m CONFIG_I2C_ALI1535=m CONFIG_I2C_ALI1563=m CONFIG_I2C_ALI15X3=m CONFIG_I2C_AMD756=m CONFIG_I2C_AMD756_S4882=m CONFIG_I2C_AMD8111=m CONFIG_I2C_I801=m CONFIG_I2C_I810=m CONFIG_I2C_PIIX4=m CONFIG_I2C_NFORCE2=m CONFIG_I2C_OCORES=m CONFIG_I2C_PARPORT=m CONFIG_I2C_PARPORT_LIGHT=m CONFIG_I2C_PROSAVAGE=m CONFIG_I2C_SAVAGE4=m CONFIG_I2C_SIS5595=m CONFIG_I2C_SIS630=m CONFIG_I2C_SIS96X=m CONFIG_I2C_STUB=m CONFIG_I2C_VIA=m CONFIG_I2C_VIAPRO=m CONFIG_I2C_VOODOO3=m CONFIG_SENSORS_DS1337=m CONFIG_SENSORS_DS1374=m CONFIG_SENSORS_EEPROM=m CONFIG_SENSORS_PCF8574=m CONFIG_SENSORS_PCA9539=m CONFIG_SENSORS_PCF8591=m CONFIG_SENSORS_MAX6875=m CONFIG_HWMON=m CONFIG_HWMON_VID=m CONFIG_SENSORS_ABITUGURU=m CONFIG_SENSORS_ADM1021=m CONFIG_SENSORS_ADM1025=m CONFIG_SENSORS_ADM1026=m CONFIG_SENSORS_ADM1031=m CONFIG_SENSORS_ADM9240=m CONFIG_SENSORS_ASB100=m CONFIG_SENSORS_ATXP1=m CONFIG_SENSORS_DS1621=m CONFIG_SENSORS_F71805F=m CONFIG_SENSORS_FSCHER=m CONFIG_SENSORS_FSCPOS=m CONFIG_SENSORS_GL518SM=m CONFIG_SENSORS_GL520SM=m CONFIG_SENSORS_IT87=m CONFIG_SENSORS_LM63=m CONFIG_SENSORS_LM75=m CONFIG_SENSORS_LM77=m CONFIG_SENSORS_LM78=m CONFIG_SENSORS_LM80=m CONFIG_SENSORS_LM83=m CONFIG_SENSORS_LM85=m CONFIG_SENSORS_LM87=m CONFIG_SENSORS_LM90=m CONFIG_SENSORS_LM92=m CONFIG_SENSORS_MAX1619=m CONFIG_SENSORS_PC87360=m CONFIG_SENSORS_SIS5595=m CONFIG_SENSORS_SMSC47M1=m CONFIG_SENSORS_SMSC47M192=m CONFIG_SENSORS_SMSC47B397=m CONFIG_SENSORS_VIA686A=m CONFIG_SENSORS_VT8231=m CONFIG_SENSORS_W83781D=m CONFIG_SENSORS_W83791D=m CONFIG_SENSORS_W83792D=m CONFIG_SENSORS_W83L785TS=m CONFIG_SENSORS_W83627HF=m CONFIG_SENSORS_W83627EHF=m CONFIG_SENSORS_HDAPS=m CONFIG_VIDEO_DEV=m CONFIG_VIDEO_V4L1=y CONFIG_VIDEO_V4L1_COMPAT=y CONFIG_VIDEO_V4L2=y CONFIG_VIDEO_CAPTURE_DRIVERS=y CONFIG_VIDEO_HELPER_CHIPS_AUTO=y CONFIG_V4L_USB_DRIVERS=y CONFIG_RADIO_ADAPTERS=y CONFIG_DAB=y CONFIG_BACKLIGHT_LCD_SUPPORT=y CONFIG_BACKLIGHT_CLASS_DEVICE=m CONFIG_VIDEO_OUTPUT_CONTROL=m CONFIG_VGA_CONSOLE=y CONFIG_VGACON_SOFT_SCROLLBACK=y CONFIG_VIDEO_SELECT=y CONFIG_DUMMY_CONSOLE=y CONFIG_FONT_8x16=y CONFIG_HID_SUPPORT=y CONFIG_HID=y CONFIG_USB_HID=y CONFIG_HID_FF=y CONFIG_HID_PID=y CONFIG_LOGITECH_FF=y CONFIG_THRUSTMASTER_FF=y CONFIG_USB_HIDDEV=y CONFIG_USB_SUPPORT=y CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y CONFIG_USB_ARCH_HAS_EHCI=y CONFIG_USB=y CONFIG_USB_DEVICEFS=y CONFIG_USB_DEVICE_CLASS=y CONFIG_USB_EHCI_HCD=m CONFIG_USB_EHCI_SPLIT_ISO=y CONFIG_USB_EHCI_ROOT_HUB_TT=y CONFIG_USB_EHCI_TT_NEWSCHED=y CONFIG_USB_ISP116X_HCD=m CONFIG_USB_OHCI_HCD=m CONFIG_USB_OHCI_LITTLE_ENDIAN=y CONFIG_USB_UHCI_HCD=m CONFIG_USB_SL811_HCD=m CONFIG_USB_ACM=m CONFIG_USB_PRINTER=m CONFIG_USB_STORAGE=m CONFIG_USB_STORAGE_DATAFAB=y CONFIG_USB_STORAGE_FREECOM=y CONFIG_USB_STORAGE_ISD200=y CONFIG_USB_STORAGE_DPCM=y CONFIG_USB_STORAGE_USBAT=y CONFIG_USB_STORAGE_SDDR09=y CONFIG_USB_STORAGE_SDDR55=y CONFIG_USB_STORAGE_JUMPSHOT=y CONFIG_USB_STORAGE_ALAUDA=y CONFIG_USB_LIBUSUAL=y CONFIG_USB_MDC800=m CONFIG_USB_MICROTEK=m CONFIG_USB_MON=y CONFIG_USB_USS720=m CONFIG_USB_SERIAL=m CONFIG_USB_SERIAL_GENERIC=y CONFIG_USB_SERIAL_AIRPRIME=m CONFIG_USB_SERIAL_ARK3116=m CONFIG_USB_SERIAL_BELKIN=m CONFIG_USB_SERIAL_WHITEHEAT=m CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m CONFIG_USB_SERIAL_CP2101=m CONFIG_USB_SERIAL_CYPRESS_M8=m CONFIG_USB_SERIAL_EMPEG=m CONFIG_USB_SERIAL_FTDI_SIO=m CONFIG_USB_SERIAL_FUNSOFT=m CONFIG_USB_SERIAL_VISOR=m CONFIG_USB_SERIAL_IPAQ=m CONFIG_USB_SERIAL_IR=m CONFIG_USB_SERIAL_EDGEPORT=m CONFIG_USB_SERIAL_EDGEPORT_TI=m CONFIG_USB_SERIAL_GARMIN=m CONFIG_USB_SERIAL_IPW=m CONFIG_USB_SERIAL_KEYSPAN_PDA=m CONFIG_USB_SERIAL_KEYSPAN=m CONFIG_USB_SERIAL_KEYSPAN_MPR=y CONFIG_USB_SERIAL_KEYSPAN_USA28=y CONFIG_USB_SERIAL_KEYSPAN_USA28X=y CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y CONFIG_USB_SERIAL_KEYSPAN_USA19=y CONFIG_USB_SERIAL_KEYSPAN_USA18X=y CONFIG_USB_SERIAL_KEYSPAN_USA19W=y CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y CONFIG_USB_SERIAL_KEYSPAN_USA49W=y CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y CONFIG_USB_SERIAL_KLSI=m CONFIG_USB_SERIAL_KOBIL_SCT=m CONFIG_USB_SERIAL_MCT_U232=m CONFIG_USB_SERIAL_NAVMAN=m CONFIG_USB_SERIAL_PL2303=m CONFIG_USB_SERIAL_HP4X=m CONFIG_USB_SERIAL_SAFE=m CONFIG_USB_SERIAL_SAFE_PADDED=y CONFIG_USB_SERIAL_SIERRAWIRELESS=m CONFIG_USB_SERIAL_TI=m CONFIG_USB_SERIAL_CYBERJACK=m CONFIG_USB_SERIAL_XIRCOM=m CONFIG_USB_SERIAL_OPTION=m CONFIG_USB_SERIAL_OMNINET=m CONFIG_USB_EZUSB=y CONFIG_USB_EMI62=m CONFIG_USB_EMI26=m CONFIG_USB_AUERSWALD=m CONFIG_USB_RIO500=m CONFIG_USB_LEGOTOWER=m CONFIG_USB_LCD=m CONFIG_USB_LED=m CONFIG_USB_IDMOUSE=m CONFIG_USB_APPLEDISPLAY=m CONFIG_USB_SISUSBVGA=m CONFIG_USB_SISUSBVGA_CON=y CONFIG_USB_LD=m CONFIG_USB_TEST=m CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_MAD=m CONFIG_INFINIBAND_USER_ACCESS=m CONFIG_INFINIBAND_USER_MEM=y CONFIG_INFINIBAND_ADDR_TRANS=y CONFIG_INFINIBAND_MTHCA=m CONFIG_INFINIBAND_MTHCA_DEBUG=y CONFIG_INFINIBAND_IPATH=m CONFIG_INFINIBAND_AMSO1100=m CONFIG_MLX4_INFINIBAND=m CONFIG_INFINIBAND_IPOIB=m CONFIG_INFINIBAND_IPOIB_DEBUG=y CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y CONFIG_INFINIBAND_SRP=m CONFIG_INFINIBAND_ISER=m CONFIG_DMA_ENGINE=y CONFIG_NET_DMA=y CONFIG_INTEL_IOATDMA=m CONFIG_VIRTUALIZATION=y CONFIG_EDD=m CONFIG_DELL_RBU=m CONFIG_DCDBAS=m CONFIG_DMIID=y CONFIG_EXT2_FS=y CONFIG_EXT2_FS_XATTR=y CONFIG_EXT2_FS_POSIX_ACL=y CONFIG_EXT2_FS_SECURITY=y CONFIG_EXT2_FS_XIP=y CONFIG_FS_XIP=y CONFIG_EXT3_FS=m CONFIG_EXT3_FS_XATTR=y CONFIG_EXT3_FS_POSIX_ACL=y CONFIG_EXT3_FS_SECURITY=y CONFIG_JBD=m CONFIG_FS_MBCACHE=y CONFIG_REISERFS_FS=m CONFIG_REISERFS_PROC_INFO=y CONFIG_REISERFS_FS_XATTR=y CONFIG_REISERFS_FS_POSIX_ACL=y CONFIG_REISERFS_FS_SECURITY=y CONFIG_JFS_FS=m CONFIG_JFS_POSIX_ACL=y CONFIG_JFS_SECURITY=y CONFIG_FS_POSIX_ACL=y CONFIG_XFS_FS=m CONFIG_XFS_QUOTA=y CONFIG_XFS_SECURITY=y CONFIG_XFS_POSIX_ACL=y CONFIG_OCFS2_FS=m CONFIG_ROMFS_FS=m CONFIG_INOTIFY=y CONFIG_INOTIFY_USER=y CONFIG_QUOTA=y CONFIG_QFMT_V2=y CONFIG_QUOTACTL=y CONFIG_DNOTIFY=y CONFIG_AUTOFS_FS=m CONFIG_AUTOFS4_FS=m CONFIG_FUSE_FS=m CONFIG_ISO9660_FS=y CONFIG_JOLIET=y CONFIG_ZISOFS=y CONFIG_UDF_FS=m CONFIG_UDF_NLS=y CONFIG_FAT_FS=m CONFIG_MSDOS_FS=m CONFIG_VFAT_FS=m CONFIG_NTFS_FS=m CONFIG_NTFS_RW=y CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_SYSCTL=y CONFIG_SYSFS=y CONFIG_TMPFS=y CONFIG_HUGETLBFS=y CONFIG_HUGETLB_PAGE=y CONFIG_RAMFS=y CONFIG_CONFIGFS_FS=m CONFIG_UFS_FS=m CONFIG_NFS_FS=m CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y CONFIG_NFS_V4=y CONFIG_NFS_DIRECTIO=y CONFIG_NFSD=m CONFIG_NFSD_V2_ACL=y CONFIG_NFSD_V3=y CONFIG_NFSD_V3_ACL=y CONFIG_NFSD_V4=y CONFIG_NFSD_TCP=y CONFIG_LOCKD=m CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=m CONFIG_NFS_ACL_SUPPORT=m CONFIG_NFS_COMMON=y CONFIG_SUNRPC=m CONFIG_SUNRPC_GSS=m CONFIG_RPCSEC_GSS_KRB5=m CONFIG_CIFS=m CONFIG_CIFS_WEAK_PW_HASH=y CONFIG_CIFS_XATTR=y CONFIG_CIFS_POSIX=y CONFIG_CODA_FS=m CONFIG_PARTITION_ADVANCED=y CONFIG_MAC_PARTITION=y CONFIG_MSDOS_PARTITION=y CONFIG_BSD_DISKLABEL=y CONFIG_MINIX_SUBPARTITION=y CONFIG_SOLARIS_X86_PARTITION=y CONFIG_UNIXWARE_DISKLABEL=y CONFIG_SGI_PARTITION=y CONFIG_SUN_PARTITION=y CONFIG_NLS=y CONFIG_NLS_CODEPAGE_437=y CONFIG_NLS_CODEPAGE_936=m CONFIG_NLS_CODEPAGE_950=m CONFIG_NLS_ASCII=y CONFIG_NLS_ISO8859_1=m CONFIG_NLS_UTF8=m CONFIG_PROFILING=y CONFIG_OPROFILE=m CONFIG_KPROBES=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_ENABLE_MUST_CHECK=y CONFIG_MAGIC_SYSRQ=y CONFIG_DEBUG_FS=y CONFIG_DEBUG_KERNEL=y CONFIG_DETECT_SOFTLOCKUP=y CONFIG_SCHED_DEBUG=y CONFIG_SCHEDSTATS=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_SPINLOCK_SLEEP=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_RODATA=y CONFIG_DEBUG_STACKOVERFLOW=y CONFIG_KEYS=y CONFIG_KEYS_DEBUG_PROC_KEYS=y CONFIG_SECURITY=y CONFIG_SECURITY_NETWORK=y CONFIG_SECURITY_CAPABILITIES=y CONFIG_XOR_BLOCKS=m CONFIG_ASYNC_CORE=m CONFIG_ASYNC_MEMCPY=m CONFIG_ASYNC_XOR=m CONFIG_CRYPTO=y CONFIG_CRYPTO_ALGAPI=y CONFIG_CRYPTO_BLKCIPHER=m CONFIG_CRYPTO_HASH=y CONFIG_CRYPTO_MANAGER=y CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_NULL=m CONFIG_CRYPTO_MD4=m CONFIG_CRYPTO_MD5=m CONFIG_CRYPTO_SHA1=y CONFIG_CRYPTO_SHA256=m CONFIG_CRYPTO_SHA512=m CONFIG_CRYPTO_WP512=m CONFIG_CRYPTO_TGR192=m CONFIG_CRYPTO_ECB=m CONFIG_CRYPTO_CBC=m CONFIG_CRYPTO_PCBC=m CONFIG_CRYPTO_DES=m CONFIG_CRYPTO_BLOWFISH=m CONFIG_CRYPTO_TWOFISH=m CONFIG_CRYPTO_TWOFISH_COMMON=m CONFIG_CRYPTO_SERPENT=m CONFIG_CRYPTO_AES=m CONFIG_CRYPTO_AES_X86_64=m CONFIG_CRYPTO_CAST5=m CONFIG_CRYPTO_CAST6=m CONFIG_CRYPTO_TEA=m CONFIG_CRYPTO_ARC4=m CONFIG_CRYPTO_KHAZAD=m CONFIG_CRYPTO_ANUBIS=m CONFIG_CRYPTO_DEFLATE=m CONFIG_CRYPTO_MICHAEL_MIC=m CONFIG_CRYPTO_CRC32C=y CONFIG_CRYPTO_HW=y CONFIG_BITREVERSE=y CONFIG_CRC_CCITT=m CONFIG_CRC16=m CONFIG_CRC32=y CONFIG_LIBCRC32C=y CONFIG_ZLIB_INFLATE=y CONFIG_ZLIB_DEFLATE=m CONFIG_PLIST=y CONFIG_HAS_IOMEM=y CONFIG_HAS_IOPORT=y CONFIG_HAS_DMA=y ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above 2007-11-21 20:34 Possible bug from kernel 2.6.22 and above Jie Chen @ 2007-11-21 22:14 ` Eric Dumazet 2007-11-22 1:52 ` Jie Chen 2007-12-05 20:36 ` Possible bug from kernel 2.6.22 and above Peter Zijlstra 1 sibling, 1 reply; 35+ messages in thread From: Eric Dumazet @ 2007-11-21 22:14 UTC (permalink / raw) To: Jie Chen; +Cc: linux-kernel Jie Chen a écrit : > Hi, there: > > We have a simple pthread program that measures the synchronization > overheads for various synchronization mechanisms such as spin locks, > barriers (the barrier is implemented using queue-based barrier > algorithm) and so on. We have dual quad-core AMD opterons (barcelona) > clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 > distribution. Before we moved to this kernel, we had kernel 2.6.21. > These two kernels are configured identical and compiled with the same > gcc 4.1.2 compiler. Under the old kernel, we observed that the > performance of these overheads increases as the number of threads > increases from 2 to 8. The following are the values of total time and > overhead for all threads acquiring a pthread spin lock and all threads > executing a barrier synchronization call. Could you post the source of your test program ? spinlock are ... spining and should not call linux scheduler, so I have no idea why a kernel change could modify your results. Also I suspect you'll have better results with Fedora Core 8 (since glibc was updated to use private futexes in v 2.7), at least for the barrier ops. > > Kernel 2.6.21 > Number of Threads 2 4 6 8 > SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643 > (Overhead) 0.073 0.05746 0.102805 0.154563 > Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002 > (Overhead) 0.531660 1.1502 1.500112 1.891617 > > Each thread is bound to a particular core using pthread_setaffinity_np. > > Kernel 2.6.23.8 > Number of Threads 2 4 6 8 > SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 > (Overhead) 4.345417 6.617207 3.949435 0.110985 > Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662 > (Overhead) 8.957755 9.784722 5.699590 1.869518 > > It is clearly that the synchronization overhead increases as the number > of threads increases in the kernel 2.6.21. But the synchronization > overhead actually decreases as the number of threads increases in the > kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as > well). This certainly is not a correct behavior. The kernels are > configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, > CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel > configuration file is in the attachment of this e-mail. > > From what we have read, there was a new scheduler (CFS) appeared from > 2.6.22. We are not sure whether the above behavior is caused by the new > scheduler. > > Finally, our machine cpu information is listed in the following: > > processor : 0 > vendor_id : AuthenticAMD > cpu family : 16 > model : 2 > model name : Quad-Core AMD Opteron(tm) Processor 2347 > stepping : 10 > cpu MHz : 1909.801 > cache size : 512 KB > physical id : 0 > siblings : 4 > core id : 0 > cpu cores : 4 > fpu : yes > fpu_exception : yes > cpuid level : 5 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov > pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt > pdpe1gb rdtscp > lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm > cmp_legacy svm > extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw > bogomips : 3822.95 > TLB size : 1024 4K pages > clflush size : 64 > cache_alignment : 64 > address sizes : 48 bits physical, 48 bits virtual > power management: ts ttp tm stc 100mhzsteps hwpstate > > In addition, we have schedstat and sched_debug files in the /proc > directory. > > Thank you for all your help to solve this puzzle. If you need more > information, please let us know. > > > P.S. I like to be cc'ed on the discussions related to this problem. > > > ############################################### > Jie Chen > Scientific Computing Group > Thomas Jefferson National Accelerator Facility > 12000, Jefferson Ave. > Newport News, VA 23606 > > (757)269-5046 (office) (757)269-6248 (fax) > chen@jlab.org > ############################################### > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above 2007-11-21 22:14 ` Eric Dumazet @ 2007-11-22 1:52 ` Jie Chen 2007-11-22 2:32 ` Simon Holm Thøgersen 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-11-22 1:52 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel Eric Dumazet wrote: > Jie Chen a écrit : >> Hi, there: >> >> We have a simple pthread program that measures the synchronization >> overheads for various synchronization mechanisms such as spin locks, >> barriers (the barrier is implemented using queue-based barrier >> algorithm) and so on. We have dual quad-core AMD opterons (barcelona) >> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 >> distribution. Before we moved to this kernel, we had kernel 2.6.21. >> These two kernels are configured identical and compiled with the same >> gcc 4.1.2 compiler. Under the old kernel, we observed that the >> performance of these overheads increases as the number of threads >> increases from 2 to 8. The following are the values of total time and >> overhead for all threads acquiring a pthread spin lock and all threads >> executing a barrier synchronization call. > > Could you post the source of your test program ? > Hi, Eric: Thank you for the quick response. You can get the source code containing the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a data parallel threading package for physics calculation. The test code is pthread_sync in the src directory once you unpack the gz file. To configure and build this package is very simple: configure and make. The test program is built by make check. The number of threads is controlled by QMT_NUM_THREADS. The package is using pthread spin lock, but the barrier is implemented using a queue-based barrier algorithm proposed by J. B. Carter of University of Utah (2005). > spinlock are ... spining and should not call linux scheduler, so I have > no idea why a kernel change could modify your results. > > Also I suspect you'll have better results with Fedora Core 8 (since > glibc was updated to use private futexes in v 2.7), at least for the > barrier ops. > > I am not sure what the biggest change between kernel 2.6.21 and 2.6.22 (23) is? Is the scheduler the biggest change between these versions? Can the scheduler of kernel somehow effect the performance? I know the scheduler is trying to do load balance and so on. Can the scheduler move threads to different cores according to the load balance algorithm even though the threads are bound to cores using pthread_setaffinity_np call when the number of threads is fewer than the number of cores? I am thinking about this because the performance of our test code is roughly the same for both kernels when the number of threads equals to the number of cores. >> >> Kernel 2.6.21 >> Number of Threads 2 4 6 8 >> SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643 >> (Overhead) 0.073 0.05746 0.102805 0.154563 >> Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002 >> (Overhead) 0.531660 1.1502 1.500112 1.891617 >> >> Each thread is bound to a particular core using pthread_setaffinity_np. >> >> Kernel 2.6.23.8 >> Number of Threads 2 4 6 8 >> SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 >> (Overhead) 4.345417 6.617207 3.949435 0.110985 >> Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662 >> (Overhead) 8.957755 9.784722 5.699590 1.869518 >> >> It is clearly that the synchronization overhead increases as the >> number of threads increases in the kernel 2.6.21. But the >> synchronization overhead actually decreases as the number of threads >> increases in the kernel 2.6.23.8 (We observed the same behavior on >> kernel 2.6.22 as well). This certainly is not a correct behavior. The >> kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, >> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel >> configuration file is in the attachment of this e-mail. >> >> From what we have read, there was a new scheduler (CFS) appeared from >> 2.6.22. We are not sure whether the above behavior is caused by the >> new scheduler. >> >> Finally, our machine cpu information is listed in the following: >> >> processor : 0 >> vendor_id : AuthenticAMD >> cpu family : 16 >> model : 2 >> model name : Quad-Core AMD Opteron(tm) Processor 2347 >> stepping : 10 >> cpu MHz : 1909.801 >> cache size : 512 KB >> physical id : 0 >> siblings : 4 >> core id : 0 >> cpu cores : 4 >> fpu : yes >> fpu_exception : yes >> cpuid level : 5 >> wp : yes >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> mca cmov >> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt >> pdpe1gb rdtscp >> lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm >> cmp_legacy svm >> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw >> bogomips : 3822.95 >> TLB size : 1024 4K pages >> clflush size : 64 >> cache_alignment : 64 >> address sizes : 48 bits physical, 48 bits virtual >> power management: ts ttp tm stc 100mhzsteps hwpstate >> >> In addition, we have schedstat and sched_debug files in the /proc >> directory. >> >> Thank you for all your help to solve this puzzle. If you need more >> information, please let us know. >> >> >> P.S. I like to be cc'ed on the discussions related to this problem. >> Thank you for your help and happy thanksgiving ! -- ############################################################################# # Jie Chen # Scientific Computing Group # Thomas Jefferson National Accelerator Facility # Newport News, VA 23606 # # chen@jlab.org # (757)269-5046 (office) # (757)269-6248 (fax) ############################################################################# ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above 2007-11-22 1:52 ` Jie Chen @ 2007-11-22 2:32 ` Simon Holm Thøgersen 2007-11-22 2:58 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Simon Holm Thøgersen @ 2007-11-22 2:32 UTC (permalink / raw) To: Jie Chen; +Cc: Eric Dumazet, linux-kernel ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: > Eric Dumazet wrote: > > Jie Chen a écrit : > >> Hi, there: > >> > >> We have a simple pthread program that measures the synchronization > >> overheads for various synchronization mechanisms such as spin locks, > >> barriers (the barrier is implemented using queue-based barrier > >> algorithm) and so on. We have dual quad-core AMD opterons (barcelona) > >> clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 > >> distribution. Before we moved to this kernel, we had kernel 2.6.21. > >> These two kernels are configured identical and compiled with the same > >> gcc 4.1.2 compiler. Under the old kernel, we observed that the > >> performance of these overheads increases as the number of threads > >> increases from 2 to 8. The following are the values of total time and > >> overhead for all threads acquiring a pthread spin lock and all threads > >> executing a barrier synchronization call. > > > > Could you post the source of your test program ? > > > > > Hi, Eric: > > Thank you for the quick response. You can get the source code containing > the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a > data parallel threading package for physics calculation. The test code > is pthread_sync in the src directory once you unpack the gz file. To > configure and build this package is very simple: configure and make. The > test program is built by make check. The number of threads is > controlled by QMT_NUM_THREADS. The package is using pthread spin lock, > but the barrier is implemented using a queue-based barrier algorithm > proposed by J. B. Carter of University of Utah (2005). > > > > > > > spinlock are ... spining and should not call linux scheduler, so I have > > no idea why a kernel change could modify your results. > > > > Also I suspect you'll have better results with Fedora Core 8 (since > > glibc was updated to use private futexes in v 2.7), at least for the > > barrier ops. > > > > > > I am not sure what the biggest change between kernel 2.6.21 and 2.6.22 > (23) is? Is the scheduler the biggest change between these versions? Can > the scheduler of kernel somehow effect the performance? I know the > scheduler is trying to do load balance and so on. Can the scheduler move > threads to different cores according to the load balance algorithm even > though the threads are bound to cores using pthread_setaffinity_np call > when the number of threads is fewer than the number of cores? I am > thinking about this because the performance of our test code is roughly > the same for both kernels when the number of threads equals to the > number of cores. > There is a backport of the CFS scheduler to 2.6.21, see http://lkml.org/lkml/2007/11/19/127 > >> > >> Kernel 2.6.21 > >> Number of Threads 2 4 6 8 > >> SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643 > >> (Overhead) 0.073 0.05746 0.102805 0.154563 > >> Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002 > >> (Overhead) 0.531660 1.1502 1.500112 1.891617 > >> > >> Each thread is bound to a particular core using pthread_setaffinity_np. > >> > >> Kernel 2.6.23.8 > >> Number of Threads 2 4 6 8 > >> SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 > >> (Overhead) 4.345417 6.617207 3.949435 0.110985 > >> Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662 > >> (Overhead) 8.957755 9.784722 5.699590 1.869518 > >> > >> It is clearly that the synchronization overhead increases as the > >> number of threads increases in the kernel 2.6.21. But the > >> synchronization overhead actually decreases as the number of threads > >> increases in the kernel 2.6.23.8 (We observed the same behavior on > >> kernel 2.6.22 as well). This certainly is not a correct behavior. The > >> kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, > >> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel > >> configuration file is in the attachment of this e-mail. > >> > >> From what we have read, there was a new scheduler (CFS) appeared from > >> 2.6.22. We are not sure whether the above behavior is caused by the > >> new scheduler. > >> > >> Finally, our machine cpu information is listed in the following: > >> > >> processor : 0 > >> vendor_id : AuthenticAMD > >> cpu family : 16 > >> model : 2 > >> model name : Quad-Core AMD Opteron(tm) Processor 2347 > >> stepping : 10 > >> cpu MHz : 1909.801 > >> cache size : 512 KB > >> physical id : 0 > >> siblings : 4 > >> core id : 0 > >> cpu cores : 4 > >> fpu : yes > >> fpu_exception : yes > >> cpuid level : 5 > >> wp : yes > >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > >> mca cmov > >> pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt > >> pdpe1gb rdtscp > >> lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm > >> cmp_legacy svm > >> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw > >> bogomips : 3822.95 > >> TLB size : 1024 4K pages > >> clflush size : 64 > >> cache_alignment : 64 > >> address sizes : 48 bits physical, 48 bits virtual > >> power management: ts ttp tm stc 100mhzsteps hwpstate > >> > >> In addition, we have schedstat and sched_debug files in the /proc > >> directory. > >> > >> Thank you for all your help to solve this puzzle. If you need more > >> information, please let us know. > >> > >> > >> P.S. I like to be cc'ed on the discussions related to this problem. > >> > > Thank you for your help and happy thanksgiving ! > Simon Holm Thøgersen ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above 2007-11-22 2:32 ` Simon Holm Thøgersen @ 2007-11-22 2:58 ` Jie Chen 2007-11-22 20:19 ` Matt Mackall 2007-12-04 13:17 ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar 0 siblings, 2 replies; 35+ messages in thread From: Jie Chen @ 2007-11-22 2:58 UTC (permalink / raw) To: Simon Holm Thøgersen; +Cc: Eric Dumazet, linux-kernel Simon Holm Thøgersen wrote: > ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: > There is a backport of the CFS scheduler to 2.6.21, see > http://lkml.org/lkml/2007/11/19/127 > Hi, Simon: I will try that after the thanksgiving holiday to find out whether the odd behavior will show up using 2.6.21 with back ported CFS. >>>> Kernel 2.6.21 >>>> Number of Threads 2 4 6 8 >>>> SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643 >>>> (Overhead) 0.073 0.05746 0.102805 0.154563 >>>> Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002 >>>> (Overhead) 0.531660 1.1502 1.500112 1.891617 >>>> >>>> Each thread is bound to a particular core using pthread_setaffinity_np. >>>> >>>> Kernel 2.6.23.8 >>>> Number of Threads 2 4 6 8 >>>> SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 >>>> (Overhead) 4.345417 6.617207 3.949435 0.110985 >>>> Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662 >>>> (Overhead) 8.957755 9.784722 5.699590 1.869518 >>>> > > > Simon Holm Thøgersen > > I just ran a simple test to prove that the problem may be related to load balance of the scheduler. I first started 6 processes using "taskset -c 2 donothing&; taskset -c 3 donothing&; ..., taskset -c 7 donothing". These 6 processes will run on core 2 to 7. Then I started my test program using two threads bound to core 0 and 1. Here is the result: Two threads on Kernel 2.6.23.8: SpinLock (Time micro second) 10.558255 (Overhead) 0.068965 Barrier (Time micro second) 10.865520 (Overhead) 0.376230 Similarly, I started 4 donothing processes on core 4, 5, 6 and 7, and ran the test program. I have the following result: Four threads on Kernel 2.6.23.8: SpinLock (Time micro second) 10.579413 (Overhead) 0.090023 Barrier (Time micro second) 11.363193 (Overhead) 0.873803 Finally, here is the result for 6 threads with two donothing processes running on core 6 and 7: Six threads on Kernel 2.6.23.8: SpinLock (Time micro second) 10.590030 (Overhead) 0.100940 Barrier (Time micro second) 11.977548 (Overhead) 1.488458 Now the above results are very much similar to the results obtained for the kernel 2.6.21. I hope this helps you guys in some ways. Thank you. -- ############################################################################# # Jie Chen # Scientific Computing Group # Thomas Jefferson National Accelerator Facility # Newport News, VA 23606 # # chen@jlab.org # (757)269-5046 (office) # (757)269-6248 (fax) ############################################################################# ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above 2007-11-22 2:58 ` Jie Chen @ 2007-11-22 20:19 ` Matt Mackall 2007-12-04 13:17 ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar 1 sibling, 0 replies; 35+ messages in thread From: Matt Mackall @ 2007-11-22 20:19 UTC (permalink / raw) To: Jie Chen; +Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Ingo Molnar On Wed, Nov 21, 2007 at 09:58:10PM -0500, Jie Chen wrote: > Simon Holm Th??gersen wrote: > >ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: > > >There is a backport of the CFS scheduler to 2.6.21, see > >http://lkml.org/lkml/2007/11/19/127 > > > Hi, Simon: > > I will try that after the thanksgiving holiday to find out whether the > odd behavior will show up using 2.6.21 with back ported CFS. > > >>>>Kernel 2.6.21 > >>>>Number of Threads 2 4 6 8 > >>>>SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643 > >>>> (Overhead) 0.073 0.05746 0.102805 0.154563 > >>>>Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002 > >>>> (Overhead) 0.531660 1.1502 1.500112 1.891617 > >>>> > >>>>Each thread is bound to a particular core using pthread_setaffinity_np. > >>>> > >>>>Kernel 2.6.23.8 > >>>>Number of Threads 2 4 6 8 > >>>>SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990 > >>>> (Overhead) 4.345417 6.617207 3.949435 0.110985 > >>>>Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662 > >>>> (Overhead) 8.957755 9.784722 5.699590 1.869518 > >>>> > > > > > > >Simon Holm Th??gersen > > > > > I just ran a simple test to prove that the problem may be related to > load balance of the scheduler. I first started 6 processes using > "taskset -c 2 donothing&; taskset -c 3 donothing&; ..., taskset -c 7 > donothing". These 6 processes will run on core 2 to 7. Then I started my > test program using two threads bound to core 0 and 1. Here is the result: > > Two threads on Kernel 2.6.23.8: > SpinLock (Time micro second) 10.558255 > (Overhead) 0.068965 > Barrier (Time micro second) 10.865520 > (Overhead) 0.376230 > > Similarly, I started 4 donothing processes on core 4, 5, 6 and 7, and > ran the test program. I have the following result: > > Four threads on Kernel 2.6.23.8: > SpinLock (Time micro second) 10.579413 > (Overhead) 0.090023 > Barrier (Time micro second) 11.363193 > (Overhead) 0.873803 > > Finally, here is the result for 6 threads with two donothing processes > running on core 6 and 7: > > Six threads on Kernel 2.6.23.8: > SpinLock (Time micro second) 10.590030 > (Overhead) 0.100940 > Barrier (Time micro second) 11.977548 > (Overhead) 1.488458 > > Now the above results are very much similar to the results obtained for > the kernel 2.6.21. I hope this helps you guys in some ways. Thank you. Yes, this really does look like a scheduling regression. I've added Ingo to the cc: list. Next time you should pick a more descriptive subject line - we've got lots of email about possible bugs. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-11-22 2:58 ` Jie Chen 2007-11-22 20:19 ` Matt Mackall @ 2007-12-04 13:17 ` Ingo Molnar 2007-12-04 15:41 ` Jie Chen 2007-12-05 15:29 ` Jie Chen 1 sibling, 2 replies; 35+ messages in thread From: Ingo Molnar @ 2007-12-04 13:17 UTC (permalink / raw) To: Jie Chen; +Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel * Jie Chen <chen@jlab.org> wrote: > Simon Holm Th??gersen wrote: >> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: > >> There is a backport of the CFS scheduler to 2.6.21, see >> http://lkml.org/lkml/2007/11/19/127 >> > Hi, Simon: > > I will try that after the thanksgiving holiday to find out whether the > odd behavior will show up using 2.6.21 with back ported CFS. would be also nice to test this with 2.6.24-rc4. Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-04 13:17 ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar @ 2007-12-04 15:41 ` Jie Chen 2007-12-05 15:29 ` Jie Chen 1 sibling, 0 replies; 35+ messages in thread From: Jie Chen @ 2007-12-04 15:41 UTC (permalink / raw) To: Ingo Molnar; +Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >> Simon Holm Th??gersen wrote: >>> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: >>> There is a backport of the CFS scheduler to 2.6.21, see >>> http://lkml.org/lkml/2007/11/19/127 >>> >> Hi, Simon: >> >> I will try that after the thanksgiving holiday to find out whether the >> odd behavior will show up using 2.6.21 with back ported CFS. > > would be also nice to test this with 2.6.24-rc4. > > Ingo Hi, Ingo: I will test 2.6.24-rc4 this week and let you know the result. Thanks. -- ############################################################################# # Jie Chen # Scientific Computing Group # Thomas Jefferson National Accelerator Facility # Newport News, VA 23606 # # chen@jlab.org # (757)269-5046 (office) # (757)269-6248 (fax) ############################################################################# ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-04 13:17 ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar 2007-12-04 15:41 ` Jie Chen @ 2007-12-05 15:29 ` Jie Chen 2007-12-05 15:40 ` Ingo Molnar 1 sibling, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-05 15:29 UTC (permalink / raw) To: Ingo Molnar; +Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >> Simon Holm Th??gersen wrote: >>> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: >>> There is a backport of the CFS scheduler to 2.6.21, see >>> http://lkml.org/lkml/2007/11/19/127 >>> >> Hi, Simon: >> >> I will try that after the thanksgiving holiday to find out whether the >> odd behavior will show up using 2.6.21 with back ported CFS. > > would be also nice to test this with 2.6.24-rc4. > > Ingo Hi, Ingo: I just ran the same test on two 2.6.24-rc4 kernels: one with CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED off. The odd behavior I described in my previous e-mails were still there for both kernels. Let me know If I can be any more help. Thank you. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 15:29 ` Jie Chen @ 2007-12-05 15:40 ` Ingo Molnar 2007-12-05 16:16 ` Eric Dumazet 2007-12-05 16:22 ` Jie Chen 0 siblings, 2 replies; 35+ messages in thread From: Ingo Molnar @ 2007-12-05 15:40 UTC (permalink / raw) To: Jie Chen Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: > I just ran the same test on two 2.6.24-rc4 kernels: one with > CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED > off. The odd behavior I described in my previous e-mails were still > there for both kernels. Let me know If I can be any more help. Thank > you. ok, i had a look at your data, and i think this is the result of the scheduler balancing out to idle CPUs more agressively than before. Doing that is almost always a good idea though - but indeed it can result in "bad" numbers if all you do is to measure the ping-pong "performance" between two threads. (with no real work done by any of them). the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. do you have testcode (or a modification of your testcase sourcecode) that simulates a real-life situation where 2.6.24-rc4 performs not as well as you'd like it to see? (or if qmt.tar.gz already contains that then please point me towards that portion of the test and how i should run it - thanks!) Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 15:40 ` Ingo Molnar @ 2007-12-05 16:16 ` Eric Dumazet 2007-12-05 16:25 ` Ingo Molnar 2007-12-05 16:22 ` Jie Chen 1 sibling, 1 reply; 35+ messages in thread From: Eric Dumazet @ 2007-12-05 16:16 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jie Chen, Simon Holm Th??gersen, linux-kernel, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 1873 bytes --] Ingo Molnar a écrit : > * Jie Chen <chen@jlab.org> wrote: > >> I just ran the same test on two 2.6.24-rc4 kernels: one with >> CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED >> off. The odd behavior I described in my previous e-mails were still >> there for both kernels. Let me know If I can be any more help. Thank >> you. > > ok, i had a look at your data, and i think this is the result of the > scheduler balancing out to idle CPUs more agressively than before. Doing > that is almost always a good idea though - but indeed it can result in > "bad" numbers if all you do is to measure the ping-pong "performance" > between two threads. (with no real work done by any of them). > > the moment you saturate the system a bit more, the numbers should > improve even with such a ping-pong test. > > do you have testcode (or a modification of your testcase sourcecode) > that simulates a real-life situation where 2.6.24-rc4 performs not as > well as you'd like it to see? (or if qmt.tar.gz already contains that > then please point me towards that portion of the test and how i should > run it - thanks!) > > Ingo I cooked a program shorter than Jie one, to try to understand what was going on. Its a pure cpu burner program, with no thread synchronisation (but the pthread_join at the very end) As each thread is bound to a given cpu, I am not sure the scheduler is allowed to balance to an idle cpu. Unfortunatly I dont have a 4 way SMP idle machine available to test it. $ gcc -O2 -o burner burner.c $ ./burner Time to perform the unit of work on one thread is 0.040328 s Time to perform the unit of work on 2 threads is 0.040221 s I tried it on a 64 way machine (Thanks David :) ) and noticed some strange results that may be related to the Niagara hardware (time for 64 threads was nearly the double for one thread) [-- Attachment #2: burner.c --] [-- Type: text/plain, Size: 2301 bytes --] #include <pthread.h> #include <sched.h> #include <unistd.h> #include <fcntl.h> #include <sys/time.h> #include <stdio.h> #include <stdlib.h> int blockthemall=1; static void inline cpupause() { #if defined(i386) asm volatile("rep;nop":::"memory"); #else asm volatile("":::"memory"); #endif } /* * Determines number of cpus * Can be overiden by the NR_CPUS environment variable */ int number_of_cpus() { char line[1024], *p; int cnt = 0; FILE *F; p = getenv("NR_CPUS"); if (p) return atoi(p); F = fopen("/proc/cpuinfo", "r"); if (F == NULL) { perror("/proc/cpuinfo"); return 1; } while (fgets(line, sizeof(line), F) != NULL) { if (memcmp(line, "processor", 9) == 0) cnt++; } fclose(F); return cnt; } void compute_elapsed(struct timeval *delta, const struct timeval *t0) { struct timeval t1; gettimeofday(&t1, NULL); delta->tv_sec = t1.tv_sec - t0->tv_sec; delta->tv_usec = t1.tv_usec - t0->tv_usec; if (delta->tv_usec < 0) { delta->tv_usec += 1000000; delta->tv_sec--; } } int nr_loops = 20*1000000; double incr=0.3456; void perform_work() { int i; double t = 0.0; for (i = 0; i < nr_loops; i++) { t += incr; } if (t < 0.0) printf("well... should not happen\n"); } void set_affinity(int cpu) { long cpu_mask; int res; cpu_mask = 1L << cpu; res = sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask); if (res) perror("sched_setaffinity"); } void *thread_work(void *arg) { int cpu = (int)arg; set_affinity(cpu); while (blockthemall) cpupause(); perform_work(); return (void *)0; } main(int argc, char *argv[]) { struct timeval t0, delta; int nr_cpus, i; pthread_t *tids; gettimeofday(&t0, NULL); perform_work(); compute_elapsed(&delta, &t0); printf("Time to perform the unit of work on one thread is %d.%06d s\n", delta.tv_sec, delta.tv_usec); nr_cpus = number_of_cpus(); if (nr_cpus <= 1) return 0; tids = malloc(nr_cpus * sizeof(pthread_t)); for (i = 1; i < nr_cpus; i++) { pthread_create(tids + i, NULL, thread_work, (void *)i); } set_affinity(0); gettimeofday(&t0, NULL); blockthemall=0; perform_work(); for (i = 1; i < nr_cpus; i++) pthread_join(tids[i], NULL); compute_elapsed(&delta, &t0); printf("Time to perform the unit of work on %d threads is %d.%06d s\n", nr_cpus, delta.tv_sec, delta.tv_usec); } ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 16:16 ` Eric Dumazet @ 2007-12-05 16:25 ` Ingo Molnar 2007-12-05 16:29 ` Eric Dumazet 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-05 16:25 UTC (permalink / raw) To: Eric Dumazet Cc: Jie Chen, Simon Holm Th??gersen, linux-kernel, Peter Zijlstra * Eric Dumazet <dada1@cosmosbay.com> wrote: > $ gcc -O2 -o burner burner.c > $ ./burner > Time to perform the unit of work on one thread is 0.040328 s > Time to perform the unit of work on 2 threads is 0.040221 s ok, but this actually suggests that scheduling is fine for this, correct? Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 16:25 ` Ingo Molnar @ 2007-12-05 16:29 ` Eric Dumazet 0 siblings, 0 replies; 35+ messages in thread From: Eric Dumazet @ 2007-12-05 16:29 UTC (permalink / raw) To: Ingo Molnar; +Cc: Jie Chen, Simon Holm Th??gersen, linux-kernel, Peter Zijlstra Ingo Molnar a écrit : > * Eric Dumazet <dada1@cosmosbay.com> wrote: > >> $ gcc -O2 -o burner burner.c >> $ ./burner >> Time to perform the unit of work on one thread is 0.040328 s >> Time to perform the unit of work on 2 threads is 0.040221 s > > ok, but this actually suggests that scheduling is fine for this, > correct? > > Ingo > > Yes, But this machine runs an old kernel. I was just giving you how to run it :) ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 15:40 ` Ingo Molnar 2007-12-05 16:16 ` Eric Dumazet @ 2007-12-05 16:22 ` Jie Chen 2007-12-05 16:47 ` Ingo Molnar 1 sibling, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-05 16:22 UTC (permalink / raw) To: Ingo Molnar Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >> I just ran the same test on two 2.6.24-rc4 kernels: one with >> CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED >> off. The odd behavior I described in my previous e-mails were still >> there for both kernels. Let me know If I can be any more help. Thank >> you. > > ok, i had a look at your data, and i think this is the result of the > scheduler balancing out to idle CPUs more agressively than before. Doing > that is almost always a good idea though - but indeed it can result in > "bad" numbers if all you do is to measure the ping-pong "performance" > between two threads. (with no real work done by any of them). > My test code are not doing much work but measuring overhead of various synchronization mechanisms such as barrier and lock. I am trying to see the scalability of different implementations/algorithms on multi-core machines. > the moment you saturate the system a bit more, the numbers should > improve even with such a ping-pong test. > You are right. If I manually do load balance (bind unrelated processes on the other cores), my test code perform as well as it did in the kernel 2.6.21. > do you have testcode (or a modification of your testcase sourcecode) > that simulates a real-life situation where 2.6.24-rc4 performs not as > well as you'd like it to see? (or if qmt.tar.gz already contains that > then please point me towards that portion of the test and how i should > run it - thanks!) The qmt.tar.gz code contains a simple test program call pthread_sync under the src directory. You can change the number of threads by setting QMT_NUM_THREADS environment variable. You can build the qmt by doing configure --enable-public-release. I do not have Intel quad core machines, I am not sure whether the behavior will show up on Intel platform. Our cluster is dual quad-core opteron which has its own hardware problem :-). http://hardware.slashdot.org/article.pl?sid=07/12/04/237248&from=rss > > Ingo Hi, Ingo: My test code qmt can be found at ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz. There is a minor performance issue in qmt pointed out by Eric, which I have not put into the tar ball yet. If I can be any help, please let me know. Thank you very much. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 16:22 ` Jie Chen @ 2007-12-05 16:47 ` Ingo Molnar 2007-12-05 17:47 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-05 16:47 UTC (permalink / raw) To: Jie Chen Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: >> the moment you saturate the system a bit more, the numbers should >> improve even with such a ping-pong test. > > You are right. If I manually do load balance (bind unrelated processes > on the other cores), my test code perform as well as it did in the > kernel 2.6.21. so right now the results dont seem to be too bad to me - the higher overhead comes from two threads running on two different cores and incurring the overhead of cross-core communications. In a true spread-out workloads that synchronize occasionally you'd get the same kind of overhead so in fact this behavior is more informative of the real overhead i guess. In 2.6.21 the two threads would stick on the same core and produce artificially low latency - which would only be true in a real spread-out workload if all tasks ran on the same core. (which is hardly the thing you want on openmp) In any case, if i misinterpreted your numbers or if you just disagree, or if have a workload/test that shows worse performance that it could/should, let me know. Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 16:47 ` Ingo Molnar @ 2007-12-05 17:47 ` Jie Chen 2007-12-05 20:03 ` Ingo Molnar 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-05 17:47 UTC (permalink / raw) To: Ingo Molnar Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >>> the moment you saturate the system a bit more, the numbers should >>> improve even with such a ping-pong test. >> You are right. If I manually do load balance (bind unrelated processes >> on the other cores), my test code perform as well as it did in the >> kernel 2.6.21. > > so right now the results dont seem to be too bad to me - the higher > overhead comes from two threads running on two different cores and > incurring the overhead of cross-core communications. In a true > spread-out workloads that synchronize occasionally you'd get the same > kind of overhead so in fact this behavior is more informative of the > real overhead i guess. In 2.6.21 the two threads would stick on the same > core and produce artificially low latency - which would only be true in > a real spread-out workload if all tasks ran on the same core. (which is > hardly the thing you want on openmp) > I use pthread_setaffinity_np call to bind one thread to one core. Unless the kernel 2.6.21 does not honor the affinity, I do not see the difference running two threads on two cores between the new kernel and the old kernel. My test code does not do any numerical calculation, but it does spin waiting on shared/non-shared flags. The reason I am using the affinity is to test synchronization overheads among different cores. In either the new and the old kernel, I do see 200% CPU usage when I ran my test code for two threads. Does this mean two threads are running on two cores? Also I verify a thread is indeed bound to a core by using pthread_getaffinity_np. > In any case, if i misinterpreted your numbers or if you just disagree, > or if have a workload/test that shows worse performance that it > could/should, let me know. > > Ingo Hi, Ingo: Since I am using affinity flag to bind each thread to a different core, the synchronization overhead should increases as the number of cores/threads increases. But what we observed in the new kernel is the opposite. The barrier overhead of two threads is 8.93 micro seconds vs 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This will confuse most of people who study the synchronization/communication scalability. I know my test code is not real-world computation which usually use up all cores. I hope I have explained myself clearly. Thank you very much. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 17:47 ` Jie Chen @ 2007-12-05 20:03 ` Ingo Molnar 2007-12-05 20:23 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-05 20:03 UTC (permalink / raw) To: Jie Chen Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: > Since I am using affinity flag to bind each thread to a different > core, the synchronization overhead should increases as the number of > cores/threads increases. But what we observed in the new kernel is the > opposite. The barrier overhead of two threads is 8.93 micro seconds vs > 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This > will confuse most of people who study the > synchronization/communication scalability. I know my test code is not > real-world computation which usually use up all cores. I hope I have > explained myself clearly. Thank you very much. btw., could you try to not use the affinity mask and let the scheduler manage the spreading of tasks? It generally has a better knowledge about how tasks interrelate. Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 20:03 ` Ingo Molnar @ 2007-12-05 20:23 ` Jie Chen 2007-12-05 20:46 ` Ingo Molnar 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-05 20:23 UTC (permalink / raw) To: Ingo Molnar Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >> Since I am using affinity flag to bind each thread to a different >> core, the synchronization overhead should increases as the number of >> cores/threads increases. But what we observed in the new kernel is the >> opposite. The barrier overhead of two threads is 8.93 micro seconds vs >> 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This >> will confuse most of people who study the >> synchronization/communication scalability. I know my test code is not >> real-world computation which usually use up all cores. I hope I have >> explained myself clearly. Thank you very much. > > btw., could you try to not use the affinity mask and let the scheduler > manage the spreading of tasks? It generally has a better knowledge about > how tasks interrelate. > > Ingo Hi, Ingo: I just disabled the affinity mask and reran the test. There were no significant changes for two threads (barrier overhead is around 9 microseconds). As for 8 threads, the barrier overhead actually drops a little, which is good. Let me know whether I can be any help. Thank you very much. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 20:23 ` Jie Chen @ 2007-12-05 20:46 ` Ingo Molnar 2007-12-05 20:52 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-05 20:46 UTC (permalink / raw) To: Jie Chen Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: > I just disabled the affinity mask and reran the test. There were no > significant changes for two threads (barrier overhead is around 9 > microseconds). As for 8 threads, the barrier overhead actually drops a > little, which is good. Let me know whether I can be any help. Thank > you very much. sorry to be dense, but could you give me instructions how i could remove the affinity mask and test the "barrier overhead" myself? I have built "pthread_sync" and it outputs numbers for me - which one would be the barrier overhead: Reference_time_1 ? Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 20:46 ` Ingo Molnar @ 2007-12-05 20:52 ` Jie Chen 2007-12-05 21:02 ` Ingo Molnar 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-05 20:52 UTC (permalink / raw) To: Ingo Molnar Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >> I just disabled the affinity mask and reran the test. There were no >> significant changes for two threads (barrier overhead is around 9 >> microseconds). As for 8 threads, the barrier overhead actually drops a >> little, which is good. Let me know whether I can be any help. Thank >> you very much. > > sorry to be dense, but could you give me instructions how i could remove > the affinity mask and test the "barrier overhead" myself? I have built > "pthread_sync" and it outputs numbers for me - which one would be the > barrier overhead: Reference_time_1 ? > > Ingo Hi, Ingo: To disable affinity, do configure --enable-public-release --disable-thread_affinity. You should see barrier overhead like the following: Computing BARRIER time Sample_size Average Min Max S.D. Outliers 20 19.486162 19.482250 19.491400 0.002740 0 BARRIER time = 19.486162 microseconds +/- 0.005371 BARRIER overhead = 8.996257 microseconds +/- 0.006575 The Reference_time_1 is the elapsed time for single thread doing simple loop without any synchronization. Thank you. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 20:52 ` Jie Chen @ 2007-12-05 21:02 ` Ingo Molnar 2007-12-05 22:16 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-05 21:02 UTC (permalink / raw) To: Jie Chen Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: >> sorry to be dense, but could you give me instructions how i could >> remove the affinity mask and test the "barrier overhead" myself? I >> have built "pthread_sync" and it outputs numbers for me - which one >> would be the barrier overhead: Reference_time_1 ? > > To disable affinity, do configure --enable-public-release > --disable-thread_affinity. You should see barrier overhead like the > following: Computing BARRIER time > > Sample_size Average Min Max S.D. Outliers > 20 19.486162 19.482250 19.491400 0.002740 0 > > BARRIER time = 19.486162 microseconds +/- 0.005371 > BARRIER overhead = 8.996257 microseconds +/- 0.006575 ok, i did that and rebuilt. I also did "make check" and got src/pthread_sync which i can run. The only thing i'm missing, if i run src/pthread_sync, it outputs "PARALLEL time": PARALLEL time = 22.486103 microseconds +/- 3.944821 PARALLEL overhead = 10.638658 microseconds +/- 10.854154 not "BARRIER time". I've re-read the discussion and found no hint about how to build and run a barrier test. Either i missed it or it's so obvious to you that you didnt mention it :-) Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 21:02 ` Ingo Molnar @ 2007-12-05 22:16 ` Jie Chen 2007-12-06 10:43 ` Ingo Molnar 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-05 22:16 UTC (permalink / raw) To: Ingo Molnar Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >>> sorry to be dense, but could you give me instructions how i could >>> remove the affinity mask and test the "barrier overhead" myself? I >>> have built "pthread_sync" and it outputs numbers for me - which one >>> would be the barrier overhead: Reference_time_1 ? >> To disable affinity, do configure --enable-public-release >> --disable-thread_affinity. You should see barrier overhead like the >> following: Computing BARRIER time >> >> Sample_size Average Min Max S.D. Outliers >> 20 19.486162 19.482250 19.491400 0.002740 0 >> >> BARRIER time = 19.486162 microseconds +/- 0.005371 >> BARRIER overhead = 8.996257 microseconds +/- 0.006575 > > ok, i did that and rebuilt. I also did "make check" and got > src/pthread_sync which i can run. The only thing i'm missing, if i run > src/pthread_sync, it outputs "PARALLEL time": > > PARALLEL time = 22.486103 microseconds +/- 3.944821 > PARALLEL overhead = 10.638658 microseconds +/- 10.854154 > > not "BARRIER time". I've re-read the discussion and found no hint about > how to build and run a barrier test. Either i missed it or it's so > obvious to you that you didnt mention it :-) > > Ingo Hi, Ingo: Did you do configure --enable-public-release? My qmt is for qcd calculation (one type of physics code). Without the above flag one can only test PARALLEL overhead. Actually the PARALLEL benchmark has the same behavior as the BARRIER. Thanks. ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-05 22:16 ` Jie Chen @ 2007-12-06 10:43 ` Ingo Molnar 2007-12-06 16:29 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-06 10:43 UTC (permalink / raw) To: Jie Chen Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: >> not "BARRIER time". I've re-read the discussion and found no hint >> about how to build and run a barrier test. Either i missed it or it's >> so obvious to you that you didnt mention it :-) >> >> Ingo > > Hi, Ingo: > > Did you do configure --enable-public-release? My qmt is for qcd > calculation (one type of physics code) [...] yes, i did exactly as instructed. > [...]. Without the above flag one can only test PARALLEL overhead. > Actually the PARALLEL benchmark has the same behavior as the BARRIER. > Thanks. hm, but PARALLEL does not seem to do that much context switching. So basically you create the threads and do a few short runs to establish overhead? Threads do not get fork-balanced at the moment - but turning it on would be easy. Could you try the patch below - how does it impact your results? (and please keep affinity setting off) Ingo -----------> Subject: sched: reactivate fork balancing From: Ingo Molnar <mingo@elte.hu> reactivate fork balancing. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/topology.h | 3 +++ 1 file changed, 3 insertions(+) Index: linux/include/linux/topology.h =================================================================== --- linux.orig/include/linux/topology.h +++ linux/include/linux/topology.h @@ -103,6 +103,7 @@ .forkexec_idx = 0, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE \ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE \ | SD_WAKE_IDLE \ @@ -134,6 +135,7 @@ .forkexec_idx = 1, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE \ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE \ | SD_WAKE_IDLE \ @@ -165,6 +167,7 @@ .forkexec_idx = 1, \ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE \ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE \ | BALANCE_FOR_PKG_POWER,\ ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-06 10:43 ` Ingo Molnar @ 2007-12-06 16:29 ` Jie Chen 2007-12-10 10:59 ` Ingo Molnar 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-06 16:29 UTC (permalink / raw) To: Ingo Molnar Cc: Simon Holm Th??gersen, Eric Dumazet, linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >>> not "BARRIER time". I've re-read the discussion and found no hint >>> about how to build and run a barrier test. Either i missed it or it's >>> so obvious to you that you didnt mention it :-) >>> >>> Ingo >> Hi, Ingo: >> >> Did you do configure --enable-public-release? My qmt is for qcd >> calculation (one type of physics code) [...] > > yes, i did exactly as instructed. > >> [...]. Without the above flag one can only test PARALLEL overhead. >> Actually the PARALLEL benchmark has the same behavior as the BARRIER. >> Thanks. > > hm, but PARALLEL does not seem to do that much context switching. So > basically you create the threads and do a few short runs to establish > overhead? Threads do not get fork-balanced at the moment - but turning > it on would be easy. Could you try the patch below - how does it impact > your results? (and please keep affinity setting off) > > Ingo > > -----------> > Subject: sched: reactivate fork balancing > From: Ingo Molnar <mingo@elte.hu> > > reactivate fork balancing. > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > --- > include/linux/topology.h | 3 +++ > 1 file changed, 3 insertions(+) > > Index: linux/include/linux/topology.h > =================================================================== > --- linux.orig/include/linux/topology.h > +++ linux/include/linux/topology.h > @@ -103,6 +103,7 @@ > .forkexec_idx = 0, \ > .flags = SD_LOAD_BALANCE \ > | SD_BALANCE_NEWIDLE \ > + | SD_BALANCE_FORK \ > | SD_BALANCE_EXEC \ > | SD_WAKE_AFFINE \ > | SD_WAKE_IDLE \ > @@ -134,6 +135,7 @@ > .forkexec_idx = 1, \ > .flags = SD_LOAD_BALANCE \ > | SD_BALANCE_NEWIDLE \ > + | SD_BALANCE_FORK \ > | SD_BALANCE_EXEC \ > | SD_WAKE_AFFINE \ > | SD_WAKE_IDLE \ > @@ -165,6 +167,7 @@ > .forkexec_idx = 1, \ > .flags = SD_LOAD_BALANCE \ > | SD_BALANCE_NEWIDLE \ > + | SD_BALANCE_FORK \ > | SD_BALANCE_EXEC \ > | SD_WAKE_AFFINE \ > | BALANCE_FOR_PKG_POWER,\ Hi, Ingo: I did patch the header file and recompiled the kernel. I observed no difference (two threads overhead stays too high). Thank you. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-06 16:29 ` Jie Chen @ 2007-12-10 10:59 ` Ingo Molnar 2007-12-10 20:04 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-10 10:59 UTC (permalink / raw) To: Jie Chen; +Cc: linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: > I did patch the header file and recompiled the kernel. I observed no > difference (two threads overhead stays too high). Thank you. ok, i think i found it. You do this in your qmt/pthread_sync.c test-code: double get_time_of_day_() { ... err = gettimeofday(&ts, NULL); ... } and then you use this in the measurement loop: for (k=0; k<=OUTERREPS; k++){ start = getclock(); for (j=0; j<innerreps; j++){ #ifdef _QMT_PUBLIC delay((void *)0, 0); #else delay(0, 0, 0, (void *)0); #endif } times[k] = (getclock() - start) * 1.0e6 / (double) innerreps; } the problem is, this does not take the overhead of gettimeofday into account - which overhead can easily reach 10 usecs (the observed regression). Could you try to eliminate the gettimeofday overhead from your measurement? gettimeofday overhead is something that might have changed from .21 to .22 on your box. Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-10 10:59 ` Ingo Molnar @ 2007-12-10 20:04 ` Jie Chen 2007-12-11 10:51 ` Ingo Molnar 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-10 20:04 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >> I did patch the header file and recompiled the kernel. I observed no >> difference (two threads overhead stays too high). Thank you. > > ok, i think i found it. You do this in your qmt/pthread_sync.c > test-code: > > double get_time_of_day_() > { > ... > err = gettimeofday(&ts, NULL); > ... > } > > and then you use this in the measurement loop: > > for (k=0; k<=OUTERREPS; k++){ > start = getclock(); > for (j=0; j<innerreps; j++){ > #ifdef _QMT_PUBLIC > delay((void *)0, 0); > #else > delay(0, 0, 0, (void *)0); > #endif > } > times[k] = (getclock() - start) * 1.0e6 / (double) innerreps; > } > > the problem is, this does not take the overhead of gettimeofday into > account - which overhead can easily reach 10 usecs (the observed > regression). Could you try to eliminate the gettimeofday overhead from > your measurement? > > gettimeofday overhead is something that might have changed from .21 to > .22 on your box. > > Ingo Hi, Ingo: In my pthread_sync code, I first call refer () subroutine which actually establishes the elapsed time (reference time) for non-synchronized delay() using the gettimeofday. Then each synchronization overhead value is obtained by subtracting the reference time from the elapsed time with introduced synchronization. The effect of gettimeofday() should be minimal if the time difference (overhead value) is the interest here. Unless the gettimeofday behaves differently in the case of running 8 threads .vs. running 2 threads. I will try to replace gettimeofday with a lightweight timer call in my test code. Thank you very much. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-10 20:04 ` Jie Chen @ 2007-12-11 10:51 ` Ingo Molnar 2007-12-11 15:28 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-11 10:51 UTC (permalink / raw) To: Jie Chen; +Cc: linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: >> and then you use this in the measurement loop: >> >> for (k=0; k<=OUTERREPS; k++){ >> start = getclock(); >> for (j=0; j<innerreps; j++){ >> #ifdef _QMT_PUBLIC >> delay((void *)0, 0); >> #else >> delay(0, 0, 0, (void *)0); >> #endif >> } >> times[k] = (getclock() - start) * 1.0e6 / (double) innerreps; >> } >> >> the problem is, this does not take the overhead of gettimeofday into >> account - which overhead can easily reach 10 usecs (the observed >> regression). Could you try to eliminate the gettimeofday overhead from >> your measurement? >> >> gettimeofday overhead is something that might have changed from .21 to .22 >> on your box. >> >> Ingo > > Hi, Ingo: > > In my pthread_sync code, I first call refer () subroutine which > actually establishes the elapsed time (reference time) for > non-synchronized delay() using the gettimeofday. Then each > synchronization overhead value is obtained by subtracting the > reference time from the elapsed time with introduced synchronization. > The effect of gettimeofday() should be minimal if the time difference > (overhead value) is the interest here. Unless the gettimeofday behaves > differently in the case of running 8 threads .vs. running 2 threads. > > I will try to replace gettimeofday with a lightweight timer call in my > test code. Thank you very much. gettimeofday overhead is around 10 usecs here: 2740 1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.000010> 2740 1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.000010> and that's the only thing that is going on when computing the reference time - and i see a similar syscall pattern in the PARALLEL and BARRIER calculations as well (with no real scheduling going on). Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-11 10:51 ` Ingo Molnar @ 2007-12-11 15:28 ` Jie Chen 2007-12-11 15:52 ` Ingo Molnar 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-11 15:28 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >>> and then you use this in the measurement loop: >>> >>> for (k=0; k<=OUTERREPS; k++){ >>> start = getclock(); >>> for (j=0; j<innerreps; j++){ >>> #ifdef _QMT_PUBLIC >>> delay((void *)0, 0); >>> #else >>> delay(0, 0, 0, (void *)0); >>> #endif >>> } >>> times[k] = (getclock() - start) * 1.0e6 / (double) innerreps; >>> } >>> >>> the problem is, this does not take the overhead of gettimeofday into >>> account - which overhead can easily reach 10 usecs (the observed >>> regression). Could you try to eliminate the gettimeofday overhead from >>> your measurement? >>> >>> gettimeofday overhead is something that might have changed from .21 to .22 >>> on your box. >>> >>> Ingo >> Hi, Ingo: >> >> In my pthread_sync code, I first call refer () subroutine which >> actually establishes the elapsed time (reference time) for >> non-synchronized delay() using the gettimeofday. Then each >> synchronization overhead value is obtained by subtracting the >> reference time from the elapsed time with introduced synchronization. >> The effect of gettimeofday() should be minimal if the time difference >> (overhead value) is the interest here. Unless the gettimeofday behaves >> differently in the case of running 8 threads .vs. running 2 threads. >> >> I will try to replace gettimeofday with a lightweight timer call in my >> test code. Thank you very much. > > gettimeofday overhead is around 10 usecs here: > > 2740 1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.000010> > 2740 1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.000010> > > and that's the only thing that is going on when computing the reference > time - and i see a similar syscall pattern in the PARALLEL and BARRIER > calculations as well (with no real scheduling going on). > > Ingo Hi, Ingo: I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs patch. The results of pthread_sync is the same as the non-patched 2.6.21 kernel. This means the performance of is not related to the scheduler. As for overhead of the gettimeofday, there is no difference between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both kernel. So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you very much for all your help. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-11 15:28 ` Jie Chen @ 2007-12-11 15:52 ` Ingo Molnar 2007-12-11 16:39 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-11 15:52 UTC (permalink / raw) To: Jie Chen; +Cc: linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: > Hi, Ingo: > > I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs > patch. The results of pthread_sync is the same as the non-patched > 2.6.21 kernel. This means the performance of is not related to the > scheduler. As for overhead of the gettimeofday, there is no difference > between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us > for both kernel. could you please paste again the relevant portion of the output you get on a "good" .21 kernel versus the output you get on a "bad" .24 kernel? > So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you > very much for all your help. we'll figure it out i'm sure :) Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-11 15:52 ` Ingo Molnar @ 2007-12-11 16:39 ` Jie Chen 2007-12-11 21:23 ` Ingo Molnar 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-11 16:39 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >> Hi, Ingo: >> >> I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs >> patch. The results of pthread_sync is the same as the non-patched >> 2.6.21 kernel. This means the performance of is not related to the >> scheduler. As for overhead of the gettimeofday, there is no difference >> between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us >> for both kernel. > > could you please paste again the relevant portion of the output you get > on a "good" .21 kernel versus the output you get on a "bad" .24 kernel? > >> So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you >> very much for all your help. > > we'll figure it out i'm sure :) > > Ingo Hi, Ingo: The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel. 2 threads: Computing reference time 1 Sample_size Average Min Max S.D. Outliers 20 10.489085 10.488800 10.491100 0.000539 1 Reference_time_1 = 10.489085 microseconds +/- 0.001057 Computing PARALLEL time Sample_size Average Min Max S.D. Outliers 20 11.106580 11.105650 11.109700 0.001255 0 PARALLEL time = 11.106580 microseconds +/- 0.002460 PARALLEL overhead = 0.617590 microseconds +/- 0.003409 8 threads: Computing reference time 1 Sample_size Average Min Max S.D. Outliers 20 10.488735 10.488500 10.490700 0.000484 1 Reference_time_1 = 10.488735 microseconds +/- 0.000948 Computing PARALLEL time Sample_size Average Min Max S.D. Outliers 20 13.000647 12.991050 13.052700 0.012592 1 PARALLEL time = 13.000647 microseconds +/- 0.024680 PARALLEL overhead = 2.511907 microseconds +/- 0.025594 Output for Kernel 2.6.24-rc4 #1 SMP 2 threads: Computing reference time 1 Sample_size Average Min Max S.D. Outliers 20 10.510535 10.508600 10.518200 0.002237 1 Reference_time_1 = 10.510535 microseconds +/- 0.004384 Computing PARALLEL time Sample_size Average Min Max S.D. Outliers 20 19.668450 19.650200 19.679650 0.008052 0 PARALLEL time = 19.668450 microseconds +/- 0.015782 PARALLEL overhead = 9.157945 microseconds +/- 0.018217 8 threads: Computing reference time 1 Sample_size Average Min Max S.D. Outliers 20 10.491285 10.490100 10.494900 0.001085 1 Reference_time_1 = 10.491285 microseconds +/- 0.002127 Computing PARALLEL time Sample_size Average Min Max S.D. Outliers 20 13.090080 13.079150 13.131450 0.010995 1 PARALLEL time = 13.090080 microseconds +/- 0.021550 PARALLEL overhead = 2.598590 microseconds +/- 0.024534 For 8 threads, both kernels have the similar performance number. But for 2 threads, the 2.6.21 is much better than 2.6.24-rc4. Thank you. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-11 16:39 ` Jie Chen @ 2007-12-11 21:23 ` Ingo Molnar 2007-12-11 22:11 ` Jie Chen 0 siblings, 1 reply; 35+ messages in thread From: Ingo Molnar @ 2007-12-11 21:23 UTC (permalink / raw) To: Jie Chen; +Cc: linux-kernel, Peter Zijlstra * Jie Chen <chen@jlab.org> wrote: > The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP > kernel. > 2 threads: > PARALLEL time = 11.106580 microseconds +/- 0.002460 > PARALLEL overhead = 0.617590 microseconds +/- 0.003409 > Output for Kernel 2.6.24-rc4 #1 SMP > PARALLEL time = 19.668450 microseconds +/- 0.015782 > PARALLEL overhead = 9.157945 microseconds +/- 0.018217 ok, so the problem is that this PARALLEL time has an additional +9 usecs overhead, right? I dont see this myself on a Core2 CPU: PARALLEL time = 10.446933 microseconds +/- 0.078849 PARALLEL overhead = 0.751732 microseconds +/- 0.177446 Ingo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-11 21:23 ` Ingo Molnar @ 2007-12-11 22:11 ` Jie Chen 2007-12-12 12:49 ` Peter Zijlstra 0 siblings, 1 reply; 35+ messages in thread From: Jie Chen @ 2007-12-11 22:11 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra Ingo Molnar wrote: > * Jie Chen <chen@jlab.org> wrote: > >> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP >> kernel. > >> 2 threads: > >> PARALLEL time = 11.106580 microseconds +/- 0.002460 >> PARALLEL overhead = 0.617590 microseconds +/- 0.003409 > >> Output for Kernel 2.6.24-rc4 #1 SMP > >> PARALLEL time = 19.668450 microseconds +/- 0.015782 >> PARALLEL overhead = 9.157945 microseconds +/- 0.018217 > > ok, so the problem is that this PARALLEL time has an additional +9 usecs > overhead, right? I dont see this myself on a Core2 CPU: > > PARALLEL time = 10.446933 microseconds +/- 0.078849 > PARALLEL overhead = 0.751732 microseconds +/- 0.177446 > > Ingo Hi, Ingo: Yes, the extra 9 usecs overhead for running two threads in the 2.6.24 kernel when there are total of 8 cores (2 quad opterons). What is the total number of cores do you have? I do not have machines that have dual quad Xeons here for direct comparisons. Thank you. -- ############################################################################# # Jie Chen # Scientific Computing Group # Thomas Jefferson National Accelerator Facility # Newport News, VA 23606 # # chen@jlab.org # (757)269-5046 (office) # (757)269-6248 (fax) ############################################################################# ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 2007-12-11 22:11 ` Jie Chen @ 2007-12-12 12:49 ` Peter Zijlstra 0 siblings, 0 replies; 35+ messages in thread From: Peter Zijlstra @ 2007-12-12 12:49 UTC (permalink / raw) To: Jie Chen; +Cc: Ingo Molnar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1277 bytes --] On Tue, 2007-12-11 at 17:11 -0500, Jie Chen wrote: > Ingo Molnar wrote: > > * Jie Chen <chen@jlab.org> wrote: > > > >> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP > >> kernel. > > > >> 2 threads: > > > >> PARALLEL time = 11.106580 microseconds +/- 0.002460 > >> PARALLEL overhead = 0.617590 microseconds +/- 0.003409 > > > >> Output for Kernel 2.6.24-rc4 #1 SMP > > > >> PARALLEL time = 19.668450 microseconds +/- 0.015782 > >> PARALLEL overhead = 9.157945 microseconds +/- 0.018217 > > > > ok, so the problem is that this PARALLEL time has an additional +9 usecs > > overhead, right? I dont see this myself on a Core2 CPU: > > > > PARALLEL time = 10.446933 microseconds +/- 0.078849 > > PARALLEL overhead = 0.751732 microseconds +/- 0.177446 > > On my dual socket AMD Athlon MP 2.6.20-13-generic PARALLEL time = 22.751875 microseconds +/- 21.370942 PARALLEL overhead = 7.046595 microseconds +/- 24.370040 2.6.24-rc5 PARALLEL time = 17.365543 microseconds +/- 3.295133 PARALLEL overhead = 2.213722 microseconds +/- 4.797886 [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above 2007-11-21 20:34 Possible bug from kernel 2.6.22 and above Jie Chen 2007-11-21 22:14 ` Eric Dumazet @ 2007-12-05 20:36 ` Peter Zijlstra 2007-12-05 20:53 ` Jie Chen 1 sibling, 1 reply; 35+ messages in thread From: Peter Zijlstra @ 2007-12-05 20:36 UTC (permalink / raw) To: Jie Chen; +Cc: linux-kernel, Eric Dumazet, Ingo Molnar, Simon Holm Th??gersen On Wed, 2007-11-21 at 15:34 -0500, Jie Chen wrote: > It is clearly that the synchronization overhead increases as the number > of threads increases in the kernel 2.6.21. But the synchronization > overhead actually decreases as the number of threads increases in the > kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as > well). This certainly is not a correct behavior. The kernels are > configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, > CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel > configuration file is in the attachment of this e-mail. > > From what we have read, there was a new scheduler (CFS) appeared from > 2.6.22. We are not sure whether the above behavior is caused by the new > scheduler. If I read this correctly, you say that: .22 is the first bad one right? The new scheduler (CFS) was introduced in .23, so it seems another change would be responsible for this. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Possible bug from kernel 2.6.22 and above 2007-12-05 20:36 ` Possible bug from kernel 2.6.22 and above Peter Zijlstra @ 2007-12-05 20:53 ` Jie Chen 0 siblings, 0 replies; 35+ messages in thread From: Jie Chen @ 2007-12-05 20:53 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Eric Dumazet, Ingo Molnar, Simon Holm Th??gersen Peter Zijlstra wrote: > On Wed, 2007-11-21 at 15:34 -0500, Jie Chen wrote: > >> It is clearly that the synchronization overhead increases as the number >> of threads increases in the kernel 2.6.21. But the synchronization >> overhead actually decreases as the number of threads increases in the >> kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as >> well). This certainly is not a correct behavior. The kernels are >> configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, >> CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel >> configuration file is in the attachment of this e-mail. >> >> From what we have read, there was a new scheduler (CFS) appeared from >> 2.6.22. We are not sure whether the above behavior is caused by the new >> scheduler. > > If I read this correctly, you say that: .22 is the first bad one right? > > The new scheduler (CFS) was introduced in .23, so it seems another > change would be responsible for this. > > > Hi, Peter: Yes. We did observe this in 2.6.22. Thank you. -- ############################################### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) chen@jlab.org ############################################### ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2007-12-12 12:49 UTC | newest] Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-11-21 20:34 Possible bug from kernel 2.6.22 and above Jie Chen 2007-11-21 22:14 ` Eric Dumazet 2007-11-22 1:52 ` Jie Chen 2007-11-22 2:32 ` Simon Holm Thøgersen 2007-11-22 2:58 ` Jie Chen 2007-11-22 20:19 ` Matt Mackall 2007-12-04 13:17 ` Possible bug from kernel 2.6.22 and above, 2.6.24-rc4 Ingo Molnar 2007-12-04 15:41 ` Jie Chen 2007-12-05 15:29 ` Jie Chen 2007-12-05 15:40 ` Ingo Molnar 2007-12-05 16:16 ` Eric Dumazet 2007-12-05 16:25 ` Ingo Molnar 2007-12-05 16:29 ` Eric Dumazet 2007-12-05 16:22 ` Jie Chen 2007-12-05 16:47 ` Ingo Molnar 2007-12-05 17:47 ` Jie Chen 2007-12-05 20:03 ` Ingo Molnar 2007-12-05 20:23 ` Jie Chen 2007-12-05 20:46 ` Ingo Molnar 2007-12-05 20:52 ` Jie Chen 2007-12-05 21:02 ` Ingo Molnar 2007-12-05 22:16 ` Jie Chen 2007-12-06 10:43 ` Ingo Molnar 2007-12-06 16:29 ` Jie Chen 2007-12-10 10:59 ` Ingo Molnar 2007-12-10 20:04 ` Jie Chen 2007-12-11 10:51 ` Ingo Molnar 2007-12-11 15:28 ` Jie Chen 2007-12-11 15:52 ` Ingo Molnar 2007-12-11 16:39 ` Jie Chen 2007-12-11 21:23 ` Ingo Molnar 2007-12-11 22:11 ` Jie Chen 2007-12-12 12:49 ` Peter Zijlstra 2007-12-05 20:36 ` Possible bug from kernel 2.6.22 and above Peter Zijlstra 2007-12-05 20:53 ` Jie Chen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).