* -mm merge plans for 2.6.23 @ 2007-07-10 8:31 Andrew Morton 2007-07-10 9:04 ` intel iommu (Re: -mm merge plans for 2.6.23) Jan Engelhardt ` (26 more replies) 0 siblings, 27 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 8:31 UTC (permalink / raw) To: linux-kernel When replying, please rewrite the subject suitably and try to Cc: the appropriate developer(s). add-lzo1x-algorithm-to-the-kernel.patch make-common-helpers-for-seq_files-that-work-with-list_head-s.patch lots-of-architectures-enable-arbitary-speed-tty-support.patch Merge serial-assert-dtr-for-serial-console-devices.patch Don't know. I worry about Russell's concern (see the changelog) git-acpi-s390-struct-bin_attribute-changes.patch cpuidle-add-rating-to-the-governors-and-pick-the-one-with-highest-rating-by-default-fix.patch exit-acpi-processor-module-gracefully-if-acpi-is-disabled.patch fix-empty-macros-in-acpi.patch drivers-acpi-sbsc-remove-dead-code.patch acpi-enable-c3-power-state-on-dell-inspiron-8200.patch drivers-acpi-pci_linkc-lower-printk-severity.patch Sent to lenb working-3d-dri-intel-agpko-resume-for-i815-chip.patch Sent to davej cifs-use-simple_prepare_write-to-zero-page-data.patch cifs-zero_user_page-conversion.patch Sent to sfrench bugfix-cpufreq-in-combination-with-performance-governor.patch restore-previously-used-governor-on-a-hot-replugged-cpu.patch Sent to davej kcopyd-use-mutex-instead-of-semaphore.patch Sent to agk powerpc-promc-remove-undef-printk.patch 8xx-mpc885ads-pcmcia-support.patch dts-kill-hardcoded-phandles.patch ppc-remove-dead-code-for-preventing-pread-and-pwrite-calls.patch viotape-use-designated-initializers-for-fops-member.patch make-drivers-char-hvc_consoleckhvcd-static.patch powerpc-enable-arbitary-speed-tty-ioctls-and-split.patch powerpc-tlb_32c-build-fix.patch sky-cpu-and-nexus-code-style-improvement.patch sky-cpu-and-nexus-include-ioh.patch sky-cpu-and-nexus-check-for-platform_get_resource-ret.patch sky-cpu-and-nexus-check-for-create_proc_entry-ret-code.patch sky-cpu-use-c99-style-for-struct-init.patch Sent to paulus revert-gregkh-driver-block-device.patch driver-core-check-return-code-of-sysfs_create_link.patch driver-core-coding-style-cleanup.patch pm-do-not-use-saved_state-from-struct-dev_pm_info-on-arm.patch nozomi-remove-termios-checks-from-various-old-char-serial-drivers.patch Sent to greg git-dvb-saa7134-tvaudio-fix.patch dvb_en_50221-convert-to-kthread-api.patch Sent to mchehab hdaps-switch-to-using-input-polldev.patch applesmc-switch-to-using-input-polldev.patch applesmc-add-temperature-sensors-set-for-macbook.patch ams-switch-to-using-input-polldev.patch Sent to mhoffman sn-correct-rom-resource-length-for-bios-copy.patch Sent to Tony make-input-layer-use-seq_list_xxx-helpers.patch touchscreen-fujitsu-touchscreen-driver.patch serio_raw_read-warning-fix.patch tsdev-fix-broken-usecto-millisecs-conversion.patch Sent to Dmitry use-posix-bre-in-headers-install-target.patch modpost-white-list-pattern-adjustment.patch strip-config_-automatically-in-kernel-configuration-search.patch fix-the-warning-when-running-make-tags.patch kconfig-reset-generated-values-only-if-kconfig-and-config-agree.patch Sent to Sam led_colour_show-warning-fix.patch Sent to rpurdie libata-config_pm=n-compile-fix.patch pata_acpi-restore-driver.patch libata-core-convert-to-use-cancel_rearming_delayed_work.patch libata-implement-ata_wait_after_reset.patch sata_promise-sata-hotplug-support.patch libata-add-irq_flags-to-struct-pata_platform_info-fix.patch ata-add-the-sw-ncq-support-to-sata_nv-for-mcp51-mcp55-mcp61.patch sata_nv-allow-changing-queue-depth.patch pata_hpt3x3-major-reworking-and-testing.patch iomap-sort-out-the-broken-address-reporting-caused-by-the-iomap-layer.patch ata-use-iomap_name.patch Sent to jgarzik libata-check-for-an-support.patch scsi-expose-an-to-user-space.patch libata-expose-an-to-user-space.patch scsi-save-disk-in-scsi_device.patch libata-send-event-when-an-received.patch Am sitting on these due to confusion regarding the status of the ata-ahci patches. ata-ahci-alpm-store-interrupt-value.patch ata-ahci-alpm-expose-power-management-policy-option-to-users.patch ata-ahci-alpm-enable-link-power-management-for-ata-drivers.patch ata-ahci-alpm-enable-aggressive-link-power-management-for-ahci-controllers.patch These appear to need some work. libata-add-human-readable-error-value-decoding.patch libata-fix-hopefully-all-the-remaining-problems-with.patch testing-patch-for-ali-pata-fixes-hopefully-for-the-problems-with-atapi-dma.patch pata_ali-more-work.patch Dead/dying/abandoned ata things. Might drop. mips-make-resources-for-ds1742-static-__initdata.patch Sent to Ralf. tty-add-the-new-ioctls-and-definitionto-the-mips.patch Awaiting merge of lots-of-architectures-enable-arbitary-speed-tty-support.patch mmc-at91_mci-typo.patch Sent to drzeus mtd-onenand-build-fix.patch nommu-present-backing-device-capabilities-for-mtd.patch nommu-add-support-for-direct-mapping-through-mtdconcat.patch nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch romfs-printk-format-warnings.patch mtd-add-module-license-to-mtdbdi.patch Sent to dvmw2 8139too-force-media-setting-fix.patch blackfin-on-chip-ethernet-mac-controller-driver.patch atari_pamsnetc-old-declaration-ritchie-style-fix.patch sundance-phy-address-form-0-only-for-device-id-0x0200.patch use-is_power_of_2-in-cxgb3-cxgb3_mainc.patch use-is_power_of_2-in-myri10ge-myri10gec.patch 3csoho100-tx-needs-extra_preamble.patch Sent to jgarzik 3x59x-fix-pci-resource-management.patch update-smc91x-driver-with-arm-versatile-board-info.patch drivers-net-ns83820c-add-paramter-to-disable-auto.patch netdev patches which are stuck in limbo land. make-atm-driver-use-seq_list_xxx-helpers.patch make-some-network-related-proc-files-use-seq_list_xxx.patch wrong-timeout-value-in-sk_wait_data-v2-fix.patch use-mutex-instead-of-semaphore-in-vlsi-82c147-irda-controller-driver.patch bonding-bond_mainc-make-2-functions-static.patch net-make-struct-dccp_li_cachep-static.patch net-ipv4-netfilter-ip_tablesc-lower-printk-severity.patch rpc-remove-makefile-reference-to-obsolete-rxrpc-config.patch Sent to davem (mostly merged now, I think) bluetooth-remove-the-redundant-non-seekable-llseek-method.patch rfcomm-hangup-ttys-before-releasing-rfcomm_dev.patch Sent to Marcel git-ioat-vs-git-md-accel.patch ioat-warning-fix.patch fix-i-oat-for-kexec.patch I don't seem to be able to get rid of these. Chris Leech appears to have vanished. auth_gss-unregister-gss_domain-when-unloading-module.patch Sent to Trond and Bruce, needs work. pa-risc-use-page-allocator-instead-of-slab-allocator.patch Sent to Kyle pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch use-menuconfig-objects-pcmcia.patch Am a bit stuck with the pcmcia patches. Dominik has disappeared. pcmcia-pccard-deadlock-fix.patch I think this isn't a good patch. Am holding onto it as a reminder that pcmcia deadlocks. dont-optimise-away-baud-rate-changes-when-bother-is-used.patch serial-add-support-for-ite-887x-chips.patch serial_txx9-fix-modem-control-line-handling.patch serial_txx9-cleanup-includes.patch Serial stuff. Will run these past rmk and Alan and will merge them if they survive. revert-gregkh-pci-pci_bridge-device.patch fix-gregkh-pci-pci-syscallc-switch-to-refcounting-api.patch pci-x-pci-express-read-control-interfaces-fix.patch remove-pci_dac_dma_-apis.patch round_up-macro-cleanup-in-drivers-pci.patch pcie-remove-spin_lock_unlocked.patch add-pci_try_set_mwi.patch cpci_hotplug-convert-to-use-the-kthread-api.patch pci_set_power_state-check-for-pm-capabilities-earlier.patch Sent to Greg s390-rename-cpu_idle-to-s390_cpu_idle.patch Sent to Martin. restore-acpi-change-for-scsi.patch git-scsi-misc-vs-greg-sysfs-stuff.patch aacraid-rename-check_reset.patch scsi-dont-build-scsi_dma_mapunmap-for-has_dma.patch drivers-scsi-small-cleanups.patch sym53c8xx_2-claims-cpqarray-device.patch drivers-scsi-wd33c93c-cleanups.patch make-seagate_st0x_detect-static.patch pci-error-recovery-symbios-scsi-base-support.patch pci-error-recovery-symbios-scsi-first-failure.patch drivers-scsi-pcmcia-nsp_csc-remove-kernel-24-code.patch drivers-message-i2o-devicec-remove-redundant-gfp_atomic-from-kmalloc.patch drivers-scsi-aic7xxx_oldc-remove-redundant-gfp_atomic-from-kmalloc.patch use-menuconfig-objects-ii-scsi.patch remove-dead-references-to-module_parm-macro.patch ppa-coding-police-and-printk-levels.patch remove-the-dead-cyberstormiii_scsi-option.patch config_scsi_fd_8xx-no-longer-exists.patch use-mutex-instead-of-semaphore-in-megaraid-mailbox-driver.patch Sent to James. scsi-lpfc-lpfc_initc-remove-unused-variable.patch Will add to the James queue once add-pci_try_set_mwi.patch is merged. use-menuconfig-objects-block-layer.patch use-menuconfig-objects-ib-block.patch use-menuconfig-objects-ii-block-devices.patch block-device-elevator-use-list_for_each_entry-instead-of-list_for_each.patch update-documentation-block-barriertxt.patch Sent to Jens. videopix-frame-grabber-fix-unreleased-lock-in-vfc_debug.patch Sent to davem fix-gregkh-usb-usb-ehci-cpufreq-fix.patch fix-gregkh-usb-usb-use-menuconfig-objects.patch make-usb-autosuspend-timer-1-sec-jiffy-aligned.patch drivers-block-ubc-use-list_for_each_entry.patch ftdi_sio-fix-something.patch usb-make-the-usb_device-numa_node-to-get-assigned-from.patch mos7840c-turn-this-into-a-serial-driver.patch pl2303-remove-bogus-checks-and-fix-speed-support-to-use.patch visor-and-whiteheat-remove-bogus-termios-change-checks.patch mos7720-remove-bogus-no-termios-change-check.patch io_-remove-bogus-termios-no-change-checks.patch usb-remove-makefile-reference-to-obsolete-ohci_at91.patch Sent to Greg. use-list_for_each_entry-for-iteration-in-prism-54-driver.patch Sent to linville revert-x86_64-mm-verify-cpu-rename.patch add-kstrndup-fix.patch xen-build-fix.patch fix-x86_64-numa-fake-apicid_to_node-mapping-for-fake-numa-2.patch fix-x86_64-mm-xen-xen-smp-guest-support.patch more-fix-x86_64-mm-xen-xen-smp-guest-support.patch fix-x86_64-mm-sched-clock-share.patch fix-x86_64-mm-xen-add-xen-virtual-block-device-driver.patch fix-x86_64-mm-add-common-orderly_poweroff.patch fix-x86_64-mm-xen-xen-event-channels.patch arch-i386-xen-mmuc-must-include-linux-schedh.patch tidy-up-usermode-helper-waiting-a-bit-fix.patch update-x86_64-mm-xen-use-iret-directly-where-possible.patch i386-add-support-for-picopower-irq-router.patch make-arch-i386-kernel-setupcremapped_pgdat_init-static.patch arch-i386-kernel-i8253c-should-include-asm-timerh.patch make-arch-i386-kernel-io_apicctimer_irq_works-static-again.patch quicklist-support-for-x86_64.patch x86_64-extract-helper-function-from-e820_register_active_regions.patch x86_64-fix-e820_hole_size-based-on-address-ranges.patch x86_64-acpi-disable-srat-when-numa-emulation-succeeds.patch x86_64-slit-fake-pxm-to-node-mapping-for-fake-numa-2.patch x86_64-numa-fake-apicid_to_node-mapping-for-fake-numa-2.patch x86-use-elfnoteh-to-generate-vsyscall-notes-fix.patch mmconfig-x86_64-i386-insert-unclaimed-mmconfig-resources.patch x86_64-fix-smp_call_function_single-return-value.patch x86_64-o_excl-on-dev-mcelog.patch x86_64-support-poll-on-dev-mcelog.patch x86_64-mcelog-tolerant-level-cleanup.patch x86_64-mce-poll-at-idle_start-and-printk-fix.patch i386-fix-machine-rebooting.patch x86-fix-section-mismatch-warnings-in-mtrr.patch x86_64-ratelimit-segfault-reporting-rate.patch x86_64-pm_trace-support.patch make-alt-sysrq-p-display-the-debug-register-contents.patch i386-flush_tlb_kernel_range-add-reference-to-the-arguments.patch round_jiffies-for-i386-and-x86-64-non-critical-corrected-mce-polling.patch pci-disable-decode-of-io-memory-during-bar-sizing.patch mmconfig-validate-against-acpi-motherboard-resources.patch x86_64-irq-check-remote-irr-bit-before-migrating-level-triggered-irq-v3.patch i386-remove-support-for-the-rise-cpu.patch x86-64-calgary-generalize-calgary_increase_split_completion_timeout.patch x86-64-calgary-update-copyright-notice.patch x86-64-calgary-introduce-handle_quirks-for-various-chipset-quirks.patch x86-64-calgary-introduce-chipset-specific-ops.patch x86-64-calgary-abstract-how-we-find-the-iommu_table-for-a-device.patch x86-64-calgary-introduce-calioc2-support.patch x86-64-calgary-add-chip_ops-and-a-quirk-function-for-calioc2.patch x86-64-calgary-implement-calioc2-tce-cache-flush-sequence.patch x86-64-calgary-make-dump_error_regs-a-chip-op.patch x86-64-calgary-grab-plssr-too-when-a-dma-error-occurs.patch x86-64-calgary-reserve-tces-with-the-same-address-as-mem-regions.patch x86-64-calgary-cleanup-of-unneeded-macros.patch x86-64-calgary-tabify-and-trim-trailing-whitespace.patch x86-64-calgary-only-reserve-the-first-1mb-of-io-space-for-calioc2.patch x86-64-calgary-tidy-up-debug-printks.patch i386-make-arch-i386-mm-pgtablecpgd_cdtor-static.patch i386-fix-section-mismatch-warning-in-intel_cacheinfo.patch i386-do-not-restore-reserved-memory-after-hibernation.patch paravirt-helper-to-disable-all-io-space-fix.patch dmi_match-patch-in-rebootc-for-sff-dell-optiplex-745-fixes-hang.patch i386-hpet-check-if-the-counter-works.patch i386-trim-memory-not-covered-by-wb-mtrrs.patch kprobes-x86_64-fix-for-mark-ro-data.patch kprobes-i386-fix-for-mark-ro-data.patch divorce-config_x86_pae-from-config_highmem64g.patch remove-unneeded-test-of-task-in-dump_trace.patch i386-move-the-kernel-to-16mb-for-numa-q.patch i386-show-unhandled-signals.patch i386-minor-nx-handling-adjustment.patch x86-smp-alt-once-option-is-only-useful-with-hotplug_cpu.patch x86-64-remove-unused-variable-maxcpus.patch move-functions-declarations-to-header-file.patch x86_64-during-vm-oom-condition.patch i386-during-vm-oom-condition.patch x86-64-disable-the-gart-in-shutdown.patch x86_84-move-iommu-declaration-from-proto-to-iommuh.patch i386-uaccessh-replace-hard-coded-constant-with-appropriate-macro-from-kernelh.patch i386-add-cpu_relax-to-cmos_lock.patch x86_64-flush_tlb_kernel_range-warning-fix.patch x86_64-add-ioapic-nmi-support.patch x86_64-change-_map_single-to-static-in-pci_gartc-etc.patch x86_64-geode-hw-random-number-generator-depend-on-x86_3.patch x86_64-fix-wrong-comment-regarding-set_fixmap.patch arch-x86_64-kernel-processc-lower-printk-severity.patch nohz-fix-nohz-x86-dyntick-idle-handling.patch acpi-move-timer-broadcast-and-pmtimer-access-before-c3-arbiter-shutdown.patch clockevents-fix-typo-in-acpi_pmc.patch timekeeping-fixup-shadow-variable-argument.patch timerc-cleanup-recently-introduced-whitespace-damage.patch clockevents-remove-prototypes-of-removed-functions.patch clockevents-fix-resume-logic.patch clockevents-fix-device-replacement.patch tick-management-spread-timer-interrupt.patch highres-improve-debug-output.patch hrtimer-speedup-hrtimer_enqueue.patch pcspkr-use-the-global-pit-lock.patch ntp-move-the-cmos-update-code-into-ntpc.patch i386-pit-stop-only-when-in-periodic-or-oneshot-mode.patch i386-remove-volatile-in-apicc.patch i386-hpet-assumes-boot-cpu-is-0.patch i386-move-pit-function-declarations-and-constants-to-correct-header-file.patch x86_64-untangle-asm-hpeth-from-asm-timexh.patch x86_64-use-generic-cmos-update.patch x86_64-remove-dead-code-and-other-janitor-work-in-tscc.patch x86_64-fix-apic-typo.patch x86_64-convert-to-cleckevents.patch acpi-remove-the-useless-ifdef-code.patch x86_64-hpet-restore-vread.patch x86_64-restore-restore-nohpet-cmdline.patch x86_64-block-irq-balancing-for-timer.patch x86_64-prep-idle-loop-for-dynticks.patch x86_64-enable-high-resolution-timers-and-dynticks.patch x86_64-dynticks-disable-hpet_id_legsup-hpets.patch xen-fix-x86-config-dependencies.patch x86_64-get-mp_bus_to_node-as-early.patch xen-suppress-abs-symbol-warnings-for-unused-reloc-pointers.patch xen-cant-support-numa-yet.patch x86-fix-iounmaps-use-of-vm_structs-size-field.patch arch-x86_64-kernel-aperturec-lower-printk-severity.patch arch-x86_64-kernel-e820c-lower-printk-severity.patch ich-force-hpet-make-generic-time-capable-of-switching-broadcast-timer.patch ich-force-hpet-restructure-hpet-generic-clock-code.patch ich-force-hpet-ich7-or-later-quirk-to-force-detect-enable.patch ich-force-hpet-late-initialization-of-hpet-after-quirk.patch ich-force-hpet-ich5-quirk-to-force-detect-enable.patch ich-force-hpet-ich5-fix-a-bug-with-suspend-resume.patch ich-force-hpet-add-ich7_0-pciid-to-quirk-list.patch geode-basic-infrastructure-support-for-amd-geode-class.patch geode-mfgpt-support-for-geode-class-machines.patch geode-mfgpt-clock-event-device-support.patch i386-x86_64-insert-hpet-firmware-resource-after-pci-enumeration-has-completed.patch i386-ioapic-remove-old-irq-balancing-debug-cruft.patch i386-deactivate-the-test-for-the-dead-config_debug_page_type.patch Sent to Andi fix-xfs_ioc_fsgeometry_v1-in-compat-mode.patch fix-xfs_ioc__to_handle-and-xfs_ioc_openreadlink_by_handle-in-compat-mode.patch fix-xfs_ioc_fsbulkstat_single-and-xfs_ioc_fsinumbers-in-compat-mode.patch Sent to Tim & David. xtensa-enable-arbitary-tty-speed-setting-ioctls.patch Sent to czankel kgdb-warning-fix.patch kgdb-kconfig-fix.patch kgdb-use-new-style-interrupt-flags.patch kgdb-section-fix.patch kgdb_skipexception-warning-fix.patch kgdb-ia64-fixes.patch kgdb-bust-on-ia64.patch kgdb-build-fix-2.patch Sent to Jason pci-x-pci-express-read-control-interfaces-myrinet.patch pci-x-pci-express-read-control-interfaces-mthca.patch pci-x-pci-express-read-control-interfaces-e1000.patch pci-x-pci-express-read-control-interfaces-qla2xxx.patch Will send these to maintainers once gregkh-pci-pci-add-pci-x-pci-express-read-control-interfaces.patch gets merged. gen_estimator-fix-locking-and-timer-related-bugs.patch netpoll-fix-a-leak-n-bug-in-netpoll_cleanup.patch I think these might be defunct. Will let the net guys sort that out. vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch This is scary. Will sit and admire it until it has been demonstrated to be a net gain. console-more-buf-for-index-parsing.patch console-console-handover-to-preferred-console.patch Merge x86-initial-fixmap-support.patch serial-convert-early_uart-to-earlycon-for-8250.patch change-zonelist-order-zonelist-order-selection-logic.patch hugetlb-remove-unnecessary-nid-initialization.patch mm-use-div_round_up-in-mm-memoryc.patch make-proc-slabinfo-use-seq_list_xxx-helpers.patch mm-alloc_large_system_hash-can-free-some-memory-for.patch remove-the-deprecated-kmem_cache_t-typedef-from-slabh.patch slob-rework-freelist-handling.patch slob-remove-bigblock-tracking.patch slob-improved-alignment-handling.patch vmscan-fix-comments-related-to-shrink_list.patch mm stuff: will merge. mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-nopfn-into-fault.patch convert-hugetlbfs-to-use-vm_ops-fault.patch mm-remove-legacy-cruft.patch mm-debug-check-for-the-fault-vs-invalidate-race.patch mm-fix-clear_page_dirty_for_io-vs-fault-race.patch invalidate_mapping_pages-add-cond_resched.patch ocfs2-release-page-lock-before-calling-page_mkwrite.patch document-page_mkwrite-locking.patch The fault-vs-invalidate race fix. I have belatedly learned that these need more work, so their state is uncertain. slub-support-slub_debug-on-by-default.patch numa-mempolicy-dynamic-interleave-map-for-system-init.patch oom-stop-allocating-user-memory-if-tif_memdie-is-set.patch numa-mempolicy-trivial-debug-fixes.patch mm-fix-improper-init-type-section-references.patch page-table-handling-cleanup.patch kill-vmalloc_earlyreserve.patch mm-more-__meminit-annotations.patch mm-slabc-start_cpu_timer-should-be-__cpuinit.patch madvise_need_mmap_write-usage.patch slob-initial-numa-support.patch mm-page_allocc-lower-printk-severity.patch mm-avoid-tlb-gather-restarts.patch mm-remove-ptep_establish.patch mm-remove-ptep_test_and_clear_dirty-and-ptep_clear_flush_dirty.patch mm misc: will merge. mm-revert-kernel_ds-buffered-write-optimisation.patch revert-81b0c8713385ce1b1b9058e916edcf9561ad76d6.patch revert-6527c2bdf1f833cc18e8f42bd97973d583e4aa83.patch mm-clean-up-buffered-write-code.patch mm-debug-write-deadlocks.patch mm-trim-more-holes.patch mm-buffered-write-cleanup.patch mm-write-iovec-cleanup.patch mm-fix-pagecache-write-deadlocks.patch mm-buffered-write-iterator.patch fs-fix-data-loss-on-error.patch fs-introduce-write_begin-write_end-and-perform_write-aops.patch mm-restore-kernel_ds-optimisations.patch implement-simple-fs-aops.patch block_dev-convert-to-new-aops.patch ext2-convert-to-new-aops.patch ext3-convert-to-new-aops.patch ext3-convert-to-new-aops-fix.patch ext4-convert-to-new-aops.patch ext4-convert-to-new-aops-fix.patch xfs-convert-to-new-aops.patch gfs2-convert-to-new-aops.patch fs-new-cont-helpers.patch fat-convert-to-new-aops.patch #adfs-convert-to-new-aops.patch hfs-convert-to-new-aops.patch hfsplus-convert-to-new-aops.patch hpfs-convert-to-new-aops.patch bfs-convert-to-new-aops.patch qnx4-convert-to-new-aops.patch reiserfs-use-generic-write.patch reiserfs-convert-to-new-aops.patch reiserfs-use-generic_cont_expand_simple.patch with-reiserfs-no-longer-using-the-weird-generic_cont_expand-remove-it-completely.patch nfs-convert-to-new-aops.patch smb-convert-to-new-aops.patch fuse-convert-to-new-aops.patch hostfs-convert-to-new-aops.patch jffs2-convert-to-new-aops.patch ufs-convert-to-new-aops.patch udf-convert-to-new-aops.patch sysv-convert-to-new-aops.patch minix-convert-to-new-aops.patch jfs-convert-to-new-aops.patch fs-adfs-convert-to-new-aops.patch fs-affs-convert-to-new-aops.patch ocfs2-convert-to-new-aops.patch pagefault-in-write deadlock fixes. Will hold for 2.6.24. fix-read-truncate-race.patch make-sure-readv-stops-reading-when-it-hits-end-of-file.patch fs-remove-some-aop_truncated_page.patch remove-alloc_zeroed_user_highpage.patch Will merge if they're mergeable. add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages.patch add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated.patch split-the-free-lists-for-movable-and-unmovable-allocations.patch choose-pages-from-the-per-cpu-list-based-on-migration-type.patch add-a-configure-option-to-group-pages-by-mobility.patch drain-per-cpu-lists-when-high-order-allocations-fail.patch move-free-pages-between-lists-on-steal.patch group-short-lived-and-reclaimable-kernel-allocations.patch group-high-order-atomic-allocations.patch do-not-group-pages-by-mobility-type-on-low-memory-systems.patch bias-the-placement-of-kernel-pages-at-lower-pfns.patch be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback.patch fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2.patch bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.patch remove-page_group_by_mobility.patch dont-group-high-order-atomic-allocations.patch fix-calculation-in-move_freepages_block-for-counting-pages.patch breakout-page_order-to-internalh-to-avoid-special-knowledge-of-the-buddy-allocator.patch do-not-depend-on-max_order-when-grouping-pages-by-mobility.patch print-out-statistics-in-relation-to-fragmentation-avoidance-to-proc-pagetypeinfo.patch Mel's page allocator work. Might merge this, but I'm still not hearing sufficiently convincing noises from a sufficient number of people over this. create-the-zone_movable-zone.patch allow-huge-page-allocations-to-use-gfp_high_movable.patch handle-kernelcore=-generic.patch Mel's moveable-zone work. In a similar situation. We need to stop whatever we're doing and get down and work out what we're going to do with all this stuff. maps2-uninline-some-functions-in-the-page-walker.patch maps2-eliminate-the-pmd_walker-struct-in-the-page-walker.patch maps2-remove-vma-from-args-in-the-page-walker.patch maps2-propagate-errors-from-callback-in-page-walker.patch maps2-add-callbacks-for-each-level-to-page-walker.patch maps2-move-the-page-walker-code-to-lib.patch maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch maps2-move-clear_refs-code-to-task_mmuc.patch maps2-regroup-task_mmu-by-interface.patch maps2-make-proc-pid-smaps-optional-under-config_embedded.patch maps2-make-proc-pid-clear_refs-option-under-config_embedded.patch maps2-add-proc-pid-pagemap-interface.patch maps2-add-proc-kpagemap-interface.patch The advanced process-memory-inspection interfaces. These weren't quite ready for 2.6.22 and nothing has changed in the past month or two. Not looking like 2.6.23 material either. lumpy-reclaim-v4.patch have-kswapd-keep-a-minimum-order-free-other-than-order-0.patch only-check-absolute-watermarks-for-alloc_high-and-alloc_harder-allocations.patch Lumpy reclaim. In a similar situation to Mel's patches. Stuck due to general lack or interest and effort. mm-clean-up-and-kernelify-shrinker-registration.patch mm-clean-up-and-kernelify-shrinker-registration-vs-git-nfs.patch Merge. split-mmap.patch only-allow-nonlinear-vmas-for-ram-backed-filesystems.patch mm-document-fault_data-and-flags.patch slub-mm-only-make-slub-the-default-slab-allocator.patch Merge. slub-exploit-page-mobility-to-increase-allocation-order.patch slub-reduce-antifrag-max-order.patch These are slub changes which are dependent on Mel's stuff, and I have a note here that there were reports of page allocation failures with these. What's up with that? Maybe I should just drop the 100-odd marginal-looking MM patches? We're simply not showing compelling reasons for merging them and quite a lot of them are stuck in a 90% complete state. slub-change-error-reporting-format-to-follow-lockdep-loosely.patch slub-use-list_for_each_entry-for-loops-over-all-slabs.patch slub-slab-validation-move-tracking-information-alloc-outside-of.patch slub-ensure-that-the-object-per-slabs-stays-low-for-high-orders.patch slub-debug-fix-initial-object-debug-state-of-numa-bootstrap-objects.patch slab-allocators-consolidate-code-for-krealloc-in-mm-utilc.patch slab-allocators-consistent-zero_size_ptr-support-and-null-result-semantics.patch slab-allocators-support-__gfp_zero-in-all-allocators.patch slub-add-some-more-inlines-and-ifdef-config_slub_debug.patch slub-extract-dma_kmalloc_cache-from-get_cache.patch slub-do-proper-locking-during-dma-slab-creation.patch slub-faster-more-efficient-slab-determination-for-__kmalloc.patch slub-simplify-dma-index-size-calculation.patch mm-slubc-make-code-static.patch slub-style-fix-up-the-loop-to-disable-small-slabs.patch slub-do-not-use-length-parameter-in-slab_alloc.patch slab-allocators-cleanup-zeroing-allocations.patch slab-allocators-replace-explicit-zeroing-with-__gfp_zero.patch slub-do-not-allocate-object-bit-array-on-stack.patch slub-move-sysfs-operations-outside-of-slub_lock.patch slub-fix-config_slub_debug-use-for-config_numa.patch Slub stuff. Will merge whatever's mergeable after the above droppage and stalls. add-vm_bug_on-in-case-someone-uses-page_mapping-on-a-slab-page.patch mm-make-needlessly-global-hugetlb_no_page-static.patch Merge fs-introduce-some-page-buffer-invariants.patch nfs-invariant-fix.patch fs-introduce-some-page-buffer-invariants-obnoxiousness.patch Re-review, maybe merge. memory-unplug-v7-migration-by-kernel.patch memory-unplug-v7-isolate_lru_page-fix.patch memory-unplug-v7-memory-hotplug-cleanup.patch memory-unplug-v7-page-isolation.patch memory-unplug-v7-page-offline.patch memory-unplug-v7-ia64-interface.patch These are new, and are dependent on Mel's stuff. Not for 2.6.23. freezer-make-kernel-threads-nonfreezable-by-default.patch Merge, subject to re-review. implement-file-posix-capabilities.patch implement-file-posix-capabilities-fix.patch file-capabilities-introduce-cap_setfcap.patch file-capabilities-get_file_caps-cleanups.patch file-caps-update-selinux-xattr-hooks.patch file-caps seems to be stuck. There has been some movement lately, might merge it subject to suiable acks from suitable parties. frv-connect-up-new-syscalls.patch frv-be-self-consistent-and-use-config_gdb_console-everywhere.patch frv-remove-some-dead-code.patch Merge blackfin-enable-arbitary-speed-serial-setting.patch Will send to Bryan when lots-of-architectures-enable-arbitary-speed-tty-support.patch is merged nommu-stub-expand_stack-for-nommu-case.patch m68knommu-use-trhead_size-instead-of-hard-constant.patch m68knommu-remove-cruft-from-setup-code.patch m68knommu-remove-old-cache-management-cruft-from-mm-code.patch Merge h8300-enable-arbitary-speed-tty-port-setup.patch h8300-zimage-support-update.patch Merge alpha-fix-trivial-section-mismatch-warnings.patch fix-alpha-isa-support.patch Merge arm26-enable-arbitary-speed-tty-ioctls-and-split.patch arm26-remove-broken-and-unused-macro.patch Will send to Ian freezer-run-show_state-when-freezing-times-out.patch pm-do-not-require-dev-spew-to-get-pm_debug.patch swsusp-remove-incorrect-code-from-userc.patch swsusp-remove-code-duplication-between-diskc-and-userc.patch swsusp-introduce-restore-platform-operations.patch swsusp-fix-hibernation-code-ordering.patch hibernation-prepare-to-enter-the-low-power-state.patch freezer-avoid-freezing-kernel-threads-prematurely.patch freezer-use-__set_current_state-in-refrigerator.patch freezer-return-int-from-freeze_processes.patch freezer-remove-redundant-check-in-try_to_freeze_tasks.patch pm-introduce-hibernation-and-suspend-notifiers.patch pm-disable-usermode-helper-before-hibernation-and-suspend.patch pm-prevent-frozen-user-mode-helpers-from-failing-the-freezing-of-tasks-rev-2.patch pm-reduce-code-duplication-between-mainc-and-userc-updated.patch acpi-do-not-prepare-for-hibernation-in-acpi_shutdown.patch pm-introduce-pm_power_off_prepare.patch pm-optional-beeping-during-resume-from-suspend-to-ram.patch pm-integrate-beeping-flag-with-existing-acpi_sleep-flags.patch Merge m32r-enable-arbitary-speed-tty-rate-setting.patch Merge etrax-enable-arbitary-speed-setting-on-tty-ports.patch cris-replace-old-style-member-inits-with-designated-inits.patch Merge uml-fix-request-sector-update.patch uml-use-get_free_pages-to-allocate-kernel-stacks.patch add-generic-exit-time-stack-depth-checking-to-config_debug_stack_usage.patch uml-debug_shirq-fixes.patch uml-xterm-driver-tidying.patch uml-pty-channel-tidying.patch uml-handle-errors-on-opening-host-side-of-consoles.patch uml-sigio-support-cleanup.patch uml-simplify-helper-stack-handling.patch uml-eliminate-kernel-allocator-wrappers.patch Merge v850-enable-arbitary-speed-tty-ioctls.patch Merge deprecate-smbfs-in-favour-of-cifs.patch Send to sfrench cpuset-remove-sched-domain-hooks-from-cpusets.patch Stuck. clone-flag-clone_parent_tidptr-leaves-invalid-results-in-memory.patch ebiederm no likee. Stuck. cache-pipe-buf-page-address-for-non-highmem-arch.patch Ugly, will probably drop. fix-rmmod-read-write-races-in-proc-entries.patch Merge more-scheduled-oss-driver-removal.patch doc-kernel-parameters-use-x86-32-tag-instead-of-ia-32.patch introduce-write_trylock_irqsave.patch use-write_trylock_irqsave-in-ptrace_attach.patch use-menuconfig-objects-ii-auxdisplay.patch use-menuconfig-objects-ii-edac.patch use-menuconfig-objects-ii-ipmi.patch use-menuconfig-objects-ii-misc-strange-dev.patch use-menuconfig-objects-ii-module-menu.patch use-menuconfig-objects-ii-oprofile.patch use-menuconfig-objects-ii-telephony.patch use-menuconfig-objects-ii-tpm.patch use-menuconfig-objects-connector.patch use-menuconfig-objects-crypto-hw.patch use-menuconfig-objects-i2o.patch use-menuconfig-objects-parport.patch use-menuconfig-objects-pnp.patch use-menuconfig-objects-w1.patch fix-jvc-cdrom-drive-lockup.patch use-no_pci_devices-in-pci-searchc.patch introduce-boot-based-time.patch use-boot-based-time-for-process-start-time-and-boot-time.patch use-boot-based-time-for-uptime-in-proc.patch udf-check-for-allocated-memory-for-data-of-new-inodes.patch add-argv_split-fix.patch add-common-orderly_poweroff-fix.patch prevent-an-o_ndelay-writer-from-blocking-when-a-tty-write-is-blocked-by.patch udf-check-for-allocated-memory-for-inode-data-v2.patch fix-stop_machine_run-problem-with-naughty-real-time-process.patch cpu-hotplug-fix-ksoftirqd-termination-on-cpu-hotplug-with-naughty-realtime-process.patch use-mutexes-instead-of-semaphores-in-i2o-driver.patch fuse-warning-fix.patch vxfs-warning-fixes.patch percpu_counters-use-cpu-notifiers.patch percpu_counters-use-for_each_online_cpu.patch make-afs-use-seq_list_xxx-helpers.patch make-crypto-api-use-seq_list_xxx-helpers.patch make-proc-misc-use-seq_list_xxx-helpers.patch make-proc-modules-use-seq_list_xxx-helpers.patch make-proc-tty-drivers-use-seq_list_xxx-helpers.patch make-proc-self-mountstats-use-seq_list_xxx-helpers.patch make-nfs-client-use-seq_list_xxx-helpers.patch fat-gcc-43-warning-fix.patch remove-unnecessary-includes-of-spinlockh-under-include-linux.patch drivers-block-z2ram-remove-true-false-defines.patch fix-compiler-warnings-in-acornc.patch update-zilog-timeout.patch edd-switch-to-pci_get-based-api.patch fix-up-codingstyle-in-isofs.patch define-config_bounce-to-avoid-useless-inclusion-of-bounce-buffer.patch mpu401-warning-fixes.patch introduce-config_virt_to_bus.patch pie-randomization.patch remove-unused-tif_notify_resume-flag.patch rocketc-fix-unchecked-mutex_lock_interruptible.patch only-send-sigxfsz-when-exceeding-rlimits.patch procfs-directory-entry-cleanup.patch 8xx-fix-whitespace-and-indentation.patch vdso-print-fatal-signals.patch rtc-ratelimit-lost-interrupts-message.patch reduce-cpusetc-write_lock_irq-to-read_lock.patch char-n_hdlc-allow-restartsys-retval-of-tty-write.patch afs-implement-file-locking.patch tty_io-use-kzalloc.patch remove-clockevents_releaserequest_device.patch kconfig-no-strange-misc-devices.patch afs-drop-explicit-extern.patch remove-useless-tolower-in-isofs.patch char-mxser_new-fix-sparse-warning.patch char-tty_ioctl-use-wait_event_interruptible_timeout.patch char-tty_ioctl-little-whitespace-cleanup.patch char-genrtc-use-wait_event_interruptible.patch char-n_r3964-use-wait_event_interruptible.patch char-ip2-use-msleep-for-sleeping.patch proc-environ-wrong-placing-of-ptrace_may_attach-check.patch udf-coding-style-conversion-lindent.patch ext2-fix-a-comment-when-ext2_release_file-is-called.patch mutex_unlock-later-in-seq_lseek.patch zs-move-to-the-serial-subsystem.patch fs-block_devc-use-list_for_each_entry.patch fault-injection-add-min-order-parameter-to-fail_page_alloc.patch fault-injection-fix-example-scripts-in-documentation.patch add-printktime-option-deprecate-time.patch fs-clarify-dummy-member-in-struct.patch dma-mapping-prevent-dma-dependent-code-from-linking-on.patch remove-odd-and-misleading-comments-from-uioh.patch add-a-flag-to-indicate-deferrable-timers-in-proc-timer_stats.patch buffer-kill-old-incorrect-comment.patch introduce-o_cloexec-take-2.patch o_cloexec-for-scm_rights.patch init-wait-for-asynchronously-scanned-block-devices.patch atmel_serial-fix-break-handling.patch documentation-proc-pid-stat-files.patch seq_file-more-atomicity-in-traverse.patch lib-add-idr_for_each.patch lib-add-idr_remove_all.patch remove-capabilityh-from-mmh.patch kernel-utf-8-handling.patch remove-sonypi_camera_command.patch drop-an-empty-isicomh-from-being-exported-to-user-space.patch ext3-ext4-orphan-list-check-on-destroy_inode.patch ext3-ext4-orphan-list-corruption-due-bad-inode.patch remove-apparently-useless-commented-apm_get_battery_status.patch taskstats-add-context-switch-counters.patch sony-laptop-use-null-for-pointer.patch undeprecate-raw-driver.patch hfsplus-change-kmalloc-memset-to-kzalloc.patch submitchecklist-update-fix-spelling-error.patch add-support-for-xilinx-systemace-compactflash-interface.patch fix-typo-in-prefetchh.patch zsc-drain-the-transmission-line.patch hugetlbfs-use-lib-parser-fix-docs.patch report-that-kernel-is-tainted-if-there-were-an-oops-before.patch intel-rng-undo-mess-made-by-an-80-column-extremist.patch improve-behaviour-of-spurious-irq-detect.patch audit-add-tty-input-auditing.patch remove-config_uts_ns-and-config_ipc_ns.patch Merge, subject to re-review. user-namespace-add-the-framework.patch I still think the magical root-user thing in here is odd and perhaps poorly thought-out. user-namespace-add-unshare.patch revert-vanishing-ioctl-handler-debugging.patch binfmt_elf-warning-fix.patch document-the-fact-that-rcu-callbacks-can-run-in-parallel.patch cobalt-remove-all-references-to-cobalt-nvram.patch allow-softlockup-to-be-runtime-disabled.patch dirty_writeback_centisecs_handler-cleanup.patch mm-fix-create_new_namespaces-return-value.patch add-a-kmem_cache-for-nsproxy-objects.patch ptrace_peekdata-consolidation.patch ptrace_pokedata-consolidation.patch adjust-nosmp-handling.patch ext3-fix-deadlock-in-ext3_remount-and-orphan-list-handling.patch ext4-fix-deadlock-in-ext4_remount-and-orphan-list-handling.patch remove-unused-lock_cpu_hotplug_interruptible-definition.patch kerneldoc-fix-in-audit_core_dumps.patch introduce-compat_u64-and-compat_s64-types.patch diskquota-32bit-quota-tools-on-64bit-architectures.patch remove-final-two-references-to-__obsolete_setup-macro.patch update-procfs-guide-doc-of-read_func.patch ext3-remove-extra-is_rdonly-check.patch namespace-ensure-clone_flags-are-always-stored-in-an-unsigned-long.patch doc-oops-tracing-add-code-decode-info.patch drop-obsolete-sys_ioctl-export.patch is_power_of_2-ext3-superc.patch is_power_of_2-jbd.patch Merge, subject to re-review sys_time-speedup.patch Am skeptical about this one. cdrom-replace-hard-coded-constants-by-kernelh-macro.patch update-description-in-documentation-filesystems-vfstxt-typo-fixed.patch futex-tidy-up-the-code-v2.patch add-documentation-sysctl-ctl_unnumberedtxt.patch sysctlc-add-text-telling-people-to-use-ctl_unnumbered.patch # drivers-pmc-msp71xx-gpio-char-driver.patch: david-b panned it drivers-pmc-msp71xx-gpio-char-driver.patch mistaken-ext4_inode_bitmap-for-ext4_block_bitmap.patch hfs-refactor-ascii-to-unicode-conversion-routine.patch hfs-add-custom-dentry-hash-and-comparison-operations.patch sprint_symbol-cleanup.patch Merge, svbject to re-review. hwrng-add-type-categories.patch This generated a flamewar. Wil probably drop. fs-namespacec-should-include-internalh.patch proper-prototype-for-proc_nr_files.patch replace-obscure-constructs-in-fs-block_devc.patch bd_claim_by_disk-fix-warning.patch fs-reiserfs-cleanups.patch adb_probe_task-remove-unneeded-flush_signals-call.patch kcdrwd-remove-unneeded-flush_signals-call.patch nbdcsock_xmit-cleanup-signal-related-code.patch move-seccomp-from-proc-to-a-prctl.patch make-seccomp-zerocost-in-schedule.patch is_power_of_2-kernel-kfifoc.patch parport_pc-it887x-fix.patch is_power_of_2-ufs-superc.patch codingstyle-add-information-about-trailing-whitespace.patch codingstyle-add-information-about-editor-modelines.patch uninline-check_signature.patch add-werror-implicit-function-declaration.patch generic-bug-use-show_regs-instead-of-dump_stack.patch udf-fix-function-name-from-udf_crc16-to-udf_crc.patch dma-make-dma-pool-to-use-kmalloc_node.patch unregister_chrdev-ignore-the-return-value.patch unregister_chrdev-return-void.patch unregister_blkdev-do-warn_on-on-failure.patch unregister_blkdev-delete-redundant-messages-in-callers.patch unregister_blkdev-delete-redundant-message.patch unregister_blkdev-return-void.patch add-missing-files-and-dirs-to-00-index-in-documentation.patch remove-the-last-few-umsdos-leftovers.patch update-documentation-filesystems-vfstxt-second-part.patch rename-cancel_rearming_delayed_work-to-cancel_delayed_work_sync.patch make-cancel_xxx_work_sync-return-a-boolean.patch ext3-fix-error-handling-in-ext3_create_journal.patch ext4-fix-error-handling-in-ext4_create_journal.patch modules-remove-modlist_lock.patch amiserial-remove-incorrect-no-termios-change-check.patch genericserial-remove-bogus-optimisation-check-and-dead-code-paths.patch synclink-remove-bogus-no-change-termios-optimisation.patch 68360serial-remove-broken-optimisation.patch serial-remove-termios-checks-from-various-old-char-serial.patch docs-static-initialization-of-spinlocks-is-ok.patch kernel-printkc-document-possible-deadlock-against-scheduler.patch remove-mm-backing-devccongestion_wait_interruptible.patch gitignore-update.patch isapnp-remove-pointless-check-of-type-against-0-in-isapnp_read_tag.patch fix-trivial-typos-in-anon_inodesc-comments.patch vsprintfc-optimizing-part-1-easy-and-obvious-stuff.patch vsprintfc-optimizing-part-2-base-10-conversion-speedup-v2.patch drivers-char-ipmi-ipmi_poweroffc-lower-printk-severity.patch drivers-char-ipmi-ipmi_si_intfc-lower-printk-severity.patch drivers-block-rdc-lower-printk-severity.patch ext2-statfs-speed-up.patch ext3-statfs-speed-up.patch ext4-statfs-speed-up.patch permit-mempool_freenull.patch nls-remove-obsolete-makefile-entries.patch compat32-ignore-the-loop_clr_fd-ioctl.patch ia64-arbitary-speed-tty-ioctl-support.patch Merge, subject to re-review. writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch writeback-fix-comment-use-helper-function.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch I guess these should be merged. There are still bugs in there which I think Ken Chen has fixed, but I haven't got onto that yet. introduce-i_sync.patch introduce-i_sync-fix.patch Merge, I guess. ibmasm-whitespace-cleanup.patch ibmasm-dont-use-extern-in-function-declarations.patch ibmasm-miscellaneous-fixes.patch ibmasm-must-depend-on-config_input.patch Merge. sync_sb_inodes-propagate-errors.patch Needs work. spi-controller-drivers-check-for-unsupported-modes.patch spi-add-3wire-mode-flag.patch crc7-support.patch spidev-compiler-warning-gone.patch spi_lm70llp-parport-adapter-driver.patch spi_mpc83xxc-underclocking-hotfix.patch atmel_spi-minor-updates.patch s3c24xx-spi-controllers-both-select-bitbang.patch spi-tle620x-power-switch-driver.patch spi-master-driver-for-xilinx-virtex.patch spi_mpc83xxc-support-qe-enabled-83xx-cpus-like-mpc832x.patch spi-omap2_mcspi-driver.patch spi_txx9-controller-driver.patch Merge move-page-writeback-acounting-out-of-macros.patch ext2-balloc-use-io_error-label.patch Might merge. ext2-reservations.patch Still needs decent testing. use-mutex-instead-of-semaphore-in-capi-20-driver.patch mismatching-declarations-of-revision-strings-in-hisax.patch make-isdn-capi-use-seq_list_xxx-helpers.patch update-isdn-tree-to-use-pci_get_device.patch sane-irq-initialization-in-sedlbauer-hisax.patch use-menuconfig-objects-isdn-config_isdn.patch use-menuconfig-objects-isdn-config_isdn_drv_gigaset.patch use-menuconfig-objects-isdn-config_isdn_capi.patch use-menuconfig-objects-isdn-config_capi_avm.patch use-menuconfig-objects-isdn-config_capi_eicon.patch isdn-capi-warning-fixes.patch i4l-leak-in-eicon-idifuncc.patch Merge use-menuconfig-objects-isdn-config_isdn_i4l.patch tilman didn't like it - might drop i2o_cfg_passthru-cleanup.patch wrong-memory-access-in-i2o_block_device_lock.patch i2o-message-leak-in-i2o_msg_post_wait_mem.patch i2o-proc-reading-oops.patch i2o-debug-output-cleanup.patch Merge knfsd-exportfs-add-exportfsh-header.patch knfsd-exportfs-remove-iget-abuse.patch knfsd-exportfs-add-procedural-interface-for-nfsd.patch knfsd-exportfs-remove-call-macro.patch knfsd-exportfs-untangle-isdir-logic-in-find_exported_dentry.patch knfsd-exportfs-move-acceptable-check-into-find_acceptable_alias.patch knfsd-exportfs-add-find_disconnected_root-helper.patch knfsd-exportfs-split-out-reconnecting-a-dentry-from-find_exported_dentry.patch nfsd-warning-fix.patch knfsd-lockd-nfsd4-use-same-grace-period-for-lockd-and-nfsd4.patch knfsd-nfsd4-fix-nfsv4-filehandle-size-units-confusion.patch knfsd-nfsd4-silence-a-compiler-warning-in-acl-code.patch knfsd-nfsd4-fix-enc_stateid_sz-for-nfsd-callbacks.patch knfsd-nfsd4-fix-handling-of-acl-errrors.patch knfsd-nfsd-remove-unused-header-interfaceh.patch knfsd-nfsd4-vary-maximum-delegation-limit-based-on-ram-size.patch knfsd-nfsd4-dont-delegate-files-that-have-had-conflicts.patch Merge couple-fixes-to-fs-ecryptfs-inodec.patch ecryptfs-move-ecryptfs-docs-into-documentation-filesystems.patch Merge rtc-ds1307-cleanups.patch rtc-rs5c372-becomes-a-new-style-i2c-driver.patch thecus-n2100-register-rtc-rs5c372-i2c-device.patch rtc-make-example-code-jump-to-done-instead-of-return-when-ioctl-not-supported.patch rtc-dev-return-enotty-in-ioctl-if-irq_set_freq-is-not-implemented-by-driver.patch driver-for-the-atmel-on-chip-rtc-on-at32ap700x-devices.patch rtc_class-is-no-longer-considered-experimental.patch rtc-kconfig-tweax.patch rtc-add-rtc-m41t80-driver-take-2.patch rtc-watchdog-support-for-rtc-m41t80-driver-take-2.patch rtc-add-support-for-the-st-m48t59-rtc.patch rtc-add-support-for-the-st-m48t59-rtc-vs-git-acpi.patch rtc-driver-for-ds1216-chips.patch rtc-driver-for-ds1216-chips-fix.patch rtc-ds1307-oscillator-restart-for-ds1337383940.patch Merge. revoke-special-mmap-handling.patch revoke-special-mmap-handling-vs-fault-vs-invalidate.patch revoke-core-code.patch revoke-support-for-ext2-and-ext3.patch revoke-add-documentation.patch revoke-wire-up-i386-system-calls.patch fs-introduce-write_begin-write_end-and-perform_write-aops-revoke.patch revoke-vs-git-block.patch Don't know. Need to ping suitable developers over this work. lguest-export-symbols-for-lguest-as-a-module.patch lguest-the-guest-code.patch lguest-the-host-code.patch lguest-the-host-code-lguest-vs-clockevents-fix-resume-logic.patch lguest-the-asm-offsets.patch lguest-the-makefile-and-kconfig.patch lguest-the-console-driver.patch lguest-the-net-driver.patch lguest-the-block-driver.patch lguest-the-documentation-example-launcher.patch Merge oss-trident-massive-whitespace-removal.patch oss-trident-fix-locking-around-write_voice_regs.patch oss-trident-replace-deprecated-pci_find_device-with-pci_get_device.patch remove-options-depending-on-oss_obsolete.patch Merge unprivileged-mounts-add-user-mounts-to-the-kernel.patch unprivileged-mounts-allow-unprivileged-umount.patch unprivileged-mounts-account-user-mounts.patch unprivileged-mounts-propagate-error-values-from-clone_mnt.patch unprivileged-mounts-allow-unprivileged-bind-mounts.patch unprivileged-mounts-put-declaration-of-put_filesystem-in-fsh.patch unprivileged-mounts-allow-unprivileged-mounts.patch unprivileged-mounts-allow-unprivileged-fuse-mounts.patch unprivileged-mounts-propagation-inherit-owner-from-parent.patch unprivileged-mounts-add-no-submounts-flag.patch Don't know. Need to ping suitable developers over this work. char-cyclades-add-firmware-loading.patch char-cyclades-fix-sparse-warning.patch char-isicom-cleanup-locking.patch char-isicom-del_timer-at-exit.patch char-isicom-proper-variables-types.patch char-moxa-eliminate-busy-waiting.patch char-specialix-remove-busy-waiting.patch char-riscom8-eliminate-busy-loop.patch char-vt-use-kzalloc.patch char-vt-use-array_size.patch char-kconfig-mxser_new-remove-experimental-comment.patch char-stallion-remove-user-class-report-request.patch char-istallion-initlocking-fixes-try-2.patch stallion-remove-unneeded-lock_kernel.patch Merge fbcon-smart-blitter-usage-for-scrolling.patch nvidiafb-adjust-flags-to-take-advantage-of-new-scroll-method.patch fbcon-cursor-blink-control.patch fbcon-use-struct-device-instead-of-struct-class_device.patch fbdev-move-arch-specific-bits-to-their-respective.patch fbdev-detect-primary-display-device.patch fbcon-allow-fbcon-to-use-the-primary-display-driver.patch radeonfb-add-support-for-radeon-xpress-200m-rs485.patch nvidiafb-add-proper-support-for-geforce-7600-chipset.patch pm2fb-white-spaces-clean-up.patch fbcon-set_con2fb_map-fixes.patch fbcon-revise-primary-device-selection.patch fbdev-fbcon-console-unregistration-from-unregister_framebuffer.patch vt-add-comment-for-unbind_con_driver.patch 68328fb-the-pseudo_palette-is-only-16-elements-long.patch controlfb-the-pseudo_palette-is-only-16-elements-long.patch cyblafb-fix-pseudo_palette-array-overrun-in-setcolreg.patch epson1355fb-color-setting-fixes.patch fm2fb-the-pseudo_palette-is-only-16-elements-long.patch gbefb-the-pseudo_palette-is-only-16-elements-long.patch macfb-fix-pseudo_palette-size-and-overrun.patch offb-the-pseudo_palette-is-only-16-elements-long.patch platinumfb-the-pseudo_palette-is-only-16-elements.patch pvr2fb-fix-pseudo_palette-array-overrun-and-typecast.patch q40fb-the-pseudo_palette-is-only-16-elements-long.patch sgivwfb-the-pseudo_palette-is-only-16-elements-long.patch tgafb-actually-allocate-memory-for-the-pseudo_palette.patch tridentfb-fix-pseudo_palette-array-overrun-in-setcolreg.patch tx3912fb-fix-improper-assignment-of-info-pseudo_palette.patch atyfb-the-pseudo_palette-is-only-16-elements-long.patch radeonfb-the-pseudo_palette-is-only-16-elements-long.patch i810fb-the-pseudo_palette-is-only-16-elements-long.patch intelfb-the-pseudo_palette-is-only-16-elements-long.patch sisfb-fix-pseudo_palette-array-size-and-overrun.patch matroxfb-color-setting-fixes.patch pm3fb-fillrect-acceleration.patch pm3fb-possible-cleanups.patch vt8623fbc-make-code-static.patch matroxfb-color-setting-fixes-fix.patch fb-epson1355fb-kill-off-dead-sh-support.patch fix-the-graphic-corruption-issue-on-ia64-machines.patch Merge omap-add-ti-omap-framebuffer-driver.patch omap-add-ti-omap1610-accelerator-entry.patch omap-add-ti-omap1-internal-lcd-controller.patch omap-add-ti-omap2-internal-display-controller-support.patch omap-add-ti-omap1-external-lcd-controller-support-sossi.patch omap-add-ti-omap2-external-lcd-controller-support-rfbi.patch omap-add-external-epson-hwa742-lcd-controller-support.patch omap-add-external-epson-blizzard-lcd-controller-support.patch omap-lcd-panel-support-for-the-ti-omap-h4-board.patch omap-lcd-panel-support-for-the-ti-omap-h3-board.patch omap-lcd-panel-support-for-the-palm-tungsten-e.patch omap-lcd-panel-support-for-palm-tungstent.patch omap-lcd-panel-support-for-the-palm-zire71.patch omap-lcd-panel-support-for-the-ti-omap1610-innovator-board.patch omap-lcd-panel-support-for-the-ti-omap1510-innovator-board.patch omap-lcd-panel-support-for-the-ti-omap-osk-board.patch omap-lcd-panel-support-for-the-siemens-sx1-mobile-phone.patch Merge use-menuconfig-objects-ii-md.patch md-improve-message-about-invalid-superblock-during-autodetect.patch md-improve-the-is_mddev_idle-test-fix.patch md-check-that-internal-bitmap-does-not-overlap-other-data.patch md-change-bitmap_unplug-and-others-to-void-functions.patch Merge raid5-add-the-stripe_queue-object-for-tracking-raid.patch raid5-use-stripe_queues-to-prioritize-the-most.patch Ping Neil readahead-introduce-pg_readahead.patch readahead-add-look-ahead-support-to-__do_page_cache_readahead.patch readahead-min_ra_pages-max_ra_pages-macros.patch readahead-data-structure-and-routines.patch readahead-on-demand-readahead-logic.patch readahead-convert-filemap-invocations.patch readahead-convert-splice-invocations.patch readahead-convert-ext3-ext4-invocations.patch readahead-remove-the-old-algorithm.patch readahead-move-synchronous-readahead-call-out-of-splice-loop.patch readahead-pass-real-splice-size.patch mm-share-pg_readahead-and-pg_reclaim.patch readahead-split-ondemand-readahead-interface-into-two-functions.patch readahead-sanify-file_ra_state-names.patch Merge fallocate-implementation-on-i86-x86_64-and-powerpc.patch fallocate-on-s390.patch fallocate-on-ia64.patch fallocate-on-ia64-fix.patch Merge. jprobes-make-struct-jprobeentry-a-void.patch jprobes-remove-jprobe_entry.patch jprobes-make-jprobes-a-little-safer-for-users.patch Merge. intel-iommu-dmar-detection-and-parsing-logic.patch intel-iommu-pci-generic-helper-function.patch intel-iommu-clflush_cache_range-now-takes-size-param.patch intel-iommu-iova-allocation-and-management-routines.patch intel-iommu-intel-iommu-driver.patch intel-iommu-avoid-memory-allocation-failures-in-dma-map-api-calls.patch intel-iommu-intel-iommu-cmdline-option-forcedac.patch intel-iommu-dmar-fault-handling-support.patch intel-iommu-iommu-gfx-workaround.patch intel-iommu-iommu-floppy-workaround.patch Don't know. I don't think there were any great objections, but I don't think much benefit has been demonstrated? define-new-percpu-interface-for-shared-data-version-4.patch use-the-new-percpu-interface-for-shared-data-version-4.patch Merge arch-personality-independent-stack-top.patch audit-rework-execve-audit.patch mm-variable-length-argument-support.patch Merge. ext4-zero_user_page-conversion.patch ext4-remove-extra-is_rdonly-check.patch is_power_of_2-ext4-superc.patch Send to tytso fs-introduce-vfs_path_lookup.patch sunrpc-use-vfs_path_lookup.patch nfsctl-use-vfs_path_lookup.patch fs-mark-link_path_walk-static.patch fs-remove-path_walk-export.patch Merge, after poking suitable maintainers kernel-doc-add-tools-doc-in-makefile.patch kernel-doc-fix-unnamed-struct-union-warning.patch kernel-doc-strip-c99-comments.patch kernel-doc-fix-leading-dot-in-man-mode-output.patch Merge coredump-masking-bound-suid_dumpable-sysctl.patch coredump-masking-reimplementation-of-dumpable-using-two-flags.patch coredump-masking-add-an-interface-for-core-dump-filter.patch coredump-masking-elf-enable-core-dump-filtering.patch coredump-masking-elf-fdpic-remove-an-unused-argument.patch coredump-masking-elf-fdpic-enable-core-dump-filtering.patch coredump-masking-documentation-for-proc-pid-coredump_filter.patch Merge kernel-relayc-make-functions-static.patch Merge configfsdlm-separate-out-__configfs_attr-into-configfsh.patch configfsdlmocfs2-convert-subsystem-semaphore-to-mutex.patch configfsdlm-rename-config_group_find_obj-and-state-semantics-clearly.patch Merge, subject to Joel acks use-data_data-in-cris.patch add-missing-data_data-in-powerpc.patch use-data_data-in-xtensa.patch Merge drivers-edac-add-edac_mc_find-api.patch drivers-edac-core-make-functions-static.patch drivers-edac-add-rddr2-memory-types.patch drivers-edac-split-out-functions-to-unique-files.patch drivers-edac-add-edac_device-class.patch drivers-edac-mc-sysfs-add-missing-mem-types.patch drivers-edac-change-from-semaphore-to-mutex-operation.patch drivers-edac-new-intel-5000-mc-driver.patch drivers-edac-new-intel-5000-mc-driver-fix.patch drivers-edac-coreh-fix-scrubdefs.patch drivers-edac-new-i82443bxgz-mc-driver.patch drivers-edac-new-i82443bxgz-mc-driver-broken.patch drivers-edac-add-new-nmi-rescan.patch drivers-edac-mod-use-edac_coreh.patch drivers-edac-add-dev_name-getter-function.patch drivers-edac-new-inte-30x0-mc-driver.patch drivers-edac-mod-mc-to-use-workq-instead-of-kthread.patch drivers-edac-updated-pci-monitoring.patch drivers-edac-mod-assert_error-check.patch drivers-edac-mod-pci-poll-names.patch drivers-edac-core-lindent-cleanup.patch drivers-edac-edac_device-sysfs-cleanup.patch drivers-edac-cleanup-workq-ifdefs.patch drivers-edac-lindent-amd76x.patch drivers-edac-lindent-i5000.patch drivers-edac-lindent-e7xxx.patch drivers-edac-lindent-i3000.patch drivers-edac-lindent-i82860.patch drivers-edac-lindent-i82875p.patch drivers-edac-lindent-e752x.patch drivers-edac-lindent-i82443bxgx.patch drivers-edac-lindent-r82600.patch drivers-edac-drivers-to-use-new-pci-operation.patch drivers-edac-add-device-sysfs-attributes.patch drivers-edac-device-output-clenaup.patch drivers-edac-add-info-kconfig.patch drivers-edac-update-maintainers-files-for-edac.patch drivers-edac-cleanup-spaces-gotos-after-lindent-messup.patch driver-edac-add-mips-and-ppc-visibility.patch driver-edac-mod-race-fix-i82875p.patch driver-edac-fix-ignored-return-i82875p.patch include-linux-pci_id-h-add-amd-northbridge-defines.patch driver-edac-i5000-define-typo.patch driver-edac-remove-null-from-statics.patch driver-edac-i5000-code-tidying.patch driver-edac-edac_device-code-tidying.patch driver-edac-mod-edac_align_ptr-function.patch driver-edac-mod-edac_opt_state_to_string-function.patch driver-edac-remove-file-edac_mc-h.patch Probably hold - there are sysfs issues and a large number of update patches in my inbox. Might merge, undecided. cpuset-zero-malloc-revert-the-old-cpuset-fix.patch containersv10-basic-container-framework.patch containersv10-basic-container-framework-fix.patch containersv10-basic-container-framework-fix-2.patch containersv10-basic-container-framework-fix-3.patch containersv10-basic-container-framework-fix-for-bad-lock-balance-in-containers.patch containersv10-example-cpu-accounting-subsystem.patch containersv10-example-cpu-accounting-subsystem-fix.patch containersv10-add-tasks-file-interface.patch containersv10-add-tasks-file-interface-fix.patch containersv10-add-tasks-file-interface-fix-2.patch containersv10-add-fork-exit-hooks.patch containersv10-add-fork-exit-hooks-fix.patch containersv10-add-container_clone-interface.patch containersv10-add-container_clone-interface-fix.patch containersv10-add-procfs-interface.patch containersv10-add-procfs-interface-fix.patch containersv10-make-cpusets-a-client-of-containers.patch containersv10-make-cpusets-a-client-of-containers-whitespace.patch containersv10-share-css_group-arrays-between-tasks-with-same-container-memberships.patch containersv10-share-css_group-arrays-between-tasks-with-same-container-memberships-fix.patch containersv10-share-css_group-arrays-between-tasks-with-same-container-memberships-cpuset-zero-malloc-fix-for-new-containers.patch containersv10-simple-debug-info-subsystem.patch containersv10-simple-debug-info-subsystem-fix.patch containersv10-simple-debug-info-subsystem-fix-2.patch containersv10-support-for-automatic-userspace-release-agents.patch containersv10-support-for-automatic-userspace-release-agents-whitespace.patch add-containerstats-v3.patch add-containerstats-v3-fix.patch update-getdelays-to-become-containerstats-aware.patch containers-implement-subsys-post_clone.patch containers-implement-namespace-tracking-subsystem-v3.patch Container stuff. Hold, I guess. I was expecting updates from Paul. fix-raw_spinlock_t-vs-lockdep.patch lockdep-sanitise-config_prove_locking.patch lockdep-reduce-the-ifdeffery.patch lockstat-core-infrastructure.patch lockstat-human-readability-tweaks.patch lockstat-hook-into-spinlock_t-rwlock_t-rwsem-and-mutex.patch Merge lockdep-various-fixes.patch lockdep-fixup-sk_callback_lock-annotation.patch lockstat-measure-lock-bouncing.patch lockstat-better-class-name-representation.patch lockdep-debugging-give-stacktrace-for-init_error.patch stacktrace-fix-header-file-for-config_stacktrace.patch Merge some-kmalloc-memset-kzalloc-tree-wide.patch Merge reiser4-sb_sync_inodes.patch reiser4-export-remove_from_page_cache.patch reiser4-export-radix_tree_preload.patch reiser4-export-find_get_pages.patch make-copy_from_user_inatomic-not-zero-the-tail-on-i386-vs-reiser4.patch reiser4.patch mm-clean-up-and-kernelify-shrinker-registration-reiser4.patch reiser4-fix-for-new-aops-patches.patch git-block-vs-reiser4.patch Hold. make-sure-nobodys-leaking-resources.patch journal_add_journal_head-debug.patch page-owner-tracking-leak-detector.patch releasing-resources-with-children.patch nr_blockdev_pages-in_interrupt-warning.patch detect-atomic-counter-underflows.patch device-suspend-debug.patch #slab-cache-shrinker-statistics.patch mm-debug-dump-pageframes-on-bad_page.patch make-frame_pointer-default=y.patch mutex-subsystem-synchro-test-module.patch slab-leaks3-default-y.patch profile-likely-unlikely-macros.patch put_bh-debug.patch acpi_format_exception-debug.patch lockdep-show-held-locks-when-showing-a-stackdump.patch add-debugging-aid-for-memory-initialisation-problems.patch kmap_atomic-debugging.patch shrink_slab-handle-bad-shrinkers.patch keep-track-of-network-interface-renaming.patch workaround-for-a-pci-restoring-bug.patch prio_tree-debugging-patch.patch check_dirty_inode_list.patch alloc_pages-debug.patch squash-ipc-warnings.patch random-warning-squishes.patch w1-build-fix.patch -mm only things. ^ permalink raw reply [flat|nested] 535+ messages in thread
* intel iommu (Re: -mm merge plans for 2.6.23) 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton @ 2007-07-10 9:04 ` Jan Engelhardt 2007-07-10 9:07 ` -mm merge plans for 2.6.23 -- sys_fallocate Heiko Carstens ` (25 subsequent siblings) 26 siblings, 0 replies; 535+ messages in thread From: Jan Engelhardt @ 2007-07-10 9:04 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Jul 10 2007 01:31, Andrew Morton wrote: >intel-iommu-dmar-detection-and-parsing-logic.patch >intel-iommu-pci-generic-helper-function.patch >intel-iommu-clflush_cache_range-now-takes-size-param.patch >intel-iommu-iova-allocation-and-management-routines.patch >intel-iommu-intel-iommu-driver.patch >intel-iommu-avoid-memory-allocation-failures-in-dma-map-api-calls.patch >intel-iommu-intel-iommu-cmdline-option-forcedac.patch >intel-iommu-dmar-fault-handling-support.patch >intel-iommu-iommu-gfx-workaround.patch >intel-iommu-iommu-floppy-workaround.patch Here's some fix: Signed-off-by: Jan Engelhardt <jengelh@gmx.de> --- arch/x86_64/Kconfig | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) Index: linux-2.6.22-rc6/arch/x86_64/Kconfig =================================================================== --- linux-2.6.22-rc6.orig/arch/x86_64/Kconfig +++ linux-2.6.22-rc6/arch/x86_64/Kconfig @@ -753,11 +753,11 @@ config DMAR depends on PCI_MSI && ACPI && EXPERIMENTAL default y help - DMA remapping(DMAR) devices support enables independent address - translations for Direct Memory Access(DMA) from Devices. + DMA remapping (DMAR) devices support enables independent address + translations for Direct Memory Access (DMA) from devices. These DMA remapping devices are reported via ACPI tables - and includes pci device scope covered by these DMA - remapping device. + and include PCI device scope covered by these DMA + remapping devices. config DMAR_GFX_WA bool "Support for Graphics workaround" @@ -765,9 +765,9 @@ config DMAR_GFX_WA default y help Current Graphics drivers tend to use physical address - for DMA and avoid using DMA api's. Setting this config + for DMA and avoid using DMA APIs. Setting this config option permits the IOMMU driver to set a unity map for - all the OS visible memory. Hence the driver can continue + all the OS-visible memory. Hence the driver can continue to use physical addresses for DMA. config DMAR_FLOPPY_WA @@ -775,10 +775,10 @@ config DMAR_FLOPPY_WA depends on DMAR default y help - Floppy disk drivers are know to by pass dma api calls - their by failing to work when IOMMU is enabled. This - work around will setup a 1 to 1 mappings for the first - 16M to make floppy(isa device) work. + Floppy disk drivers are know to bypass DMA API calls + thereby failing to work when IOMMU is enabled. This + workaround will setup a 1:1 mapping for the first + 16M to make floppy (an ISA device) work. source "drivers/pci/pcie/Kconfig" ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton 2007-07-10 9:04 ` intel iommu (Re: -mm merge plans for 2.6.23) Jan Engelhardt @ 2007-07-10 9:07 ` Heiko Carstens 2007-07-10 9:22 ` Andrew Morton 2007-07-10 9:17 ` cpuset-remove-sched-domain-hooks-from-cpusets Paul Jackson ` (24 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Heiko Carstens @ 2007-07-10 9:07 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Andi Kleen, Amit Arora, Martin Schwidefsky > fallocate-implementation-on-i86-x86_64-and-powerpc.patch Still broken: arch/x86_64/ia32/ia32entry.S wants compat_sys_fallocate instead of sys_fallocate. Also compat_sys_fallocate probably should be moved to fs/compat.c. > fallocate-on-s390.patch We reserved a different syscall number than the one that is used right now in the patch. Please drop this patch... Martin or I will wire up the syscall as soon as the x86 variant is merged. Everything else just causes trouble and confusion. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 9:07 ` -mm merge plans for 2.6.23 -- sys_fallocate Heiko Carstens @ 2007-07-10 9:22 ` Andrew Morton 2007-07-10 15:45 ` Theodore Tso 0 siblings, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-10 9:22 UTC (permalink / raw) To: Heiko Carstens; +Cc: linux-kernel, Andi Kleen, Amit Arora, Martin Schwidefsky On Tue, 10 Jul 2007 11:07:37 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > fallocate-implementation-on-i86-x86_64-and-powerpc.patch > > Still broken: arch/x86_64/ia32/ia32entry.S wants compat_sys_fallocate instead > of sys_fallocate. Also compat_sys_fallocate probably should be moved to > fs/compat.c. > > > fallocate-on-s390.patch > > We reserved a different syscall number than the one that is used right now > in the patch. Please drop this patch... Martin or I will wire up the syscall > as soon as the x86 variant is merged. Everything else just causes trouble and > confusion. OK, I dropped all the fallocate patches. That means that a few other syscalls (or at least, revoke) get renumbered. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 9:22 ` Andrew Morton @ 2007-07-10 15:45 ` Theodore Tso 2007-07-10 17:27 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 535+ messages in thread From: Theodore Tso @ 2007-07-10 15:45 UTC (permalink / raw) To: Andrew Morton Cc: Heiko Carstens, linux-kernel, Andi Kleen, Amit Arora, Martin Schwidefsky On Tue, Jul 10, 2007 at 02:22:13AM -0700, Andrew Morton wrote: > On Tue, 10 Jul 2007 11:07:37 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > We reserved a different syscall number than the one that is used right now > > in the patch. Please drop this patch... Martin or I will wire up the syscall > > as soon as the x86 variant is merged. Everything else just causes trouble and > > confusion. > > OK, I dropped all the fallocate patches. Andrew, I want to clarify who is going to push the fallocate patches. I can either push them to Linus as part of the ext4 patch set, or we can wait for you to push them. I thought since you had them in -mm and we were going to wait you to push them (and presume that this was going to happen soon). Alternatively I can push them directly to Linus along with other ext4 patches. We can drop the s390 patch if Martin or Heiko wants to wire it up themselves. As far as I know there hasn't been any real contention on the actual syscall patches, other than the numbering issues, so it seems that pushing them to Linus sooner rather than later is the right thing to do. I don't particularly care who pushes them, just as long as they get pushed. :-) So if you've dropped, shall I push them to Linus as part of the ext4 patches we've been planning on pushing? Regards, - Ted ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 15:45 ` Theodore Tso @ 2007-07-10 17:27 ` Andrew Morton 2007-07-10 18:05 ` Heiko Carstens 2007-07-10 18:20 ` Mark Fasheh 2 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 17:27 UTC (permalink / raw) To: Theodore Tso Cc: Heiko Carstens, linux-kernel, Andi Kleen, Amit Arora, Martin Schwidefsky On Tue, 10 Jul 2007 11:45:03 -0400 Theodore Tso <tytso@mit.edu> wrote: > On Tue, Jul 10, 2007 at 02:22:13AM -0700, Andrew Morton wrote: > > On Tue, 10 Jul 2007 11:07:37 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > > We reserved a different syscall number than the one that is used right now > > > in the patch. Please drop this patch... Martin or I will wire up the syscall > > > as soon as the x86 variant is merged. Everything else just causes trouble and > > > confusion. > > > > OK, I dropped all the fallocate patches. > > Andrew, I want to clarify who is going to push the fallocate patches. > I can either push them to Linus as part of the ext4 patch set, or we > can wait for you to push them. I thought since you had them in -mm > and we were going to wait you to push them (and presume that this was > going to happen soon). How about you send them? The syscall numbers might need to be changed based upon when/whether the revoke patches get merged. > Alternatively I can push them directly to Linus along with other ext4 > patches. I note that nobody really bothered reviewing all those ext4 patches. Do you feel that they have been adequately reviewed? I don't. I guess I know what I'll be doing today :( > We can drop the s390 patch if Martin or Heiko wants to wire > it up themselves. ia64 needs changing too. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 15:45 ` Theodore Tso 2007-07-10 17:27 ` Andrew Morton @ 2007-07-10 18:05 ` Heiko Carstens 2007-07-10 18:39 ` Amit K. Arora 2007-07-10 18:41 ` Andrew Morton 2007-07-10 18:20 ` Mark Fasheh 2 siblings, 2 replies; 535+ messages in thread From: Heiko Carstens @ 2007-07-10 18:05 UTC (permalink / raw) To: Theodore Tso, Andrew Morton, linux-kernel, Andi Kleen, Amit Arora, Martin Schwidefsky > Alternatively I can push them directly to Linus along with other ext4 > patches. We can drop the s390 patch if Martin or Heiko wants to wire > it up themselves. Yes, please drop the s390 patch. In general it seems to be better if only one architecture gets a syscall wired up initially and let other arches follow later. Just wondering if the x86_64 compat syscall gets ever fixed? I think I mentioned already three or four times to Amit that it is broken. Or is it that nobody cares? Dunno.. In addition there used to be a somewhat inofficial rule that new syscalls have to come with a test program, so people can easily test if they wired up the syscall correctly. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 18:05 ` Heiko Carstens @ 2007-07-10 18:39 ` Amit K. Arora 2007-07-10 18:41 ` Andrew Morton 1 sibling, 0 replies; 535+ messages in thread From: Amit K. Arora @ 2007-07-10 18:39 UTC (permalink / raw) To: Heiko Carstens Cc: Theodore Tso, Andrew Morton, linux-kernel, Andi Kleen, Martin Schwidefsky On Tue, Jul 10, 2007 at 08:05:31PM +0200, Heiko Carstens wrote: > > Alternatively I can push them directly to Linus along with other ext4 > > patches. We can drop the s390 patch if Martin or Heiko wants to wire > > it up themselves. > > Yes, please drop the s390 patch. In general it seems to be better if only > one architecture gets a syscall wired up initially and let other arches > follow later. > > Just wondering if the x86_64 compat syscall gets ever fixed? I think > I mentioned already three or four times to Amit that it is broken. > Or is it that nobody cares? Dunno.. Last time it was brought up was when TAKE5 of the patchset was posted and I had planned to fix this in the TAKE6 - which didn't happen since there was no final descision on the mode flags. Anyhow, the x86_64 compat syscall has already been fixed in the ext4 patch queue. I will repost all the patches rebased on 2.6.22 (as they are in the ext4 patch queue), since these have already been dropped from -mm. > In addition there used to be a somewhat inofficial rule that new syscalls > have to come with a test program, so people can easily test if they wired > up the syscall correctly. Ok. Will work on a small testcase and post it soon. -- Regards, Amit Arora ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 18:05 ` Heiko Carstens 2007-07-10 18:39 ` Amit K. Arora @ 2007-07-10 18:41 ` Andrew Morton 2007-07-11 9:36 ` testcases, was " Christoph Hellwig 2007-07-11 9:40 ` Andi Kleen 1 sibling, 2 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 18:41 UTC (permalink / raw) To: Heiko Carstens Cc: Theodore Tso, linux-kernel, Andi Kleen, Amit Arora, Martin Schwidefsky On Tue, 10 Jul 2007 20:05:31 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > Alternatively I can push them directly to Linus along with other ext4 > > patches. We can drop the s390 patch if Martin or Heiko wants to wire > > it up themselves. > > Yes, please drop the s390 patch. In general it seems to be better if only > one architecture gets a syscall wired up initially and let other arches > follow later. Yep. otoh, fallocate() was special, because we had so many problems working out how to organise the args so that certain kooky architectures can implement it. > Just wondering if the x86_64 compat syscall gets ever fixed? I think > I mentioned already three or four times to Amit that it is broken. > Or is it that nobody cares? Dunno.. > > In addition there used to be a somewhat inofficial rule that new syscalls > have to come with a test program, so people can easily test if they wired > up the syscall correctly. Yes please. I normally just slam the whole .c file into the changelog. I'd support an ununofficial rule that submitters of new syscalls also raise a patch against LTP, come to that... ^ permalink raw reply [flat|nested] 535+ messages in thread
* testcases, was Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 18:41 ` Andrew Morton @ 2007-07-11 9:36 ` Christoph Hellwig 2007-07-11 9:40 ` Nick Piggin 2007-07-11 9:40 ` Andi Kleen 1 sibling, 1 reply; 535+ messages in thread From: Christoph Hellwig @ 2007-07-11 9:36 UTC (permalink / raw) To: Andrew Morton Cc: Heiko Carstens, Theodore Tso, linux-kernel, Andi Kleen, Amit Arora, Martin Schwidefsky On Tue, Jul 10, 2007 at 11:41:19AM -0700, Andrew Morton wrote: > I'd support an ununofficial rule that submitters of new syscalls also raise > a patch against LTP, come to that... s/ununofficial//, please. And extend this to every new kernel interface that's not bound to a specific piece of hardware. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: testcases, was Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-11 9:36 ` testcases, was " Christoph Hellwig @ 2007-07-11 9:40 ` Nick Piggin 2007-07-11 10:36 ` Michael Kerrisk 0 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-11 9:40 UTC (permalink / raw) To: Christoph Hellwig Cc: Andrew Morton, Heiko Carstens, Theodore Tso, linux-kernel, Andi Kleen, Amit Arora, Martin Schwidefsky, Michael Kerrisk Christoph Hellwig wrote: > On Tue, Jul 10, 2007 at 11:41:19AM -0700, Andrew Morton wrote: > >>I'd support an ununofficial rule that submitters of new syscalls also raise >>a patch against LTP, come to that... > > > s/ununofficial//, please. And extend this to every new kernel interface > that's not bound to a specific piece of hardware. Agree, and cc manpages maintainer too? (preferably write most of the manpage body as well, IMO). -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: testcases, was Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-11 9:40 ` Nick Piggin @ 2007-07-11 10:36 ` Michael Kerrisk 0 siblings, 0 replies; 535+ messages in thread From: Michael Kerrisk @ 2007-07-11 10:36 UTC (permalink / raw) To: Nick Piggin, hch Cc: schwidefsky, aarora, ak, linux-kernel, tytso, heiko.carstens, akpm > Christoph Hellwig wrote: > > On Tue, Jul 10, 2007 at 11:41:19AM -0700, Andrew Morton wrote: > > > >>I'd support an ununofficial rule that submitters of new syscalls also > >> raise a patch against LTP, come to that... > > > > > > s/ununofficial//, please. And extend this to every new kernel interface > > that's not bound to a specific piece of hardware. > > Agree, and cc manpages maintainer too? (preferably write most > of the manpage body as well, IMO). Yes, please. Docs written by, or with input from, the implementer provide one of the few ways for everyone else to spot differences between implementation and intention. -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 18:41 ` Andrew Morton 2007-07-11 9:36 ` testcases, was " Christoph Hellwig @ 2007-07-11 9:40 ` Andi Kleen 1 sibling, 0 replies; 535+ messages in thread From: Andi Kleen @ 2007-07-11 9:40 UTC (permalink / raw) To: Andrew Morton Cc: Heiko Carstens, Theodore Tso, linux-kernel, Amit Arora, Martin Schwidefsky > I'd support an ununofficial rule that submitters of new syscalls also raise > a patch against LTP, come to that... And a patch for the manpages. Definitely in favor. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 15:45 ` Theodore Tso 2007-07-10 17:27 ` Andrew Morton 2007-07-10 18:05 ` Heiko Carstens @ 2007-07-10 18:20 ` Mark Fasheh 2007-07-10 20:28 ` Amit K. Arora 2 siblings, 1 reply; 535+ messages in thread From: Mark Fasheh @ 2007-07-10 18:20 UTC (permalink / raw) To: Theodore Tso, Andrew Morton, Heiko Carstens, linux-kernel, Andi Kleen, Amit Arora, Martin Schwidefsky On Tue, Jul 10, 2007 at 11:45:03AM -0400, Theodore Tso wrote: > On Tue, Jul 10, 2007 at 02:22:13AM -0700, Andrew Morton wrote: > > On Tue, 10 Jul 2007 11:07:37 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > > We reserved a different syscall number than the one that is used right now > > > in the patch. Please drop this patch... Martin or I will wire up the syscall > > > as soon as the x86 variant is merged. Everything else just causes trouble and > > > confusion. > > > > OK, I dropped all the fallocate patches. > > Andrew, I want to clarify who is going to push the fallocate patches. > I can either push them to Linus as part of the ext4 patch set, or we > can wait for you to push them. I thought since you had them in -mm > and we were going to wait you to push them (and presume that this was > going to happen soon). > > Alternatively I can push them directly to Linus along with other ext4 > patches. We can drop the s390 patch if Martin or Heiko wants to wire > it up themselves. > > As far as I know there hasn't been any real contention on the actual > syscall patches, other than the numbering issues, so it seems that > pushing them to Linus sooner rather than later is the right thing to > do. Where is the latest and greatest version of those patches? Is it still the patch set distributed in 2.6.22-rc6-mm1? I'd mostly like to see the final set of flags we're planning on supporting. But yeah, I second the "sooner rather than later" :) --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 -- sys_fallocate 2007-07-10 18:20 ` Mark Fasheh @ 2007-07-10 20:28 ` Amit K. Arora 0 siblings, 0 replies; 535+ messages in thread From: Amit K. Arora @ 2007-07-10 20:28 UTC (permalink / raw) To: Mark Fasheh Cc: Theodore Tso, Andrew Morton, Heiko Carstens, linux-kernel, Andi Kleen, Martin Schwidefsky On Tue, Jul 10, 2007 at 11:20:47AM -0700, Mark Fasheh wrote: > On Tue, Jul 10, 2007 at 11:45:03AM -0400, Theodore Tso wrote: > > On Tue, Jul 10, 2007 at 02:22:13AM -0700, Andrew Morton wrote: > > > On Tue, 10 Jul 2007 11:07:37 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > > > We reserved a different syscall number than the one that is used right now > > > > in the patch. Please drop this patch... Martin or I will wire up the syscall > > > > as soon as the x86 variant is merged. Everything else just causes trouble and > > > > confusion. > > > > > > OK, I dropped all the fallocate patches. > > > > Andrew, I want to clarify who is going to push the fallocate patches. > > I can either push them to Linus as part of the ext4 patch set, or we > > can wait for you to push them. I thought since you had them in -mm > > and we were going to wait you to push them (and presume that this was > > going to happen soon). > > > > Alternatively I can push them directly to Linus along with other ext4 > > patches. We can drop the s390 patch if Martin or Heiko wants to wire > > it up themselves. > > > > As far as I know there hasn't been any real contention on the actual > > syscall patches, other than the numbering issues, so it seems that > > pushing them to Linus sooner rather than later is the right thing to > > do. > > Where is the latest and greatest version of those patches? Is it still the > patch set distributed in 2.6.22-rc6-mm1? I'd mostly like to see the final > set of flags we're planning on supporting. But yeah, I second the "sooner > rather than later" :) I have posted the latest fallocate patches as part of TAKE6. These patches are exactly same as how they currently look in the ext4 patch queue being maintained by Ted. -- Regards, Amit Arora ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: cpuset-remove-sched-domain-hooks-from-cpusets 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton 2007-07-10 9:04 ` intel iommu (Re: -mm merge plans for 2.6.23) Jan Engelhardt 2007-07-10 9:07 ` -mm merge plans for 2.6.23 -- sys_fallocate Heiko Carstens @ 2007-07-10 9:17 ` Paul Jackson 2007-07-10 10:15 ` -mm merge plans for 2.6.23 Con Kolivas ` (23 subsequent siblings) 26 siblings, 0 replies; 535+ messages in thread From: Paul Jackson @ 2007-07-10 9:17 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Dinakar Guniguntala, Cliff Wickman Andrew wrote: > cpuset-remove-sched-domain-hooks-from-cpusets.patch > > Stuck. Well ... a few hours ago I just finished the 'unrelated task' that kept me from doing much cpuset work the last six months. So, after a little bit of saved up vacation (SGI sabbatical - yippee!), I should be able to dig into this, see what Dinikar and Cliff have been up to here, and make some progress on this. Cliff -- did I see some work from you go by relating to cpusets and the sched domain hooks in cpusets? As before, Andrew, if you're getting bored holding this patch, it's ok to drop it, even though I still figure you'll see it again, as part of a patch set to improve this cpuset to sched domain interaction. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (2 preceding siblings ...) 2007-07-10 9:17 ` cpuset-remove-sched-domain-hooks-from-cpusets Paul Jackson @ 2007-07-10 10:15 ` Con Kolivas [not found] ` <b21f8390707101802o2d546477n2a18c1c3547c3d7a@mail.gmail.com> ` (2 more replies) 2007-07-10 10:52 ` containers (was Re: -mm merge plans for 2.6.23) Srivatsa Vaddagiri ` (22 subsequent siblings) 26 siblings, 3 replies; 535+ messages in thread From: Con Kolivas @ 2007-07-10 10:15 UTC (permalink / raw) To: Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm; +Cc: linux-kernel On Tuesday 10 July 2007 18:31, Andrew Morton wrote: > When replying, please rewrite the subject suitably and try to Cc: the > appropriate developer(s). ~swap prefetch Nick's only remaining issue which I could remotely identify was to make it cpuset aware: http://marc.info/?l=linux-mm&m=117875557014098&w=2 as discussed with Paul Jackson it was cpuset aware: http://marc.info/?l=linux-mm&m=117895463120843&w=2 I fixed all bugs I could find and improved it as much as I could last kernel cycle. Put me and the users out of our misery and merge it now or delete it forever please. And if the meaningless handwaving that I 100% expect as a response begins again, then that's fine. I'll take that as a no and you can dump it. -- -ck ^ permalink raw reply [flat|nested] 535+ messages in thread
[parent not found: <b21f8390707101802o2d546477n2a18c1c3547c3d7a@mail.gmail.com>]
* Re: [ck] Re: -mm merge plans for 2.6.23 [not found] ` <b21f8390707101802o2d546477n2a18c1c3547c3d7a@mail.gmail.com> @ 2007-07-11 1:14 ` Andrew Morton [not found] ` <b8bf37780707101852g25d835b4ubbf8da5383755d4b@mail.gmail.com> ` (5 more replies) 0 siblings, 6 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-11 1:14 UTC (permalink / raw) To: Matthew Hawkins Cc: Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 11 Jul 2007 11:02:56 +1000 "Matthew Hawkins" <darthmdh@gmail.com> wrote: > We all know swap prefetch has been tested out the wazoo since Moses was a > little boy, is compile-time and runtime selectable, and gives an important > and quantifiable performance increase to desktop systems. Always interested. Please provide us more details on your usage and testing of that code. Amount of memory, workload, observed results, etc? > Save a Redhat > employee some time reinventing the wheel and just merge it. This wheel > already has dope 21" rims, homes ;-) ooh, kernel bling. ^ permalink raw reply [flat|nested] 535+ messages in thread
[parent not found: <b8bf37780707101852g25d835b4ubbf8da5383755d4b@mail.gmail.com>]
* Fwd: [ck] Re: -mm merge plans for 2.6.23 [not found] ` <b8bf37780707101852g25d835b4ubbf8da5383755d4b@mail.gmail.com> @ 2007-07-11 1:53 ` André Goddard Rosa 0 siblings, 0 replies; 535+ messages in thread From: André Goddard Rosa @ 2007-07-11 1:53 UTC (permalink / raw) To: linux list On 7/10/07, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 11 Jul 2007 11:02:56 +1000 "Matthew Hawkins" <darthmdh@gmail.com> wrote: > > > We all know swap prefetch has been tested out the wazoo since Moses was a > > little boy, is compile-time and runtime selectable, and gives an important > > and quantifiable performance increase to desktop systems. > > Always interested. Please provide us more details on your usage and > testing of that code. Amount of memory, workload, observed results, > etc? > > It keeps my machine responsive after some time of inactivity, i.e. when I try to use firefox in the morning after leaving it running overnight with multiple tabs open. I have 1Gb of memory in this machine. With regards, -- []s, André Goddard ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 1:14 ` [ck] " Andrew Morton [not found] ` <b8bf37780707101852g25d835b4ubbf8da5383755d4b@mail.gmail.com> @ 2007-07-11 2:21 ` Ira Snyder 2007-07-11 3:37 ` timotheus 2007-07-11 2:54 ` Matthew Hawkins ` (3 subsequent siblings) 5 siblings, 1 reply; 535+ messages in thread From: Ira Snyder @ 2007-07-11 2:21 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Hawkins, linux-kernel, Con Kolivas, ck list, linux-mm, Paul Jackson [-- Attachment #1: Type: text/plain, Size: 1376 bytes --] On Tue, 10 Jul 2007 18:14:19 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 11 Jul 2007 11:02:56 +1000 "Matthew Hawkins" <darthmdh@gmail.com> wrote: > > > We all know swap prefetch has been tested out the wazoo since Moses was a > > little boy, is compile-time and runtime selectable, and gives an important > > and quantifiable performance increase to desktop systems. > > Always interested. Please provide us more details on your usage and > testing of that code. Amount of memory, workload, observed results, > etc? > I often leave long compiles running overnight (I'm a gentoo user). I always have the desktop running, with quite a few applications open, usually firefox, amarok, sylpheed, and liferea at the minimum. I've recently tried using a "stock" gentoo kernel, without the swap prefetch patch, and in the morning when I get on the computer, it hits the disk pretty hard pulling my applications (especially firefox) in from swap. With swap prefetch, the system responds like I expect: quick. It doesn't hit the swap at all, at least that I can tell. Swap prefetch definitely makes a difference for me: it makes my experience MUCH better. My system is a Core Duo 1.83GHz laptop, with 1GB ram and a 5400 rpm disk. With the disk being so slow, the less I hit swap, the better. I'll cast my vote to merge swap prefetch. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 2:21 ` Ira Snyder @ 2007-07-11 3:37 ` timotheus 0 siblings, 0 replies; 535+ messages in thread From: timotheus @ 2007-07-11 3:37 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Hawkins, linux-kernel, Con Kolivas, ck list, linux-mm, Paul Jackson [-- Attachment #1: Type: text/plain, Size: 2484 bytes --] Ira Snyder <kernel@irasnyder.com> writes: >> Always interested. Please provide us more details on your usage and >> testing of that code. Amount of memory, workload, observed results, >> etc? >> > > I often leave long compiles running overnight (I'm a gentoo user). I > always have the desktop running, with quite a few applications open, > usually firefox, amarok, sylpheed, and liferea at the minimum. I've > recently tried using a "stock" gentoo kernel, without the swap > prefetch patch, and in the morning when I get on the computer, it hits > the disk pretty hard pulling my applications (especially firefox) in > from swap. With swap prefetch, the system responds like I expect: > quick. It doesn't hit the swap at all, at least that I can tell. > > Swap prefetch definitely makes a difference for me: it makes my > experience MUCH better. > > My system is a Core Duo 1.83GHz laptop, with 1GB ram and a 5400 rpm > disk. With the disk being so slow, the less I hit swap, the better. > > I'll cast my vote to merge swap prefetch. Very similar experiences. Other usage patterns that swap prefetch can cause improvements with: - Idling VMware session with large memory. Since VMware (server) can use mixed swap/RAM, the prefetch allows it swap back into RAM without having to make the application active in the foreground. - Firefox, OO Office, long from-source compilations, all of the normal. - My largest RAM capacity machine is a Core 2 Duo Laptop with 2 GB of RAM. It still benefits from the prefetch after running long compilations or backups. - Also, I have an old Pentium 4 server (1.3 GHz, original RDRAM, ...) that uses the CK patches including swap prefetch. It has only 640 MB of RAM, and runs GBytes of data backup every night. The swap is split among multiple disks, and can easily fill .5 GBytes over night. Applications that run in a VNC session, web browsers, office programs, etc., all resume much faster with the prefetch. Even the intial ssh-login appears snappier; but I think that is just CK's fine work elsewhere. :) I am curious how much of the benefit is due to prefetch, or due to CK using `vm_mapped' rather than `vm_swappiness'. Swappiness always seemed such an unbenificial hack to me... (The past 6 months I've tried weeks/months of using various kernels, -mm, -ck, vanilla, genpatches, combinations there of -- x86 and ppc.) I vote for prefetch and `vm_mapped'. [-- Attachment #2: Type: application/pgp-signature, Size: 188 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 1:14 ` [ck] " Andrew Morton [not found] ` <b8bf37780707101852g25d835b4ubbf8da5383755d4b@mail.gmail.com> 2007-07-11 2:21 ` Ira Snyder @ 2007-07-11 2:54 ` Matthew Hawkins 2007-07-11 5:18 ` Nick Piggin 2007-07-11 3:59 ` Grzegorz Kulewski ` (2 subsequent siblings) 5 siblings, 1 reply; 535+ messages in thread From: Matthew Hawkins @ 2007-07-11 2:54 UTC (permalink / raw) To: Andrew Morton Cc: Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/11/07, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 11 Jul 2007 11:02:56 +1000 "Matthew Hawkins" <darthmdh@gmail.com> wrote: > Always interested. Please provide us more details on your usage and > testing of that code. Amount of memory, workload, observed results, > etc? My usual workstation has 1Gb of ram & 2Gb of swap (single partition - though in the past with multiple drives I would spread swap around the less-used disks & fiddle with the priority). Its acting as server for my home network too (so it has squid, cups, bind, dhcpd, apache, mysql & postgresql) but for the most part I'll have Listen playing music while I switch between Flock &/or Firefox, Thunderbird, and xvncviewer. On the odd occasion I'll fire up some game (gewled, actioncube, critical mass). Compiling these days has been mostly limited to kernels, I've been building mostly -ck and -cfs - keeping up-to-date and also doing some odd things (like patching the non-SD -ck stuff on top of CFS). Mainly just to get swap prefetch, but also not to lose skills since I'm out of the daily coding routine now. Anyhow with swap prefetch, applications that may have been sitting there idle for a while become responsive in the single-digit seconds rather than double-digit or worse. The same goes for a morning wakeup (ie after nightly cron jobs throw things out) and also after doing basically anything that wants memory, like benchmarking the various kernels I'm messing with or doing some local DB work or coding a memory leak into a web application running under apache ;) -- Matt ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 2:54 ` Matthew Hawkins @ 2007-07-11 5:18 ` Nick Piggin 2007-07-11 5:47 ` Ray Lee 0 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-11 5:18 UTC (permalink / raw) To: Matthew Hawkins Cc: Andrew Morton, Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Matthew Hawkins wrote: > On 7/11/07, Andrew Morton <akpm@linux-foundation.org> wrote: > Anyhow with swap prefetch, applications that may have been sitting > there idle for a while become responsive in the single-digit seconds > rather than double-digit or worse. The same goes for a morning wakeup > (ie after nightly cron jobs throw things out) OK that's a good data point. It would be really good to be able to do an analysis on your overnight IO patterns and the corresponding memory reclaim behaviour and see why things are getting evicted. Not that swap prefetching isn't a good solution for this situation, but the fact that things are getting swapped out for you also means that mapped files and possibly important pagecache and dentries are being flushed out, which we might be able to avoid. Thanks, Nick -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 5:18 ` Nick Piggin @ 2007-07-11 5:47 ` Ray Lee 2007-07-11 5:54 ` Nick Piggin 2007-07-11 6:00 ` [ck] Re: -mm merge plans for 2.6.23 Nick Piggin 0 siblings, 2 replies; 535+ messages in thread From: Ray Lee @ 2007-07-11 5:47 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Hawkins, Andrew Morton, Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Matthew Hawkins wrote: > > On 7/11/07, Andrew Morton <akpm@linux-foundation.org> wrote: > > > Anyhow with swap prefetch, applications that may have been sitting > > there idle for a while become responsive in the single-digit seconds > > rather than double-digit or worse. The same goes for a morning wakeup > > (ie after nightly cron jobs throw things out) > > OK that's a good data point. It would be really good to be able to > do an analysis on your overnight IO patterns and the corresponding > memory reclaim behaviour and see why things are getting evicted. Eviction can happen for multiple reasons, as I'm sure you're painfully aware. It can happen because of poor balancing choices, or it can happen because the system is just short of RAM for the workload. As for the former, you're absolutely right, it would be good to know where those come from and see if they can be addressed. However, it's the latter that swap prefetch can help and no amount of fiddling with the aging code can address. As an honest question, what's it going to take here? If I were to write something that watched the task stats at process exit (cool feature, that), and recorded the IO wait time or some such, and showed it was lower with a kernel with the prefetch, would *that* get us some forward motion on this? I mean, from my point of view, it's a simple mental proof to show that if you're out of RAM for your working set, things that you'll eventually need again will get kicked out, and prefetch will bring those back in before normal access patterns would fault them back in under today's behavior. That seems like an obvious win. Where's the corresponding obvious loss that makes this a questionable addition to the kernel? Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 5:47 ` Ray Lee @ 2007-07-11 5:54 ` Nick Piggin 2007-07-11 6:04 ` Ray Lee 2007-07-11 6:00 ` [ck] Re: -mm merge plans for 2.6.23 Nick Piggin 1 sibling, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-11 5:54 UTC (permalink / raw) To: Ray Lee Cc: Matthew Hawkins, Andrew Morton, Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Ray Lee wrote: > On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> Matthew Hawkins wrote: >> > On 7/11/07, Andrew Morton <akpm@linux-foundation.org> wrote: >> >> > Anyhow with swap prefetch, applications that may have been sitting >> > there idle for a while become responsive in the single-digit seconds >> > rather than double-digit or worse. The same goes for a morning wakeup >> > (ie after nightly cron jobs throw things out) >> >> OK that's a good data point. It would be really good to be able to >> do an analysis on your overnight IO patterns and the corresponding >> memory reclaim behaviour and see why things are getting evicted. > > > Eviction can happen for multiple reasons, as I'm sure you're painfully > aware. It can happen because of poor balancing choices, or it can s/balancing/reclaim, yes. And for the nightly cron job case, this is could quite possibly be the cause. At least updatedb should be fairly easy to apply use-once heuristics for, so if they're not working then we should hopefully be able to improve it. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 5:54 ` Nick Piggin @ 2007-07-11 6:04 ` Ray Lee 2007-07-11 6:24 ` Nick Piggin 0 siblings, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-11 6:04 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Hawkins, Andrew Morton, Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> OK that's a good data point. It would be really good to be able to > >> do an analysis on your overnight IO patterns and the corresponding > >> memory reclaim behaviour and see why things are getting evicted. > > > > Eviction can happen for multiple reasons, as I'm sure you're painfully > > aware. It can happen because of poor balancing choices, or it can > > s/balancing/reclaim, yes. And for the nightly cron job case, this is > could quite possibly be the cause. At least updatedb should be fairly > easy to apply use-once heuristics for, so if they're not working then > we should hopefully be able to improve it. <nod> Sorry, I'm not so clear on the terminology, am I. So, that's one part of it: one could argue that for that bit swap prefetch is a bit of a band-aid over the issue. A useful band-aid, that works today, isn't invasive, and can be ripped out at some future time if the underlying issue is eventually solved by a proper use-once aging mechanism, but nevertheless a band-aid. The other part is when I've got evolution and a few other things open, then I run gimp on a raw photo and do some work on it, quit out of gimp, do a couple of things in a shell to upload the photo to my server, then switch back to evolution. Hang, waiting on swap in. Well, the kernel had some free time there to repopulate evolution's working set, and swap prefetch would help that, while better (or perfect!) heuristics in the reclaim *won't*. That's the real issue here. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 6:04 ` Ray Lee @ 2007-07-11 6:24 ` Nick Piggin 2007-07-11 7:50 ` swap prefetch (Re: -mm merge plans for 2.6.23) Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-11 6:24 UTC (permalink / raw) To: Ray Lee Cc: Matthew Hawkins, Andrew Morton, Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Ray Lee wrote: > On 7/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> >> OK that's a good data point. It would be really good to be able to >> >> do an analysis on your overnight IO patterns and the corresponding >> >> memory reclaim behaviour and see why things are getting evicted. >> > >> > Eviction can happen for multiple reasons, as I'm sure you're painfully >> > aware. It can happen because of poor balancing choices, or it can >> >> s/balancing/reclaim, yes. And for the nightly cron job case, this is >> could quite possibly be the cause. At least updatedb should be fairly >> easy to apply use-once heuristics for, so if they're not working then >> we should hopefully be able to improve it. > > > <nod> Sorry, I'm not so clear on the terminology, am I. > > So, that's one part of it: one could argue that for that bit swap > prefetch is a bit of a band-aid over the issue. A useful band-aid, > that works today, isn't invasive, and can be ripped out at some future > time if the underlying issue is eventually solved by a proper use-once > aging mechanism, but nevertheless a band-aid. I think for some workloads it is probably a bandaid, and for others the concept of prefetching likely to be used again data back in is undeniably going to be a win for others. A lot of postitive reports I have seen about this say that desktop the next morning is more responsive. So I kind of want to know what's happening here -- as far as I can tell, swap prefetching shouldn't help a huge amount to recover from a simple updatedb alone -- although if other cron stuff happened that used a bit more memory afterwards and pushing out some of updatedb's cache, perhaps that's when swap prefetching finds its niche. I don't know. However, I don't like the fact that there is _any_ swap happening on 1GB desktops after a single updatedb run. Is something else running that hogs a huge amount of memory? Maybe that explains it, but I don't know. I do know that we probably don't do very good use-once algorithms on the dentry and inode caches, so updatedb might cause them to push swap out. We could test that by winding the vfs reclaim right up. > The other part is when I've got evolution and a few other things open, > then I run gimp on a raw photo and do some work on it, quit out of > gimp, do a couple of things in a shell to upload the photo to my > server, then switch back to evolution. Hang, waiting on swap in. Well, > the kernel had some free time there to repopulate evolution's working > set, and swap prefetch would help that, while better (or perfect!) > heuristics in the reclaim *won't*. > > That's the real issue here. Yeah that's an issue, and swap prefetching has the potential to help there no doubt at all. How much is the saving? I don't think it will be like an order of magnitude because unfortunately we also get mapped pagecache being thrown out as well as swap, so for example all your evolution mailbox, libraries, executable, etc. is still going to have to be paged back in. Regarding swap prefetching. I'm not going to argue for or against it anymore because I have really stopped following where it is up to, for now. If the code and the results meet the standard that Andrew wants then I don't particularly mind if he merges it. It would be nice if some of you guys would still report and test problems with reclaim when prefetching is turned off -- I have never encountered the morning after sluggishness (although I don't doubt for a minute that it is a problem for some). -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: swap prefetch (Re: -mm merge plans for 2.6.23) 2007-07-11 6:24 ` Nick Piggin @ 2007-07-11 7:50 ` Ingo Molnar 0 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-07-11 7:50 UTC (permalink / raw) To: Nick Piggin Cc: Ray Lee, Matthew Hawkins, Andrew Morton, Con Kolivas, ck list, Paul Jackson, linux-mm, linux-kernel * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Regarding swap prefetching. I'm not going to argue for or against it > anymore because I have really stopped following where it is up to, for > now. If the code and the results meet the standard that Andrew wants > then I don't particularly mind if he merges it. I have tested it and have read the code, and it looks fine to me. (i've reported my test results elsewhere already) We should include this in v2.6.23. Acked-by: Ingo Molnar <mingo@elte.hu> Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 5:47 ` Ray Lee 2007-07-11 5:54 ` Nick Piggin @ 2007-07-11 6:00 ` Nick Piggin 1 sibling, 0 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-11 6:00 UTC (permalink / raw) To: Ray Lee Cc: Matthew Hawkins, Andrew Morton, Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Ray Lee wrote: > As an honest question, what's it going to take here? If I were to > write something that watched the task stats at process exit (cool > feature, that), and recorded the IO wait time or some such, and showed > it was lower with a kernel with the prefetch, would *that* get us some > forward motion on this? Honest answer? Sure, why not. Numbers are good. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 1:14 ` [ck] " Andrew Morton ` (2 preceding siblings ...) 2007-07-11 2:54 ` Matthew Hawkins @ 2007-07-11 3:59 ` Grzegorz Kulewski 2007-07-11 12:26 ` Kevin Winchester 2007-07-12 12:06 ` Kacper Wysocki 5 siblings, 0 replies; 535+ messages in thread From: Grzegorz Kulewski @ 2007-07-11 3:59 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Hawkins, linux-kernel, Con Kolivas, ck list, linux-mm, Paul Jackson On Tue, 10 Jul 2007, Andrew Morton wrote: > On Wed, 11 Jul 2007 11:02:56 +1000 "Matthew Hawkins" <darthmdh@gmail.com> wrote: > >> We all know swap prefetch has been tested out the wazoo since Moses was a >> little boy, is compile-time and runtime selectable, and gives an important >> and quantifiable performance increase to desktop systems. > > Always interested. Please provide us more details on your usage and > testing of that code. Amount of memory, workload, observed results, > etc? I am using swap prefetch in -ck kernels since it was introduced. My machine: Athlon XP 2000MHz, 1GB DDR 266, fast SATA disk, different swap configurations but usually heaps of swap (2GB and/or 8GB). My workload: desktop usage, KDE, software development, Firefox (HUGE memory hog), Eclipse and all that stuff (HUGE memory hog), sometimes other applications, sometimes some game such as Americas Army (that one will eat all your memory in any configuration), Konsole with heaps of tabs, usually some heavy compilations in the background. Observed result (of not broken swap prefetch versions): after closing some memory hog (for example stopping playing game and starting to write some code or reloading Firefox after it leaked enough memory to nearly bring the system down) the disk will work for some time and after that everything works as expected, no heavy swap-in when switching between applications and so on, nearly no lags in desktop usage. This is nearly unnoticable. Unless I have to run pure mainline. In that case I can notice that swap prefetch is off very quickly because after closing such memory hog and returning to some other application the system is slow for long time. Worse: after it starts to work reasonably and I try to switch to some other application or even try to use some dialog window or module of current application I have to wait, sometimes > 10s for it to swap back in (even if 70% of my RAM is free at that time, after memory hog is gone). It is painfull. I observed similar results on my laptop (Athlon 64, 512MB RAM, slow ATA disk, similar workload but reduced because hardware is weak). For me swap prefetch makes huge difference. The system lags a lot less in such circumstances. Personaly I think swap prefetch is a hack. Maybe not very dirty and ugly but still a hack. But since: * nobody proposed anything that can replace it and can be considered a no-hack, * swap prefetch is rather well tested and shouldn't cause regressions (no known regressions as far as I know, the patch does not look very invasive, was reviewed several times, ...), * Con said he won't make further -ck relases and won't port these patches to newer kernels, * there are at least several people who see the difference, * if somebody really hates it (s)he can turn it off I think it could get merged, at least temporarily, before somebody can suggest some better or extended solution. Personaly I would be very happy to see it in so people like me don't have to patch it in or (worse) port it (possibly causing bugs and filling additional bug reports and asking additional questions on these lists). I even wonder if adding the opposite of swap prefetch too wouldn't be even better for many workloads. Something like: "when system and swap-disk is idle try to copy some pages to swap so when system needs memory swap-out could be much cheaper". I suspect patch like that can reduce startup times (and other operations) of great memory hogs because disk (the slowest device) will only have to read the application and won't have to swap-out half of the RAM at the same time. I am happy to provide further info if needed. Thanks, Grzegorz Kulewski ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 1:14 ` [ck] " Andrew Morton ` (3 preceding siblings ...) 2007-07-11 3:59 ` Grzegorz Kulewski @ 2007-07-11 12:26 ` Kevin Winchester 2007-07-11 12:36 ` Jesper Juhl 2007-07-12 12:06 ` Kacper Wysocki 5 siblings, 1 reply; 535+ messages in thread From: Kevin Winchester @ 2007-07-11 12:26 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Hawkins, Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1411 bytes --] On Tue, 10 Jul 2007 18:14:19 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 11 Jul 2007 11:02:56 +1000 "Matthew Hawkins" <darthmdh@gmail.com> wrote: > > > We all know swap prefetch has been tested out the wazoo since Moses was a > > little boy, is compile-time and runtime selectable, and gives an important > > and quantifiable performance increase to desktop systems. > > Always interested. Please provide us more details on your usage and > testing of that code. Amount of memory, workload, observed results, > etc? > I only have 512 MB of memory on my Athlon64 desktop box, and I switch between -mm and mainline kernels regularly. I have noticed that -mm is always much more responsive, especially first thing in the morning. I believe this has been due to the new schedulers in -mm (because I notice an improvement in mainline now that CFS has been merged), as well as swap prefetch. I haven't tested swap prefetch alone to know for sure, but it seems pretty likely. My workload is compiling kernels, with sylpheed, pidgin and firefox[1] open, and sometimes MonoDevelop if I want to slow my system to a crawl. I will be getting another 512 MB of RAM at Christmas time, but from the other reports, it seems that swap prefetch will still be useful. [1] Is there a graphical browser for linux that doesn't suck huge amounts of RAM? -- Kevin Winchester [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 12:26 ` Kevin Winchester @ 2007-07-11 12:36 ` Jesper Juhl 0 siblings, 0 replies; 535+ messages in thread From: Jesper Juhl @ 2007-07-11 12:36 UTC (permalink / raw) To: Kevin Winchester Cc: Andrew Morton, Matthew Hawkins, Con Kolivas, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 11/07/07, Kevin Winchester <kjwinchester@gmail.com> wrote: [snip] > > [1] Is there a graphical browser for linux that doesn't suck huge amounts of RAM? > Dillo (http://www.dillo.org/) is really really tiny , a memory footprint somewhere in the hundreds of K area IIRC. links 2 (http://links.twibright.com/) has a graphical mode in addition to the traditional text only mode (links -g) and the memory footprint is really tiny. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-11 1:14 ` [ck] " Andrew Morton ` (4 preceding siblings ...) 2007-07-11 12:26 ` Kevin Winchester @ 2007-07-12 12:06 ` Kacper Wysocki 2007-07-12 12:35 ` Avuton Olrich 5 siblings, 1 reply; 535+ messages in thread From: Kacper Wysocki @ 2007-07-12 12:06 UTC (permalink / raw) To: Andrew Morton Cc: Matthew Hawkins, linux-kernel, Con Kolivas, ck list, linux-mm, Paul Jackson On 7/11/07, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 11 Jul 2007 11:02:56 +1000 "Matthew Hawkins" <darthmdh@gmail.com> wrote: > > > We all know swap prefetch has been tested out the wazoo since Moses was a > > little boy, is compile-time and runtime selectable, and gives an important > > and quantifiable performance increase to desktop systems. > > Always interested. Please provide us more details on your usage and > testing of that code. Amount of memory, workload, observed results, > etc? Swap prefetch has been around for years, and it's a complete boon for the desktop user and a noop in any other situation. In addition to the sp_tester tool which consistently shows a definite advantage, there are many user reports that show the noticeable improvements it has. The many people who have tried it out have generally chosen to switch to patched kernels because of the performance increase. It's been discussed on the lkml many times before, we've been over performance, testing and impact. The big question is: why *don't* we merge it? -Kacper ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-12 12:06 ` Kacper Wysocki @ 2007-07-12 12:35 ` Avuton Olrich 0 siblings, 0 replies; 535+ messages in thread From: Avuton Olrich @ 2007-07-12 12:35 UTC (permalink / raw) To: Kacper Wysocki Cc: Andrew Morton, Matthew Hawkins, linux-kernel, Con Kolivas, ck list, linux-mm, Paul Jackson On 7/12/07, Kacper Wysocki <kacperw@online.no> wrote: > performance, testing and impact. The big question is: why *don't* we > merge it? Stranger thing to me is that this is like Déjà Vu. Many have asked this same question. When users were asked for their comment before many end users and some developers have given rave reviews. I don't remember anyone giving it the heavy thumbs-down, with exception of when things needed fixing (over 6 months ago?). It continues to go unmerged. Is there a clear answer on what needs to happen for it to get merged? -- avuton -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 10:15 ` -mm merge plans for 2.6.23 Con Kolivas [not found] ` <b21f8390707101802o2d546477n2a18c1c3547c3d7a@mail.gmail.com> @ 2007-07-23 23:08 ` Jesper Juhl 2007-07-24 3:22 ` Nick Piggin 2007-07-24 0:08 ` Con Kolivas 2 siblings, 1 reply; 535+ messages in thread From: Jesper Juhl @ 2007-07-23 23:08 UTC (permalink / raw) To: Con Kolivas Cc: Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 10/07/07, Con Kolivas <kernel@kolivas.org> wrote: > On Tuesday 10 July 2007 18:31, Andrew Morton wrote: > > When replying, please rewrite the subject suitably and try to Cc: the > > appropriate developer(s). > > ~swap prefetch > > Nick's only remaining issue which I could remotely identify was to make it > cpuset aware: > http://marc.info/?l=linux-mm&m=117875557014098&w=2 > as discussed with Paul Jackson it was cpuset aware: > http://marc.info/?l=linux-mm&m=117895463120843&w=2 > > I fixed all bugs I could find and improved it as much as I could last kernel > cycle. > > Put me and the users out of our misery and merge it now or delete it forever > please. And if the meaningless handwaving that I 100% expect as a response > begins again, then that's fine. I'll take that as a no and you can dump it. > For what it's worth; put me down as supporting the merger of swap prefetch. I've found it useful in the past, Con has maintained it nicely and cleaned up everything that people have pointed out - it's mature, does no harm - let's just get it merged. It's too late for 2.6.23-rc1 now, but let's try and get this in by -rc2 - it's long overdue... -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-23 23:08 ` Jesper Juhl @ 2007-07-24 3:22 ` Nick Piggin 2007-07-24 4:53 ` Ray Lee 0 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-24 3:22 UTC (permalink / raw) To: Jesper Juhl Cc: Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Jesper Juhl wrote: > On 10/07/07, Con Kolivas <kernel@kolivas.org> wrote: > >> On Tuesday 10 July 2007 18:31, Andrew Morton wrote: >> > When replying, please rewrite the subject suitably and try to Cc: the >> > appropriate developer(s). >> >> ~swap prefetch >> >> Nick's only remaining issue which I could remotely identify was to >> make it >> cpuset aware: >> http://marc.info/?l=linux-mm&m=117875557014098&w=2 >> as discussed with Paul Jackson it was cpuset aware: >> http://marc.info/?l=linux-mm&m=117895463120843&w=2 >> >> I fixed all bugs I could find and improved it as much as I could last >> kernel >> cycle. >> >> Put me and the users out of our misery and merge it now or delete it >> forever >> please. And if the meaningless handwaving that I 100% expect as a >> response >> begins again, then that's fine. I'll take that as a no and you can >> dump it. >> > For what it's worth; put me down as supporting the merger of swap > prefetch. I've found it useful in the past, Con has maintained it > nicely and cleaned up everything that people have pointed out - it's > mature, does no harm - let's just get it merged. It's too late for > 2.6.23-rc1 now, but let's try and get this in by -rc2 - it's long > overdue... Not talking about swap prefetch itself, but everytime I have asked anyone to instrument or produce some workload where swap prefetch helps, they never do. Fair enough if swap prefetch helps them, but I also want to look at why that is the case and try to improve page reclaim in some of these situations (for example standard overnight cron jobs shouldn't need swap prefetch on a 1 or 2GB system, I would hope). Anyway, back to swap prefetch, I don't know why I've been singled out as the bad guy here. I'm one of the only people who has had a look at the damn thing and tried to point out areas where it could be improved to the point of being included, and outlining things that are needed for it to be merged (ie. numbers). If anyone thinks that makes me the bad guy then they have an utterly inverted understanding of what peer review is for. Finally, everyone who has ever hacked on these heuristicy parts of the VM has heaps of patches that help some workload or some silly test case or (real or percieved) shortfall but have not been merged. It really isn't anything personal. If something really works, then it should be possible to get real numbers in real situations where it helps (OK, swap prefetching won't be as easy as a straight line performance improvement, but still much easier than trying to measure something like scheduler interactivity). Numbers are the best way to add weight to the pro-merge argument, so for all the people who a whining about merging this and don't want to actually work on the code -- post some numbers for where it helps you!! -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 3:22 ` Nick Piggin @ 2007-07-24 4:53 ` Ray Lee 2007-07-24 5:10 ` Jeremy Fitzhardinge ` (2 more replies) 0 siblings, 3 replies; 535+ messages in thread From: Ray Lee @ 2007-07-24 4:53 UTC (permalink / raw) To: Nick Piggin Cc: Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/23/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Not talking about swap prefetch itself, but everytime I have asked > anyone to instrument or produce some workload where swap prefetch > helps, they never do. [...] > so for all the people who a whining about merging this and don't want > to actually work on the code -- post some numbers for where it helps > you!! <Raised eyebrow> You sound frustrated. Perhaps we could be communicating better. I'll start. Unlike others on the cc: line, I don't get paid to hack on the kernel, not even indirectly. So if you find that my lack of providing numbers is giving you heartache, I can only apologize and point at my paying work that requires my attention. That said, I'm willing to run my day to day life through both a swap prefetch kernel and a normal one. *However*, before I go through all the work of instrumenting the damn thing, I'd really like Andrew (or Linus) to lay out his acceptance criteria on the feature. Exactly what *should* I be paying attention to? I've suggested keeping track of process swapin delay total time, and comparing with and without. Is that reasonable? Is it incomplete? Without Andrew's criteria, we're back to where we've been for a long time: lots of work, no forward motion. Perhaps it's a character flaw of mine, but I'd really like to know what would constitute proof here before I invest the effort. Especially given that Con has already written a test case that shows that swap prefetch works, and that I've given you a clear argument for why better (or even perfect) page reclaim can't provide full coverage to all the situations that swap prefetch helps. (Also, it's not like I've got tons free time, y'know? Just like all the rest of you all, I have to pick and choose my battles if I'm going to be effective.) Since this merge period has appeared particularly frazzling for Andrew, I've been keeping silent and waiting for him to get to a point where there's a breather. I didn't feel it would be polite to request yet more work out of him while he had a mess on his hands. But, given this has come to a head, I'm asking now. Andrew? You've always given the impression that you want this run more as an engineering effort than an artistic endeavour, so help us out here. What are your concerns with swap prefetch? What sort of comparative data would you like to see to justify its inclusion, or to prove that it's not needed? Or are we reading too much into the fact that it isn't merged? In short, communicate please, it will help. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 4:53 ` Ray Lee @ 2007-07-24 5:10 ` Jeremy Fitzhardinge 2007-07-24 5:18 ` Ray Lee 2007-07-24 5:16 ` Nick Piggin 2007-07-24 5:18 ` Andrew Morton 2 siblings, 1 reply; 535+ messages in thread From: Jeremy Fitzhardinge @ 2007-07-24 5:10 UTC (permalink / raw) To: Ray Lee Cc: Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Ray Lee wrote: > That said, I'm willing to run my day to day life through both a swap > prefetch kernel and a normal one. *However*, before I go through all > the work of instrumenting the damn thing, I'd really like Andrew (or > Linus) to lay out his acceptance criteria on the feature. Exactly what > *should* I be paying attention to? I've suggested keeping track of > process swapin delay total time, and comparing with and without. Is > that reasonable? Is it incomplete? Um, isn't it up to you? The questions that need to be answered are: 1. What are you trying to achieve? Presumably you have some intended or desired effect you're trying to get. What's the intended audience? Who would be expected to see a benefit? Who suffers? 2. How does the code achieve that end? Is it nasty or nice? Has everyone who's interested in the affected areas at least looked at the changes, or ideally given them a good review? Does it need lots of tunables, or is it set-and-forget? 3. Does it achieve the intended end? Numbers are helpful here. 4. Does it make anything worse? A lot or a little? Rare corner cases, or a real world usage? Again, numbers make the case most strongly. I can't say I've been following this particular feature very closely, but these are the fundamental questions that need to be dealt with in merging any significant change. And as Nick says, historically point 4 is very important in VM tuning changes, because "obvious" improvements have often ended up giving pathologically bad results on unexpected workloads. J ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 5:10 ` Jeremy Fitzhardinge @ 2007-07-24 5:18 ` Ray Lee 0 siblings, 0 replies; 535+ messages in thread From: Ray Lee @ 2007-07-24 5:18 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/23/07, Jeremy Fitzhardinge <jeremy@goop.org> wrote: > Ray Lee wrote: > > That said, I'm willing to run my day to day life through both a swap > > prefetch kernel and a normal one. *However*, before I go through all > > the work of instrumenting the damn thing, I'd really like Andrew (or > > Linus) to lay out his acceptance criteria on the feature. Exactly what > > *should* I be paying attention to? I've suggested keeping track of > > process swapin delay total time, and comparing with and without. Is > > that reasonable? Is it incomplete? > > Um, isn't it up to you? Huh? I'm not Linus or Andrew, with the power to merge a patch to the 2.6 kernel, so I think that the answer to that is a really clear 'No.' > 4. Does it make anything worse? A lot or a little? Rare corner > cases, or a real world usage? Again, numbers make the case most > strongly. > > I can't say I've been following this particular feature very closely, > but these are the fundamental questions that need to be dealt with in > merging any significant change. And as Nick says, historically point 4 > is very important in VM tuning changes, because "obvious" improvements > have often ended up giving pathologically bad results on unexpected > workloads. Dude. My whole question was *what* numbers. Please go back and read it all again. Maybe I was unclear, but I really don't think so. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 4:53 ` Ray Lee 2007-07-24 5:10 ` Jeremy Fitzhardinge @ 2007-07-24 5:16 ` Nick Piggin 2007-07-24 16:15 ` Ray Lee 2007-07-24 5:18 ` Andrew Morton 2 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-24 5:16 UTC (permalink / raw) To: Ray Lee Cc: Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Ray Lee wrote: > On 7/23/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > That said, I'm willing to run my day to day life through both a swap > prefetch kernel and a normal one. *However*, before I go through all > the work of instrumenting the damn thing, I'd really like Andrew (or > Linus) to lay out his acceptance criteria on the feature. Exactly what > *should* I be paying attention to? I've suggested keeping track of > process swapin delay total time, and comparing with and without. Is > that reasonable? Is it incomplete? I don't feel it is so useful without more context. For example, in most situations where pages get pushed to swap, there will *also* be useful file backed pages being thrown out. Swap prefetch might improve the total swapin delay time very significantly but that may be just a tiny portion of the real problem. Also a random day at the desktop, it is quite a broad scope and pretty well impossible to analyse. If we can first try looking at some specific problems that are easily identified. Looking at your past email, you have a 1GB desktop system and your overnight updatedb run is causing stuff to get swapped out such that swap prefetch makes it significantly better. This is really intriguing to me, and I would hope we can start by making this particular workload "not suck" without swap prefetch (and hopefully make it even better than it currently is with swap prefetch because we'll try not to evict useful file backed pages as well). After that we can look at other problems that swap prefetch helps with, or think of some ways to measure your "whole day" scenario. So when/if you have time, I can cook up a list of things to monitor and possibly a patch to add some instrumentation over this updatedb run. Anyway, I realise swap prefetching has some situations where it will fundamentally outperform even the page replacement oracle. This is why I haven't asked for it to be dropped: it isn't a bad idea at all. However, if we can improve basic page reclaim where it is obviously lacking, that is always preferable. eg: being a highly speculative operation, swap prefetch is not great for power efficiency -- but we still want laptop users to have a good experience as well, right? -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 5:16 ` Nick Piggin @ 2007-07-24 16:15 ` Ray Lee 2007-07-24 17:46 ` [ck] " Rashkae ` (2 more replies) 0 siblings, 3 replies; 535+ messages in thread From: Ray Lee @ 2007-07-24 16:15 UTC (permalink / raw) To: Nick Piggin Cc: Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/23/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Ray Lee wrote: > > That said, I'm willing to run my day to day life through both a swap > > prefetch kernel and a normal one. *However*, before I go through all > > the work of instrumenting the damn thing, I'd really like Andrew (or > > Linus) to lay out his acceptance criteria on the feature. Exactly what > > *should* I be paying attention to? I've suggested keeping track of > > process swapin delay total time, and comparing with and without. Is > > that reasonable? Is it incomplete? > > I don't feel it is so useful without more context. For example, in > most situations where pages get pushed to swap, there will *also* be > useful file backed pages being thrown out. Swap prefetch might > improve the total swapin delay time very significantly but that may > be just a tiny portion of the real problem. Agreed, it's important to make sure we're not being penny-wise and pound-foolish here. > Also a random day at the desktop, it is quite a broad scope and > pretty well impossible to analyse. It is pretty broad, but that's also what swap prefetch is targetting. As for hard to analyze, I'm not sure I agree. One can black-box test this stuff with only a few controls. e.g., if I use the same apps each day (mercurial, firefox, xorg, gcc), and the total I/O wait time consistently goes down on a swap prefetch kernel (normalized by some control statistic, such as application CPU time or total I/O, or something), then that's a useful measurement. > If we can first try looking at > some specific problems that are easily identified. Always easier, true. Let's start with "My mouse jerks around under memory load." A Google Summer of Code student working on X.Org claims that mlocking the mouse handling routines gives a smooth cursor under load ([1]). It's surprising that the kernel would swap that out in the first place. [1] http://vignatti.wordpress.com/2007/07/06/xorg-input-thread-summary-or-something/ > Looking at your past email, you have a 1GB desktop system and your > overnight updatedb run is causing stuff to get swapped out such that > swap prefetch makes it significantly better. This is really > intriguing to me, and I would hope we can start by making this > particular workload "not suck" without swap prefetch (and hopefully > make it even better than it currently is with swap prefetch because > we'll try not to evict useful file backed pages as well). updatedb is an annoying case, because one would hope that there would be a better way to deal with that highly specific workload. It's also pretty stat dominant, which puts it roughly in the same category as a git diff. (They differ in that updatedb does a lot of open()s and getdents on directories, git merely does a ton of lstat()s instead.) Anyway, my point is that I worry that tuning for an unusual and infrequent workload (which updatedb certainly is), is the wrong way to go. > After that we can look at other problems that swap prefetch helps > with, or think of some ways to measure your "whole day" scenario. > > So when/if you have time, I can cook up a list of things to monitor > and possibly a patch to add some instrumentation over this updatedb > run. That would be appreciated. Don't spend huge amounts of time on it, okay? Point me the right direction, and we'll see how far I can run with it. > Anyway, I realise swap prefetching has some situations where it will > fundamentally outperform even the page replacement oracle. This is > why I haven't asked for it to be dropped: it isn't a bad idea at all. <nod> > However, if we can improve basic page reclaim where it is obviously > lacking, that is always preferable. eg: being a highly speculative > operation, swap prefetch is not great for power efficiency -- but we > still want laptop users to have a good experience as well, right? Absolutely. Disk I/O is the enemy, and the best I/O is one you never had to do in the first place. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-24 16:15 ` Ray Lee @ 2007-07-24 17:46 ` Rashkae 2007-07-25 4:06 ` Nick Piggin 2007-07-25 4:46 ` david 2 siblings, 0 replies; 535+ messages in thread From: Rashkae @ 2007-07-24 17:46 UTC (permalink / raw) To: Ray Lee; +Cc: ck, linux-kernel > >> However, if we can improve basic page reclaim where it is obviously >> lacking, that is always preferable. eg: being a highly speculative >> operation, swap prefetch is not great for power efficiency -- but we >> still want laptop users to have a good experience as well, right? > Sounds like something that can be altered with a tuneable for workloads where power efficiency is more important than performance. As far as performance goes, empty memory is wasted memory. I think the most important 'measurement' people can make for swap prefetch, if this is even possible to capture, is a positive hit ratio. Under everyman's typical workload, what percentage of pages prefetched end up being used? And what percentage end up discarded? I'm pulling these numbers out of thin air, but I would say, if > 10% is referenced, and < 70% discarded, then that would be significant performance boost well worthwhile. To be clear, I don't know what I'm talking about. It just seems to me however, that debating whether or not to implement a performance boost because we can better tune corner cases is silly. For as long as computers have used swap and unused memory, there will be a performance gain to background prefetching. That doesn't preclude developers from tuning the specific workloads that lead to such. It's not like this were a theoretical discussion of should we develop this or not.. Prefetch is here, now, and working. The only questions I see are: Does the performance gain from Prefetch compensate for the prefetch code memory requirements? Is there someone who's comfortable with lkml politics willing to maintain the thing? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 16:15 ` Ray Lee 2007-07-24 17:46 ` [ck] " Rashkae @ 2007-07-25 4:06 ` Nick Piggin 2007-07-25 4:55 ` Rene Herman ` (4 more replies) 2007-07-25 4:46 ` david 2 siblings, 5 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-25 4:06 UTC (permalink / raw) To: Ray Lee Cc: Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Ray Lee wrote: > On 7/23/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> Also a random day at the desktop, it is quite a broad scope and >> pretty well impossible to analyse. > > > It is pretty broad, but that's also what swap prefetch is targetting. > As for hard to analyze, I'm not sure I agree. One can black-box test > this stuff with only a few controls. e.g., if I use the same apps each > day (mercurial, firefox, xorg, gcc), and the total I/O wait time > consistently goes down on a swap prefetch kernel (normalized by some > control statistic, such as application CPU time or total I/O, or > something), then that's a useful measurement. I'm not saying that we can't try to tackle that problem, but first of all you have a really nice narrow problem where updatedb seems to be causing the kernel to completely do the wrong thing. So we start on that. >> If we can first try looking at >> some specific problems that are easily identified. > > > Always easier, true. Let's start with "My mouse jerks around under > memory load." A Google Summer of Code student working on X.Org claims > that mlocking the mouse handling routines gives a smooth cursor under > load ([1]). It's surprising that the kernel would swap that out in the > first place. > > [1] > http://vignatti.wordpress.com/2007/07/06/xorg-input-thread-summary-or-something/ OK, I'm not sure what the point is though. Under heavy memory load, things are going to get swapped out... and swap prefetch isn't going to help there (at least, not during the memory load). There are also other issues like whether the CPU scheduler is at fault, etc. Interactive workloads are always the hardest to work out. updatedb is a walk in the park by comparison. >> Looking at your past email, you have a 1GB desktop system and your >> overnight updatedb run is causing stuff to get swapped out such that >> swap prefetch makes it significantly better. This is really >> intriguing to me, and I would hope we can start by making this >> particular workload "not suck" without swap prefetch (and hopefully >> make it even better than it currently is with swap prefetch because >> we'll try not to evict useful file backed pages as well). > > > updatedb is an annoying case, because one would hope that there would > be a better way to deal with that highly specific workload. It's also > pretty stat dominant, which puts it roughly in the same category as a > git diff. (They differ in that updatedb does a lot of open()s and > getdents on directories, git merely does a ton of lstat()s instead.) Yeah, and I suspect we might be able to do better use-once of inode and dentry caches. It isn't really highly specific: lots of things tend to just scan over a few files once -- updatedb just scans a lot so the problem becomes more noticable. > Anyway, my point is that I worry that tuning for an unusual and > infrequent workload (which updatedb certainly is), is the wrong way to > go. Well it runs every day or so for every desktop Linux user, and it has similarities with other workloads. We don't want to optimise it at the expense of other things, but it _really_ should not be pushing a 1-2GB desktop into swap, I don't think. >> After that we can look at other problems that swap prefetch helps >> with, or think of some ways to measure your "whole day" scenario. >> >> So when/if you have time, I can cook up a list of things to monitor >> and possibly a patch to add some instrumentation over this updatedb >> run. > > > That would be appreciated. Don't spend huge amounts of time on it, > okay? Point me the right direction, and we'll see how far I can run > with it. I guess /proc/meminfo, /proc/zoneinfo, /proc/vmstat, /proc/slabinfo before and after the updatedb run with the latest kernel would be a first step. top and vmstat output during the run wouldn't hurt either. Thanks, Nick -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 4:06 ` Nick Piggin @ 2007-07-25 4:55 ` Rene Herman 2007-07-25 5:00 ` Nick Piggin ` (2 more replies) 2007-07-25 6:09 ` [ck] " Matthew Hawkins ` (3 subsequent siblings) 4 siblings, 3 replies; 535+ messages in thread From: Rene Herman @ 2007-07-25 4:55 UTC (permalink / raw) To: Nick Piggin Cc: Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 07/25/2007 06:06 AM, Nick Piggin wrote: > Ray Lee wrote: >> Anyway, my point is that I worry that tuning for an unusual and >> infrequent workload (which updatedb certainly is), is the wrong way to >> go. > > Well it runs every day or so for every desktop Linux user, and it has > similarities with other workloads. It certainly doesn't run for me ever. Always kind of a "that's not the point" comment but I just keep wondering whenever I see anyone complain about updatedb why the _hell_ they are running it in the first place. If anyone who never uses "locate" for anything simply disable updatedb, the problem will for a large part be solved. This not just meant as a cheap comment; while I can think of a few similar loads even on the desktop (scanning a browser cache, a media player indexing a large amount of media files, ...) I've never heard of problems _other_ than updatedb. So just junk that crap and be happy. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 4:55 ` Rene Herman @ 2007-07-25 5:00 ` Nick Piggin 2007-07-25 5:12 ` david 2007-07-25 5:30 ` Eric St-Laurent 2 siblings, 0 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-25 5:00 UTC (permalink / raw) To: Rene Herman Cc: Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Rene Herman wrote: > On 07/25/2007 06:06 AM, Nick Piggin wrote: > >> Ray Lee wrote: > > >>> Anyway, my point is that I worry that tuning for an unusual and >>> infrequent workload (which updatedb certainly is), is the wrong way >>> to go. >> >> >> Well it runs every day or so for every desktop Linux user, and it has >> similarities with other workloads. > > > It certainly doesn't run for me ever. Always kind of a "that's not the > point" comment but I just keep wondering whenever I see anyone complain > about updatedb why the _hell_ they are running it in the first place. If > anyone who never uses "locate" for anything simply disable updatedb, the > problem will for a large part be solved. > > This not just meant as a cheap comment; while I can think of a few > similar loads even on the desktop (scanning a browser cache, a media > player indexing a large amount of media files, ...) I've never heard of > problems _other_ than updatedb. So just junk that crap and be happy. OK fair point, but the counter point that there are real patterns that just use-once a lot of metadata (ls, for example. grep even.) -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 4:55 ` Rene Herman 2007-07-25 5:00 ` Nick Piggin @ 2007-07-25 5:12 ` david 2007-07-25 5:30 ` Rene Herman 2007-07-25 5:30 ` Eric St-Laurent 2 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-25 5:12 UTC (permalink / raw) To: Rene Herman Cc: Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 25 Jul 2007, Rene Herman wrote: > On 07/25/2007 06:06 AM, Nick Piggin wrote: > >> Ray Lee wrote: > >> > Anyway, my point is that I worry that tuning for an unusual and >> > infrequent workload (which updatedb certainly is), is the wrong way to >> > go. >> >> Well it runs every day or so for every desktop Linux user, and it has >> similarities with other workloads. > > It certainly doesn't run for me ever. Always kind of a "that's not the point" > comment but I just keep wondering whenever I see anyone complain about > updatedb why the _hell_ they are running it in the first place. If anyone who > never uses "locate" for anything simply disable updatedb, the problem will > for a large part be solved. > > This not just meant as a cheap comment; while I can think of a few similar > loads even on the desktop (scanning a browser cache, a media player indexing > a large amount of media files, ...) I've never heard of problems _other_ than > updatedb. So just junk that crap and be happy. but if you do use locate then the alturnative becomes sitting around and waiting for find to complete on a regular basis. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:12 ` david @ 2007-07-25 5:30 ` Rene Herman 2007-07-25 5:51 ` david ` (2 more replies) 0 siblings, 3 replies; 535+ messages in thread From: Rene Herman @ 2007-07-25 5:30 UTC (permalink / raw) To: david Cc: Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 07/25/2007 07:12 AM, david@lang.hm wrote: > On Wed, 25 Jul 2007, Rene Herman wrote: >> It certainly doesn't run for me ever. Always kind of a "that's not the >> point" comment but I just keep wondering whenever I see anyone >> complain about updatedb why the _hell_ they are running it in the >> first place. If anyone who never uses "locate" for anything simply >> disable updatedb, the problem will for a large part be solved. >> >> This not just meant as a cheap comment; while I can think of a few >> similar loads even on the desktop (scanning a browser cache, a media >> player indexing a large amount of media files, ...) I've never heard >> of problems _other_ than updatedb. So just junk that crap and be happy. > > but if you do use locate then the alturnative becomes sitting around and > waiting for find to complete on a regular basis. Yes, but what's locate's usage scenario? I've never, ever wanted to use it. When do you know the name of something but not where it's located, other than situations which "which" wouldn't cover and after just having installed/unpacked something meaning locate doesn't know about it yet either? Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:30 ` Rene Herman @ 2007-07-25 5:51 ` david 2007-07-25 7:14 ` Valdis.Kletnieks 2007-07-25 16:02 ` Ray Lee 2 siblings, 0 replies; 535+ messages in thread From: david @ 2007-07-25 5:51 UTC (permalink / raw) To: Rene Herman Cc: Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 25 Jul 2007, Rene Herman wrote: > On 07/25/2007 07:12 AM, david@lang.hm wrote: > >> On Wed, 25 Jul 2007, Rene Herman wrote: > >> > It certainly doesn't run for me ever. Always kind of a "that's not the >> > point" comment but I just keep wondering whenever I see anyone complain >> > about updatedb why the _hell_ they are running it in the first place. If >> > anyone who never uses "locate" for anything simply disable updatedb, the >> > problem will for a large part be solved. >> > >> > This not just meant as a cheap comment; while I can think of a few >> > similar loads even on the desktop (scanning a browser cache, a media >> > player indexing a large amount of media files, ...) I've never heard of >> > problems _other_ than updatedb. So just junk that crap and be happy. >> >> but if you do use locate then the alturnative becomes sitting around and >> waiting for find to complete on a regular basis. > > Yes, but what's locate's usage scenario? I've never, ever wanted to use it. > When do you know the name of something but not where it's located, other than > situations which "which" wouldn't cover and after just having > installed/unpacked something meaning locate doesn't know about it yet either? which only finds executables that are in the path. I commonly use locate to find config files (or sample config files) for packages that were installed at some point in the past with fairly default configs and now I want to go and tweak them. so I start reading documentation and then need to find out where $disto moved the files to this release (I commonly am working on machines with over a half dozen different distro releases, and none of them RedHat) David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:30 ` Rene Herman 2007-07-25 5:51 ` david @ 2007-07-25 7:14 ` Valdis.Kletnieks 2007-07-25 8:18 ` Rene Herman 2007-07-25 16:02 ` Ray Lee 2 siblings, 1 reply; 535+ messages in thread From: Valdis.Kletnieks @ 2007-07-25 7:14 UTC (permalink / raw) To: Rene Herman Cc: david, Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1659 bytes --] On Wed, 25 Jul 2007 07:30:37 +0200, Rene Herman said: > Yes, but what's locate's usage scenario? I've never, ever wanted to use it. > When do you know the name of something but not where it's located, other > than situations which "which" wouldn't cover and after just having > installed/unpacked something meaning locate doesn't know about it yet either? My favorite use - with 5 Fedora kernels and as many -mm kernels on my laptop, doing a 'locate moby' finds all the moby.c and moby.o and moby.ko for the various releases. For bonus points, something like: ls -lt `locate iwl3945.ko` to find all 19 copies that are on my system, and remind me which ones were compiled when. Or just when you remember the name of some one-off 100-line Perl program that you wrote 6 months ago, but not sure which directory you left it in... ;) You want hard numbers? Here you go - 'locate' versus 'find' (/usr/src/ has about 290K files on it): % strace locate iwl3945.ko >| /tmp/foo3 2>&1 % wc /tmp/foo3 96 592 6252 /tmp/foo3 % strace find /usr/src /lib -name iwl3945.ko >| /tmp/foo4 2>&1 % wc /tmp/foo4 328380 1550032 15708205 /tmp/foo4 # echo 1 > /proc/sys/vm/drop_caches (to empty the caches % time locate iwl3945.ko > /dev/null real 0m0.872s user 0m0.867s sys 0m0.008s % time find /usr/src /lib -name iwl3945.ko > /dev/null find: /usr/src/lost+found: Permission denied real 1m12.241s user 0m1.128s sys 0m3.566s So 96 system calls in 1 second, against 328K calls in a minute. There's your use case, right there. Now if we can just find a way for that find/updatedb to not be as painful to the rest of the system..... [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 7:14 ` Valdis.Kletnieks @ 2007-07-25 8:18 ` Rene Herman 2007-07-25 8:28 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-25 8:18 UTC (permalink / raw) To: Valdis.Kletnieks Cc: david, Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 07/25/2007 09:14 AM, Valdis.Kletnieks@vt.edu wrote: > On Wed, 25 Jul 2007 07:30:37 +0200, Rene Herman said: > >> Yes, but what's locate's usage scenario? I've never, ever wanted to use >> it. When do you know the name of something but not where it's located, >> other than situations which "which" wouldn't cover and after just >> having installed/unpacked something meaning locate doesn't know about >> it yet either? > > My favorite use - with 5 Fedora kernels and as many -mm kernels on my > laptop, doing a 'locate moby' finds all the moby.c and moby.o and moby.ko > for the various releases. Supposing you know the path in one tree, you know the path in all of them, right? :-? > You want hard numbers? Here you go - 'locate' versus 'find' These are ofcourse not necesary. If you discount the time updatedb itself takes it's utterly obvious that _if_ you use it, it's going to be wildly faster than find. Regardless, I'll stand by "[by disabling updatedb] the problem will for a large part be solved" as I expect approximately 94.372 percent of Linux desktop users couldn't care less about locate. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 8:18 ` Rene Herman @ 2007-07-25 8:28 ` Ingo Molnar 2007-07-25 8:43 ` Rene Herman 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-25 8:28 UTC (permalink / raw) To: Rene Herman Cc: Valdis.Kletnieks, david, Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel * Rene Herman <rene.herman@gmail.com> wrote: > Regardless, I'll stand by "[by disabling updatedb] the problem will > for a large part be solved" as I expect approximately 94.372 percent > of Linux desktop users couldn't care less about locate. i think that approach is illogical: because Linux mis-handled a mixed workload the answer is to ... remove a portion of that workload? To bring your approach to the extreme: what if Linux sucked at running more than two CPU-intense tasks at once. Most desktop users dont do that, so a probably larger than 94.372 percent of Linux desktop users couldn't care less about a proper scheduler. Still, anyone who builds a kernel (the average desktop user wont do that) while using firefox will attest to the fact that it's quite handy that the Linux scheduler can handle mixed workloads pretty well. now, it might be the case that this mixed VM/VFS workload cannot be handled any more intelligently - but that wasnt your argument! The swap-prefetch patch certainly tried to do things more intelligently and the test-case (measurement app) Con provided showed visible improvements in swap-in latency. (and a good number of people posted those results) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 8:28 ` Ingo Molnar @ 2007-07-25 8:43 ` Rene Herman 2007-07-25 11:34 ` Ingo Molnar [not found] ` <5c77e14b0707250353r48458316x5e6adde6dbce1fbd@mail.gmail.com> 0 siblings, 2 replies; 535+ messages in thread From: Rene Herman @ 2007-07-25 8:43 UTC (permalink / raw) To: Ingo Molnar Cc: Valdis.Kletnieks, david, Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel On 07/25/2007 10:28 AM, Ingo Molnar wrote: >> Regardless, I'll stand by "[by disabling updatedb] the problem will >> for a large part be solved" as I expect approximately 94.372 percent >> of Linux desktop users couldn't care less about locate. > > i think that approach is illogical: because Linux mis-handled a mixed > workload the answer is to ... remove a portion of that workload? No. It got snipped but I introduced the comment by saying it was a "that's not the point" kind of thing. Sometimes things that aren't the point are still true though and in the case of Linux desktop users complaining about updatedb runs, a comment that says that for many an obvious solution would be to stop running the damned thing is not in any sense illogical. Also note I'm not against swap prefetch or anything. I don't use it and do not believe I have a pressing need for it, but do suspect it has potential to make quite a bit of difference on some things -- if only to drastically reduce seeks if it means it's swapping in larger chunks than a randomly faulting program would. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 8:43 ` Rene Herman @ 2007-07-25 11:34 ` Ingo Molnar 2007-07-25 11:40 ` Rene Herman ` (2 more replies) [not found] ` <5c77e14b0707250353r48458316x5e6adde6dbce1fbd@mail.gmail.com> 1 sibling, 3 replies; 535+ messages in thread From: Ingo Molnar @ 2007-07-25 11:34 UTC (permalink / raw) To: Rene Herman Cc: Valdis.Kletnieks, david, Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel * Rene Herman <rene.herman@gmail.com> wrote: > On 07/25/2007 10:28 AM, Ingo Molnar wrote: > > >>Regardless, I'll stand by "[by disabling updatedb] the problem will > >>for a large part be solved" as I expect approximately 94.372 percent > >>of Linux desktop users couldn't care less about locate. > > > > i think that approach is illogical: because Linux mis-handled a > > mixed workload the answer is to ... remove a portion of that > > workload? > > No. It got snipped but I introduced the comment by saying it was a > "that's not the point" kind of thing. [...] ok - with that qualification i understand. still, especially for someone like me who frequently deals with source code, 'locate' is indispensible. and the fact is: updatedb discards a considerable portion of the cache completely unnecessarily: on a reasonably complex box no way do all the inodes and dentries fit into all of RAM, so we just trash everything. Maybe the kernel could be extended with a method of opening files in a 'drop from the dcache after use' way. (beagled and backup tools could make use of that facility too.) (Or some other sort of file-cache-invalidation syscall that already exist, which would _also_ result in the immediate zapping of the dentry+inode from the dcache.) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 11:34 ` Ingo Molnar @ 2007-07-25 11:40 ` Rene Herman 2007-07-25 11:50 ` Ingo Molnar 2007-07-25 16:08 ` Valdis.Kletnieks 2007-07-25 22:05 ` Paul Jackson 2 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-25 11:40 UTC (permalink / raw) To: Ingo Molnar Cc: Valdis.Kletnieks, david, Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel On 07/25/2007 01:34 PM, Ingo Molnar wrote: > and the fact is: updatedb discards a considerable portion of the cache > completely unnecessarily: on a reasonably complex box no way do all the > inodes and dentries fit into all of RAM, so we just trash everything. Okay, but unless I've now managed to really quite horribly confuse myself, that wouldn't have anything to do with _swap_ prefetch would it? Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 11:40 ` Rene Herman @ 2007-07-25 11:50 ` Ingo Molnar 0 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-07-25 11:50 UTC (permalink / raw) To: Rene Herman Cc: Valdis.Kletnieks, david, Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel * Rene Herman <rene.herman@gmail.com> wrote: > > and the fact is: updatedb discards a considerable portion of the > > cache completely unnecessarily: on a reasonably complex box no way > > do all the inodes and dentries fit into all of RAM, so we just trash > > everything. > > Okay, but unless I've now managed to really quite horribly confuse > myself, that wouldn't have anything to do with _swap_ prefetch would > it? it's connected: it would remove updatedb from the VM picture altogether. (updatedb would just cycle through the files with leaving minimal cache disturbance.) hence swap-prefetch could concentrate on the cases where it makes sense to start swap prefetching _without_ destroying other, already cached content: such as when a large app exits and frees gobs of memory back into the buddy allocator. _That_ would be a definitive "no costs and side-effects" point for swap-prefetch to kick in, and it would eliminate this pretty artificial (and unnecessary) 'desktop versus server' controversy and would turn it into a 'helps everyone' feature. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 11:34 ` Ingo Molnar 2007-07-25 11:40 ` Rene Herman @ 2007-07-25 16:08 ` Valdis.Kletnieks 2007-07-25 22:05 ` Paul Jackson 2 siblings, 0 replies; 535+ messages in thread From: Valdis.Kletnieks @ 2007-07-25 16:08 UTC (permalink / raw) To: Ingo Molnar Cc: Rene Herman, david, Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 860 bytes --] On Wed, 25 Jul 2007 13:34:01 +0200, Ingo Molnar said: > Maybe the kernel could be extended with a method of opening files in a > 'drop from the dcache after use' way. (beagled and backup tools could > make use of that facility too.) (Or some other sort of > file-cache-invalidation syscall that already exist, which would _also_ > result in the immediate zapping of the dentry+inode from the dcache.) The semantic that would benefit my work patterns the most would not be "immediate zapping" - I have 2G of RAM, so often there's no memory pressure, and often a 'find' will be followed by another similar 'find' that will hit a lot of the same dentries and inodes, so may as well save them if we can. Flagging it as "the first to be heaved over the side the instant there *is* pressure" would suit just fine. Or is that the semantic you actually meant? [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 11:34 ` Ingo Molnar 2007-07-25 11:40 ` Rene Herman 2007-07-25 16:08 ` Valdis.Kletnieks @ 2007-07-25 22:05 ` Paul Jackson 2007-07-25 22:22 ` Zan Lynx ` (3 more replies) 2 siblings, 4 replies; 535+ messages in thread From: Paul Jackson @ 2007-07-25 22:05 UTC (permalink / raw) To: Ingo Molnar Cc: rene.herman, Valdis.Kletnieks, david, nickpiggin, ray-lk, jesper.juhl, akpm, ck, linux-mm, linux-kernel > and the fact is: updatedb discards a considerable portion of the cache > completely unnecessarily: on a reasonably complex box no way do all the I'm wondering how much of this updatedb problem is due to poor layout of swap and other file systems across disk spindles. I'll wager that those most impacted by updatedb have just one disk. I have the following three boxes - three different setups, each with different updatedb behaviour: The first box, with 1 GB ram, becomes dog slow as soon as it breaths on the swap device. Updatedb and backups are painful intrusions on any interactive work on that system. I sometimes wait a half minute for a response from an interactive application anytime it has to go to disk. This box has a single disk spindle, on an old cheap slow disk, with swap on the opposite end of the disk from root and the main usr partition. It's a worst case disk seek test device. The second box, also with 1 GB ram, has multiple disk spindles, and swap on its own spindle. I can still notice updatedb and backup, but it's far far less painful. The third box has dual CPU cores and 4 GB ram. Updatedb runs over the entire system in perhaps 30 seconds with no perceptible impact at all on interactive uses. Everything is still in memory from the previous updatedb run; the disk is just used to write out new stuff. Swap is never used on this (sweet) rig. I'd think that prefetch would help in the single disk spindle configuration, because it does the swap accesses separately, instead of intermingling them with root or usr partition accesses, which would require alot of disk head seeking. Pretty much anytime that ordinary desktop users complain about performance as much as they have about this one, it's either disk head seeks or network delays. Nothing else is -that- slow, to be so noticeable to so many users just doing ordinary work. Question: Could those who have found this prefetch helps them alot say how many disks they have? In particular, is their swap on the same disk spindle as their root and user files? Answer - for me: On my system where updatedb is a big problem, I have one, slow, disk. On my system where updatedb is a small problem, swap is on a separate spindle. On my system where updatedb is -no- problem, I have so much memory I never use swap. I'd expect the laptop crowd to mostly have a single, slow, disk, and hence to find updatedb more painful. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 22:05 ` Paul Jackson @ 2007-07-25 22:22 ` Zan Lynx 2007-07-25 22:27 ` Jesper Juhl ` (2 subsequent siblings) 3 siblings, 0 replies; 535+ messages in thread From: Zan Lynx @ 2007-07-25 22:22 UTC (permalink / raw) To: Paul Jackson Cc: Ingo Molnar, rene.herman, Valdis.Kletnieks, david, nickpiggin, ray-lk, jesper.juhl, akpm, ck, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 953 bytes --] On Wed, 2007-07-25 at 15:05 -0700, Paul Jackson wrote: [snip] > Question: > Could those who have found this prefetch helps them alot say how > many disks they have? In particular, is their swap on the same > disk spindle as their root and user files? > > Answer - for me: > On my system where updatedb is a big problem, I have one, slow, disk. > On my system where updatedb is a small problem, swap is on a separate > spindle. > On my system where updatedb is -no- problem, I have so much memory > I never use swap. > > I'd expect the laptop crowd to mostly have a single, slow, disk, and > hence to find updatedb more painful. A well done swap-to-flash would help here. I sometimes do it anyway to a 4GB CF card but I can tell it's hitting the read/update/write cycles on the flash blocks. The sad thing is that it is still a speed improvement over swapping to laptop disk. -- Zan Lynx <zlynx@acm.org> [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 22:05 ` Paul Jackson 2007-07-25 22:22 ` Zan Lynx @ 2007-07-25 22:27 ` Jesper Juhl 2007-07-25 22:28 ` [ck] " Michael Chang 2007-07-25 23:45 ` André Goddard Rosa 3 siblings, 0 replies; 535+ messages in thread From: Jesper Juhl @ 2007-07-25 22:27 UTC (permalink / raw) To: Paul Jackson Cc: Ingo Molnar, rene.herman, Valdis.Kletnieks, david, nickpiggin, ray-lk, akpm, ck, linux-mm, linux-kernel On 26/07/07, Paul Jackson <pj@sgi.com> wrote: > > and the fact is: updatedb discards a considerable portion of the cache > > completely unnecessarily: on a reasonably complex box no way do all the > > I'm wondering how much of this updatedb problem is due to poor layout > of swap and other file systems across disk spindles. > > I'll wager that those most impacted by updatedb have just one disk. > [snip] > > Question: > Could those who have found this prefetch helps them alot say how > many disks they have? In particular, is their swap on the same > disk spindle as their root and user files? > Swap prefetch helps me. In my case I have a single (10K RPM, Ultra 160 SCSI) disk. # fdisk -l /dev/sda Disk /dev/sda: 36.7 GB, 36703918080 bytes 255 heads, 63 sectors/track, 4462 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 1 974 7823623+ 83 Linux /dev/sda2 975 1218 1959930 83 Linux /dev/sda3 1219 1341 987997+ 82 Linux swap /dev/sda4 1342 4462 25069432+ 83 Linux sda1 is "/", sda2 is "/usr/local/" and sda4 is "/home/" But, I don't think updatedb is the problem, at least not just updatedb on its own. My machine has 2GB of RAM, so a single updatedb on its own will not cause it to start swapping, but it does eat up a chunk of mem no doubt about that. The problem with updatedb is simply that it can be a contributing factor to stuff being swapped out, but any memory hungry application can do that - just try building an allyesconfig kernel and see how much the linker eats towards the end. What swap prefetch helps is not updatedb specifically, In my experience it helps any case where you have applications running, then start some memory hungry job that runs for a limited time, push the previously started apps out to swap and then dies (like updatedb or a compile job). Without swap prefetch those apps that were pushed to swap won't be brought back in before they are used (at which time the user is going to have to sit there and wait for them). With swap prefetch, the apps that got swapped out will slowly make their way back once the mem hungry app has died and will then be fully or partly back in memory when the user comes back to them. That's how swap prefetch helps, it's got nothing to do with updatedb as such - at least not as I see it. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 22:05 ` Paul Jackson 2007-07-25 22:22 ` Zan Lynx 2007-07-25 22:27 ` Jesper Juhl @ 2007-07-25 22:28 ` Michael Chang 2007-07-25 23:45 ` André Goddard Rosa 3 siblings, 0 replies; 535+ messages in thread From: Michael Chang @ 2007-07-25 22:28 UTC (permalink / raw) To: Paul Jackson Cc: Ingo Molnar, david, nickpiggin, Valdis.Kletnieks, ray-lk, jesper.juhl, linux-kernel, ck, linux-mm, akpm, rene.herman On 7/25/07, Paul Jackson <pj@sgi.com> wrote: > Question: > Could those who have found this prefetch helps them alot say how > many disks they have? In particular, is their swap on the same > disk spindle as their root and user files? I have found that swap prefetch helped on all of the four machines machine I have, although the effect is more noticeable on machines with slower disks. They all have one hard disk, and root and swap were always on the same disk. I have no idea how to determine how many disk spindles they have, but since the drives are mainly low-end consumer models sold with low-end sub $500 PCs... -- Michael Chang Please avoid sending me Word or PowerPoint attachments. Send me ODT, RTF, or HTML instead. See http://www.gnu.org/philosophy/no-word-attachments.html Thank you. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 22:05 ` Paul Jackson ` (2 preceding siblings ...) 2007-07-25 22:28 ` [ck] " Michael Chang @ 2007-07-25 23:45 ` André Goddard Rosa 3 siblings, 0 replies; 535+ messages in thread From: André Goddard Rosa @ 2007-07-25 23:45 UTC (permalink / raw) To: Paul Jackson Cc: Ingo Molnar, david, nickpiggin, Valdis.Kletnieks, ray-lk, jesper.juhl, linux-kernel, ck, linux-mm, akpm, rene.herman > Question: > Could those who have found this prefetch helps them alot say how > many disks they have? In particular, is their swap on the same > disk spindle as their root and user files? > > Answer - for me: > On my system where updatedb is a big problem, I have one, slow, disk. On both desktop and laptop. Cheers, -- []s, André Goddard ^ permalink raw reply [flat|nested] 535+ messages in thread
[parent not found: <5c77e14b0707250353r48458316x5e6adde6dbce1fbd@mail.gmail.com>]
* Re: [ck] Re: -mm merge plans for 2.6.23 [not found] ` <5c77e14b0707250353r48458316x5e6adde6dbce1fbd@mail.gmail.com> @ 2007-07-25 11:06 ` Nick Piggin 2007-07-25 13:30 ` Rene Herman 1 sibling, 0 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-25 11:06 UTC (permalink / raw) To: Jos Poortvliet Cc: Rene Herman, Ingo Molnar, david, Valdis.Kletnieks, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton Jos Poortvliet wrote: > Nick > has been talking about 'fixing the updatedb thing' for years now, no patch > yet. Wrong Nick, I think. First I heard about the updatedb problem was a few months ago with people saying updatedb was causing their system to swap (that is, swap prefetching helped after updatedb). I haven't been able to even try to fix it because I can't reproduce it (I'm sitting on a machine with 256MB RAM), and nobody has wanted to help me. > Besides, he won't fix OO.o nor all other userspace stuff - so > actually, > he does NOT even promise an alternative. Not that I think fixing updatedb > would be cool, btw - it sure would, but it's no reason not to include swap > prefetch - it's mostly unrelated. > > I think everyone with >1 gb ram should stop saying 'I don't need it' > because > that's obvious for that hardware. Just like ppl having a dual- or quadcore > shouldn't even talk about scheduler interactivity stuff... Actually there are people with >1GB of ram who are saying it helps. Why do you want to shut people out of the discussion? > Desktop users want it, tests show it works, there is no alternative and the > maybe-promised-one won't even fix all cornercases. It's small, mostly > selfcontained. There is a maintainer. It's been stable for a long time. > It's > been in MM for a long time. > > Yet it doesn't make it. Andrew says 'some ppl have objections' (he means > Nick) and he doesn't see an advantage in it (at least 4 gig ram, right, > Andrew?). > > Do I miss things? You could try constructively contributing? > Apparently, it didn't get in yet - and I find it hard to believe Andrew > holds swapprefetch for reasons like the above. So it must be something > else. > > > Nick is saying tests have already proven swap prefetch to be helpfull, > that's not the problem. He calls the requirements to get in 'fuzzy'. OK. The test I have seen is the one that forces a huge amount of memory to swap out, waits, then touches it. That speeds up, and that's fine. That's a good sanity test to ensure it is working. Beyond that there are other considerations to getting something merged. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 [not found] ` <5c77e14b0707250353r48458316x5e6adde6dbce1fbd@mail.gmail.com> 2007-07-25 11:06 ` Nick Piggin @ 2007-07-25 13:30 ` Rene Herman 2007-07-25 13:50 ` Ingo Molnar 1 sibling, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-25 13:30 UTC (permalink / raw) To: Jos Poortvliet Cc: Ingo Molnar, david, Nick Piggin, Valdis.Kletnieks, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton On 07/25/2007 12:53 PM, Jos Poortvliet wrote: > On 7/25/07, *Rene Herman* <rene.herman@gmail.com > <mailto:rene.herman@gmail.com>> wrote: >> Also note I'm not against swap prefetch or anything. I don't use it and >> do not believe I have a pressing need for it, but do suspect it has >> potential to make quite a bit of difference on some things -- if only >> to drastically reduce seeks if it means it's swapping in larger chunks >> than a randomly faulting program would. > > I wonder what your hardware is. Con talked about the diff in hardware > between most endusers and the kernel developers. I'm afraid you will need to categorize me more as an innocent bystander than a kernel developer and, as such, I have an endusery x86 with 768M (but I still think of myself as one of the cool kids!) > Yes, swap prefetch doesn't help if you have 3 GB ram, but it DOES do a > lot on a 256 mb laptop... Rather, it does not help if you are not swapping or idling. Probably largely due to me being a rather serial person I seem to never even push my git tree from cache. Hence my belief that "I don't have a pressing need for it". Taking a laptop as an example is interesting in itself by the way since a spundown disk (most applicable to laptops) is an argument against swap prefetch even when idle and when I'm not mistaken the feature actually disables itself when the machine's set to laptop mode... > After using OO.o, the system continues to be slow for a long time. With > swap prefetch, it's back up speed much faster. Con has showed a benchmark > for this with speedups of 10 times and more, users mentioned they liked > it. After using and quiting OO.o. If you simply don't have any memory free to prefetch into swap prefetch won't help any. The fact that it helps the case of OO.o having pushed out firefox is fairly obvious. > Nick has been talking about 'fixing the updatedb thing' for years now, no > patch yet. Besides, he won't fix OO.o nor all other userspace stuff - so > actually, he does NOT even promise an alternative. Not that I think > fixing updatedb would be cool, btw - it sure would, but it's no reason > not to include swap prefetch - it's mostly unrelated. Well, the trouble at least to some is that they indeed seem to be rather unrelated. Why does the updatedb run even cause swapout? (itself ofcourse a pre-condition for swap-prefetch to help). > I think everyone with >1 gb ram should stop saying 'I don't need it' > because that's obvious for that hardware. Just like ppl having a dual- > or quadcore shouldn't even talk about scheduler interactivity stuff... Actually, interactivity is largely about latency and latency is largely or partly independent of CPU speed -- if something's keeping the system from scheduling for too long it's likely that it's hogging the CPU for a fixed number of usecs and those pass in the same amount of time on all CPUs (we hope...). But that's a tangent anyway. I'm just glad that I get to say that I believe I don't need it with my 768M! > Apparently, it didn't get in yet - and I find it hard to believe Andrew > holds swapprefetch for reasons like the above. So it must be something > else. > > Nick is saying tests have already proven swap prefetch to be helpfull, > that's not the problem. He calls the requirements to get in 'fuzzy'. OK. > Beer is fuzzy, do we need to offer beer to someone? If Andrew promises > to come to FOSDEM again next year, I'll offer him a beer, if that > helps... Anything else? A nice massage? Personally I'd go for sexual favours directly (but then again, I always do). But please also note that I even _literally_ said above that I myself am not against swap-prefetch or anything and yet I get what appears to be an least somewhat adversary rant directed at me. Which in itself is fine, but not too helpful... Nick Piggin is the person to convince it seems and if I've read things right (I only stepped into this thing at the updatedb mention, so maybe I haven't) his main question is _why_ the hell it helps updatedb. As long as you don't know this, then even a solution that helps could be papering over a problem which you'd much rather fix at the root rather than at the symptom. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 13:30 ` Rene Herman @ 2007-07-25 13:50 ` Ingo Molnar 2007-07-25 17:33 ` Satyam Sharma 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-25 13:50 UTC (permalink / raw) To: Rene Herman Cc: Jos Poortvliet, david, Nick Piggin, Valdis.Kletnieks, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton * Rene Herman <rene.herman@gmail.com> wrote: > Nick Piggin is the person to convince it seems and if I've read things > right (I only stepped into this thing at the updatedb mention, so > maybe I haven't) his main question is _why_ the hell it helps > updatedb. [...] btw., i'd like to make this clear: if you want stuff to go upstream, do not concentrate on 'convincing the maintainer'. Instead concentrate on understanding the _problem_, concentrate on making sure that both you and the maintainer understands the problem correctly, possibly write some testcase that clearly exposes it, and help the maintainer debug the problem. _Optionally_, if you find joy in it, you are also free to write a proposed solution for that problem and submit it to the maintainer. But a "here is a solution, take it or leave it" approach, before having communicated the problem to the maintainer and before having debugged the problem is the wrong way around. It might still work out fine if the solution is correct (especially if the patch is small and obvious), but if there are any non-trivial tradeoffs involved, or if nontrivial amount of code is involved, you might see your patch at the end of a really long (and constantly growing) waiting list of patches. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 13:50 ` Ingo Molnar @ 2007-07-25 17:33 ` Satyam Sharma 2007-07-25 20:35 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Satyam Sharma @ 2007-07-25 17:33 UTC (permalink / raw) To: Ingo Molnar Cc: Rene Herman, Jos Poortvliet, david, Nick Piggin, Valdis.Kletnieks, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton Hi Ingo, [ Going off-topic, nothing related to swap/prefetch/etc. Just getting a hang of how development goes on here ... ] On 7/25/07, Ingo Molnar <mingo@elte.hu> wrote: > > * Rene Herman <rene.herman@gmail.com> wrote: > > > Nick Piggin is the person to convince it seems and if I've read things > > right (I only stepped into this thing at the updatedb mention, so > > maybe I haven't) his main question is _why_ the hell it helps > > updatedb. [...] > > btw., i'd like to make this clear: if you want stuff to go upstream, do > not concentrate on 'convincing the maintainer'. It's not so easy or clear-cut, see below. > Instead concentrate on understanding the _problem_, Of course -- that's a given. > concentrate on > making sure that both you and the maintainer understands the problem > correctly, This itself may require some "convincing" to do. What if the maintainer just doesn't recognize the problem? Note that the development model here is more about the "social" thing than purely a "technical" thing. People do handwave, possibly due to innocent misunderstandings, possibly without. Often it's just a case of seeing different reasons behind the "problematic behaviour". Or it could be a case of all of the above. > possibly write some testcase that clearly exposes it, and Oh yes -- that'll be helpful, but definitely not necessarily a prerequisite for all issues, and then you can't even expect everybody to write or test/benchmark with testcases. (oh, btw, this is assuming you do find consensus on a testcase) > help the maintainer debug the problem. Umm ... well. Should this "dance-with-the-maintainer" and all be really necessary? What you're saying is easy if a "bug" is simple and objective, with mathematically few (probably just one) possible correct solutions. Often (most often, in fact) it's a subjective issue -- could be about APIs, high level design, tradeoffs, even little implementation nits ... with one person wanting to do it one way, another thinks there's something hacky or "band-aidy" about it and a more beautiful/elegant solution exists elsewhere. I think there's a similar deadlock here (?) > _Optionally_, if you find joy in > it, you are also free to write a proposed solution for that problem Oh yes. But why "optionally"? This is *precisely* what the spirit of development in such open / distributed projects is ... unless Linux wants to die the same, slow, ivory-towered, miserable death that *BSD have. > and > submit it to the maintainer. Umm, ok ... pretty unlikely Linus or Andrew would take patches for any kernel subsystem (that isn't obvious/trivial) from anybody just like that, so you do need to Cc: the ones they trust (maintainer) to ensure they review/ack your work and pick it up. > But a "here is a solution, take it or leave it" approach, Agreed. That's definitely not the way to go. > before having > communicated the problem to the maintainer Umm, well this could depend from problem-to-problem. > and before having debugged > the problem Again, agreed -- but people can plausibly see different root causes for the same symptoms -- and different solutions. > is the wrong way around. It might still work out fine if the > solution is correct (especially if the patch is small and obvious), but > if there are any non-trivial tradeoffs involved, or if nontrivial amount > of code is involved, you might see your patch at the end of a really > long (and constantly growing) waiting list of patches. That's the whole point. For non-trivial / non-obvious / subjective issues, the "process" you laid out above could itself become a problem ... Satyam ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 17:33 ` Satyam Sharma @ 2007-07-25 20:35 ` Ingo Molnar 2007-07-26 2:32 ` Bartlomiej Zolnierkiewicz 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-25 20:35 UTC (permalink / raw) To: Satyam Sharma Cc: Rene Herman, Jos Poortvliet, david, Nick Piggin, Valdis.Kletnieks, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton * Satyam Sharma <satyam.sharma@gmail.com> wrote: > > concentrate on making sure that both you and the maintainer > > understands the problem correctly, > > This itself may require some "convincing" to do. What if the > maintainer just doesn't recognize the problem? Note that the > development model here is more about the "social" thing than purely a > "technical" thing. People do handwave, possibly due to innocent > misunderstandings, possibly without. Often it's just a case of seeing > different reasons behind the "problematic behaviour". Or it could be a > case of all of the above. sure - but i was really not talking about from the user's perspective, but from the enterprising kernel developer's perspective who'd like to solve a particular problem. And the nice thing about concentrating on the problem: if you do that well, it does not really matter what the maintainer thinks! ( Talking to the maintainer can of course be of enormous help in the quest for understanding the problem and figuring out the best fix - the maintainer will most likely know more about the subject than yourself. More communication never hurts. It's an additional bonus if you manage to convince the maintainer to take up the matter for himself. It's not a given right though - a maintainer's main task is to judge code that is being submitted, to keep a subsystem running smoothly and to not let it regress - but otherwise there can easily be different priorities of what tasks to tackle first, and in that sense the maintainer is just one of the many overworked kernel developers who has no real obligation what to tackle first. ) If the maintainer rejects something despite it being well-reasoned, well-researched and robustly implemented with no tradeoffs and maintainance problems at all then it's a bad maintainer. (but from all i've seen in the past few years the VM maintainers do their job pretty damn fine.) And note that i _do_ disagree with them in this particular swap-prefetch case, but still, the non-merging of swap-prefetch was not a final decision at all. It was more of a "hm, dunno, i still dont really like it - shouldnt this be done differently? Could we debug this a bit better?" reaction. Yes, it can be frustrating after more than one year. > > possibly write some testcase that clearly exposes it, and > > Oh yes -- that'll be helpful, but definitely not necessarily a > prerequisite for all issues, and then you can't even expect everybody > to write or test/benchmark with testcases. (oh, btw, this is assuming > you do find consensus on a testcase) no, but Con is/was certainly more than capable to write testcases and to debug various scenarios. That's the way how new maintainers are found within Linux: people take matters in their own hands and improve a subsystem so that they'll either peacefully co-work with the other maintainers or they replace them (sometimes not so peacefully - like in the IDE/SATA/PATA saga). > > help the maintainer debug the problem. > > Umm ... well. Should this "dance-with-the-maintainer" and all be > really necessary? What you're saying is easy if a "bug" is simple and > objective, with mathematically few (probably just one) possible > correct solutions. Often (most often, in fact) it's a subjective issue > -- could be about APIs, high level design, tradeoffs, even little > implementation nits ... with one person wanting to do it one way, > another thinks there's something hacky or "band-aidy" about it and a > more beautiful/elegant solution exists elsewhere. I think there's a > similar deadlock here (?) you dont _have to_ cooperative with the maintainer, but it's certainly useful to work with good maintainers, if your goal is to improve Linux. Or if for some reason communication is not working out fine then grow into the job and replace the maintainer by doing a better job. > > _Optionally_, if you find joy in it, you are also free to write a > > proposed solution for that problem > > Oh yes. But why "optionally"? This is *precisely* what the spirit of > development in such open / distributed projects is ... unless Linux > wants to die the same, slow, ivory-towered, miserable death that *BSD > have. perhaps you misunderstood how i meant the 'optional': it is certainly not required to write a solution for every problem you are reporting. Best-case the maintainer picks the issue up and solves it. Worst-case you get ignored. But you always have the option to take matters into your own hands and solve the problem. > >and submit it to the maintainer. > > Umm, ok ... pretty unlikely Linus or Andrew would take patches for any > kernel subsystem (that isn't obvious/trivial) from anybody just like > that, so you do need to Cc: the ones they trust (maintainer) to ensure > they review/ack your work and pick it up. actually, it happens pretty frequently, and NACK-ing perfectly reasonable patches is a sure way towards getting replaced as a maintainer. > > is the wrong way around. It might still work out fine if the > > solution is correct (especially if the patch is small and obvious), > > but if there are any non-trivial tradeoffs involved, or if > > nontrivial amount of code is involved, you might see your patch at > > the end of a really long (and constantly growing) waiting list of > > patches. > > That's the whole point. For non-trivial / non-obvious / subjective > issues, the "process" you laid out above could itself become a problem > ... firstly, there's rarely any 'subjective' issue in maintainance decisions, even when it comes to complex patches. The 'subjective' issue becomes a factor mostly when a problem has not been researched well enough, when it becomes more of a faith thing ('i believe it helps me') than a fully fact-backed argument. Maintainers tend to dodge such issues until they become more clearly fact-backed. providing more and more facts gradually reduces the 'judgement/taste' leeway of maintainers, down to an almost algorithmic level. but in any case there's always the ultimate way out: prove that you can do a better job yourself and replace the maintainer. But providing an overwhelming, irresistable body of facts in favor of a patch does the trick too in 99.9% of the cases. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 20:35 ` Ingo Molnar @ 2007-07-26 2:32 ` Bartlomiej Zolnierkiewicz 2007-07-26 4:13 ` Jeff Garzik 0 siblings, 1 reply; 535+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2007-07-26 2:32 UTC (permalink / raw) To: Ingo Molnar Cc: Satyam Sharma, Rene Herman, Jos Poortvliet, david, Nick Piggin, Valdis.Kletnieks, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton Hi, Some general thoughts about submitter/maintainer responsibilities, not necessarily connected with the recents events (I hasn't been following them closely - some people don't have that much free time to burn at their hands ;)... On Wednesday 25 July 2007, Ingo Molnar wrote: > > * Satyam Sharma <satyam.sharma@gmail.com> wrote: > > > > concentrate on making sure that both you and the maintainer > > > understands the problem correctly, > > > > This itself may require some "convincing" to do. What if the > > maintainer just doesn't recognize the problem? Note that the > > development model here is more about the "social" thing than purely a > > "technical" thing. People do handwave, possibly due to innocent > > misunderstandings, possibly without. Often it's just a case of seeing > > different reasons behind the "problematic behaviour". Or it could be a > > case of all of the above. > > sure - but i was really not talking about from the user's perspective, > but from the enterprising kernel developer's perspective who'd like to > solve a particular problem. And the nice thing about concentrating on > the problem: if you do that well, it does not really matter what the > maintainer thinks! Yes, this is a really good strategy to get you changes upstream (and it works) - just make changes so perfect that nobody can really complain. :) The only problem is that the bigger the change becomes the less likely it is to get it perfect so for really big changes it is also useful to show maintainer that you take responsibility of your changes (by taking bugreports and potential review issues very seriously instead of ignoring them, past history of your merged changes has also a big influence here) so he will know that you won't leave him in the cold with your code when bugreports happen and be _sure_ that they will happen with bigger changes. > ( Talking to the maintainer can of course be of enormous help in the > quest for understanding the problem and figuring out the best fix - > the maintainer will most likely know more about the subject than > yourself. More communication never hurts. It's an additional bonus if > you manage to convince the maintainer to take up the matter for > himself. It's not a given right though - a maintainer's main task is > to judge code that is being submitted, to keep a subsystem running > smoothly and to not let it regress - but otherwise there can easily be > different priorities of what tasks to tackle first, and in that sense > the maintainer is just one of the many overworked kernel developers > who has no real obligation what to tackle first. ) Yep, and patch author should try to help maintainer understand both the problem he is trying to fix and the solution, i.e. throwing some undocumented patches and screaming at maintainer to merge them is not a way to go. > If the maintainer rejects something despite it being well-reasoned, > well-researched and robustly implemented with no tradeoffs and > maintainance problems at all then it's a bad maintainer. (but from all > i've seen in the past few years the VM maintainers do their job pretty > damn fine.) And note that i _do_ disagree with them in this particular > swap-prefetch case, but still, the non-merging of swap-prefetch was not > a final decision at all. It was more of a "hm, dunno, i still dont > really like it - shouldnt this be done differently? Could we debug this > a bit better?" reaction. Yes, it can be frustrating after more than one > year. > > > > possibly write some testcase that clearly exposes it, and > > > > Oh yes -- that'll be helpful, but definitely not necessarily a > > prerequisite for all issues, and then you can't even expect everybody > > to write or test/benchmark with testcases. (oh, btw, this is assuming > > you do find consensus on a testcase) > > no, but Con is/was certainly more than capable to write testcases and to > debug various scenarios. That's the way how new maintainers are found > within Linux: people take matters in their own hands and improve a > subsystem so that they'll either peacefully co-work with the other > maintainers or they replace them (sometimes not so peacefully - like in > the IDE/SATA/PATA saga). Heh, now that you've raised IDE saga I feel obligated to stand up and say a few words... The latest opening of IDE saga was quite interesting in the current context because we had exactly the reversed situation there - "independent" maintainer and "enterprise" developer (imagine the amount of frustration on both sides) but the root source was quite similar (inability to get changes merged). IMO the source root of the conflict lied in coming from different perspectives and having a bit different priorities (stabilising/cleaning current code vs adding new features on top of pile of crap). In such situations it is very important to be able to stop for a moment and look at the situation from the other person's perspective. In summary: The IDE-wars are the thing of the past and lets learn from IDE-world mistakes instead of repeating them in other subsystems, OK? :) > > > help the maintainer debug the problem. > > > > Umm ... well. Should this "dance-with-the-maintainer" and all be > > really necessary? What you're saying is easy if a "bug" is simple and > > objective, with mathematically few (probably just one) possible > > correct solutions. Often (most often, in fact) it's a subjective issue > > -- could be about APIs, high level design, tradeoffs, even little > > implementation nits ... with one person wanting to do it one way, > > another thinks there's something hacky or "band-aidy" about it and a > > more beautiful/elegant solution exists elsewhere. I think there's a > > similar deadlock here (?) > > you dont _have to_ cooperative with the maintainer, but it's certainly > useful to work with good maintainers, if your goal is to improve Linux. > Or if for some reason communication is not working out fine then grow > into the job and replace the maintainer by doing a better job. The idea of growing into the job and replacing the maintainer by proving the you are doing better job was viable few years ago but may not be feasible today. If maintainer is "enterprise" developer and maintaining is part of his job replacing him may be not possible et all because you simply lack the time to do the job. You may be actually better but you can't afford to show it and without showing it you won't replace him (catch 22). Oh, and it could happen that if maintainer works for a distro he sticks his competing solution to the problem to the distro kernel and suddenly gets order of magnitude more testers and sometimes even contributors. How are you supposed to win such competition? [ A: You can't. ] I'm not even mentioning the situation when the maintainer is just a genius and one of the best kernel hackers ever (I'm talking about you actually :) so your chances are pretty slim from the start... > > > _Optionally_, if you find joy in it, you are also free to write a > > > proposed solution for that problem > > > > Oh yes. But why "optionally"? This is *precisely* what the spirit of > > development in such open / distributed projects is ... unless Linux > > wants to die the same, slow, ivory-towered, miserable death that *BSD > > have. > > perhaps you misunderstood how i meant the 'optional': it is certainly > not required to write a solution for every problem you are reporting. > Best-case the maintainer picks the issue up and solves it. Worst-case > you get ignored. But you always have the option to take matters into > your own hands and solve the problem. > > > >and submit it to the maintainer. > > > > Umm, ok ... pretty unlikely Linus or Andrew would take patches for any > > kernel subsystem (that isn't obvious/trivial) from anybody just like > > that, so you do need to Cc: the ones they trust (maintainer) to ensure > > they review/ack your work and pick it up. > > actually, it happens pretty frequently, and NACK-ing perfectly It actually happens really rarely (there are pretty good reasons for that). > reasonable patches is a sure way towards getting replaced as a > maintainer. "reasonable" is highly subjective > > > is the wrong way around. It might still work out fine if the > > > solution is correct (especially if the patch is small and obvious), > > > but if there are any non-trivial tradeoffs involved, or if > > > nontrivial amount of code is involved, you might see your patch at > > > the end of a really long (and constantly growing) waiting list of > > > patches. > > > > That's the whole point. For non-trivial / non-obvious / subjective > > issues, the "process" you laid out above could itself become a problem > > ... > > firstly, there's rarely any 'subjective' issue in maintainance > decisions, even when it comes to complex patches. The 'subjective' issue > becomes a factor mostly when a problem has not been researched well > enough, when it becomes more of a faith thing ('i believe it helps me') > than a fully fact-backed argument. Maintainers tend to dodge such issues > until they become more clearly fact-backed. Yep. However there is a some reasonable time limit for this dodging, two years isn't reasonable. By being a maintainer you frequently have to sacrifice your own goals and instead work on other people changes first (sometimes even on changes that you don't find particulary interesting or important). Sure it doesn't give you the same credit you'll get for your own changes but you're investing in people who will help you in a long-term. Could you allow the luxury of losing these people? The another problem is that sometimes it seems that independent developers has to go through more hops than entreprise ones and it is really frustrating experience for them. There is no conspiracy here - it is only the natural mechanism of trusting more in the code of people who you are working with more. > providing more and more facts gradually reduces the 'judgement/taste' > leeway of maintainers, down to an almost algorithmic level. > but in any case there's always the ultimate way out: prove that you can > do a better job yourself and replace the maintainer. But providing an As stated before - this is nearly impossible in some cases. I'm not proposing any kind of justice or fair chances here I'm just saying that in the long-term it is gonna hurt said maintainer because he will lose talented people willing to work on the code that he maintains. > overwhelming, irresistable body of facts in favor of a patch does the > trick too in 99.9% of the cases. Now could I ask people to stop all this -ck threads and give the developers involved in the recent events some time to calmly rethink the whole case. Please? Thanks, Bart ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 2:32 ` Bartlomiej Zolnierkiewicz @ 2007-07-26 4:13 ` Jeff Garzik 2007-07-26 10:22 ` Bartlomiej Zolnierkiewicz 0 siblings, 1 reply; 535+ messages in thread From: Jeff Garzik @ 2007-07-26 4:13 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz Cc: Ingo Molnar, Satyam Sharma, Rene Herman, Jos Poortvliet, david, Nick Piggin, Valdis.Kletnieks, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton Bartlomiej Zolnierkiewicz wrote: > On Wednesday 25 July 2007, Ingo Molnar wrote: >> you dont _have to_ cooperative with the maintainer, but it's certainly >> useful to work with good maintainers, if your goal is to improve Linux. >> Or if for some reason communication is not working out fine then grow >> into the job and replace the maintainer by doing a better job. > > The idea of growing into the job and replacing the maintainer by proving > the you are doing better job was viable few years ago but may not be > feasible today. IMO... Tejun is an excellent counter-example. He showed up as an independent developer, put a bunch of his own spare time and energy into the codebase, and is probably libata's main engineer (in terms of code output) today. If I get hit by a bus tomorrow, I think the Linux community would be quite happy with him as the libata maintainer. > The another problem is that sometimes it seems that independent developers > has to go through more hops than entreprise ones and it is really frustrating > experience for them. There is no conspiracy here - it is only the natural > mechanism of trusting more in the code of people who you are working with more. I think Tejun is a counter-example here too :) Everyone's experience is different, but from my perspective, Tejun "appeared out of nowhere" producing good code, and so, it got merged rapidly. Personally, for merging code, I tend to trust people who are most in tune with "the Linux Way(tm)." It is hard to quantify, but quite often, independent developers "get it" when enterprise developers do not. > Now could I ask people to stop all this -ck threads and give the developers > involved in the recent events some time to calmly rethink the whole case. Indeed... Jeff ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 4:13 ` Jeff Garzik @ 2007-07-26 10:22 ` Bartlomiej Zolnierkiewicz 0 siblings, 0 replies; 535+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2007-07-26 10:22 UTC (permalink / raw) To: Jeff Garzik Cc: Ingo Molnar, Satyam Sharma, Rene Herman, Jos Poortvliet, david, Nick Piggin, Valdis.Kletnieks, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton On Thursday 26 July 2007, Jeff Garzik wrote: > Bartlomiej Zolnierkiewicz wrote: > > On Wednesday 25 July 2007, Ingo Molnar wrote: > >> you dont _have to_ cooperative with the maintainer, but it's certainly > >> useful to work with good maintainers, if your goal is to improve Linux. > >> Or if for some reason communication is not working out fine then grow > >> into the job and replace the maintainer by doing a better job. > > > > The idea of growing into the job and replacing the maintainer by proving > > the you are doing better job was viable few years ago but may not be > > feasible today. > > IMO... Tejun is an excellent counter-example. He showed up as an IMO this doesn't qualify as a counter-example here et all unless you are trying to say that Tejun does your job much better and that we should just replace you. ;) > independent developer, put a bunch of his own spare time and energy into > the codebase, and is probably libata's main engineer (in terms of code > output) today. If I get hit by a bus tomorrow, I think the Linux > community would be quite happy with him as the libata maintainer. Fully agreed on this part. > > The another problem is that sometimes it seems that independent developers > > has to go through more hops than entreprise ones and it is really frustrating > > experience for them. There is no conspiracy here - it is only the natural > > mechanism of trusting more in the code of people who you are working with more. > > I think Tejun is a counter-example here too :) Everyone's experience is > different, but from my perspective, Tejun "appeared out of nowhere" > producing good code, and so, it got merged rapidly. Tejun (like any of other developers) spent some time in-the-making and this time was in large part spent in the IDE-land, and yes I'm also very glad of the effects. :) Thanks, Bart ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:30 ` Rene Herman 2007-07-25 5:51 ` david 2007-07-25 7:14 ` Valdis.Kletnieks @ 2007-07-25 16:02 ` Ray Lee 2007-07-25 20:55 ` Zan Lynx 2007-07-26 1:15 ` [ck] " Matthew Hawkins 2 siblings, 2 replies; 535+ messages in thread From: Ray Lee @ 2007-07-25 16:02 UTC (permalink / raw) To: Rene Herman Cc: david, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/24/07, Rene Herman <rene.herman@gmail.com> wrote: > Yes, but what's locate's usage scenario? I've never, ever wanted to use it. > When do you know the name of something but not where it's located, other > than situations which "which" wouldn't cover and after just having > installed/unpacked something meaning locate doesn't know about it yet either? I use it to find source files and documents all the time. One of my work boxes has <runs a locate work | wc -l> ~38500 files and directories under my source directory. And then there's the "I wrote that tech doc two years ago, where was that. Hmm, what did I name it? Bet it had 323 in the name, and doc in the path." I'd just like updatedb to amortize its work better. If we had some way to track all filesystem events, updatedb could keep a live and accurate index on the filesystem. And this isn't just updatedb that wants that, beagle and tracker et al also want to know filesystem events so that they can index the documents themselves as well as the metadata. And if they do it live, that spreads the cost out, including the VM pressure. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 16:02 ` Ray Lee @ 2007-07-25 20:55 ` Zan Lynx 2007-07-25 21:28 ` Ray Lee 2007-07-26 1:15 ` [ck] " Matthew Hawkins 1 sibling, 1 reply; 535+ messages in thread From: Zan Lynx @ 2007-07-25 20:55 UTC (permalink / raw) To: Ray Lee Cc: Rene Herman, david, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 854 bytes --] On Wed, 2007-07-25 at 09:02 -0700, Ray Lee wrote: > I'd just like updatedb to amortize its work better. If we had some way > to track all filesystem events, updatedb could keep a live and > accurate index on the filesystem. And this isn't just updatedb that > wants that, beagle and tracker et al also want to know filesystem > events so that they can index the documents themselves as well as the > metadata. And if they do it live, that spreads the cost out, including > the VM pressure. That would be nice. It'd be great if there was a per-filesystem inotify mode. I can't help but think it'd be more efficient than recursing every directory and adding a watch. Or maybe a netlink thing that could buffer events since filesystem mount until a daemon could get around to starting, so none were lost. -- Zan Lynx <zlynx@acm.org> [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 20:55 ` Zan Lynx @ 2007-07-25 21:28 ` Ray Lee 0 siblings, 0 replies; 535+ messages in thread From: Ray Lee @ 2007-07-25 21:28 UTC (permalink / raw) To: Zan Lynx Cc: Rene Herman, david, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/25/07, Zan Lynx <zlynx@acm.org> wrote: > On Wed, 2007-07-25 at 09:02 -0700, Ray Lee wrote: > > > I'd just like updatedb to amortize its work better. If we had some way > > to track all filesystem events, updatedb could keep a live and > > accurate index on the filesystem. And this isn't just updatedb that > > wants that, beagle and tracker et al also want to know filesystem > > events so that they can index the documents themselves as well as the > > metadata. And if they do it live, that spreads the cost out, including > > the VM pressure. > > That would be nice. It'd be great if there was a per-filesystem inotify > mode. I can't help but think it'd be more efficient than recursing > every directory and adding a watch. > > Or maybe a netlink thing that could buffer events since filesystem mount > until a daemon could get around to starting, so none were lost. See "Filesystem Event Reporter" by Yi Yang, that does pretty much exactly that. http://lkml.org/lkml/2006/9/30/98 . Author had things to update, never resubmitted it as far as I can tell. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 16:02 ` Ray Lee 2007-07-25 20:55 ` Zan Lynx @ 2007-07-26 1:15 ` Matthew Hawkins 2007-07-26 1:32 ` Ray Lee 2007-07-26 22:30 ` Michael Chang 1 sibling, 2 replies; 535+ messages in thread From: Matthew Hawkins @ 2007-07-26 1:15 UTC (permalink / raw) To: Ray Lee; +Cc: linux-kernel, ck list, linux-mm On 7/26/07, Ray Lee <ray-lk@madrabbit.org> wrote: > I'd just like updatedb to amortize its work better. If we had some way > to track all filesystem events, updatedb could keep a live and > accurate index on the filesystem. And this isn't just updatedb that > wants that, beagle and tracker et al also want to know filesystem > events so that they can index the documents themselves as well as the > metadata. And if they do it live, that spreads the cost out, including > the VM pressure. We already have this, its called inotify (and if I'm not mistaken, beagle already uses it). Several years ago when it was still a little flakey patch, I built a custom filesystem indexer into an enterprise search engine using it (I needed to pull apart Unix mbox files). The only trouble of course is the action is triggered immediately, which may not always be ideal (but that's a userspace problem) -- Matt ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 1:15 ` [ck] " Matthew Hawkins @ 2007-07-26 1:32 ` Ray Lee 2007-07-26 3:16 ` Matthew Hawkins 2007-07-26 22:30 ` Michael Chang 1 sibling, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-26 1:32 UTC (permalink / raw) To: Matthew Hawkins; +Cc: linux-kernel, ck list, linux-mm On 7/25/07, Matthew Hawkins <darthmdh@gmail.com> wrote: > On 7/26/07, Ray Lee <ray-lk@madrabbit.org> wrote: > > I'd just like updatedb to amortize its work better. If we had some way > > to track all filesystem events, updatedb could keep a live and > > accurate index on the filesystem. And this isn't just updatedb that > > wants that, beagle and tracker et al also want to know filesystem > > events so that they can index the documents themselves as well as the > > metadata. And if they do it live, that spreads the cost out, including > > the VM pressure. > > We already have this, its called inotify (and if I'm not mistaken, > beagle already uses it). Yeah, I know about inotify, but it doesn't scale. ray@phoenix:~$ find ~ -type d | wc -l 17933 ray@phoenix:~$ That's not fun with inotify, and that's just my home directory. The vast majority of those are quiet the vast majority of the time, which is the crux of the problem, and why inotify isn't a great fit for on-demand virus scanners or indexers. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 1:32 ` Ray Lee @ 2007-07-26 3:16 ` Matthew Hawkins 0 siblings, 0 replies; 535+ messages in thread From: Matthew Hawkins @ 2007-07-26 3:16 UTC (permalink / raw) To: Ray Lee; +Cc: linux-kernel, ck list, linux-mm On 7/26/07, Ray Lee <ray-lk@madrabbit.org> wrote: > Yeah, I know about inotify, but it doesn't scale. Yeah, the nonrecursive behaviour is a bugger. Also I found it helped to queue operations in userspace and execute periodically rather than trying to execute on every single notification. Worked well for indexing, for virus scanning though you'd want to do some risk analysis. It'd be nice to have a filesystem that handled that sort of thing internally *cough*winfs*cough*. That was my hope for reiserfs a very long time ago with its pluggable fs modules feature. -- Matt ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 1:15 ` [ck] " Matthew Hawkins 2007-07-26 1:32 ` Ray Lee @ 2007-07-26 22:30 ` Michael Chang 1 sibling, 0 replies; 535+ messages in thread From: Michael Chang @ 2007-07-26 22:30 UTC (permalink / raw) To: Matthew Hawkins; +Cc: Ray Lee, ck list, linux-mm, linux-kernel On 7/25/07, Matthew Hawkins <darthmdh@gmail.com> wrote: > On 7/26/07, Ray Lee <ray-lk@madrabbit.org> wrote: > > I'd just like updatedb to amortize its work better. If we had some way > > to track all filesystem events, updatedb could keep a live and > > accurate index on the filesystem. And this isn't just updatedb that > > wants that, beagle and tracker et al also want to know filesystem > > events so that they can index the documents themselves as well as the > > metadata. And if they do it live, that spreads the cost out, including > > the VM pressure. > > We already have this, its called inotify (and if I'm not mistaken, > beagle already uses it). Several years ago when it was still a little > flakey patch, I built a custom filesystem indexer into an enterprise > search engine using it (I needed to pull apart Unix mbox files). The > only trouble of course is the action is triggered immediately, which > may not always be ideal (but that's a userspace problem) > With all this discussion about updatedb and locate and such, I thought I'd do a Google search, (considering I've never heard of locate before but I've seen updatedb here and there in ps lists) and I found this: http://www.linux.com/articles/114029 That page mentions something called "rlocate", which seems to provide some sort of almost-real-time mechanism, although the method it does so bothers me -- it uses a 2.6 kernel module AND a userspace daemon. And from what I can tell, there's no indication that this almost "real-time" (--I see mentions of a 2 second lag--) system replaces/eliminates updatedb in any way, shape, or form. http://rlocate.sourceforge.net/ - Project "Web Site" http://sourceforge.net/projects/rlocate/ - Source Forge Project Summary The last release also appears a bit dated on sourceforge... release 0.4.0 on 2006-01-15. Just thought I'd mention it. -- Michael Chang Please avoid sending me Word or PowerPoint attachments. Send me ODT, RTF, or HTML instead. See http://www.gnu.org/philosophy/no-word-attachments.html Thank you. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 4:55 ` Rene Herman 2007-07-25 5:00 ` Nick Piggin 2007-07-25 5:12 ` david @ 2007-07-25 5:30 ` Eric St-Laurent 2007-07-25 5:37 ` Nick Piggin 2 siblings, 1 reply; 535+ messages in thread From: Eric St-Laurent @ 2007-07-25 5:30 UTC (permalink / raw) To: Rene Herman Cc: Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 2007-25-07 at 06:55 +0200, Rene Herman wrote: > It certainly doesn't run for me ever. Always kind of a "that's not the > point" comment but I just keep wondering whenever I see anyone complain > about updatedb why the _hell_ they are running it in the first place. If > anyone who never uses "locate" for anything simply disable updatedb, the > problem will for a large part be solved. > > This not just meant as a cheap comment; while I can think of a few similar > loads even on the desktop (scanning a browser cache, a media player indexing > a large amount of media files, ...) I've never heard of problems _other_ > than updatedb. So just junk that crap and be happy. >From my POV there's two different problems discussed recently: - updatedb type of workloads that add tons of inodes and dentries in the slab caches which of course use the pagecache. - streaming large files (read or copying) that fill the pagecache with useless used-once data swap prefetch fix the first case, drop-behind fix the second case. Both have the same symptoms but the cause is different. Personally updatedb doesn't really hurt me. But I don't have that many files on my desktop. I've tried the swap prefetch patch in the past and it was not so noticeable for me. (I don't doubt it's helpful for others) But every time I read or copy a large file around (usually from a server) the slowdown is noticeable for some moments. I just wanted to point this out, if it wasn't clean enough for everyone. I hope both problems get fixed. Best regards, - Eric ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:30 ` Eric St-Laurent @ 2007-07-25 5:37 ` Nick Piggin 2007-07-25 5:53 ` david ` (4 more replies) 0 siblings, 5 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-25 5:37 UTC (permalink / raw) To: Eric St-Laurent Cc: Rene Herman, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Eric St-Laurent wrote: > On Wed, 2007-25-07 at 06:55 +0200, Rene Herman wrote: > > >>It certainly doesn't run for me ever. Always kind of a "that's not the >>point" comment but I just keep wondering whenever I see anyone complain >>about updatedb why the _hell_ they are running it in the first place. If >>anyone who never uses "locate" for anything simply disable updatedb, the >>problem will for a large part be solved. >> >>This not just meant as a cheap comment; while I can think of a few similar >>loads even on the desktop (scanning a browser cache, a media player indexing >>a large amount of media files, ...) I've never heard of problems _other_ >>than updatedb. So just junk that crap and be happy. > > >>From my POV there's two different problems discussed recently: > > - updatedb type of workloads that add tons of inodes and dentries in the > slab caches which of course use the pagecache. > > - streaming large files (read or copying) that fill the pagecache with > useless used-once data > > swap prefetch fix the first case, drop-behind fix the second case. OK, this is where I start to worry. Swap prefetch AFAIKS doesn't fix the updatedb problem very well, because if updatedb has caused swapout then it has filled memory, and swap prefetch doesn't run unless there is free memory (not to mention that updatedb would have paged out other files as well). And drop behind doesn't fix your usual problem where you are downloading from a server, because that is use-once write(2) data which is the problem. And this readahead-based drop behind also doesn't help if data you were reading happened to be a sequence of small files, or otherwise not in good readahead order. Not to say that neither fix some problems, but for such conceptually big changes, it should take a little more effort than a constructed test case and no consideration of the alternatives to get it merged. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:37 ` Nick Piggin @ 2007-07-25 5:53 ` david 2007-07-25 6:04 ` Nick Piggin 2007-07-25 6:19 ` [ck] " Matthew Hawkins ` (3 subsequent siblings) 4 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-25 5:53 UTC (permalink / raw) To: Nick Piggin Cc: Eric St-Laurent, Rene Herman, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 25 Jul 2007, Nick Piggin wrote: > Eric St-Laurent wrote: >> On Wed, 2007-25-07 at 06:55 +0200, Rene Herman wrote: >> >> >> > It certainly doesn't run for me ever. Always kind of a "that's not the >> > point" comment but I just keep wondering whenever I see anyone complain >> > about updatedb why the _hell_ they are running it in the first place. If >> > anyone who never uses "locate" for anything simply disable updatedb, the >> > problem will for a large part be solved. >> > >> > This not just meant as a cheap comment; while I can think of a few >> > similar loads even on the desktop (scanning a browser cache, a media >> > player indexing a large amount of media files, ...) I've never heard of >> > problems _other_ than updatedb. So just junk that crap and be happy. >> >> >> >From my POV there's two different problems discussed recently: >> >> - updatedb type of workloads that add tons of inodes and dentries in the >> slab caches which of course use the pagecache. >> >> - streaming large files (read or copying) that fill the pagecache with >> useless used-once data >> >> swap prefetch fix the first case, drop-behind fix the second case. > > OK, this is where I start to worry. Swap prefetch AFAIKS doesn't fix > the updatedb problem very well, because if updatedb has caused swapout > then it has filled memory, and swap prefetch doesn't run unless there > is free memory (not to mention that updatedb would have paged out other > files as well). > > And drop behind doesn't fix your usual problem where you are downloading > from a server, because that is use-once write(2) data which is the > problem. And this readahead-based drop behind also doesn't help if data > you were reading happened to be a sequence of small files, or otherwise > not in good readahead order. > > Not to say that neither fix some problems, but for such conceptually > big changes, it should take a little more effort than a constructed test > case and no consideration of the alternatives to get it merged. well, there appears to be a fairly large group of people who have subjective opinions that it helps them. but those were dismissed becouse they aren't measurements. so now the measurements of the constructed test case aren't acceptable. what sort of test case would be acceptable? David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:53 ` david @ 2007-07-25 6:04 ` Nick Piggin 2007-07-25 6:23 ` david 2007-07-25 10:41 ` Jesper Juhl 0 siblings, 2 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-25 6:04 UTC (permalink / raw) To: david Cc: Eric St-Laurent, Rene Herman, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel david@lang.hm wrote: > On Wed, 25 Jul 2007, Nick Piggin wrote: >> OK, this is where I start to worry. Swap prefetch AFAIKS doesn't fix >> the updatedb problem very well, because if updatedb has caused swapout >> then it has filled memory, and swap prefetch doesn't run unless there >> is free memory (not to mention that updatedb would have paged out other >> files as well). >> >> And drop behind doesn't fix your usual problem where you are downloading >> from a server, because that is use-once write(2) data which is the >> problem. And this readahead-based drop behind also doesn't help if data >> you were reading happened to be a sequence of small files, or otherwise >> not in good readahead order. >> >> Not to say that neither fix some problems, but for such conceptually >> big changes, it should take a little more effort than a constructed test >> case and no consideration of the alternatives to get it merged. > > > well, there appears to be a fairly large group of people who have > subjective opinions that it helps them. but those were dismissed becouse > they aren't measurements. Not at all. But there is also seems to be some people also experiencing problems with basic page reclaim on some of the workloads where these things help. I am not dismissing anybody's claims about anything; I want to try to solve some of these problems. Interestingly, some of the people ranting the most about how the VM sucks are the ones helping least in solving these basic problems. > so now the measurements of the constructed test case aren't acceptable. > > what sort of test case would be acceptable? Well I never said real world tests aren't acceptable, they are. There is a difference between an "it feels better for me", and some actual real measurement and analysis of said workload. And constructed test cases of course are useful as well, I didn't say they weren't. I don't know what you mean by "acceptable", but you should read my last paragraph again. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 6:04 ` Nick Piggin @ 2007-07-25 6:23 ` david 2007-07-25 7:25 ` Nick Piggin 2007-07-25 10:41 ` Jesper Juhl 1 sibling, 1 reply; 535+ messages in thread From: david @ 2007-07-25 6:23 UTC (permalink / raw) To: Nick Piggin Cc: Eric St-Laurent, Rene Herman, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 25 Jul 2007, Nick Piggin wrote: > david@lang.hm wrote: >> On Wed, 25 Jul 2007, Nick Piggin wrote: > >> > OK, this is where I start to worry. Swap prefetch AFAIKS doesn't fix >> > the updatedb problem very well, because if updatedb has caused swapout >> > then it has filled memory, and swap prefetch doesn't run unless there >> > is free memory (not to mention that updatedb would have paged out other >> > files as well). >> > >> > And drop behind doesn't fix your usual problem where you are downloading >> > from a server, because that is use-once write(2) data which is the >> > problem. And this readahead-based drop behind also doesn't help if data >> > you were reading happened to be a sequence of small files, or otherwise >> > not in good readahead order. >> > >> > Not to say that neither fix some problems, but for such conceptually >> > big changes, it should take a little more effort than a constructed test >> > case and no consideration of the alternatives to get it merged. >> >> >> well, there appears to be a fairly large group of people who have >> subjective opinions that it helps them. but those were dismissed becouse >> they aren't measurements. > > Not at all. But there is also seems to be some people also experiencing > problems with basic page reclaim on some of the workloads where these > things help. I am not dismissing anybody's claims about anything; I want > to try to solve some of these problems. > > Interestingly, some of the people ranting the most about how the VM sucks > are the ones helping least in solving these basic problems. > > >> so now the measurements of the constructed test case aren't acceptable. >> >> what sort of test case would be acceptable? > > Well I never said real world tests aren't acceptable, they are. There is > a difference between an "it feels better for me", and some actual real > measurement and analysis of said workload. > > And constructed test cases of course are useful as well, I didn't say > they weren't. I don't know what you mean by "acceptable", but you should > read my last paragraph again. this problem has been around for many years, with many different people working on solutions. it's hardly a case of getting a proposal and trying to get it in without anyone looking at other options. it seems that there are some people (not nessasarily including you) who will oppose this feature until a test is created that shows that it's better. the question is what sort of test will be accepted as valid? I'm not useing this patch, but it sounds as if the people who are useing it are interested in doing whatever testing is required, but so far the situation seems to be a series of "here's a test", "that test isn't valid, try again" loops. which don't seem to be doing anyone any good and are frustrating lots of people, so like several people over the last few days O'm asking the question, "what sort of test would be acceptable as proof that this patch does some good?" David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 6:23 ` david @ 2007-07-25 7:25 ` Nick Piggin 2007-07-25 7:49 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-25 7:25 UTC (permalink / raw) To: david Cc: Eric St-Laurent, Rene Herman, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel david@lang.hm wrote: > On Wed, 25 Jul 2007, Nick Piggin wrote: >> And constructed test cases of course are useful as well, I didn't say >> they weren't. I don't know what you mean by "acceptable", but you should >> read my last paragraph again. > > > this problem has been around for many years, with many different people > working on solutions. it's hardly a case of getting a proposal and > trying to get it in without anyone looking at other options. What is "this problem"? People have an updatedb problem that is solved by swap prefetching which I want to fix in a different way. There would be a different problem of "run something that uses heaps of memory and swap everything else out, then quit it, wait for a while, and swap prefetching helps". OK, definitely swap prefetching would help there. How much? I don't know. I'd be slightly surprised if it was like an order of magnitude, because not only swap but everything else has been thrown out too. > it seems that there are some people (not nessasarily including you) who > will oppose this feature until a test is created that shows that it's > better. the question is what sort of test will be accepted as valid? I'm > not useing this patch, but it sounds as if the people who are useing it > are interested in doing whatever testing is required, but so far the > situation seems to be a series of "here's a test", "that test isn't > valid, try again" loops. which don't seem to be doing anyone any good And yet despite my repeated pleas, none of those people has yet spent a bit of time with me to help analyse what is happening. > and are frustrating lots of people, so like several people over the last > few days O'm asking the question, "what sort of test would be acceptable > as proof that this patch does some good?" I don't think any further proof is needed that the patch does "some" good. Rig up a test case and you could see some seconds shaved off it. Maybe you want to know "how to get this patch merged"? And I don't know that one. I do know that it is fuzzy, and probably doesn't include demanding things of Andrew or Linus. BTW. If you find out the answer to that one, let me know because I have this lockless pagecache patch that has also been around for years, is also just a few hundred lines in the VM, and does do some good too. I'm sure the buffered AIO people and many others would also like to know. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 7:25 ` Nick Piggin @ 2007-07-25 7:49 ` Ingo Molnar 2007-07-25 7:58 ` Nick Piggin 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-25 7:49 UTC (permalink / raw) To: Nick Piggin Cc: david, Eric St-Laurent, Rene Herman, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > And yet despite my repeated pleas, none of those people has yet spent > a bit of time with me to help analyse what is happening. btw., it might help to give specific, precise instructions about what people should do to help you analyze this problem. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 7:49 ` Ingo Molnar @ 2007-07-25 7:58 ` Nick Piggin 2007-07-25 8:15 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-25 7:58 UTC (permalink / raw) To: Ingo Molnar Cc: david, Eric St-Laurent, Rene Herman, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>And yet despite my repeated pleas, none of those people has yet spent >>a bit of time with me to help analyse what is happening. > > > btw., it might help to give specific, precise instructions about what > people should do to help you analyze this problem. Ray has been the first one to offer (thank you), and yes I have asked him for precise details of info to collect to hopefully work out what is happening with his first problem. For the general "it feels better for me" it is harder, but not as hard as CPU scheduler. We can measure various types of IO waits, swap in/out events, swap prefetch events and successfulness; see what happens to those as we change swappiness or vfs_cache_pressure etc. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 7:58 ` Nick Piggin @ 2007-07-25 8:15 ` Ingo Molnar 0 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-07-25 8:15 UTC (permalink / raw) To: Nick Piggin Cc: david, Eric St-Laurent, Rene Herman, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > And yet despite my repeated pleas, none of those people has yet > > > spent a bit of time with me to help analyse what is happening. > > > > btw., it might help to give specific, precise instructions about > > what people should do to help you analyze this problem. > > Ray has been the first one to offer (thank you), and yes I have asked > him for precise details of info to collect to hopefully work out what > is happening with his first problem. do you mean this paragraph: | I guess /proc/meminfo, /proc/zoneinfo, /proc/vmstat, /proc/slabinfo | before and after the updatedb run with the latest kernel would be a | first step. top and vmstat output during the run wouldn't hurt either. correct? Does "latest kernel" mean v2.6.22.1, or does it have to be v2.6.23-rc1? I guess v2.6.22.1 would be fine as this is a VM problem, not a scheduling problem. the following script will gather all the above information for a 10 seconds interval: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh Ray, please run this script before the updatedb run, once during the updatedb run and once after the updatedb run, and send Nick the 3 files it creates. (feel free to Cc: me too) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 6:04 ` Nick Piggin 2007-07-25 6:23 ` david @ 2007-07-25 10:41 ` Jesper Juhl 1 sibling, 0 replies; 535+ messages in thread From: Jesper Juhl @ 2007-07-25 10:41 UTC (permalink / raw) To: Nick Piggin Cc: david, Eric St-Laurent, Rene Herman, Ray Lee, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 25/07/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: [snip] > > Well I never said real world tests aren't acceptable, they are. There is > a difference between an "it feels better for me", and some actual real > measurement and analysis of said workload. > Let me tell you about the use-case where swap prefetch helps me. I don't have actual numbers currently, only a subjective "it feels better", but when I get home from work tonight I'll try to collect some actual numbers for you. Anyway, here's a description of the scenario (machine is a AMD Athlon64 X2 4400+, 2GB RAM, 1GB swap, running 32bit kernel & userspace): A KDE desktop with the following running is common for me - A few (konsole) shells open running vim, pine, less, ssh sessions etc. - Eclipse (with CDT) with 20-30 files open in a project. - Firefox with 30+ tabs open. - LyX with a 200+ page document I'm working on open, is running. - Gimp running, usually with at least one or two images open (~1280x1024). - Amarok open and playing my playlist (a few days worth of music). - At least one Konqueror window in filemanager mode running. - More often than not OpenOffice is running with a spreadsheet or text document open. - In the background the machine is running Apache, MySQL, BIND and NFS services for my local LAN, but they see very little actual use. Now, a thing I commonly do is fire up a new shell, pull the latest changes from Linus' git tree and start a script running that builds a allnoconfig kernel, a allmodconfig kernel, a allyesconfig kernel and then 30 randconfig kernels. Obviously that script takes quite a while to run and loads the box quite a bit, so I usually just leave the box alone for a few hours until it is done (sometimes I leave it over night, in which case updatedb also gets added to the mix during the night). This usually pushes the box to use some amount of swap. Without swap prefetch; when I start working with one of the apps I had running before starting the compile job it always feels a little laggy at first. With swap prefetch app response time is not laggy when I come back. The "laggyness" doesn't last too long and is hard to quantify, but I'll try getting some numbers (if in no other way, then perhaps by using a stop watch).... Fact is, this is a scenario that is common to me and one where swap prefetch definately makes the box feel nicer to work with. -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 5:37 ` Nick Piggin 2007-07-25 5:53 ` david @ 2007-07-25 6:19 ` Matthew Hawkins 2007-07-25 6:30 ` Nick Piggin 2007-07-25 6:47 ` Mike Galbraith 2007-07-25 6:44 ` Eric St-Laurent ` (2 subsequent siblings) 4 siblings, 2 replies; 535+ messages in thread From: Matthew Hawkins @ 2007-07-25 6:19 UTC (permalink / raw) To: Nick Piggin Cc: Eric St-Laurent, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton, Rene Herman On 7/25/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Not to say that neither fix some problems, but for such conceptually > big changes, it should take a little more effort than a constructed test > case and no consideration of the alternatives to get it merged. Swap Prefetch has existed since September 5, 2005. Please Nick, enlighten us all with your "alternatives" which have been offered (in practical, not theoretical form) in the past 23 months, along with their non-constructed benchmarks proving their case and the hordes of happy users and kernel developers who have tested them out the wazoo and given their backing. Or just take a nice steaming jug of STFU. -- Matt ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 6:19 ` [ck] " Matthew Hawkins @ 2007-07-25 6:30 ` Nick Piggin 2007-07-25 6:47 ` Mike Galbraith 1 sibling, 0 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-25 6:30 UTC (permalink / raw) To: Matthew Hawkins Cc: Eric St-Laurent, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton, Rene Herman Matthew Hawkins wrote: > On 7/25/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> Not to say that neither fix some problems, but for such conceptually >> big changes, it should take a little more effort than a constructed test >> case and no consideration of the alternatives to get it merged. > > > Swap Prefetch has existed since September 5, 2005. Please Nick, > enlighten us all with your "alternatives" which have been offered (in > practical, not theoretical form) in the past 23 months, along with > their non-constructed benchmarks proving their case and the hordes of > happy users and kernel developers who have tested them out the wazoo > and given their backing. Or just take a nice steaming jug of STFU. The alternatives comment was in relation to the readahead based drop behind patch,for which an alternative would be improving use-once, possibly in the way I described. As for swap prefetch, I don't know, I'm not in charge of it being merged or not merged. I do know some people have reported that their updatedb problem gets much better with swap prefetch turned on, and I am trying to work on that too. For you? You also have the alternative to help improve things yourself, and you can modify your own kernel. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 6:19 ` [ck] " Matthew Hawkins 2007-07-25 6:30 ` Nick Piggin @ 2007-07-25 6:47 ` Mike Galbraith 2007-07-25 7:19 ` Eric St-Laurent 1 sibling, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-07-25 6:47 UTC (permalink / raw) To: Matthew Hawkins Cc: Nick Piggin, Eric St-Laurent, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton, Rene Herman On Wed, 2007-07-25 at 16:19 +1000, Matthew Hawkins wrote: > On 7/25/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Not to say that neither fix some problems, but for such conceptually > > big changes, it should take a little more effort than a constructed test > > case and no consideration of the alternatives to get it merged. > > Swap Prefetch has existed since September 5, 2005. Please Nick, > enlighten us all with your "alternatives" which have been offered (in > practical, not theoretical form) in the past 23 months, along with > their non-constructed benchmarks proving their case and the hordes of > happy users and kernel developers who have tested them out the wazoo > and given their backing. Or just take a nice steaming jug of STFU. Heh. Here we have a VM developer expressing his interest in the problem space, and you offer him a steaming jug of STFU because he doesn't say what you want to hear. I wonder how many killfiles you just entered. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 6:47 ` Mike Galbraith @ 2007-07-25 7:19 ` Eric St-Laurent 0 siblings, 0 replies; 535+ messages in thread From: Eric St-Laurent @ 2007-07-25 7:19 UTC (permalink / raw) To: Mike Galbraith Cc: Matthew Hawkins, Nick Piggin, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton, Rene Herman On Wed, 2007-25-07 at 08:47 +0200, Mike Galbraith wrote: > Heh. Here we have a VM developer expressing his interest in the problem > space, and you offer him a steaming jug of STFU because he doesn't say > what you want to hear. I wonder how many killfiles you just entered. > Agreed. (a bit OT) People should understand that it's not (I think) about a desktop workload vs enterprise workloads war. I see it mostly as a progression versus regressions trade-off. And adding potentially useless or unmaintained code is a regression from the maintainers POV. The best way to justify a patch and have it integrated is to have a scientific testing method with repeatable numbers. Con has done so for his patch, his benchmark demonstrated good improvements. But I feel some of his supporters have indirectly harmed his cause by their comments. Also, the fact that Con recently stopped maintaining his work out of frustration also don't help having his patch merged. Again I'm not personally pushing this patch, I don't need it. Con has worked for many years on two area that still cause problems for desktop users: scheduler interactivity and pagecache trashing. Now that the scheduler has been fixed, let's have the VM fixed too. Sorry for the slightly OT post, and please don't start a flame war... - Eric ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:37 ` Nick Piggin 2007-07-25 5:53 ` david 2007-07-25 6:19 ` [ck] " Matthew Hawkins @ 2007-07-25 6:44 ` Eric St-Laurent 2007-07-25 16:09 ` Ray Lee 2007-07-25 17:55 ` Frank A. Kingswood 4 siblings, 0 replies; 535+ messages in thread From: Eric St-Laurent @ 2007-07-25 6:44 UTC (permalink / raw) To: Nick Piggin Cc: Rene Herman, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 2007-25-07 at 15:37 +1000, Nick Piggin wrote: > OK, this is where I start to worry. Swap prefetch AFAIKS doesn't fix > the updatedb problem very well, because if updatedb has caused swapout > then it has filled memory, and swap prefetch doesn't run unless there > is free memory (not to mention that updatedb would have paged out other > files as well). > > And drop behind doesn't fix your usual problem where you are downloading > from a server, because that is use-once write(2) data which is the > problem. And this readahead-based drop behind also doesn't help if data > you were reading happened to be a sequence of small files, or otherwise > not in good readahead order. > > Not to say that neither fix some problems, but for such conceptually > big changes, it should take a little more effort than a constructed test > case and no consideration of the alternatives to get it merged. Sorry for the confusion. For swap prefetch I should have said "some people claim that it fix their problem". I didn't want to hurt anybody feelings, some people are tired to hear others speak hypothetically about this patch, as it work-for-them (TM). I don't experience the problem. Can't help. For drop behind it fix half the problem. The read case is handled perfectly by Peter's patch. And the copy (read+write) is unchanged. My test case demonstrate it very easily, just look at the numbers. So, I agree with you that drop behind doesn't fix the write() case. Peter has said so himself when I offered to test his patch. As I do experience this problem, I have written a small test program and batch file to help push the patch for acceptance. I'm very willing to help improve the test cases, test patches and write code, time permitting. About this very subject, earlier this year this Andrew suggested me to came up with a test case to demonstrate my problem, well finally I've done so. http://lkml.org/lkml/2007/3/3/164 http://lkml.org/lkml/2007/3/3/166 Lastly, I would go as far to say that the use-once read then copy fix must also work with copies over NFS. I don't know if NFS change the workload on the client station versus the local case, and I don't know if it's still possible to consider data copied this way as use-once. - Eric ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:37 ` Nick Piggin ` (2 preceding siblings ...) 2007-07-25 6:44 ` Eric St-Laurent @ 2007-07-25 16:09 ` Ray Lee 2007-07-26 4:57 ` Andrew Morton 2007-07-25 17:55 ` Frank A. Kingswood 4 siblings, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-25 16:09 UTC (permalink / raw) To: Nick Piggin Cc: Eric St-Laurent, Rene Herman, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Hey Eric, On 7/24/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Eric St-Laurent wrote: > > On Wed, 2007-25-07 at 06:55 +0200, Rene Herman wrote: > > > > > >>It certainly doesn't run for me ever. Always kind of a "that's not the > >>point" comment but I just keep wondering whenever I see anyone complain > >>about updatedb why the _hell_ they are running it in the first place. If > >>anyone who never uses "locate" for anything simply disable updatedb, the > >>problem will for a large part be solved. > >> > >>This not just meant as a cheap comment; while I can think of a few similar > >>loads even on the desktop (scanning a browser cache, a media player indexing > >>a large amount of media files, ...) I've never heard of problems _other_ > >>than updatedb. So just junk that crap and be happy. > > > > > >>From my POV there's two different problems discussed recently: > > > > - updatedb type of workloads that add tons of inodes and dentries in the > > slab caches which of course use the pagecache. > > > > - streaming large files (read or copying) that fill the pagecache with > > useless used-once data No, there's a third case which I find the most annoying. I have multiple working sets, the sum of which won't fit into RAM. When I finish one, the kernel had time to preemptively swap back in the other, and yet it didn't. So, I sit around, twiddling my thumbs, waiting for my music player to come back to life, or thunderbird, or... ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 16:09 ` Ray Lee @ 2007-07-26 4:57 ` Andrew Morton 2007-07-26 5:53 ` Nick Piggin ` (4 more replies) 0 siblings, 5 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-26 4:57 UTC (permalink / raw) To: Ray Lee Cc: Nick Piggin, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 25 Jul 2007 09:09:01 -0700 "Ray Lee" <ray-lk@madrabbit.org> wrote: > No, there's a third case which I find the most annoying. I have > multiple working sets, the sum of which won't fit into RAM. When I > finish one, the kernel had time to preemptively swap back in the > other, and yet it didn't. So, I sit around, twiddling my thumbs, > waiting for my music player to come back to life, or thunderbird, > or... Yes, I'm thinking that's a good problem statement and it isn't something which the kernel even vaguely attempts to address, apart from normal demand paging. We could perhaps improve things with larger and smarter fault readaround, perhaps guided by refault-rate measurement. But that's still demand-paged rather than being proactive/predictive/whatever. None of this is swap-specific though: exactly the same problem would need to be solved for mmapped files and even plain old pagecache. In fact I'd restate the problem as "system is in steady state A, then there is a workload shift causing transition to state B, then the system goes idle. We now wish to reinstate state A in anticipation of a resumption of the original workload". swap-prefetch solves a part of that. A complete solution for anon and file-backed memory could be implemented (ta-da) in userspace using the kernel inspection tools in -mm's maps2-* patches. We would need to add a means by which userspace can repopulate swapcache, but that doesn't sound too hard (especially when you haven't thought about it). And userspace can right now work out which pages from which files are in pagecache so this application can handle pagecache, swap and file-backed memory. (file-backed memory might not even need special treatment, given that it's pagecache anyway). And userspace can do a much better implementation of this how-to-handle-large-load-shifts problem, because it is really quite complex. The system needs to be monitored to determine what is the "usual" state (ie: the thing we wish to reestablish when the transient workload subsides). The system then needs to be monitored to determine when the exceptional workload has started, and when it has subsided, and userspace then needs to decide when to start reestablishing the old working set, at what rate, when to abort doing that, etc. All this would end up needing runtime configurability and tweakability and customisability. All standard fare for userspace stuff - much easier than patching the kernel. So. We can a) provide a way for userspace to reload pagecache and b) merge maps2 (once it's finished) (pokes mpm) and we're done? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 4:57 ` Andrew Morton @ 2007-07-26 5:53 ` Nick Piggin 2007-07-26 6:06 ` Andrew Morton 2007-07-26 6:33 ` Ray Lee ` (3 subsequent siblings) 4 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-26 5:53 UTC (permalink / raw) To: Andrew Morton Cc: Ray Lee, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Andrew Morton wrote: > All this would end up needing runtime configurability and tweakability and > customisability. All standard fare for userspace stuff - much easier than > patching the kernel. > > > So. We can > > a) provide a way for userspace to reload pagecache and > > b) merge maps2 (once it's finished) (pokes mpm) > > and we're done? The userspace solution has been brought up before. It could be a good way to go. I was thinking about how to do refetching of file backed pages from the kernel, and it isn't impossible, but it it seems like locking would be quite hard and it would be pretty complex and inflexible compared to a userspace solution. Userspace might know what to chuck out, what to keep, what access patterns to use... Not that I want to say anything about swap prefetch getting merged: my inbox is already full of enough "helpful suggestions" about that, so I'll just be happy to have a look at little things like updatedb. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 5:53 ` Nick Piggin @ 2007-07-26 6:06 ` Andrew Morton 2007-07-26 6:17 ` Nick Piggin 0 siblings, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-26 6:06 UTC (permalink / raw) To: Nick Piggin Cc: Ray Lee, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Thu, 26 Jul 2007 15:53:37 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Not that I want to say anything about swap prefetch getting merged: my > inbox is already full of enough "helpful suggestions" about that, give them the kernel interfaces, they can do it themselves ;) > so I'll > just be happy to have a look at little things like updatedb. Yes, that is a little thing. I mean, even if the kernel's behaviour during an updatedb run was "perfect" (ie: does what we the designers curently intend it to do (whatever that is)) then the core problem isn't solved: short-term workload evicts your working set and you have to synchronously reestablish it. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 6:06 ` Andrew Morton @ 2007-07-26 6:17 ` Nick Piggin 0 siblings, 0 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-26 6:17 UTC (permalink / raw) To: Andrew Morton Cc: Ray Lee, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Andrew Morton wrote: > On Thu, 26 Jul 2007 15:53:37 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Not that I want to say anything about swap prefetch getting merged: my >>inbox is already full of enough "helpful suggestions" about that, > > > give them the kernel interfaces, they can do it themselves ;) It is a good idea if we can give enough to get started. Then if they run into something they really need to do in the kernel, we can take a look. Page eviction order / prefetch-back-in-order might be tricky to expose. >>so I'll >>just be happy to have a look at little things like updatedb. > > > Yes, that is a little thing. I mean, even if the kernel's behaviour > during an updatedb run was "perfect" (ie: does what we the designers > curently intend it to do (whatever that is)) then the core problem isn't > solved: short-term workload evicts your working set and you have to > synchronously reestablish it. Sure, I know and I was never against swap (and/or file) prefetching to solve this problem. I'm just saying, I'm staying out of that :) -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 4:57 ` Andrew Morton 2007-07-26 5:53 ` Nick Piggin @ 2007-07-26 6:33 ` Ray Lee 2007-07-26 6:50 ` Andrew Morton 2007-07-26 14:19 ` [ck] " Michael Chang ` (2 subsequent siblings) 4 siblings, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-26 6:33 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/25/07, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 25 Jul 2007 09:09:01 -0700 > "Ray Lee" <ray-lk@madrabbit.org> wrote: > > > No, there's a third case which I find the most annoying. I have > > multiple working sets, the sum of which won't fit into RAM. When I > > finish one, the kernel had time to preemptively swap back in the > > other, and yet it didn't. So, I sit around, twiddling my thumbs, > > waiting for my music player to come back to life, or thunderbird, > > or... > > Yes, I'm thinking that's a good problem statement and it isn't something > which the kernel even vaguely attempts to address, apart from normal > demand paging. > > We could perhaps improve things with larger and smarter fault readaround, > perhaps guided by refault-rate measurement. But that's still demand-paged > rather than being proactive/predictive/whatever. > > None of this is swap-specific though: exactly the same problem would need > to be solved for mmapped files and even plain old pagecache. <nod> Could be what I'm noticing, but it's important to note that as others have shown improvement with Con's swap prefetch, it's easily arguable that targeting just swap is good enough for a first approximation. > In fact I'd restate the problem as "system is in steady state A, then there > is a workload shift causing transition to state B, then the system goes > idle. We now wish to reinstate state A in anticipation of a resumption of > the original workload". Yes, that's a fair transformation / generalization. It's always nice talking to someone with more clarity than one's self. > swap-prefetch solves a part of that. > > A complete solution for anon and file-backed memory could be implemented > (ta-da) in userspace using the kernel inspection tools in -mm's maps2-* > patches. > We would need to add a means by which userspace can repopulate > swapcache, Okay, let's run with that for argument's sake. > but that doesn't sound too hard (especially when you haven't > thought about it). I've always thought your sense of humor was underappreciated. > And userspace can right now work out which pages from which files are in > pagecache so this application can handle pagecache, swap and file-backed > memory. (file-backed memory might not even need special treatment, given > that it's pagecache anyway). So in your proposed scheme, would userspace be polling, er, <goes and looks through email for maps2 stuff, only finds Rusty's patches to it>, well, /proc/<pids>/something_or_another? A userspace daemon that wakes up regularly to poll a bunch of proc files fills me with glee. Wait, is that glee? I think, no... wait... horror, yes, horror is what I'm feeling. I'm wrong, right? I love being wrong about this kind of stuff. > And userspace can do a much better implementation of this > how-to-handle-large-load-shifts problem, because it is really quite > complex. The system needs to be monitored to determine what is the "usual" > state (ie: the thing we wish to reestablish when the transient workload > subsides). The system then needs to be monitored to determine when the > exceptional workload has started, and when it has subsided, and userspace > then needs to decide when to start reestablishing the old working set, at > what rate, when to abort doing that, etc. Oy. I mean this in the most respectful way possible, but you're too smart for your own good. I mean, sure, it's possible one could have multiply-chained transient workloads each of which have their optimum workingset, of which there's little overlap with the previous. Mainframes made their names on such loads. Workingset A starts, generates data, finishes and invokes workingset B, of which the only thing they share in common is said data. B finishes and invokes C, etc. So, yeah, that's way too complex to stuff into the kernel. Even if it were possible to do so, I cringe at the thought. And I can't believe that would be a common enough pattern nowadays to justify any hueristics on anyone's part. It's certainly complex enough that I'd like to punt that scenario out of the conversation entirely -- I think it has the potential to give a false impression as to how involved of a process we're talking about here. Let's go back to your restatement: > In fact I'd restate the problem as "system is in steady state A, then there > is a workload shift causing transition to state B, then the system goes > idle. We now wish to reinstate state A in anticipation of a resumption of > the original workload". I'll take an 80% solution for that one problem, and happily declare that the kernel's job is done. In particular, when a resource hog exits (or whatever hueristics prefetch is currently hooking in to), the kernel (or userspace, if that interface could be made sane) could exercise a completely workload agnostic refetch of the last n things evicted, where n is determined by what's suddenly become free (or whatever Con came up with). Just, y'know, MRU style. > All this would end up needing runtime configurability and tweakability and > customisability. All standard fare for userspace stuff - much easier than > patching the kernel. We're talking about patching the kernel for whatever API you're coming up with to repopulate pagecache, swap, and inodes, aren't we? If we are, it doesn't seem like we're saving any work here. Also we're talking about a creating a new user-visible API instead of augmenting a pre-existing hueristic -- page replacement -- that the kernel doesn't export and so can change at a moment's notice. Augmenting an opaque hueristic seems a lot more friendly to long-term maintenance. > So. We can > > a) provide a way for userspace to reload pagecache and > > b) merge maps2 (once it's finished) (pokes mpm) > > and we're done? Eh, dunno. Maybe? We're assuming we come up with an API for userspace to get notifications of evictions (without polling, though poll() would be fine -- you know what I mean), and an API for re-victing those things on demand. If you think that adding that API and maintaining it is simpler/better than including a variation on the above hueristic I offered, then yeah, I guess we are. It'll all have that vague userspace s2ram odor about it, but I'm sure it could be made to work. As I think I've successfully Peter Principled my way through this conversation to my level of incompetence, I'll shut up now. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 6:33 ` Ray Lee @ 2007-07-26 6:50 ` Andrew Morton 2007-07-26 7:43 ` Ray Lee 2007-07-28 0:24 ` Matt Mackall 0 siblings, 2 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-26 6:50 UTC (permalink / raw) To: Ray Lee Cc: Nick Piggin, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 25 Jul 2007 23:33:24 -0700 "Ray Lee" <ray-lk@madrabbit.org> wrote: > > So. We can > > > > a) provide a way for userspace to reload pagecache and > > > > b) merge maps2 (once it's finished) (pokes mpm) > > > > and we're done? > > Eh, dunno. Maybe? > > We're assuming we come up with an API for userspace to get > notifications of evictions (without polling, though poll() would be > fine -- you know what I mean), and an API for re-victing those things > on demand. I was assuming that polling would work OK. I expect it would. > If you think that adding that API and maintaining it is > simpler/better than including a variation on the above hueristic I > offered, then yeah, I guess we are. It'll all have that vague > userspace s2ram odor about it, but I'm sure it could be made to work. Actually, I overdesigned the API, I suspect. What we _could_ do is to provide a way of allowing userspace to say "pretend process A touched page B": adopt its mm and go touch the page. We in fact already have that: PTRACE_PEEKTEXT. So I suspect this could all be done by polling maps2 and using PEEKTEXT. The tricky part would be working out when to poll, and when to reestablish. A neater implementation than PEEKTEXT would be to make the maps2 files writeable(!) so as a party trick you could tar 'em up and then, when you want to reestablish firefox's previous working set, do a untar in /proc/$(pidof firefox)/ ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 6:50 ` Andrew Morton @ 2007-07-26 7:43 ` Ray Lee 2007-07-26 7:59 ` Nick Piggin 2007-07-28 0:24 ` Matt Mackall 1 sibling, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-26 7:43 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/25/07, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 25 Jul 2007 23:33:24 -0700 "Ray Lee" <ray-lk@madrabbit.org> wrote: > > If you think that adding that API and maintaining it is > > simpler/better than including a variation on the above hueristic I > > offered, then yeah, I guess we are. It'll all have that vague > > userspace s2ram odor about it, but I'm sure it could be made to work. > > Actually, I overdesigned the API, I suspect. What we _could_ do is to > provide a way of allowing userspace to say "pretend process A touched page > B": adopt its mm and go touch the page. We in fact already have that: > PTRACE_PEEKTEXT. Huh. All right. > So I suspect this could all be done by polling maps2 and using PEEKTEXT. > The tricky part would be working out when to poll, and when to reestablish. Welllllll.... there is the taskstats interface. It's not required right now, though, and lacks most of what userspace would need, I think. It does at least currently provide a notification of process exit, which is a clue for when to start reestablishment. Gotta be another way we can get at that... Oh, stat on /proc, does that work? Huh, it does, sort of. It seems to be off by 12 or 13, but hey, that's something. Wish I had the time to look at the maps2 stuff, but regardless, it probably currently provides too much detail for continual polling? I suspect what we'd want to do is to take a detailed snapshot a little after the beginning of a process's lifetime (once the block-in counts subside), then poll aggregate residency or evicition counts to know which processes are suffering the burden of the transient workload. Eh, wait, that doesn't help with inodes. No matter, I guess; I'm the one who said targetting swap-in would be good enough for a first pass. On process exit, if userspace can get a hold of an estimate of the size of what just freed up, it could then spend min(that,evicted_count) on repopulation. That's probably already available by polling whatever `free` calls. > A neater implementation than PEEKTEXT would be to make the maps2 files > writeable(!) so as a party trick you could tar 'em up and then, when you > want to reestablish firefox's previous working set, do a untar in > /proc/$(pidof firefox)/ I'm going to get into trouble if I wake up the other person in the house with my laughter. That's laughter in a positive sense, not a "you're daft" kind of way. Huh. <thinks> So, to go back a little bit, I guess one of my problems with polling is that it means that userspace can only approximate an MRU of what's been evicted. Perhaps an approximation is good enough, I don't know, but that's something to keep in mind. (Hmm, how many pages can an average desktop evict per second? If we poll everything once per second, that's how off we could be.) Another is a more philosophical hangup -- running a process that polls periodically to improve system performance seems backward. Okay, so that's my problem to get over, not yours. Another problem is what poor sod would be willing to write and test this, given that there's already a written and tested kernel patch to do much the same thing? Yeah, that's sorta rhetorical, but it's sorta not. Given that swap prefetch could be ripped out of 2.6.n+1 if it's introduced in 2.6.n, and nothing in userspace would be the wiser, where's the burden? There is some, just as any kernel code has some, and as it's core code (versus, say, a driver), the burden is correspondingly greater per line, but given the massive changesets flowing through each release now, I have to think that the burden this introduces is marginal compared to the rest of the bulk sweeping through the kernel weekly. This is obviously where I'm totally conjecturing, and you'll know far, far better than I. Offline for about 20 hours or so, not that anyone would probably notice :-). Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 7:43 ` Ray Lee @ 2007-07-26 7:59 ` Nick Piggin 0 siblings, 0 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-26 7:59 UTC (permalink / raw) To: Ray Lee Cc: Andrew Morton, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Ray Lee wrote: > Another is a more philosophical hangup -- running a process that polls > periodically to improve system performance seems backward. You mean like the kprefetchd of swap prefetch? ;) > Okay, so > that's my problem to get over, not yours. If it was a problem you could add some event trigger to wake it up. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 6:50 ` Andrew Morton 2007-07-26 7:43 ` Ray Lee @ 2007-07-28 0:24 ` Matt Mackall 1 sibling, 0 replies; 535+ messages in thread From: Matt Mackall @ 2007-07-28 0:24 UTC (permalink / raw) To: Andrew Morton Cc: Ray Lee, Nick Piggin, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, Jul 25, 2007 at 11:50:37PM -0700, Andrew Morton wrote: > On Wed, 25 Jul 2007 23:33:24 -0700 "Ray Lee" <ray-lk@madrabbit.org> wrote: > > > > So. We can > > > > > > a) provide a way for userspace to reload pagecache and > > > > > > b) merge maps2 (once it's finished) (pokes mpm) > > > > > > and we're done? > > > > Eh, dunno. Maybe? > > > > We're assuming we come up with an API for userspace to get > > notifications of evictions (without polling, though poll() would be > > fine -- you know what I mean), and an API for re-victing those things > > on demand. > > I was assuming that polling would work OK. I expect it would. > > > If you think that adding that API and maintaining it is > > simpler/better than including a variation on the above hueristic I > > offered, then yeah, I guess we are. It'll all have that vague > > userspace s2ram odor about it, but I'm sure it could be made to work. > > Actually, I overdesigned the API, I suspect. What we _could_ do is to > provide a way of allowing userspace to say "pretend process A touched page > B": adopt its mm and go touch the page. We in fact already have that: > PTRACE_PEEKTEXT. > > So I suspect this could all be done by polling maps2 and using PEEKTEXT. > The tricky part would be working out when to poll, and when to reestablish. > > A neater implementation than PEEKTEXT would be to make the maps2 files > writeable(!) so as a party trick you could tar 'em up and then, when you > want to reestablish firefox's previous working set, do a untar in > /proc/$(pidof firefox)/ Sick. But thankfully, unnecessary. The pagemaps give you more than just a present bit, which is all we care about here. We simply need to record which pages are mapped, then reference them all back to life.. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 4:57 ` Andrew Morton 2007-07-26 5:53 ` Nick Piggin 2007-07-26 6:33 ` Ray Lee @ 2007-07-26 14:19 ` Michael Chang 2007-07-26 18:13 ` Andrew Morton 2007-07-28 0:12 ` Matt Mackall 2007-07-28 3:42 ` Daniel Cheng 4 siblings, 1 reply; 535+ messages in thread From: Michael Chang @ 2007-07-26 14:19 UTC (permalink / raw) To: Andrew Morton Cc: Ray Lee, Nick Piggin, Eric St-Laurent, linux-kernel, ck list, linux-mm, Paul Jackson, Jesper Juhl, Rene Herman On 7/26/07, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 25 Jul 2007 09:09:01 -0700 > "Ray Lee" <ray-lk@madrabbit.org> wrote: > > > No, there's a third case which I find the most annoying. I have > > multiple working sets, the sum of which won't fit into RAM. When I > > finish one, the kernel had time to preemptively swap back in the > > other, and yet it didn't. So, I sit around, twiddling my thumbs, > > waiting for my music player to come back to life, or thunderbird, > > or... > > In fact I'd restate the problem as "system is in steady state A, then there > is a workload shift causing transition to state B, then the system goes > idle. We now wish to reinstate state A in anticipation of a resumption of > the original workload". > > swap-prefetch solves a part of that. > > A complete solution for anon and file-backed memory could be implemented > (ta-da) in userspace using the kernel inspection tools in -mm's maps2-* > patches. We would need to add a means by which userspace can repopulate > swapcache, but that doesn't sound too hard (especially when you haven't > thought about it). > > And userspace can right now work out which pages from which files are in > pagecache so this application can handle pagecache, swap and file-backed > memory. (file-backed memory might not even need special treatment, given > that it's pagecache anyway). > > And userspace can do a much better implementation of this > how-to-handle-large-load-shifts problem, because it is really quite > complex. The system needs to be monitored to determine what is the "usual" > state (ie: the thing we wish to reestablish when the transient workload > subsides). The system then needs to be monitored to determine when the > exceptional workload has started, and when it has subsided, and userspace > then needs to decide when to start reestablishing the old working set, at > what rate, when to abort doing that, etc. > > All this would end up needing runtime configurability and tweakability and > customisability. All standard fare for userspace stuff - much easier than > patching the kernel. Maybe I'm missing something here, but if the problem is resource allocation when switching from state A to state B, and from B to C, etc.; wouldn't it be a bad thing if state B happened to be (in the future) this state-shifting userspace daemon of which you speak? (Or is that likely to be impossible/unlikely for some other reason which alludes me at the moment?) -- Michael Chang Please avoid sending me Word or PowerPoint attachments. Send me ODT, RTF, or HTML instead. See http://www.gnu.org/philosophy/no-word-attachments.html Thank you. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 14:19 ` [ck] " Michael Chang @ 2007-07-26 18:13 ` Andrew Morton 2007-07-26 22:04 ` Dirk Schoebel 0 siblings, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-26 18:13 UTC (permalink / raw) To: Michael Chang Cc: Ray Lee, Nick Piggin, Eric St-Laurent, linux-kernel, ck list, linux-mm, Paul Jackson, Jesper Juhl, Rene Herman On Thu, 26 Jul 2007 10:19:06 -0400 "Michael Chang" <thenewme91@gmail.com> wrote: > > All this would end up needing runtime configurability and tweakability and > > customisability. All standard fare for userspace stuff - much easier than > > patching the kernel. > > Maybe I'm missing something here, but if the problem is resource > allocation when switching from state A to state B, and from B to C, > etc.; wouldn't it be a bad thing if state B happened to be (in the > future) this state-shifting userspace daemon of which you speak? (Or > is that likely to be impossible/unlikely for some other reason which > alludes me at the moment?) Well. I was assuming that the daemon wouldn't be a great memory pig. I suspect it would do practically zero IO and would use little memory. It could even be mlocked, but I doubt if that would be needed. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 18:13 ` Andrew Morton @ 2007-07-26 22:04 ` Dirk Schoebel 2007-07-26 22:33 ` Dirk Schoebel 0 siblings, 1 reply; 535+ messages in thread From: Dirk Schoebel @ 2007-07-26 22:04 UTC (permalink / raw) To: ck Cc: Andrew Morton, Michael Chang, Nick Piggin, Ray Lee, Eric St-Laurent, linux-kernel, linux-mm, Paul Jackson, Jesper Juhl, Rene Herman [-- Attachment #1: Type: text/plain, Size: 1323 bytes --] I don't really understand the reasons for all those discussions. As long as you have a maintainer, why don't you just put swap prefetch into the kernel, marked experimental, default deactivated so anyone who just make[s] oldconfig (or defaultconfig) won't get it. If anyone finds a good solution for all those cache 'poisoning' problems and problems concerning swapin on workload changes and such swap prefetch can easily taken out again and no one has to complain and continuing maintaining it. Actually the same goes for plugshed (having it might have kept Con as a valuable developer). I am actually waiting for more than 2 years that reiser4 will make it into the kernel (sure, marked experimental or even highly experimental) so the patch-journey for every new kernel comes to an end. And most things in-kernel will surely be tested more intense so bugs will come up much faster. (Constantly running MM kernels is not really an option since many things in there can't be deactivated if they don't work as expected since lots of patches also concern 'vital' parts of the kernel.) ...just 2cents from a happy CK user for it made it possible to watch a movie while compiling the system - which was impossible with mainline kernel, even on dual core 2.2 GHz AMD64 with 4G RAM ! Dirk. [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 22:04 ` Dirk Schoebel @ 2007-07-26 22:33 ` Dirk Schoebel 2007-07-26 23:27 ` Jeff Garzik 0 siblings, 1 reply; 535+ messages in thread From: Dirk Schoebel @ 2007-07-26 22:33 UTC (permalink / raw) To: ck Cc: Nick Piggin, Ray Lee, Eric St-Laurent, linux-kernel, linux-mm, Paul Jackson, Jesper Juhl, Andrew Morton, Rene Herman [-- Attachment #1: Type: text/plain, Size: 1126 bytes --] ...sorry for the reply to myself. As Gentoo user i'm happy about the freedom of choice in almost every aspect of the system. But with the kernel this freedom is taken away and i'm left with largely choices other people did. Sure, i can get the sources and patch and change the kernel myself in every aspect but thats more the possibility given by the freedom of the open source of the kernel. I think the kernel should be more open for new ideas, and as long as the maintainer follows the kernel development things can be left in, if the maintainer can't follow anymore they are taken out quite fast again. (This statement mostly counts for parts of the kernel where a choice is possible or the coding overhead of making such choice possible is quite low.) A problem of Linux is there are 100s of patches and patchsets out there but if you want to have more than one (since they offer advantages or functionality in different places) you are mostly left alone, start to integrate all patches by hand and solve so many rejects. ..still a happy CK user, but sad that Con left the scene. Dirk. [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 22:33 ` Dirk Schoebel @ 2007-07-26 23:27 ` Jeff Garzik 2007-07-26 23:29 ` david 0 siblings, 1 reply; 535+ messages in thread From: Jeff Garzik @ 2007-07-26 23:27 UTC (permalink / raw) To: Dirk Schoebel Cc: ck, Nick Piggin, Ray Lee, Eric St-Laurent, linux-kernel, linux-mm, Paul Jackson, Jesper Juhl, Andrew Morton, Rene Herman Dirk Schoebel wrote: > as long as the > maintainer follows the kernel development things can be left in, if the > maintainer can't follow anymore they are taken out quite fast again. (This > statement mostly counts for parts of the kernel where a choice is possible or > the coding overhead of making such choice possible is quite low.) This is just not good engineering. It is axiomatic that it is easy to add code, but difficult to remove code. It takes -years- to remove code that no one uses. Long after the maintainer disappears, the users (and bug reports!) remain. It is also axiomatic that adding code, particularly core code, often exponentially increases complexity. Jeff ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 23:27 ` Jeff Garzik @ 2007-07-26 23:29 ` david 2007-07-26 23:39 ` Jeff Garzik 0 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-26 23:29 UTC (permalink / raw) To: Jeff Garzik Cc: Dirk Schoebel, ck, Nick Piggin, Ray Lee, Eric St-Laurent, linux-kernel, linux-mm, Paul Jackson, Jesper Juhl, Andrew Morton, Rene Herman On Thu, 26 Jul 2007, Jeff Garzik wrote: > Dirk Schoebel wrote: >> as long as the maintainer follows the kernel development things can be >> left in, if the maintainer can't follow anymore they are taken out quite >> fast again. (This statement mostly counts for parts of the kernel where a >> choice is possible or the coding overhead of making such choice possible >> is quite low.) > > > This is just not good engineering. > > It is axiomatic that it is easy to add code, but difficult to remove code. > It takes -years- to remove code that no one uses. Long after the maintainer > disappears, the users (and bug reports!) remain. I'll point out that the code that's so hard to remove is the code that exposes an API to userspace. code that's an internal implementation (like a couple of the things being discussed) gets removed much faster. > It is also axiomatic that adding code, particularly core code, often > exponentially increases complexity. this is true and may be a valid argument (depending on how large and how intrusive the proposed patch is) David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 23:29 ` david @ 2007-07-26 23:39 ` Jeff Garzik 2007-07-27 0:12 ` david 0 siblings, 1 reply; 535+ messages in thread From: Jeff Garzik @ 2007-07-26 23:39 UTC (permalink / raw) To: david Cc: Dirk Schoebel, ck, Nick Piggin, Ray Lee, Eric St-Laurent, linux-kernel, linux-mm, Paul Jackson, Jesper Juhl, Andrew Morton, Rene Herman david@lang.hm wrote: > On Thu, 26 Jul 2007, Jeff Garzik wrote: > >> Dirk Schoebel wrote: >>> as long as the maintainer follows the kernel development things can be >>> left in, if the maintainer can't follow anymore they are taken out >>> quite >>> fast again. (This statement mostly counts for parts of the kernel >>> where a >>> choice is possible or the coding overhead of making such choice >>> possible >>> is quite low.) >> >> >> This is just not good engineering. >> >> It is axiomatic that it is easy to add code, but difficult to remove >> code. It takes -years- to remove code that no one uses. Long after >> the maintainer disappears, the users (and bug reports!) remain. > > I'll point out that the code that's so hard to remove is the code that > exposes an API to userspace. True. > code that's an internal implementation (like a couple of the things > being discussed) gets removed much faster. Not true. It is highly unlikely that code will get removed if it has active users, even if the maintainer has disappeared. The only things that get removed rapidly are those things mathematically guaranteed to be dead code. _Behavior changes_, driver removals, feature removals happen more frequently than userspace ABI changes -- true -- but the rate of removal is still very, very slow. It is axiomatic that we are automatically burdened with new code for at least 10 years :) That's what you have to assume, when accepting anything. Jeff ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-26 23:39 ` Jeff Garzik @ 2007-07-27 0:12 ` david 0 siblings, 0 replies; 535+ messages in thread From: david @ 2007-07-27 0:12 UTC (permalink / raw) To: Jeff Garzik Cc: Dirk Schoebel, ck, Nick Piggin, Ray Lee, Eric St-Laurent, linux-kernel, linux-mm, Paul Jackson, Jesper Juhl, Andrew Morton, Rene Herman On Thu, 26 Jul 2007, Jeff Garzik wrote: > david@lang.hm wrote: >> On Thu, 26 Jul 2007, Jeff Garzik wrote: >> >> > Dirk Schoebel wrote: >> > > as long as the maintainer follows the kernel development things can >> > > be >> > > left in, if the maintainer can't follow anymore they are taken out >> > > quite >> > > fast again. (This statement mostly counts for parts of the kernel >> > > where a >> > > choice is possible or the coding overhead of making such choice >> > > possible >> > > is quite low.) >> > >> > >> > This is just not good engineering. >> > >> > It is axiomatic that it is easy to add code, but difficult to remove >> > code. It takes -years- to remove code that no one uses. Long after the >> > maintainer disappears, the users (and bug reports!) remain. >> >> I'll point out that the code that's so hard to remove is the code that >> exposes an API to userspace. > > True. > > >> code that's an internal implementation (like a couple of the things being >> discussed) gets removed much faster. > > Not true. It is highly unlikely that code will get removed if it has active > users, even if the maintainer has disappeared. if you propose removing code in such a way that performance suffers then yes, it's hard to remove (and it should be). but if it has no API the code is only visable to the users as a side effect of it's use. if the new code works better then it can be replaced. the scheduler change that we're going through right now is an example, new code came along that was better and the old code went away very quickly. the SLAB/SLOB/SLUB/S**B debate is another example. currently the different versions have different performance advantages and disadvantages, as one drops behind to the point where one of the others is better at all times, it goes away. > The only things that get removed rapidly are those things mathematically > guaranteed to be dead code. > > _Behavior changes_, driver removals, feature removals happen more frequently > than userspace ABI changes -- true -- but the rate of removal is still very, > very slow. a large part of this is that it's so hard to get a replacement that works better (this is very definantly a compliment to the kernel coders :-) > It is axiomatic that we are automatically burdened with new code for at > least 10 years :) That's what you have to assume, when accepting > anything. for userspace API's 10 years is reasonable, for internal features it's not. there is a LOT of internal stuff that was in the kernel 10 (or even 5) years ago that isn't there now. the key is that the behavior as far as users is concerned is better now. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 4:57 ` Andrew Morton ` (2 preceding siblings ...) 2007-07-26 14:19 ` [ck] " Michael Chang @ 2007-07-28 0:12 ` Matt Mackall 2007-07-28 3:42 ` Daniel Cheng 4 siblings, 0 replies; 535+ messages in thread From: Matt Mackall @ 2007-07-28 0:12 UTC (permalink / raw) To: Andrew Morton Cc: Ray Lee, Nick Piggin, Eric St-Laurent, Rene Herman, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, Jul 25, 2007 at 09:57:17PM -0700, Andrew Morton wrote: > So. We can > > a) provide a way for userspace to reload pagecache and > > b) merge maps2 (once it's finished) (pokes mpm) Consider me poked, despite not being cc:ed. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 4:57 ` Andrew Morton ` (3 preceding siblings ...) 2007-07-28 0:12 ` Matt Mackall @ 2007-07-28 3:42 ` Daniel Cheng 2007-07-28 9:35 ` Stefan Richter 4 siblings, 1 reply; 535+ messages in thread From: Daniel Cheng @ 2007-07-28 3:42 UTC (permalink / raw) To: linux-kernel; +Cc: ck, linux-mm, linux-kernel, linux-mm Andrew Morton wrote: [...] > > And userspace can do a much better implementation of this > how-to-handle-large-load-shifts problem, because it is really quite > complex. The system needs to be monitored to determine what is the "usual" [...] > All this would end up needing runtime configurability and tweakability and > customisability. All standard fare for userspace stuff - much easier than > patching the kernel. But a patch already exist. Which is easier: (1) apply the patch ; or (2) write a new patch? > > So. We can > a) provide a way for userspace to reload pagecache and > b) merge maps2 (once it's finished) (pokes mpm) > and we're done? might be. but merging maps2 have higher risk which should be done in a development branch (er... 2.7, but we don't have it now). -- ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-28 3:42 ` Daniel Cheng @ 2007-07-28 9:35 ` Stefan Richter 0 siblings, 0 replies; 535+ messages in thread From: Stefan Richter @ 2007-07-28 9:35 UTC (permalink / raw) To: Daniel Cheng; +Cc: linux-kernel, ck, linux-mm Daniel Cheng wrote: > but merging maps2 have higher risk which should be done in a development > branch (er... 2.7, but we don't have it now). This is off-topic and has been discussed to death, but: Rather than one stable branch and one development branch, we have a few stable branches and a lot of development branches. Some are located at git.kernel.org. Among else, this gives you a predictable release rythm and very timely updated stable branches. -- Stefan Richter -=====-=-=== -=== ===-- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 5:37 ` Nick Piggin ` (3 preceding siblings ...) 2007-07-25 16:09 ` Ray Lee @ 2007-07-25 17:55 ` Frank A. Kingswood 4 siblings, 0 replies; 535+ messages in thread From: Frank A. Kingswood @ 2007-07-25 17:55 UTC (permalink / raw) To: linux-kernel; +Cc: ck, linux-mm Nick Piggin wrote: > OK, this is where I start to worry. Swap prefetch AFAIKS doesn't fix > the updatedb problem very well, because if updatedb has caused swapout > then it has filled memory, and swap prefetch doesn't run unless there > is free memory (not to mention that updatedb would have paged out other > files as well). It is *not* about updatedb. That is just a trivial case which people notice. Therefore fixing updatedb to be nicer, as was discussed at various points in this thread, is *not* the solution. Most users are also *not*at*all* interested in kernel builds as a metric of system performance. When I'm at work, I run a large, commercial, engineering application. While running, it takes most of the system memory (4GB and up), and it reads and writes very large files. Swap prefetch noticeably helps my desktop too. Can I measure it? Not sure. Can people on lkml fix the application? Certainly not. Frank ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 4:06 ` Nick Piggin 2007-07-25 4:55 ` Rene Herman @ 2007-07-25 6:09 ` Matthew Hawkins 2007-07-25 6:18 ` Nick Piggin 2007-07-25 16:19 ` Ray Lee ` (2 subsequent siblings) 4 siblings, 1 reply; 535+ messages in thread From: Matthew Hawkins @ 2007-07-25 6:09 UTC (permalink / raw) To: Nick Piggin Cc: Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton On 7/25/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > I'm not saying that we can't try to tackle that problem, but first of > all you have a really nice narrow problem where updatedb seems to be > causing the kernel to completely do the wrong thing. So we start on > that. updatedb isn't the only problem, its just an obvious one. I like the idea of looking into the vfs for this and other one-shot applications (rather than looking at updatedb itself specifically) Many modern applications have a lot of open file handles. For example, I just fired up my usual audio player and sys/fs/file-nr showed another 600 open files (funnily enough, I have roughly that many audio files :) I'm not exactly sure what happens when this one gets swapped out for whatever reason (firefox/java/vmware/etc chews ram, updatedb, whatever) but I'm fairly confident what happens between kswapd and the vfs and whatever else we're caching is not optimal come time for this process to context-switch back in. We're not running a highly-optimised number-crunching scientific app on desktops, we're running a full herd of poorly-coded hogs simultaneously through smaller pens. I don't think anyone is trying to claim that swap prefetch is the be all and end all of this problem's solution, however without it the effects are an order of magnitude worse (I've cited numbers elsewhere, as have several others); its relatively non-intrusive (600+ lines of the 755 changed ones are self-contained), is compile and runtime selectable, and still has a maintainer now that Con has retired. If there was a better solution, it should have been developed sometime in the past 23 months that swap prefetch has addressed it. That's how we got rmap versus aa, and so on. But nobody chose to do so, and continuing to hold out on merging it on the promise of vapourware is ridiculous. That has never been the way linux kernel development has operated. -- Matt ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 6:09 ` [ck] " Matthew Hawkins @ 2007-07-25 6:18 ` Nick Piggin 0 siblings, 0 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-25 6:18 UTC (permalink / raw) To: Matthew Hawkins Cc: Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton Matthew Hawkins wrote: > On 7/25/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> I'm not saying that we can't try to tackle that problem, but first of >> all you have a really nice narrow problem where updatedb seems to be >> causing the kernel to completely do the wrong thing. So we start on >> that. > > > updatedb isn't the only problem, its just an obvious one. I like the > idea of looking into the vfs for this and other one-shot applications > (rather than looking at updatedb itself specifically) That's the point, it is an obvious one. So it should be easy to work out why it is going wrong, and fix it. (And hopefully that fixes some of the less obvious problems too.) > Many modern applications have a lot of open file handles. For > example, I just fired up my usual audio player and sys/fs/file-nr > showed another 600 open files (funnily enough, I have roughly that > many audio files :) I'm not exactly sure what happens when this one > gets swapped out for whatever reason (firefox/java/vmware/etc chews > ram, updatedb, whatever) but I'm fairly confident what happens between > kswapd and the vfs and whatever else we're caching is not optimal come > time for this process to context-switch back in. We're not running a > highly-optimised number-crunching scientific app on desktops, we're > running a full herd of poorly-coded hogs simultaneously through > smaller pens. And yet nobody wants to take the time to properly analyse why these things are going wrong and reporting their findings? Or if they have, where is that documented? -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 4:06 ` Nick Piggin 2007-07-25 4:55 ` Rene Herman 2007-07-25 6:09 ` [ck] " Matthew Hawkins @ 2007-07-25 16:19 ` Ray Lee 2007-07-25 20:46 ` Andi Kleen 2007-07-31 16:37 ` [ck] Re: -mm merge plans for 2.6.23 Matthew Hawkins 4 siblings, 0 replies; 535+ messages in thread From: Ray Lee @ 2007-07-25 16:19 UTC (permalink / raw) To: Nick Piggin Cc: Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/24/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Ray Lee wrote: > > On 7/23/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> If we can first try looking at > >> some specific problems that are easily identified. > > > > Always easier, true. Let's start with "My mouse jerks around under > > memory load." A Google Summer of Code student working on X.Org claims > > that mlocking the mouse handling routines gives a smooth cursor under > > load ([1]). It's surprising that the kernel would swap that out in the > > first place. > > > > [1] > > http://vignatti.wordpress.com/2007/07/06/xorg-input-thread-summary-or-something/ > > OK, I'm not sure what the point is though. Under heavy memory load, > things are going to get swapped out... and swap prefetch isn't going > to help there (at least, not during the memory load). Sorry, I headed slightly off-topic. Or perhaps 'up-topic' to the larger issue, which is that the desktop experience has some suckiness to it. My point is that the page replacement algorithm has some choice as to what to evict. The xorg input handler never should have been evicted. It was hopefully a hard example of where the current page replacement policy is falling flat on its face. All that said, this could really easily be handled by xorg mlocking the critical realtime stuff. > There are also other issues like whether the CPU scheduler is at fault, > etc. Interactive workloads are always the hardest to work out. This one is not a scheduler issue, as mlock()ing the mouse handling routines gives a smooth cursor. It's just a pure page replacement problem, as the kernel should never have swapped that out in the first place. <snip things I agreed with> <snip list of things to watch during updatedb run> Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 4:06 ` Nick Piggin ` (2 preceding siblings ...) 2007-07-25 16:19 ` Ray Lee @ 2007-07-25 20:46 ` Andi Kleen 2007-07-26 8:38 ` Frank Kingswood 2007-07-31 16:37 ` [ck] Re: -mm merge plans for 2.6.23 Matthew Hawkins 4 siblings, 1 reply; 535+ messages in thread From: Andi Kleen @ 2007-07-25 20:46 UTC (permalink / raw) To: Nick Piggin Cc: Ray Lee, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Nick Piggin <nickpiggin@yahoo.com.au> writes: > Ray Lee wrote: > > On 7/23/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >> Also a random day at the desktop, it is quite a broad scope and > >> pretty well impossible to analyse. > > It is pretty broad, but that's also what swap prefetch is targetting. > > As for hard to analyze, I'm not sure I agree. One can black-box test > > this stuff with only a few controls. e.g., if I use the same apps each > > day (mercurial, firefox, xorg, gcc), and the total I/O wait time > > consistently goes down on a swap prefetch kernel (normalized by some > > control statistic, such as application CPU time or total I/O, or > > something), then that's a useful measurement. > > I'm not saying that we can't try to tackle that problem, but first of > all you have a really nice narrow problem where updatedb seems to be > causing the kernel to completely do the wrong thing. So we start on > that. One simple way to fix this would be to implement a fadvise() flag that puts the dentry/inode on a "soon to be expired" list if there are no other references. Then if a dentry allocation needs more memory try to reuse dentries from that list (or better queue) first. Any other access will remove the dentry from the list. Disadvantage would be that the userland would need to be patched, but I guess it's better than adding very dubious heuristics to the kernel. Similar thing could be done for directory buffers although they are probably less of a problem. I expect that C.Lameter's directed dentry/inode freeing in slub will also make a big difference. People who have problems with updatedb should definitely try mm which has it I believe and enable SLUB. -Andi (who always thought swap prefetch was just a workaround, not a real solution) ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 20:46 ` Andi Kleen @ 2007-07-26 8:38 ` Frank Kingswood 2007-07-26 9:20 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Frank Kingswood @ 2007-07-26 8:38 UTC (permalink / raw) To: linux-kernel; +Cc: ck, linux-mm Andi Kleen wrote: > One simple way to fix this would be to implement a fadvise() flag > that puts the dentry/inode on a "soon to be expired" list if there > are no other references. Then if a dentry allocation needs more > memory try to reuse dentries from that list (or better queue) first. Any other > access will remove the dentry from the list. > > Disadvantage would be that the userland would need to be patched, > but I guess it's better than adding very dubious heuristics to the > kernel. Are you going to change every single large memory application in the world? As I wrote before, it is *not* about updatedb, but about all applications that use a lot of memory, and then terminate. Frank ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 8:38 ` Frank Kingswood @ 2007-07-26 9:20 ` Ingo Molnar 2007-07-26 9:34 ` Andrew Morton 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-26 9:20 UTC (permalink / raw) To: Frank Kingswood Cc: Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, Andrew Morton, ck list, Paul Jackson, linux-mm, linux-kernel * Frank Kingswood <frank@kingswood-consulting.co.uk> wrote: > > Disadvantage would be that the userland would need to be patched, > > but I guess it's better than adding very dubious heuristics to the > > kernel. > > Are you going to change every single large memory application in the > world? As I wrote before, it is *not* about updatedb, but about all > applications that use a lot of memory, and then terminate. it is about multiple problems, _one_ problem is updatedb. The _second_ problem is large memory applications. note that updatedb is not a "large memory application". It simply scans through the filesystem and has pretty minimal memory footprint. the _kernel_ ends up blowing up the dentry cache to a rather large size (because it has no idea that updatedb uses every dentry only once). Once we give the kernel the knowledge that the dentry wont be used again by this app, the kernel can do a lot more intelligent decision and not baloon the dentry cache. ( we _do_ want to baloon the dentry cache otherwise - for things like "find" - having a fast VFS is important. But known-use-once things like the daily updatedb job can clearly be annotated properly. ) the 'large memory apps' are a second category of problems. And those are where swap-prefetch could indeed help. (as long as it only 'fills up' the free memory that a large-memory-exit left behind it.) the 'morning after' phenomenon that the majority of testers complained about will likely be resolved by the updatedb change. The second category is likely an improvement too, for swap-happy desktop (and server) workloads. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-26 9:20 ` Ingo Molnar @ 2007-07-26 9:34 ` Andrew Morton 2007-07-26 9:40 ` RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-26 9:34 UTC (permalink / raw) To: Ingo Molnar Cc: Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Thu, 26 Jul 2007 11:20:25 +0200 Ingo Molnar <mingo@elte.hu> wrote: > Once we give the kernel the knowledge that the dentry wont be used again > by this app, the kernel can do a lot more intelligent decision and not > baloon the dentry cache. > > ( we _do_ want to baloon the dentry cache otherwise - for things like > "find" - having a fast VFS is important. But known-use-once things > like the daily updatedb job can clearly be annotated properly. ) Mutter. /proc/sys/vm/vfs_cache_pressure has been there for what, three years? Are any distros raising it during the updatedb run yet? ^ permalink raw reply [flat|nested] 535+ messages in thread
* RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 9:34 ` Andrew Morton @ 2007-07-26 9:40 ` Ingo Molnar 2007-07-26 10:09 ` Andrew Morton 2007-07-26 10:20 ` Al Viro 0 siblings, 2 replies; 535+ messages in thread From: Ingo Molnar @ 2007-07-26 9:40 UTC (permalink / raw) To: Andrew Morton Cc: Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel * Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 26 Jul 2007 11:20:25 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > > Once we give the kernel the knowledge that the dentry wont be used again > > by this app, the kernel can do a lot more intelligent decision and not > > baloon the dentry cache. > > > > ( we _do_ want to baloon the dentry cache otherwise - for things like > > "find" - having a fast VFS is important. But known-use-once things > > like the daily updatedb job can clearly be annotated properly. ) > > Mutter. /proc/sys/vm/vfs_cache_pressure has been there for what, > three years? Are any distros raising it during the updatedb run yet? but ... that's system-wide, and the 'dont baloon the dcache' is only a property of updatedb. Still, it's useful to debug this thing. below is an updatedb hack that sets vfs_cache_pressure down to 0 during an updatedb run. Could someone who is affected by the 'morning after' problem give it a try? If this works then we can think about any other measures ... Ingo --- /etc/cron.daily/mlocate.cron.orig +++ /etc/cron.daily/mlocate.cron @@ -1,4 +1,7 @@ #!/bin/sh nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }') renice +19 -p $$ >/dev/null 2>&1 +PREV=`cat /proc/sys/vm/vfs_cache_pressure 2>/dev/null` +echo 0 > /proc/sys/vm/vfs_cache_pressure 2>/dev/null /usr/bin/updatedb -f "$nodevs" +[ "$PREV" != "" ] && echo $PREV > /proc/sys/vm/vfs_cache_pressure 2>/dev/null ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 9:40 ` RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] Ingo Molnar @ 2007-07-26 10:09 ` Andrew Morton 2007-07-26 10:24 ` Ingo Molnar ` (2 more replies) 2007-07-26 10:20 ` Al Viro 1 sibling, 3 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-26 10:09 UTC (permalink / raw) To: Ingo Molnar Cc: Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Thu, 26 Jul 2007 11:40:24 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Thu, 26 Jul 2007 11:20:25 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > > > > Once we give the kernel the knowledge that the dentry wont be used again > > > by this app, the kernel can do a lot more intelligent decision and not > > > baloon the dentry cache. > > > > > > ( we _do_ want to baloon the dentry cache otherwise - for things like > > > "find" - having a fast VFS is important. But known-use-once things > > > like the daily updatedb job can clearly be annotated properly. ) > > > > Mutter. /proc/sys/vm/vfs_cache_pressure has been there for what, > > three years? Are any distros raising it during the updatedb run yet? > > but ... that's system-wide, and the 'dont baloon the dcache' is only a > property of updatedb. Sure, but it's practical, isn't it? Who runs (and cares about) vfs-intensive workloads during their wee-small-hours updatedb run? (OK, I do, but I kill the damn thing if it goes off) > Still, it's useful to debug this thing. > > below is an updatedb hack that sets vfs_cache_pressure down to 0 during > an updatedb run. Could someone who is affected by the 'morning after' > problem give it a try? If this works then we can think about any other > measures ... > > Ingo > > --- /etc/cron.daily/mlocate.cron.orig > +++ /etc/cron.daily/mlocate.cron > @@ -1,4 +1,7 @@ > #!/bin/sh > nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }') > renice +19 -p $$ >/dev/null 2>&1 > +PREV=`cat /proc/sys/vm/vfs_cache_pressure 2>/dev/null` > +echo 0 > /proc/sys/vm/vfs_cache_pressure 2>/dev/null > /usr/bin/updatedb -f "$nodevs" > +[ "$PREV" != "" ] && echo $PREV > /proc/sys/vm/vfs_cache_pressure 2>/dev/null Setting it to zero will maximise the preservation of the vfs caches. You wanted 10000 there. <bets that nobody will test this> ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 10:09 ` Andrew Morton @ 2007-07-26 10:24 ` Ingo Molnar 2007-07-27 0:33 ` [ck] " Matthew Hawkins 2007-07-26 10:27 ` Ingo Molnar 2007-07-26 12:46 ` Mike Galbraith 2 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-26 10:24 UTC (permalink / raw) To: Andrew Morton Cc: Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel * Andrew Morton <akpm@linux-foundation.org> wrote: > Setting it to zero will maximise the preservation of the vfs caches. > You wanted 10000 there. ok, updated patch below :-) > <bets that nobody will test this> wrong, it's active on three of my boxes already :) But then again, i never had these hangover problems. (not really expected with gigs of RAM anyway) Ingo --- /etc/cron.daily/mlocate.cron.orig +++ /etc/cron.daily/mlocate.cron @@ -1,4 +1,7 @@ #!/bin/sh nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }') renice +19 -p $$ >/dev/null 2>&1 +PREV=`cat /proc/sys/vm/vfs_cache_pressure 2>/dev/null` +echo 10000 > /proc/sys/vm/vfs_cache_pressure 2>/dev/null /usr/bin/updatedb -f "$nodevs" +[ "$PREV" != "" ] && echo $PREV > /proc/sys/vm/vfs_cache_pressure 2>/dev/null ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 10:24 ` Ingo Molnar @ 2007-07-27 0:33 ` Matthew Hawkins 2007-07-30 9:33 ` Helge Hafting 0 siblings, 1 reply; 535+ messages in thread From: Matthew Hawkins @ 2007-07-27 0:33 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Nick Piggin, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andi Kleen, Frank Kingswood On 7/26/07, Ingo Molnar <mingo@elte.hu> wrote: > wrong, it's active on three of my boxes already :) But then again, i > never had these hangover problems. (not really expected with gigs of RAM > anyway) [...] > --- /etc/cron.daily/mlocate.cron.orig [...] mlocate by design doesn't thrash the cache as much. People using slocate (distros other than Redhat ;) are going to be hit worse. See http://carolina.mff.cuni.cz/~trmac/blog/mlocate/ updatedb by itself doesn't really bug me, its just that on occasion its still running at 7am which then doesn't assist my single spindle come swapin of the other apps! I'm considering getting one of the old ide drives out in the garage and shifting swap onto it. The swap prefetch patch has mainly assisted me in the "state A -> B -> A" scenario. A lot. -- Matt ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 0:33 ` [ck] " Matthew Hawkins @ 2007-07-30 9:33 ` Helge Hafting 0 siblings, 0 replies; 535+ messages in thread From: Helge Hafting @ 2007-07-30 9:33 UTC (permalink / raw) To: Matthew Hawkins Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andi Kleen, Frank Kingswood Matthew Hawkins wrote: > updatedb by itself doesn't really bug me, its just that on occasion > its still running at 7am You should start it earlier then - assuming it doesn't already start at the earliest opportunity? Helge Hafting ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 10:09 ` Andrew Morton 2007-07-26 10:24 ` Ingo Molnar @ 2007-07-26 10:27 ` Ingo Molnar 2007-07-26 10:38 ` Andrew Morton 2007-07-26 12:46 ` Mike Galbraith 2 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-26 10:27 UTC (permalink / raw) To: Andrew Morton Cc: Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > ( we _do_ want to baloon the dentry cache otherwise - for things like > > > > "find" - having a fast VFS is important. But known-use-once things > > > > like the daily updatedb job can clearly be annotated properly. ) > > > > > > Mutter. /proc/sys/vm/vfs_cache_pressure has been there for what, > > > three years? Are any distros raising it during the updatedb run yet? > > > > but ... that's system-wide, and the 'dont baloon the dcache' is only a > > property of updatedb. > > Sure, but it's practical, isn't it? Who runs (and cares about) > vfs-intensive workloads during their wee-small-hours updatedb run? there's another side-effect: it likely results in the zapping of thousands of dentries that were cached nicely before. So we might exchange 'all my apps are swapped out' experience with 'all file access is slow'. The latter is _probably_ still an improvement over the balooning, but i'm not sure. What we _really_ want is an updatedb that does not disturb the dcache. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 10:27 ` Ingo Molnar @ 2007-07-26 10:38 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-26 10:38 UTC (permalink / raw) To: Ingo Molnar Cc: Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Thu, 26 Jul 2007 12:27:30 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > ( we _do_ want to baloon the dentry cache otherwise - for things like > > > > > "find" - having a fast VFS is important. But known-use-once things > > > > > like the daily updatedb job can clearly be annotated properly. ) > > > > > > > > Mutter. /proc/sys/vm/vfs_cache_pressure has been there for what, > > > > three years? Are any distros raising it during the updatedb run yet? > > > > > > but ... that's system-wide, and the 'dont baloon the dcache' is only a > > > property of updatedb. > > > > Sure, but it's practical, isn't it? Who runs (and cares about) > > vfs-intensive workloads during their wee-small-hours updatedb run? > > there's another side-effect: it likely results in the zapping of > thousands of dentries that were cached nicely before. So we might > exchange 'all my apps are swapped out' experience with 'all file access > is slow'. The latter is _probably_ still an improvement over the > balooning, but i'm not sure. Yup. Nobody has begun to think about preserving dcache/icache across load shifts yet, afaik. Hard. > What we _really_ want is an updatedb that > does not disturb the dcache. Well. Hopefully this time next year you can prep a 16MB container and toss your updatedb inside that. Maybe set its peak disk bandwidth utilisation too. However that won't work ;) because I don't think anyone is looking at containerisation of vfs cache memory yet. Perhaps full-on openvz has it, dunno. But updatedb is a special case, because it is so vfs-intensive. For lots of other workloads (those which use heaps of pagecache), resource management via containerisation will work nicely. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 10:09 ` Andrew Morton 2007-07-26 10:24 ` Ingo Molnar 2007-07-26 10:27 ` Ingo Molnar @ 2007-07-26 12:46 ` Mike Galbraith 2007-07-26 18:05 ` Andrew Morton 2 siblings, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-07-26 12:46 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Thu, 2007-07-26 at 03:09 -0700, Andrew Morton wrote: > Setting it to zero will maximise the preservation of the vfs caches. You > wanted 10000 there. > > <bets that nobody will test this> drops caches prior to both updatedb runs. root@Homer: df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/hdc3 12500992 1043544 11457448 9% / udev 129162 1567 127595 2% /dev /dev/hdc1 26104 87 26017 1% /boot /dev/hda1 108144 90676 17468 84% /windows/C /dev/hda5 11136 3389 7747 31% /windows/D /dev/hda6 0 0 0 - /windows/E vfs_cache_pressure=10000, updatedb freshly completed: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 48 76348 420356 104748 0 0 0 0 1137 912 3 1 97 0 ext3_inode_cache 315153 316274 524 7 1 : tunables 54 27 8 : slabdata 45182 45182 0 dentry_cache 224829 281358 136 29 1 : tunables 120 60 8 : slabdata 9702 9702 0 buffer_head 156624 159728 56 67 1 : tunables 120 60 8 : slabdata 2384 2384 0 vfs_cache_pressure=100 (stock), updatedb freshly completed: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 148 83824 270088 116340 0 0 0 0 1095 330 2 1 97 0 ext3_inode_cache 467257 502495 524 7 1 : tunables 54 27 8 : slabdata 71785 71785 0 dentry_cache 292695 408958 136 29 1 : tunables 120 60 8 : slabdata 14102 14102 0 buffer_head 118329 184384 56 67 1 : tunables 120 60 8 : slabdata 2752 2752 1 Note: updatedb doesn't bother my box, not running enough leaky apps I guess. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 12:46 ` Mike Galbraith @ 2007-07-26 18:05 ` Andrew Morton 2007-07-27 5:12 ` Mike Galbraith 0 siblings, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-26 18:05 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Thu, 26 Jul 2007 14:46:58 +0200 Mike Galbraith <efault@gmx.de> wrote: > On Thu, 2007-07-26 at 03:09 -0700, Andrew Morton wrote: > > > Setting it to zero will maximise the preservation of the vfs caches. You > > wanted 10000 there. > > > > <bets that nobody will test this> > > drops caches prior to both updatedb runs. I think that was the wrong thing to do. That will leave gobs of free memory for updatedb to populate with dentries and inodes. Instead, fill all of memory up with pagecache, then do the updatedb. See how much pagecache is left behind and see how large the vfs caches end up. > root@Homer: df -i > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/hdc3 12500992 1043544 11457448 9% / > udev 129162 1567 127595 2% /dev > /dev/hdc1 26104 87 26017 1% /boot > /dev/hda1 108144 90676 17468 84% /windows/C > /dev/hda5 11136 3389 7747 31% /windows/D > /dev/hda6 0 0 0 - /windows/E > > vfs_cache_pressure=10000, updatedb freshly completed: > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 0 48 76348 420356 104748 0 0 0 0 1137 912 3 1 97 0 > > ext3_inode_cache 315153 316274 524 7 1 : tunables 54 27 8 : slabdata 45182 45182 0 > dentry_cache 224829 281358 136 29 1 : tunables 120 60 8 : slabdata 9702 9702 0 > buffer_head 156624 159728 56 67 1 : tunables 120 60 8 : slabdata 2384 2384 0 > > vfs_cache_pressure=100 (stock), updatedb freshly completed: > > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 0 148 83824 270088 116340 0 0 0 0 1095 330 2 1 97 0 > > ext3_inode_cache 467257 502495 524 7 1 : tunables 54 27 8 : slabdata 71785 71785 0 > dentry_cache 292695 408958 136 29 1 : tunables 120 60 8 : slabdata 14102 14102 0 > buffer_head 118329 184384 56 67 1 : tunables 120 60 8 : slabdata 2752 2752 1 > > Note: updatedb doesn't bother my box, not running enough leaky apps I > guess. > So you ended up with a couple hundred MB of pagecache preserved. Capturing before-and-after /proc/meminfo would be nice - it's a useful summary. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 18:05 ` Andrew Morton @ 2007-07-27 5:12 ` Mike Galbraith 2007-07-27 7:23 ` Mike Galbraith 0 siblings, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-07-27 5:12 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Thu, 2007-07-26 at 11:05 -0700, Andrew Morton wrote: > On Thu, 26 Jul 2007 14:46:58 +0200 Mike Galbraith <efault@gmx.de> wrote: > > > On Thu, 2007-07-26 at 03:09 -0700, Andrew Morton wrote: > > > > > Setting it to zero will maximise the preservation of the vfs caches. You > > > wanted 10000 there. > > > > > > <bets that nobody will test this> > > > > drops caches prior to both updatedb runs. > > I think that was the wrong thing to do. That will leave gobs of free > memory for updatedb to populate with dentries and inodes. > > Instead, fill all of memory up with pagecache, then do the updatedb. See > how much pagecache is left behind and see how large the vfs caches end up. Yeah. Before these two runs just to see what difference there was in caches with those two settings, I tried running with a heavier than normal (for me) desktop application mix, to see if it would start swapping, but it didn't. Seems that 1GB ram is enough space for everything I do, and everything updatedb does as well. You need a larger working set to feel the pain I guess. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 5:12 ` Mike Galbraith @ 2007-07-27 7:23 ` Mike Galbraith 2007-07-27 8:47 ` Andrew Morton 0 siblings, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-07-27 7:23 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, 2007-07-27 at 07:13 +0200, Mike Galbraith wrote: > On Thu, 2007-07-26 at 11:05 -0700, Andrew Morton wrote: > > > drops caches prior to both updatedb runs. > > > > I think that was the wrong thing to do. That will leave gobs of free > > memory for updatedb to populate with dentries and inodes. > > > > Instead, fill all of memory up with pagecache, then do the updatedb. See > > how much pagecache is left behind and see how large the vfs caches end up. I didn't _fill_ memory, but loaded it up a bit with some real workload data... I tried time sh -c 'git diff v2.6.11 HEAD > /dev/null' to populate the cache, and tried different values for vfs_cache_pressure. Nothing prevented git's data from being trashed by updatedb. Turning the knob downward rapidly became very unpleasant due to swap, (with 0 not surprisingly being a true horror) but turning it up didn't help git one bit. The amount of data that had to be re-read with stock 100 or 10000 was the same, or at least so close that you couldn't see a difference in vmstat and wall-clock. Cache sizes varied, but the bottom line didn't. (wasn't surprised, seems quite reasonable that git's data looks old and useless to the reclaim logic when updatedb runs in between git runs) -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 7:23 ` Mike Galbraith @ 2007-07-27 8:47 ` Andrew Morton 2007-07-27 8:54 ` Al Viro ` (3 more replies) 0 siblings, 4 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-27 8:47 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, 27 Jul 2007 09:23:41 +0200 Mike Galbraith <efault@gmx.de> wrote: > On Fri, 2007-07-27 at 07:13 +0200, Mike Galbraith wrote: > > On Thu, 2007-07-26 at 11:05 -0700, Andrew Morton wrote: > > > > drops caches prior to both updatedb runs. > > > > > > I think that was the wrong thing to do. That will leave gobs of free > > > memory for updatedb to populate with dentries and inodes. > > > > > > Instead, fill all of memory up with pagecache, then do the updatedb. See > > > how much pagecache is left behind and see how large the vfs caches end up. > > I didn't _fill_ memory, but loaded it up a bit with some real workload > data... > > I tried time sh -c 'git diff v2.6.11 HEAD > /dev/null' to populate the > cache, and tried different values for vfs_cache_pressure. Nothing > prevented git's data from being trashed by updatedb. Turning the knob > downward rapidly became very unpleasant due to swap, (with 0 not > surprisingly being a true horror) but turning it up didn't help git one > bit. The amount of data that had to be re-read with stock 100 or 10000 > was the same, or at least so close that you couldn't see a difference in > vmstat and wall-clock. Cache sizes varied, but the bottom line didn't. > (wasn't surprised, seems quite reasonable that git's data looks old and > useless to the reclaim logic when updatedb runs in between git runs) > Did a bit of playing with this with 128MB of memory. - drop caches - read a 1MB file - run slocate.cron With vfs_cache_pressure=100: MemTotal: 116316 kB MemFree: 3196 kB Buffers: 54408 kB Cached: 5128 kB SwapCached: 0 kB Active: 41728 kB Inactive: 27540 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 116316 kB LowFree: 3196 kB SwapTotal: 1020116 kB SwapFree: 1019496 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 9760 kB Mapped: 3808 kB Slab: 40468 kB SReclaimable: 34824 kB SUnreclaim: 5644 kB PageTables: 720 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 1078272 kB Committed_AS: 25988 kB VmallocTotal: 901112 kB VmallocUsed: 656 kB VmallocChunk: 900412 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 4096 kB WIth vfs_cache_pressure=10000: MemTotal: 116316 kB MemFree: 3060 kB Buffers: 80792 kB Cached: 5052 kB SwapCached: 0 kB Active: 59432 kB Inactive: 36140 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 116316 kB LowFree: 3060 kB SwapTotal: 1020116 kB SwapFree: 1019512 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 9756 kB Mapped: 3832 kB Slab: 14304 kB SReclaimable: 7992 kB SUnreclaim: 6312 kB PageTables: 732 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 1078272 kB Committed_AS: 26000 kB VmallocTotal: 901112 kB VmallocUsed: 656 kB VmallocChunk: 900412 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 4096 kB so we reaped quite a lot more slab with the higher vfs_cache_pressure. What I think is killing us here is the blockdev pagecache: the pagecache which backs those directory entries and inodes. These pages get read multiple times because they hold multiple directory entries and multiple inodes. These multiple touches will put those pages onto the active list so they stick around for a long time and everything else gets evicted. I've never been very sure about this policy for the metadata pagecache. We read the filesystem objects into the dcache and icache and then we won't read from that page again for a long time (I expect). But the page will still hang around for a long time. It could be that we should leave those pages inactive. <tries it> diff -puN include/linux/buffer_head.h~a include/linux/buffer_head.h --- a/include/linux/buffer_head.h~a +++ a/include/linux/buffer_head.h @@ -130,7 +130,7 @@ BUFFER_FNS(Eopnotsupp, eopnotsupp) BUFFER_FNS(Unwritten, unwritten) #define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK) -#define touch_buffer(bh) mark_page_accessed(bh->b_page) +#define touch_buffer(bh) do { } while(0) /* If we *know* page->private refers to buffer_heads */ #define page_buffers(page) \ _ vfs_cache_pressure=100: vmm:/home/akpm# cat /proc/meminfo MemTotal: 116524 kB MemFree: 2692 kB Buffers: 51044 kB Cached: 5440 kB SwapCached: 0 kB Active: 19248 kB Inactive: 46996 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 116524 kB LowFree: 2692 kB SwapTotal: 1020116 kB SwapFree: 1019492 kB Dirty: 2008 kB Writeback: 0 kB AnonPages: 9772 kB Mapped: 3812 kB Slab: 44336 kB SReclaimable: 38792 kB SUnreclaim: 5544 kB PageTables: 720 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 1078376 kB Committed_AS: 26108 kB VmallocTotal: 901112 kB VmallocUsed: 648 kB VmallocChunk: 900412 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 4096 kB vfs_cache_pressure=10000 vmm:/home/akpm# cat /proc/meminfo MemTotal: 116524 kB MemFree: 3720 kB Buffers: 79832 kB Cached: 6260 kB SwapCached: 0 kB Active: 18276 kB Inactive: 77584 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 116524 kB LowFree: 3720 kB SwapTotal: 1020116 kB SwapFree: 1019500 kB Dirty: 2228 kB Writeback: 0 kB AnonPages: 9788 kB Mapped: 3828 kB Slab: 13680 kB SReclaimable: 7676 kB SUnreclaim: 6004 kB PageTables: 736 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 1078376 kB Committed_AS: 26112 kB VmallocTotal: 901112 kB VmallocUsed: 648 kB VmallocChunk: 900412 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 4096 kB So again, slab was trimmed a lot more, but all our pagecache still got evicted. ah, but I started that pagecache out on the inactive list. Try again. This time, instead of reading a 1GB file once, let's read an 80MB file four times. <no difference> OK, I saw what happened then. The inode for my 80MB file got reclaimed from icache and that instantly reclaimed all 80MB of pagecache. A single large file probably isn't a good testcase, but the same will happen with multiple files. Higher vfs_cache_pressure will worsen this effect. But it won't happen with mapped files because their inodes aren't reclaimable. More sophisticated testing is needed - there's something in ext3-tools which will mmap, page in and hold a file for you. Anyway, blockdev pagecache is a problem, I expect. It's worth playing with that patch. Another problem is atime updates. You really do want to mount noatime. Because with atimes enabled, each touch of a file will touch its inode and will keep its backing blockdev pagecache page in core. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 8:47 ` Andrew Morton @ 2007-07-27 8:54 ` Al Viro 2007-07-27 9:02 ` Andrew Morton 2007-07-27 9:40 ` Mike Galbraith ` (2 subsequent siblings) 3 siblings, 1 reply; 535+ messages in thread From: Al Viro @ 2007-07-27 8:54 UTC (permalink / raw) To: Andrew Morton Cc: Mike Galbraith, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, Jul 27, 2007 at 01:47:49AM -0700, Andrew Morton wrote: > What I think is killing us here is the blockdev pagecache: the pagecache > which backs those directory entries and inodes. These pages get read > multiple times because they hold multiple directory entries and multiple > inodes. These multiple touches will put those pages onto the active list > so they stick around for a long time and everything else gets evicted. I wonder what happens if you try that on ext2. There we'd get directory contents in per-directory page cache, so the picture might change... ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 8:54 ` Al Viro @ 2007-07-27 9:02 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-27 9:02 UTC (permalink / raw) To: Al Viro Cc: Mike Galbraith, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, 27 Jul 2007 09:54:41 +0100 Al Viro <viro@ftp.linux.org.uk> wrote: > On Fri, Jul 27, 2007 at 01:47:49AM -0700, Andrew Morton wrote: > > What I think is killing us here is the blockdev pagecache: the pagecache > > which backs those directory entries and inodes. These pages get read > > multiple times because they hold multiple directory entries and multiple > > inodes. These multiple touches will put those pages onto the active list > > so they stick around for a long time and everything else gets evicted. > > I wonder what happens if you try that on ext2. There we'd get directory > contents in per-directory page cache, so the picture might change... afacit ext2 just forgets to run mark_page_accessed for directory pages altogether, so it'll be equivalent to ext3 with that one-liner, I expect. The directory pagecache on ext2 might get reclaimed faster because those pages are eligible for reclaiming via the reclaim of their inodes, whereas ext3's directories are in blockdev pagecache, for which the reclaim-via-inode mechanism cannot happen. I should do some testing with mmapped files. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 8:47 ` Andrew Morton 2007-07-27 8:54 ` Al Viro @ 2007-07-27 9:40 ` Mike Galbraith 2007-07-27 10:00 ` Andrew Morton 2007-07-29 1:33 ` Rik van Riel 3 siblings, 0 replies; 535+ messages in thread From: Mike Galbraith @ 2007-07-27 9:40 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, 2007-07-27 at 01:47 -0700, Andrew Morton wrote: > Anyway, blockdev pagecache is a problem, I expect. It's worth playing with > that patch. (may tinker a bit, but i'm way rusty. ain't had the urge to mutilate anything down there in quite a while... works just fine for me these days) > Another problem is atime updates. You really do want to mount noatime. > Because with atimes enabled, each touch of a file will touch its inode and > will keep its backing blockdev pagecache page in core. Yeah, I mount noatime,nodiratime,data=writeback. ext3's journal with my crusty old disk/fs is painful as heck. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 8:47 ` Andrew Morton 2007-07-27 8:54 ` Al Viro 2007-07-27 9:40 ` Mike Galbraith @ 2007-07-27 10:00 ` Andrew Morton 2007-07-27 10:25 ` Mike Galbraith 2007-07-29 1:33 ` Rik van Riel 3 siblings, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-27 10:00 UTC (permalink / raw) To: Mike Galbraith, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, 27 Jul 2007 01:47:49 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > More sophisticated testing is needed - there's something in > ext3-tools which will mmap, page in and hold a file for you. So much for that theory. afaict mmapped, active pagecache is immune to updatedb activity. It just sits there while updatedb continues munching away at the slab and blockdev pagecache which it instantiated. I assume we're never getting the VM into enough trouble to tip it over the start-reclaiming-mapped-pages threshold (ie: /proc/sys/vm/swappiness). Start the updatedb on this 128MB machine with 80MB of mapped pagecache, it falls to 55MB fairly soon and then never changes. So hrm. Are we sure that updatedb is the problem? There are quite a few heavyweight things which happen in the wee small hours. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 10:00 ` Andrew Morton @ 2007-07-27 10:25 ` Mike Galbraith 2007-07-27 17:45 ` Daniel Hazelton 0 siblings, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-07-27 10:25 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, 2007-07-27 at 03:00 -0700, Andrew Morton wrote: > On Fri, 27 Jul 2007 01:47:49 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > > More sophisticated testing is needed - there's something in > > ext3-tools which will mmap, page in and hold a file for you. > > So much for that theory. afaict mmapped, active pagecache is immune to > updatedb activity. It just sits there while updatedb continues munching > away at the slab and blockdev pagecache which it instantiated. I assume > we're never getting the VM into enough trouble to tip it over the > start-reclaiming-mapped-pages threshold (ie: /proc/sys/vm/swappiness). > > Start the updatedb on this 128MB machine with 80MB of mapped pagecache, it > falls to 55MB fairly soon and then never changes. > > So hrm. Are we sure that updatedb is the problem? There are quite a few > heavyweight things which happen in the wee small hours. The balance in _my_ world seems just fine. I don't let any of those system maintenance things run while I'm using the system, and it doesn't bother me if my working set has to be reconstructed after heavy-weight maintenance things are allowed to run. I'm not seeing anything I wouldn't expect to see when running a job the size of updatedb. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 10:25 ` Mike Galbraith @ 2007-07-27 17:45 ` Daniel Hazelton 2007-07-27 18:16 ` Rene Herman 2007-07-27 22:08 ` Mike Galbraith 0 siblings, 2 replies; 535+ messages in thread From: Daniel Hazelton @ 2007-07-27 17:45 UTC (permalink / raw) To: Mike Galbraith Cc: Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Friday 27 July 2007 06:25:18 Mike Galbraith wrote: > On Fri, 2007-07-27 at 03:00 -0700, Andrew Morton wrote: > > On Fri, 27 Jul 2007 01:47:49 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > > More sophisticated testing is needed - there's something in > > > ext3-tools which will mmap, page in and hold a file for you. > > > > So much for that theory. afaict mmapped, active pagecache is immune to > > updatedb activity. It just sits there while updatedb continues munching > > away at the slab and blockdev pagecache which it instantiated. I assume > > we're never getting the VM into enough trouble to tip it over the > > start-reclaiming-mapped-pages threshold (ie: /proc/sys/vm/swappiness). > > > > Start the updatedb on this 128MB machine with 80MB of mapped pagecache, > > it falls to 55MB fairly soon and then never changes. > > > > So hrm. Are we sure that updatedb is the problem? There are quite a few > > heavyweight things which happen in the wee small hours. > > The balance in _my_ world seems just fine. I don't let any of those > system maintenance things run while I'm using the system, and it doesn't > bother me if my working set has to be reconstructed after heavy-weight > maintenance things are allowed to run. I'm not seeing anything I > wouldn't expect to see when running a job the size of updatedb. > > -Mike Do you realize you've totally missed the point? It isn't about what is fine in the Kernel Developers world, but what is fine in the *USERS* world. There are dozens of big businesses pushing Linux for Enterprise performance. Rather than discussing the merit of those patches - some of which just improve the performance of a specific application by 1 or 2 percent - they get a nod and go into the kernel. But when a group of users that don't represent one of those businesses says "Hey, this helps with problems I see on my system" there is a big discussion and ultimately those patches get rejected. Why? Because they'll give an example using a program that they see causing part of the problem and be told "Use program X - it does things differently and shouldn't cause the problem" or "But what causes the problem to happen? The patch treats a symptom of a larger problem". The fucked up part of that is that the (mass of) kernel developers will see a similar report saying "mySQL has a performance problem because of X, this fixes it" and not blink twice - even if it is "treating the symptom and not the cause". It's this attitude more than anything that caused Con to "retire" - at least that is the impression I got from the interviews he's given. (The exact impression was "I'm sick of the kernel developers doing everything they can to help enterprise users and ignoring the home users") So... The problem: Updatedb or another process that uses the FS heavily runs on a users 256MB P3-800 (when it is idle) and the VFS caches grow, causing memory pressure that causes other applications to be swapped to disk. In the morning the user has to wait for the system to swap those applications back in. Questions about it: Q) Does swap-prefetch help with this? A) [From all reports I've seen (*)] Yes, it does. Q) Why does it help? A) Because it pro-actively swaps stuff back-in when the memory pressure that caused it to be swapped out is gone. Q) What causes the problem? A) The VFS layer not keeping a limited cache. Instead the VFS will chew through available memory in the name of "increasing performance". Solution(s) to the problem: 1) Limit the amount of memory the pagecache and other VFS caches can consume 2) Implement swap prefetch If I had a (more) complete understanding of how the VFS cache(s) work I'd try to code a patch to do #1 myself. Patches to do #2 already exist and have been shown to work for the users that have tried it. My question is thus, simply: What is the reason that it is argued against?(**) DRH PS: Yes, I realize I've repeated some people from earlier in this thread, but it seems that people have been forgetting the point. (*) I've followed this thread and all of its splinters. The reports that are in them, where the person making the report has a system that has the limited memory needed for the problem to exhibit itself, all show that swap-prefetch helps. (**) No, I won't settle for "Its treating a symptom". The fact is that this isn't a *SYMPTOM* of anything. It treats the cause of the lag the users that have less than (for the sake of argument) 1G of memory are seeing. And no, changing userspace isn't a solution - updatedb may be the program that has been used as an example, but there are others. The proper fix is to change the kernel to either make the situation impossible (limit the VFS and other kernel caches) or make the situation as painless as possible (implement swap prefetch to alleviate the lag of swapping data back in). -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 17:45 ` Daniel Hazelton @ 2007-07-27 18:16 ` Rene Herman 2007-07-27 19:43 ` david ` (2 more replies) 2007-07-27 22:08 ` Mike Galbraith 1 sibling, 3 replies; 535+ messages in thread From: Rene Herman @ 2007-07-27 18:16 UTC (permalink / raw) To: Daniel Hazelton Cc: Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/27/2007 07:45 PM, Daniel Hazelton wrote: > Updatedb or another process that uses the FS heavily runs on a users > 256MB P3-800 (when it is idle) and the VFS caches grow, causing memory > pressure that causes other applications to be swapped to disk. In the > morning the user has to wait for the system to swap those applications > back in. > > Questions about it: > Q) Does swap-prefetch help with this? > A) [From all reports I've seen (*)] Yes, it does. No it does not. If updatedb filled memory to the point of causing swapping (which noone is reproducing anyway) it HAS FILLED MEMORY and swap-prefetch hasn't any memory to prefetch into -- updatedb itself doesn't use any significant memory. Here's swap-prefetch's author saying the same: http://lkml.org/lkml/2007/2/9/112 | It can't help the updatedb scenario. Updatedb leaves the ram full and | swap prefetch wants to cost as little as possible so it will never | move anything out of ram in preference for the pages it wants to swap | back in. Now please finally either understand this, or tell us how we're wrong. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 18:16 ` Rene Herman @ 2007-07-27 19:43 ` david 2007-07-28 7:19 ` Rene Herman 2007-07-27 20:28 ` Daniel Hazelton 2007-07-27 23:15 ` Björn Steinbrink 2 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-27 19:43 UTC (permalink / raw) To: Rene Herman Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, 27 Jul 2007, Rene Herman wrote: > On 07/27/2007 07:45 PM, Daniel Hazelton wrote: > >> Updatedb or another process that uses the FS heavily runs on a users >> 256MB P3-800 (when it is idle) and the VFS caches grow, causing memory >> pressure that causes other applications to be swapped to disk. In the >> morning the user has to wait for the system to swap those applications >> back in. >> >> Questions about it: >> Q) Does swap-prefetch help with this? A) [From all reports I've seen (*)] >> Yes, it does. > > No it does not. If updatedb filled memory to the point of causing swapping > (which noone is reproducing anyway) it HAS FILLED MEMORY and swap-prefetch > hasn't any memory to prefetch into -- updatedb itself doesn't use any > significant memory. however there are other programs which are known to take up significant amounts of memory and will cause the issue being described (openoffice for example) please don't get hung up on the text 'updatedb' and accept that there are programs that do run intermittently and do use a significant amount of ram and then free it. David Lang > Here's swap-prefetch's author saying the same: > > http://lkml.org/lkml/2007/2/9/112 > > | It can't help the updatedb scenario. Updatedb leaves the ram full and > | swap prefetch wants to cost as little as possible so it will never > | move anything out of ram in preference for the pages it wants to swap > | back in. > > Now please finally either understand this, or tell us how we're wrong. > > Rene. > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 19:43 ` david @ 2007-07-28 7:19 ` Rene Herman 2007-07-28 8:55 ` david 0 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-28 7:19 UTC (permalink / raw) To: david Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/27/2007 09:43 PM, david@lang.hm wrote: > On Fri, 27 Jul 2007, Rene Herman wrote: > >> On 07/27/2007 07:45 PM, Daniel Hazelton wrote: >> >>> Questions about it: >>> Q) Does swap-prefetch help with this? >>> A) [From all reports I've seen (*)] >>> Yes, it does. >> >> No it does not. If updatedb filled memory to the point of causing >> swapping (which noone is reproducing anyway) it HAS FILLED MEMORY and >> swap-prefetch hasn't any memory to prefetch into -- updatedb itself >> doesn't use any significant memory. > > however there are other programs which are known to take up significant > amounts of memory and will cause the issue being described (openoffice > for example) > > please don't get hung up on the text 'updatedb' and accept that there > are programs that do run intermittently and do use a significant amount > of ram and then free it. Different issue. One that's worth pursueing perhaps, but a different issue from the VFS caches issue that people have been trying to track down. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 7:19 ` Rene Herman @ 2007-07-28 8:55 ` david 2007-07-28 10:11 ` Rene Herman 2007-07-28 15:56 ` Daniel Hazelton 0 siblings, 2 replies; 535+ messages in thread From: david @ 2007-07-28 8:55 UTC (permalink / raw) To: Rene Herman Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Sat, 28 Jul 2007, Rene Herman wrote: > On 07/27/2007 09:43 PM, david@lang.hm wrote: > >> On Fri, 27 Jul 2007, Rene Herman wrote: >> >> > On 07/27/2007 07:45 PM, Daniel Hazelton wrote: >> > >> > > Questions about it: >> > > Q) Does swap-prefetch help with this? >> > > A) [From all reports I've seen (*)] >> > > Yes, it does. >> > >> > No it does not. If updatedb filled memory to the point of causing >> > swapping (which noone is reproducing anyway) it HAS FILLED MEMORY and >> > swap-prefetch hasn't any memory to prefetch into -- updatedb itself >> > doesn't use any significant memory. >> >> however there are other programs which are known to take up significant >> amounts of memory and will cause the issue being described (openoffice for >> example) >> >> please don't get hung up on the text 'updatedb' and accept that there are >> programs that do run intermittently and do use a significant amount of ram >> and then free it. > > Different issue. One that's worth pursueing perhaps, but a different issue > from the VFS caches issue that people have been trying to track down. people are trying to track down the problem of their machine being slow until enough data is swapped back in to operate normally. in at some situations swap prefetch can help becouse something that used memory freed it so there is free memory that could be filled with data (which is something that Linux does agressivly in most other situations) in some other situations swap prefetch cannot help becouse useless data is getting cached at the expense of useful data. nobody is arguing that swap prefetch helps in the second cast. what people are arguing is that there are situations where it helps for the first case. on some machines and version of updatedb the nighly run of updatedb can cause both sets of problems. but the nightly updatedb run is not the only thing that can cause problems but let's talk about the concept here for a little bit the design is to use CPU and I/O capacity that's otherwise idle to fill free memory with data from swap. pro: more ram has potentially useful data in it con: it takes a little extra effort to give this memory to another app (the page must be removed from the list and zeroed at the time it's needed, I assume that the data is left in swap so that it doesn't have to be written out again) it adds some complexity to the kernel (~500 lines IIRC from this thread) by undoing recent swapouts it can potentially mask problems with swapout it looks to me like unless the code was really bad (and after 23 months in -mm it doesn't sound like it is) that the only significant con left is the potential to mask other problems. however there are many legitimate cases where it is definantly dong the right thing (swapout was correct in pushing out the pages, but now the cause of that preasure is gone). the amount of benifit from this will vary from situation to situation, but it's not reasonable to claim that this provides no benifit (you have benchmark numbers that show it in synthetic benchmarks, and you have user reports that show it in the real-worlk) there are lots of things in the kernel who's job is to pre-fill the memroy with data that may (or may not) be useful in the future. this is just another method of filling the cache. it does so my saying "the user wanted these pages in the recent past, so it's a reasonable guess to say that the user will want them again in the future" David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 8:55 ` david @ 2007-07-28 10:11 ` Rene Herman 2007-07-28 11:21 ` Alan Cox 2007-07-28 21:00 ` david 2007-07-28 15:56 ` Daniel Hazelton 1 sibling, 2 replies; 535+ messages in thread From: Rene Herman @ 2007-07-28 10:11 UTC (permalink / raw) To: david Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/28/2007 10:55 AM, david@lang.hm wrote: > in at some situations swap prefetch can help becouse something that used > memory freed it so there is free memory that could be filled with data > (which is something that Linux does agressivly in most other situations) > > in some other situations swap prefetch cannot help becouse useless data > is getting cached at the expense of useful data. > > nobody is arguing that swap prefetch helps in the second cast. Oh yes they are. Daniel for example did twice, telling me to turn my brain on in between (if you read it, you may have noticed I got a little annoyed at that point). > but let's talk about the concept here for a little bit > > the design is to use CPU and I/O capacity that's otherwise idle to fill > free memory with data from swap. > > pro: > more ram has potentially useful data in it > > con: > it takes a little extra effort to give this memory to another app (the > page must be removed from the list and zeroed at the time it's needed, I > assume that the data is left in swap so that it doesn't have to be > written out again) It is. Prefetched pages can be dropped on the floor without additional I/O. > it adds some complexity to the kernel (~500 lines IIRC from this thread) > > by undoing recent swapouts it can potentially mask problems with swapout > > it looks to me like unless the code was really bad (and after 23 months > in -mm it doesn't sound like it is) Not to sound pretentious or anything but I assume that Andrew has a fairly good overview of exactly how broken -mm can be at times. How many -mm users use it anyway? He himself said he's not convinced of usefulness having not seen it help for him (and notice that most developers are also users), turned it off due to it annoying him at some point and hasn't seen a serious investigation into potential downsides. > that the only significant con left is the potential to mask other > problems. Which is not a madeup issue, mind you. As an example, I just now tried GNU locate and saw it's a complete pig and specifically unsuitable for the low memory boxes under discussion. Upon completion, it actually frees enough memory that swap-prefetch _could_ help on some boxes, while the real issue is that they should first and foremost dump GNU locate. > however there are many legitimate cases where it is definantly dong the > right thing (swapout was correct in pushing out the pages, but now the > cause of that preasure is gone). the amount of benifit from this will > vary from situation to situation, but it's not reasonable to claim that > this provides no benifit (you have benchmark numbers that show it in > synthetic benchmarks, and you have user reports that show it in the > real-worlk) I certainly would not want to argue anything of the sort no. As said a few times, I agree that swap-prefetch makes sense and has at least the potential to help some situations that you really wouldnt even want to try and fix any other way, simply because nothing's broken. > there are lots of things in the kernel who's job is to pre-fill the > memroy with data that may (or may not) be useful in the future. this is > just another method of filling the cache. it does so my saying "the user > wanted these pages in the recent past, so it's a reasonable guess to say > that the user will want them again in the future" Well, _that_ is what the kernel is already going to great lengths at doing, and it decided that those pages us poor overnight OO.o users want in in the morning weren't reasonable guesses. The kernel also won't any time soon be reading our minds, so any solution would need either user intervention (we could devise a way to tell the kernel "hey ho, I consider these pages to be very important -- try not to swap them out" possible even with a "and if you do, please pull them back in when possible") or we can let swap-prefetch do the "just in case" thing it is doing. While swap-prefetch may not be the be all end all of solutions I agree that having a machine sit around with free memory and applications in swap seems not too useful if (as is the case) fetched pages can be dropped immediately when it turns out swap-prefetch made the wrong decision. So that's for the concept. As to implementation, if I try and look at the code, it seems to be trying hard to really be free and as such, potential downsides seem limited. It's a rather core concept though and as such needs someone with a _lot_ more VM clue to ack. Sorry for not knowing, but who's maintaining/submitting the thing now that Con's not? He or she should preferably address any concerns it seems. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 10:11 ` Rene Herman @ 2007-07-28 11:21 ` Alan Cox 2007-07-28 16:29 ` Ray Lee ` (2 more replies) 2007-07-28 21:00 ` david 1 sibling, 3 replies; 535+ messages in thread From: Alan Cox @ 2007-07-28 11:21 UTC (permalink / raw) To: Rene Herman Cc: david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel > It is. Prefetched pages can be dropped on the floor without additional I/O. Which is essentially free for most cases. In addition your disk access may well have been in idle time (and should be for this sort of stuff) and if it was in the same chunk as something nearby was effectively free anyway. Actual physical disk ops are precious resource and anything that mostly reduces the number will be a win - not to stay swap prefetch is the right answer but accidentally or otherwise there are good reasons it may happen to help. Bigger more linear chunks of writeout/readin is much more important I suspect than swap prefetching. > good overview of exactly how broken -mm can be at times. How many -mm users > use it anyway? He himself said he's not convinced of usefulness having not I've been using it for months with no noticed problem. I turn it on because it might as well get tested. I've not done comparison tests so I can't comment on if its worth it. Lots of -mm testers turn *everything* on because its a test kernel. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 11:21 ` Alan Cox @ 2007-07-28 16:29 ` Ray Lee 2007-07-28 21:03 ` david 2007-07-29 8:11 ` Rene Herman 2 siblings, 0 replies; 535+ messages in thread From: Ray Lee @ 2007-07-28 16:29 UTC (permalink / raw) To: Alan Cox Cc: Rene Herman, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 7/28/07, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > Actual physical disk ops are precious resource and anything that mostly > reduces the number will be a win - not to stay swap prefetch is the right > answer but accidentally or otherwise there are good reasons it may happen > to help. > > Bigger more linear chunks of writeout/readin is much more important I > suspect than swap prefetching. <nod>. The larger the chunks are that we swap out, the less it actually hurts to swap, which might make all this a moot point. Not all I/O is created equal... Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 11:21 ` Alan Cox 2007-07-28 16:29 ` Ray Lee @ 2007-07-28 21:03 ` david 2007-07-29 8:11 ` Rene Herman 2 siblings, 0 replies; 535+ messages in thread From: david @ 2007-07-28 21:03 UTC (permalink / raw) To: Alan Cox Cc: Rene Herman, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Sat, 28 Jul 2007, Alan Cox wrote: >> It is. Prefetched pages can be dropped on the floor without additional I/O. > > Which is essentially free for most cases. In addition your disk access > may well have been in idle time (and should be for this sort of stuff) > and if it was in the same chunk as something nearby was effectively free > anyway. as I understand it the swap-prefetch only kicks in if the device is idle > Actual physical disk ops are precious resource and anything that mostly > reduces the number will be a win - not to stay swap prefetch is the right > answer but accidentally or otherwise there are good reasons it may happen > to help. > > Bigger more linear chunks of writeout/readin is much more important I > suspect than swap prefetching. I'm sure this is true while you are doing the swapout or swapin and the system is waiting for it. but with prefetch you may be able to avoid doing the swapin at a time when the system is waiting for it by doing it at a time when the system is otherwise idle. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 11:21 ` Alan Cox 2007-07-28 16:29 ` Ray Lee 2007-07-28 21:03 ` david @ 2007-07-29 8:11 ` Rene Herman 2007-07-29 13:12 ` Alan Cox 2 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-29 8:11 UTC (permalink / raw) To: Alan Cox Cc: david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/28/2007 01:21 PM, Alan Cox wrote: >> It is. Prefetched pages can be dropped on the floor without additional >> I/O. > > Which is essentially free for most cases. In addition your disk access > may well have been in idle time (and should be for this sort of stuff) Yes. The swap-prefetch patch ensures that the machine (well, the VM) is very idle before it allows itself to kick in. > and if it was in the same chunk as something nearby was effectively free > anyway. > > Actual physical disk ops are precious resource and anything that mostly > reduces the number will be a win - not to stay swap prefetch is the right > answer but accidentally or otherwise there are good reasons it may > happen to help. > > Bigger more linear chunks of writeout/readin is much more important I > suspect than swap prefetching. Yes, I believe this might be an important point. Earlier I posted a dumb little VM thrasher: http://lkml.org/lkml/2007/7/25/85 Contrived thing and all, but what it does do is show exactly how bad seeking all over swap-space is. If you push it out before hitting enter, the time it takes easily grows past 10 minutes (with my 768M) versus sub-second (!) when it's all in to start with. What are the tradeoffs here? What wants small chunks? Also, as far as I'm aware Linux does not do things like up the granularity when it notices it's swapping in heavily? That sounds sort of promising... >> good overview of exactly how broken -mm can be at times. How many -mm users >> use it anyway? He himself said he's not convinced of usefulness having not > > I've been using it for months with no noticed problem. I turn it on > because it might as well get tested. I've not done comparison tests so I > can't comment on if its worth it. > > Lots of -mm testers turn *everything* on because its a test kernel. Okay. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 8:11 ` Rene Herman @ 2007-07-29 13:12 ` Alan Cox 2007-07-29 14:07 ` Rene Herman 0 siblings, 1 reply; 535+ messages in thread From: Alan Cox @ 2007-07-29 13:12 UTC (permalink / raw) To: Rene Herman Cc: david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel > Contrived thing and all, but what it does do is show exactly how bad seeking > all over swap-space is. If you push it out before hitting enter, the time it > takes easily grows past 10 minutes (with my 768M) versus sub-second (!) when > it's all in to start with. Think in "operations/second" and you get a better view of the disk. > What are the tradeoffs here? What wants small chunks? Also, as far as I'm > aware Linux does not do things like up the granularity when it notices it's > swapping in heavily? That sounds sort of promising... Small chunks means you get better efficiency of memory use - large chunks mean you may well page in a lot more than you needed to each time (and cause more paging in turn). Your disk would prefer you fed it big linear I/O's - 512KB would probably be my first guess at tuning a large box under load for paging chunk size. More radically if anyone wants to do real researchy type work - how about log structured swap with a cleaner ? Alan ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 13:12 ` Alan Cox @ 2007-07-29 14:07 ` Rene Herman 2007-07-29 14:58 ` Ray Lee 0 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-29 14:07 UTC (permalink / raw) To: Alan Cox Cc: david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/29/2007 03:12 PM, Alan Cox wrote: >> What are the tradeoffs here? What wants small chunks? Also, as far as >> I'm aware Linux does not do things like up the granularity when it >> notices it's swapping in heavily? That sounds sort of promising... > > Small chunks means you get better efficiency of memory use - large chunks > mean you may well page in a lot more than you needed to each time (and > cause more paging in turn). Your disk would prefer you fed it big linear > I/O's - 512KB would probably be my first guess at tuning a large box > under load for paging chunk size. That probably kills my momentary hope that I was looking at yet another good use of large soft-pages seeing as how 512K would be going overboard a bit right? :-/ > More radically if anyone wants to do real researchy type work - how about > log structured swap with a cleaner ? Right over my head. Why does log-structure help anything? Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 14:07 ` Rene Herman @ 2007-07-29 14:58 ` Ray Lee 2007-07-29 14:59 ` Rene Herman 0 siblings, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-29 14:58 UTC (permalink / raw) To: Rene Herman Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 7/29/07, Rene Herman <rene.herman@gmail.com> wrote: > On 07/29/2007 03:12 PM, Alan Cox wrote: > > More radically if anyone wants to do real researchy type work - how about > > log structured swap with a cleaner ? > > Right over my head. Why does log-structure help anything? Log structured disk layouts allow for better placement of writeout, so that you cn eliminate most or all seeks. Seeks are the enemy when trying to get full disk bandwidth. google on log structured disk layout, or somesuch, for details. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 14:58 ` Ray Lee @ 2007-07-29 14:59 ` Rene Herman 2007-07-29 15:20 ` Ray Lee 0 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-29 14:59 UTC (permalink / raw) To: Ray Lee Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/29/2007 04:58 PM, Ray Lee wrote: > On 7/29/07, Rene Herman <rene.herman@gmail.com> wrote: >> On 07/29/2007 03:12 PM, Alan Cox wrote: >>> More radically if anyone wants to do real researchy type work - how about >>> log structured swap with a cleaner ? >> Right over my head. Why does log-structure help anything? > > Log structured disk layouts allow for better placement of writeout, so > that you cn eliminate most or all seeks. Seeks are the enemy when > trying to get full disk bandwidth. > > google on log structured disk layout, or somesuch, for details. I understand what log structure is generally, but how does it help swapin? Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 14:59 ` Rene Herman @ 2007-07-29 15:20 ` Ray Lee 2007-07-29 15:36 ` Rene Herman 2007-07-29 19:33 ` Paul Jackson 0 siblings, 2 replies; 535+ messages in thread From: Ray Lee @ 2007-07-29 15:20 UTC (permalink / raw) To: Rene Herman Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 7/29/07, Rene Herman <rene.herman@gmail.com> wrote: > On 07/29/2007 04:58 PM, Ray Lee wrote: > > On 7/29/07, Rene Herman <rene.herman@gmail.com> wrote: > >> Right over my head. Why does log-structure help anything? > > > > Log structured disk layouts allow for better placement of writeout, so > > that you cn eliminate most or all seeks. Seeks are the enemy when > > trying to get full disk bandwidth. > > > > google on log structured disk layout, or somesuch, for details. > > I understand what log structure is generally, but how does it help swapin? Look at the swap out case first. Right now, when swapping out the kernel places whatever it can wherever it can inside the swap space. The closer you are to filling your swap space, the more likely that those swapped out blocks will be all over place, rather than in one nice chunk. Contrast that with a log structured scheme, where the writeout happens to sequential spaces on the drive instead of scattered about. So, at some point when the system needs to fault those blocks that back in, it now has a linear span of sectors to read instead of asking the drive to bounce over twenty tracks for a hundred blocks. So, it eliminates the seeks. My laptop drive can read (huh, how odd, it got slower, need to retest in single user mode), hmm, let's go with about 25 MB/s. If we ask for a single block from each track, though, that'll drop to 4k * (1 second / seek time) which is about a megabyte a second if we're lucky enough to read from consecutive tracks. Even worse if it's not. Seeks are the enemy any time you need to hit the drive for anything, be it swapping or optimizing a database. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 15:20 ` Ray Lee @ 2007-07-29 15:36 ` Rene Herman 2007-07-29 16:04 ` Ray Lee 2007-07-29 19:33 ` Paul Jackson 1 sibling, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-29 15:36 UTC (permalink / raw) To: Ray Lee Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/29/2007 05:20 PM, Ray Lee wrote: >> I understand what log structure is generally, but how does it help swapin? > > Look at the swap out case first. > > Right now, when swapping out the kernel places whatever it can > wherever it can inside the swap space. The closer you are to filling > your swap space, the more likely that those swapped out blocks will be > all over place, rather than in one nice chunk. Contrast that with a > log structured scheme, where the writeout happens to sequential spaces > on the drive instead of scattered about. This seems to be now fixing the different problem of swap-space filling up. I'm quite willing to for now assume I've got plenty free. > So, at some point when the system needs to fault those blocks that > back in, it now has a linear span of sectors to read instead of asking > the drive to bounce over twenty tracks for a hundred blocks. Moreover though -- what I know about log structure is that generally it optimises for write (swapout) and might make read (swapin) worse due to fragmentation that wouldn't happen with a regular fs structure. I guess that cleaner that Alan mentioned might be involved there -- I don't know how/what it would be doing. > So, it eliminates the seeks. My laptop drive can read (huh, how odd, > it got slower, need to retest in single user mode), hmm, let's go with > about 25 MB/s. If we ask for a single block from each track, though, > that'll drop to 4k * (1 second / seek time) which is about a megabyte > a second if we're lucky enough to read from consecutive tracks. Even > worse if it's not. > > Seeks are the enemy any time you need to hit the drive for anything, > be it swapping or optimizing a database. I am very aware of the costs of seeks (on current magnetic media). Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 15:36 ` Rene Herman @ 2007-07-29 16:04 ` Ray Lee 2007-07-29 16:59 ` Rene Herman 0 siblings, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-29 16:04 UTC (permalink / raw) To: Rene Herman Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 7/29/07, Rene Herman <rene.herman@gmail.com> wrote: > On 07/29/2007 05:20 PM, Ray Lee wrote: > This seems to be now fixing the different problem of swap-space filling up. > I'm quite willing to for now assume I've got plenty free. I was trying to point out that currently, as an example, memory that is linear in a process' space could be fragmented on disk when swapped out. That's today. Under a log-structured scheme, one could set it up such that something that was linear in RAM could be swapped out linearly on the drive, minimizing seeks on writeout, which will naturally minimize seeks on swap in of that same data. > > So, at some point when the system needs to fault those blocks that > > back in, it now has a linear span of sectors to read instead of asking > > the drive to bounce over twenty tracks for a hundred blocks. > > Moreover though -- what I know about log structure is that generally it > optimises for write (swapout) and might make read (swapin) worse due to > fragmentation that wouldn't happen with a regular fs structure. It looks like I'm not doing a very good job of explaining this, I'm afraid. Suffice it to say that a log structured swap would give optimization options that we don't have today. > I guess that cleaner that Alan mentioned might be involved there -- I don't > know how/what it would be doing. Then you should google on `log structured filesystem (primer OR introduction)` and read a few of the links that pop up. You might find it interesting. > I am very aware of the costs of seeks (on current magnetic media). Then perhaps you can just take it on faith -- log structured layouts are designed to help minimize seeks, read and write. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 16:04 ` Ray Lee @ 2007-07-29 16:59 ` Rene Herman 2007-07-29 17:19 ` Ray Lee 0 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-29 16:59 UTC (permalink / raw) To: Ray Lee Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/29/2007 06:04 PM, Ray Lee wrote: >> I am very aware of the costs of seeks (on current magnetic media). > > Then perhaps you can just take it on faith -- log structured layouts > are designed to help minimize seeks, read and write. I am particularly bad at faith. Let's take that stupid program that I posted: http://lkml.org/lkml/2007/7/25/85 You push it out before you hit enter, it's written out to swap, at whatever speed. How should it be layed out so that it's swapped in most efficiently after hitting enter? Reading bigger chunks would quite obviously help, but the layout? The program is not a real-world issue and if you do not consider it a useful boundary condition either (okay I guess), how would log structured swap help if I just assume I have plenty of free swap to begin with? Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 16:59 ` Rene Herman @ 2007-07-29 17:19 ` Ray Lee 2007-07-29 17:33 ` Rene Herman 0 siblings, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-29 17:19 UTC (permalink / raw) To: Rene Herman Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 7/29/07, Rene Herman <rene.herman@gmail.com> wrote: > On 07/29/2007 06:04 PM, Ray Lee wrote: > >> I am very aware of the costs of seeks (on current magnetic media). > > > > Then perhaps you can just take it on faith -- log structured layouts > > are designed to help minimize seeks, read and write. > > I am particularly bad at faith. Let's take that stupid program that I posted: You only think you are :-). I'm sure there are lots of things you have faith in. Gravity, for example :-). > The program is not a real-world issue and if you do not consider it a useful > boundary condition either (okay I guess), how would log structured swap help > if I just assume I have plenty of free swap to begin with? Is that generally the case on your systems? Every linux system I've run, regardless of RAM, has always pushed things out to swap. And once there's something already in swap, you now have a packing problem when you want to swap something else out. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 17:19 ` Ray Lee @ 2007-07-29 17:33 ` Rene Herman 2007-07-29 17:52 ` Ray Lee 2007-07-29 17:53 ` Alan Cox 0 siblings, 2 replies; 535+ messages in thread From: Rene Herman @ 2007-07-29 17:33 UTC (permalink / raw) To: Ray Lee Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/29/2007 07:19 PM, Ray Lee wrote: >> The program is not a real-world issue and if you do not consider it a useful >> boundary condition either (okay I guess), how would log structured swap help >> if I just assume I have plenty of free swap to begin with? > > Is that generally the case on your systems? Every linux system I've > run, regardless of RAM, has always pushed things out to swap. For me, it is generally the case yes. We are still discussing this in the context of desktop machines and their problems with being slow as things have been swapped out and generally I expect a desktop to have plenty of swap which it's not regularly going to fillup significantly since then the machine's unworkably slow as a desktop anyway. > And once there's something already in swap, you now have a packing > problem when you want to swap something else out. Once we're crammed, it gets to be a different situation yes. As far as I'm concerned that's for another thread though. I'm spending too much time on LKML as it is... Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 17:33 ` Rene Herman @ 2007-07-29 17:52 ` Ray Lee 2007-07-29 19:05 ` Rene Herman 2007-07-29 17:53 ` Alan Cox 1 sibling, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-29 17:52 UTC (permalink / raw) To: Rene Herman Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 7/29/07, Rene Herman <rene.herman@gmail.com> wrote: > On 07/29/2007 07:19 PM, Ray Lee wrote: > For me, it is generally the case yes. We are still discussing this in the > context of desktop machines and their problems with being slow as things > have been swapped out and generally I expect a desktop to have plenty of > swap which it's not regularly going to fillup significantly since then the > machine's unworkably slow as a desktop anyway. <Shrug> Well, that doesn't match my systems. My laptop has 400MB in swap: ray@phoenix:~$ free total used free shared buffers cached Mem: 894208 883920 10288 0 3044 163224 -/+ buffers/cache: 717652 176556 Swap: 1116476 393132 723344 > > And once there's something already in swap, you now have a packing > > problem when you want to swap something else out. > > Once we're crammed, it gets to be a different situation yes. As far as I'm > concerned that's for another thread though. I'm spending too much time on > LKML as it is... No, it's not even when crammed. It's just when there are holes. mm/swapfile.c does try to cluster things, but doesn't work too hard at it as we don't want to spend all our time looking for a perfect fit that may not exist. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 17:52 ` Ray Lee @ 2007-07-29 19:05 ` Rene Herman 0 siblings, 0 replies; 535+ messages in thread From: Rene Herman @ 2007-07-29 19:05 UTC (permalink / raw) To: Ray Lee Cc: Alan Cox, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/29/2007 07:52 PM, Ray Lee wrote: > <Shrug> Well, that doesn't match my systems. My laptop has 400MB in swap: Which in your case is slightly more than 1/3 of available swap space. Quite a lot for a desktop indeed. And if it's more than a few percent fragmented, please fix current swapout instead of log structuring it. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 17:33 ` Rene Herman 2007-07-29 17:52 ` Ray Lee @ 2007-07-29 17:53 ` Alan Cox 1 sibling, 0 replies; 535+ messages in thread From: Alan Cox @ 2007-07-29 17:53 UTC (permalink / raw) To: Rene Herman Cc: Ray Lee, david, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel > > Is that generally the case on your systems? Every linux system I've > > run, regardless of RAM, has always pushed things out to swap. > > For me, it is generally the case yes. We are still discussing this in the > context of desktop machines and their problems with being slow as things > have been swapped out and generally I expect a desktop to have plenty of > swap which it's not regularly going to fillup significantly since then the > machine's unworkably slow as a desktop anyway. A simple log optimises writeout (which is latency critical) and can otherwise stall an enitre system. In a log you can also have multiple copies of the same page on disk easily, some stale - so you can write out chunks of data that are not all them removed from memory, just so you get them back more easily if you then do (and I guess you'd mark them accordingly) The second element is a cleaner - something to go around removing stuff from the log that is needed when the disks are idle - and also to repack data in nice linear chunks. So instead of using the empty disk time for page-in you use it for packing data and optimising future paging. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 15:20 ` Ray Lee 2007-07-29 15:36 ` Rene Herman @ 2007-07-29 19:33 ` Paul Jackson 2007-07-29 20:00 ` Ray Lee 1 sibling, 1 reply; 535+ messages in thread From: Paul Jackson @ 2007-07-29 19:33 UTC (permalink / raw) To: Ray Lee Cc: rene.herman, alan, david, dhazelton, efault, akpm, mingo, frank, andi, nickpiggin, jesper.juhl, ck, linux-mm, linux-kernel Ray wrote: > a log structured scheme, where the writeout happens to sequential spaces > on the drive instead of scattered about. If the problem is reading stuff back in from swap quickly when needed, then this likely helps, by reducing the seeks needed. If the problem is reading stuff back in from swap at the *same time* that the application is reading stuff from some user file system, and if that user file system is on the same drive as the swap partition (typical on laptops), then interleaving the user file system accesses with the swap partition accesses might overwhelm all other performance problems, due to the frequent long seeks between the two. In that case, swap layout and swap i/o block size are secondary. However, pre-fetching, so that swap read back is not interleaved with application file accesses, could help dramatically. === Perhaps we could have a 'wake-up' command, analogous to the various sleep and hibernate commands. The 'wake-up' command could do whatever of the following it knew to do, in order to optimize for an anticipated change in usage patterns: 1) pre-fetch swap 2) clean (write out) dirty pages 3) maximize free memory 4) layout swap nicely 5) pre-fetch a favorite set of apps Stumble out of bed in the morning, press 'wake-up', start boiling the water for your coffee, and in another ten minutes, one is ready to rock and roll. In case Andrew is so bored he read this far -- yes this wake-up sounds like user space code, with minimal kernel changes to support any particular lower level operation that we can't do already. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 19:33 ` Paul Jackson @ 2007-07-29 20:00 ` Ray Lee 2007-07-29 20:18 ` Paul Jackson 2007-07-29 21:06 ` Daniel Hazelton 0 siblings, 2 replies; 535+ messages in thread From: Ray Lee @ 2007-07-29 20:00 UTC (permalink / raw) To: Paul Jackson Cc: rene.herman, alan, david, dhazelton, efault, akpm, mingo, frank, andi, nickpiggin, jesper.juhl, ck, linux-mm, linux-kernel On 7/29/07, Paul Jackson <pj@sgi.com> wrote: > If the problem is reading stuff back in from swap at the *same time* > that the application is reading stuff from some user file system, and if > that user file system is on the same drive as the swap partition > (typical on laptops), then interleaving the user file system accesses > with the swap partition accesses might overwhelm all other performance > problems, due to the frequent long seeks between the two. Ah, so in a normal scenario where a working-set is getting faulted back in, we have the swap storage as well as the file-backed stuff that needs to be read as well. So even if swap is organized perfectly, we're still seeking. Damn. On the other hand, that explains another thing that swap prefetch could be helping with -- if it preemptively faults the swap back in, then the file-backed stuff can be faulted back more quickly, just by the virtue of not needing to seek back and forth to swap for its stuff. Hadn't thought of that. That also implies that people running with swap files rather than swap partitions will see less of an issue. I should dig out my old compact flash card and try putting swap on that for a week. > In that case, swap layout and swap i/o block size are secondary. > However, pre-fetching, so that swap read back is not interleaved > with application file accesses, could help dramatically. <Nod> > Perhaps we could have a 'wake-up' command, analogous to the various sleep > and hibernate commands. [...] > In case Andrew is so bored he read this far -- yes this wake-up sounds > like user space code, with minimal kernel changes to support any > particular lower level operation that we can't do already. He'd suggested using, uhm, ptrace_peek or somesuch for just such a purpose. The second half of the issue is to know when and what to target. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 20:00 ` Ray Lee @ 2007-07-29 20:18 ` Paul Jackson 2007-07-29 20:23 ` Ray Lee 2007-07-29 21:06 ` Daniel Hazelton 1 sibling, 1 reply; 535+ messages in thread From: Paul Jackson @ 2007-07-29 20:18 UTC (permalink / raw) To: Ray Lee Cc: rene.herman, alan, david, dhazelton, efault, akpm, mingo, frank, andi, nickpiggin, jesper.juhl, ck, linux-mm, linux-kernel Ray wrote: > Ah, so in a normal scenario where a working-set is getting faulted > back in, we have the swap storage as well as the file-backed stuff > that needs to be read as well. So even if swap is organized perfectly, > we're still seeking. Damn. Perhaps this applies in some cases ... perhaps. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 20:18 ` Paul Jackson @ 2007-07-29 20:23 ` Ray Lee 0 siblings, 0 replies; 535+ messages in thread From: Ray Lee @ 2007-07-29 20:23 UTC (permalink / raw) To: Paul Jackson Cc: rene.herman, alan, david, dhazelton, efault, akpm, mingo, frank, andi, nickpiggin, jesper.juhl, ck, linux-mm, linux-kernel On 7/29/07, Paul Jackson <pj@sgi.com> wrote: > Ray wrote: > > Ah, so in a normal scenario where a working-set is getting faulted > > back in, we have the swap storage as well as the file-backed stuff > > that needs to be read as well. So even if swap is organized perfectly, > > we're still seeking. Damn. > > Perhaps this applies in some cases ... perhaps. Yeah, point taken: better data would make this a lot easier to figure out and target fixes. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 20:00 ` Ray Lee 2007-07-29 20:18 ` Paul Jackson @ 2007-07-29 21:06 ` Daniel Hazelton 1 sibling, 0 replies; 535+ messages in thread From: Daniel Hazelton @ 2007-07-29 21:06 UTC (permalink / raw) To: Ray Lee Cc: Paul Jackson, rene.herman, alan, david, efault, akpm, mingo, frank, andi, nickpiggin, jesper.juhl, ck, linux-mm, linux-kernel On Sunday 29 July 2007 16:00:22 Ray Lee wrote: > On 7/29/07, Paul Jackson <pj@sgi.com> wrote: > > If the problem is reading stuff back in from swap at the *same time* > > that the application is reading stuff from some user file system, and if > > that user file system is on the same drive as the swap partition > > (typical on laptops), then interleaving the user file system accesses > > with the swap partition accesses might overwhelm all other performance > > problems, due to the frequent long seeks between the two. > > Ah, so in a normal scenario where a working-set is getting faulted > back in, we have the swap storage as well as the file-backed stuff > that needs to be read as well. So even if swap is organized perfectly, > we're still seeking. Damn. That is one reason why I try to have swap on a device dedicated just for it. It helps keep the system from having to seek all over the drive for data. (I remember that this was recommended years ago with Windows - back when you could tell Windows where to put the swap file) > On the other hand, that explains another thing that swap prefetch > could be helping with -- if it preemptively faults the swap back in, > then the file-backed stuff can be faulted back more quickly, just by > the virtue of not needing to seek back and forth to swap for its > stuff. Hadn't thought of that. For it to really help swap-prefetch would have to be more aggressive. At the moment (if I'm reading the code correctly) the system has to have close to zero for it to kick in. A tunable knob controlling how much activity is too much for the prefetch to kick in would help with finding a sane default. IMHO it should be the one that provides the most benefit with the least hit to performance. > That also implies that people running with swap files rather than swap > partitions will see less of an issue. I should dig out my old compact > flash card and try putting swap on that for a week. Maybe. It all depends on how much seeking is needed to track down the pages in the swapfile and such. What would really help make the situation even better would be doing the log structured swap + cleaner. The log structured swap + cleaner should provide a performance boost by itself - add in the prefetch mechanism and the benefits are even more visible. Another way to improve performance would require making the page replacement mechanism more intelligent. There are bounds to what can be done in the kernel without negatively impacting performance, but, if I've read the code correctly, there might be a better way to decide which pages to evict. One way to do this would be to implement some mechanism that allows the system to choose a single group of contiguous pages (or, say, a large soft-page) over swapping out a single page at a time. (some form of memory defrag would also be nice, but I can't think of a way to do that without massively breaking everything) <snip> > > In case Andrew is so bored he read this far -- yes this wake-up sounds > > like user space code, with minimal kernel changes to support any > > particular lower level operation that we can't do already. > > He'd suggested using, uhm, ptrace_peek or somesuch for just such a > purpose. The second half of the issue is to know when and what to > target. The userspace suggestion that was thrown out earlier would have been as error-prone and problematic as FUSE. A solution like you suggest would be workable - its small and does a task that is best done in userspace (IMHO). (IIRC, the original suggestion involved merging maps2 and another patchset into mainline and using that, combined with PEEKTEXT to provide for a userspace swap daemon. Swap, IMHO, should never be handled outside the kernel) What might be useful is a userspace daemon that tracks memory pressure and uses a concise API to trigger various levels of prefetch and/or swap aggressiveness. DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 10:11 ` Rene Herman 2007-07-28 11:21 ` Alan Cox @ 2007-07-28 21:00 ` david 2007-07-29 10:09 ` Rene Herman 1 sibling, 1 reply; 535+ messages in thread From: david @ 2007-07-28 21:00 UTC (permalink / raw) To: Rene Herman Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Sat, 28 Jul 2007, Rene Herman wrote: > On 07/28/2007 10:55 AM, david@lang.hm wrote: > >> it looks to me like unless the code was really bad (and after 23 months in >> -mm it doesn't sound like it is) > > Not to sound pretentious or anything but I assume that Andrew has a fairly > good overview of exactly how broken -mm can be at times. How many -mm users > use it anyway? He himself said he's not convinced of usefulness having not > seen it help for him (and notice that most developers are also users), turned > it off due to it annoying him at some point and hasn't seen a serious > investigation into potential downsides. if that was the case then people should be responding to the request to get it merged with 'but it caused problems for me when I tried it' I haven't seen any comments like that. >> that the only significant con left is the potential to mask other >> problems. > > Which is not a madeup issue, mind you. As an example, I just now tried GNU > locate and saw it's a complete pig and specifically unsuitable for the low > memory boxes under discussion. Upon completion, it actually frees enough > memory that swap-prefetch _could_ help on some boxes, while the real issue is > that they should first and foremost dump GNU locate. I see the conclusion as being exactly the opposite. here is a workload with some badly designed userspace software that the kernel can make much more pleasent for users. arguing that users should never use badly designed software in userspace doesn't seem like an argument that will gain much traction. I'm not saying the kernel needs to fix the software itself (ala the sched_yeild issues), but the kernel should try and keep such software from hurting the rest of the system where it can. in this case it can't help it while the bad software is running, but it could minimize the impact after it finishes. >> however there are many legitimate cases where it is definantly dong the >> right thing (swapout was correct in pushing out the pages, but now the >> cause of that preasure is gone). the amount of benifit from this will vary >> from situation to situation, but it's not reasonable to claim that this >> provides no benifit (you have benchmark numbers that show it in synthetic >> benchmarks, and you have user reports that show it in the real-worlk) > > I certainly would not want to argue anything of the sort no. As said a few > times, I agree that swap-prefetch makes sense and has at least the potential > to help some situations that you really wouldnt even want to try and fix any > other way, simply because nothing's broken. so there is a legitimate situation where swap-prefetch will help significantly, what is the downside that prevents it from being included? (reading this thread it sometimes seems like the downside is that updatedb shouldn't cause this problem and so if you fixed updatedb there wold be no legitimate benifit, or alturnatly this patch doesn't help updatedb so there's no legitimate benifit) >> there are lots of things in the kernel who's job is to pre-fill the memroy >> with data that may (or may not) be useful in the future. this is just >> another method of filling the cache. it does so my saying "the user >> wanted these pages in the recent past, so it's a reasonable guess to say >> that the user will want them again in the future" > > Well, _that_ is what the kernel is already going to great lengths at doing, > and it decided that those pages us poor overnight OO.o users want in in the > morning weren't reasonable guesses. The kernel also won't any time soon be > reading our minds, so any solution would need either user intervention (we > could devise a way to tell the kernel "hey ho, I consider these pages to be > very important -- try not to swap them out" possible even with a "and if you > do, please pull them back in when possible") or we can let swap-prefetch do > the "just in case" thing it is doing. it's not that they shouldn't have been swapped out (they should have been), it's that the reason they were swapped out no longer exists. > While swap-prefetch may not be the be all end all of solutions I agree that > having a machine sit around with free memory and applications in swap seems > not too useful if (as is the case) fetched pages can be dropped immediately > when it turns out swap-prefetch made the wrong decision. > > So that's for the concept. As to implementation, if I try and look at the > code, it seems to be trying hard to really be free and as such, potential > downsides seem limited. It's a rather core concept though and as such needs > someone with a _lot_ more VM clue to ack. Sorry for not knowing, but who's > maintaining/submitting the thing now that Con's not? He or she should > preferably address any concerns it seems. I've seen it mentioned that there is still a maintainer but I missed who it is, but I haven't seen any concerns that can be addressed, they all seem to be 'this is a core concept, people need to think about it' or 'but someone may find a better answer in the future' type of things. it's impossible to address these concerns directly. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 21:00 ` david @ 2007-07-29 10:09 ` Rene Herman 2007-07-29 11:41 ` david 0 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-29 10:09 UTC (permalink / raw) To: david Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/28/2007 11:00 PM, david@lang.hm wrote: >> many -mm users use it anyway? He himself said he's not convinced of >> usefulness having not seen it help for him (and notice that most >> developers are also users), turned it off due to it annoying him at >> some point and hasn't seen a serious investigation into potential >> downsides. > > if that was the case then people should be responding to the request to > get it merged with 'but it caused problems for me when I tried it' > > I haven't seen any comments like that. So you're saying Andrew did not say that? You're jumping to the conclusion that I am saying that it's causing problems. >>> that the only significant con left is the potential to mask other >>> problems. >> >> Which is not a madeup issue, mind you. As an example, I just now tried >> GNU locate and saw it's a complete pig and specifically unsuitable for >> the low memory boxes under discussion. Upon completion, it actually >> frees enough memory that swap-prefetch _could_ help on some boxes, >> while the real issue is that they should first and foremost dump GNU >> locate. > > I see the conclusion as being exactly the opposite. And now you do it again :-) There is no conclusion -- just the inescapable observation that swap-prefetch was (or may have been) masking the problem of GNU locate being a program that noone in their right mind should be using. > so there is a legitimate situation where swap-prefetch will help > significantly, what is the downside that prevents it from being > included? People being unconvinced it helps all that much, no serious investigation into possible downsides and no consideration of alternatives is three I've personally heard. You don't want to merge a conceptually core VM feature if you're not really convinced. It's not a part of the kernel you can throw a feature into like you could some driver saying "ah, heck, if it makes someone happy" since everything in the VM ends up interacting -- that in fact is actually the hard part of VM as far as I've seen it. And in this situation the proposed feature is something that "papers over a problem" by design -- where it could certainly be that the problem is not solveable in another way simply due to the kernel not growing the possiblity to read user's minds anytime soon (which some might even like to rephrase as "due to no problem existing") but that this gets people a bit anxious is not surprising. > I've seen it mentioned that there is still a maintainer but I missed who > it is, but I haven't seen any concerns that can be addressed, they all > seem to be 'this is a core concept, people need to think about it' or > 'but someone may find a better answer in the future' type of things. it's > impossible to address these concerns directly. So do it indirectly. But please don't just say "it help some people (not me mind you!) so merge it and if you don't it's all just politics and we can't do anything about it anyway". Because that's mostly what I've been hearing. And no, I'm not subscribed to any ck mailinglists nor do I hang around its IRC community which will can account for part of that. I expect though that the same holds for the people that actually matter in this, such as Andrew Morton and Nick Piggin. -- 1: people being unconvinced it helps all that much At least partly caused by the updatedb i/dcache red herring that infected this issue. Also, at the point VM pressure has mounted high enough to cause enough to be swapped out to give you a bad experience, a lot of other things have been dropped already as well. It's unsurprising though that it would for example help the issue of openoffice with a large open spreadsheet having been thrown out overnight meaning it's a matter of deciding whether or not this is an important enough issue to fix inside the VM with something like swap-prefetch. Personally -- no opinion, I do not experience the problem (I even switch off the machine at night and do not run cron at all). -- 2: no serious investigation into possible downsides Swap-prefetch tries hard to be as free as possible and it seems to largely be succeeding at that. Thing that (obviously -- as in I wouldn't want to state it's the only possible worry anyone could have left) remains is the "papering over effect" it has by design that one might not care for. -- 3: no serious consideration of possible alternatives Tweaking existing use-oce logic is one I've heard but if we consider the i/dcache issue dead, I believe that one is as well. Going to userspace is another one. Largest theoretical potential. I myself am extremely sceptical about the Linux userland, and largely equate it with "smallest _practical_ potential" -- but that might just be me. A larger swap granularity, possible even a self-training granularity. Up to now, seeks only get costlier and costlier with respect to reads with every generation of disk (flash would largely overcome it though) and doing more in one read/write _greatly_ improves throughput, maybe up to the point that swap-prefetch is no longer very useful. I myself don't know about the tradeoffs involved. Any other alternatives? Any 4th and higher points? Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 10:09 ` Rene Herman @ 2007-07-29 11:41 ` david 2007-07-29 14:01 ` Rene Herman 0 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-29 11:41 UTC (permalink / raw) To: Rene Herman Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Sun, 29 Jul 2007, Rene Herman wrote: > On 07/28/2007 11:00 PM, david@lang.hm wrote: > >> > many -mm users use it anyway? He himself said he's not convinced of >> > usefulness having not seen it help for him (and notice that most >> > developers are also users), turned it off due to it annoying him at some >> > point and hasn't seen a serious investigation into potential downsides. >> >> if that was the case then people should be responding to the request to >> get it merged with 'but it caused problems for me when I tried it' >> >> I haven't seen any comments like that. > > So you're saying Andrew did not say that? You're jumping to the conclusion > that I am saying that it's causing problems. I don't remember anyone saying that it actually caused problems (including both you and andrew). I (and others) have been trying to learn what problems people believe it has in the hope that they can be addressed one way or another. >> > > that the only significant con left is the potential to mask other >> > > problems. >> > >> > Which is not a madeup issue, mind you. As an example, I just now tried >> > GNU locate and saw it's a complete pig and specifically unsuitable for >> > the low memory boxes under discussion. Upon completion, it actually >> > frees enough memory that swap-prefetch _could_ help on some boxes, while >> > the real issue is that they should first and foremost dump GNU locate. >> >> I see the conclusion as being exactly the opposite. > > And now you do it again :-) There is no conclusion -- just the inescapable > observation that swap-prefetch was (or may have been) masking the problem of > GNU locate being a program that noone in their right mind should be using. isn't your conclusion then that if people just stopped useing that version of updatedb the problem would be solved and there would be no need for the swap prefetch patch? that seemed to be what you were strongly implying (if not saying outright) >> so there is a legitimate situation where swap-prefetch will help >> significantly, what is the downside that prevents it from being included? > > People being unconvinced it helps all that much, no serious investigation > into possible downsides and no consideration of alternatives is three I've > personally heard. > > You don't want to merge a conceptually core VM feature if you're not really > convinced. It's not a part of the kernel you can throw a feature into like > you could some driver saying "ah, heck, if it makes someone happy" since > everything in the VM ends up interacting -- that in fact is actually the hard > part of VM as far as I've seen it. > > And in this situation the proposed feature is something that "papers over a > problem" by design -- where it could certainly be that the problem is not > solveable in another way simply due to the kernel not growing the possiblity > to read user's minds anytime soon (which some might even like to rephrase as > "due to no problem existing") but that this gets people a bit anxious is not > surprising. people who have lots of memory and so don't use swap will never see the benifit of this patch. over the years many people have investigated the problem and tried to address it in other ways (the better version of updatedb is an attempt to fix it for that program as an example), but there is still a problem. I agree that tinkering with the core VM code should not be done lightly, but this has been put through the proper process and is stalled with no hints on how to move forward. >> I've seen it mentioned that there is still a maintainer but I missed who >> it is, but I haven't seen any concerns that can be addressed, they all >> seem to be 'this is a core concept, people need to think about it' or 'but >> someone may find a better answer in the future' type of things. it's >> impossible to address these concerns directly. > > So do it indirectly. But please don't just say "it help some people (not me > mind you!) so merge it and if you don't it's all just politics and we can't > do anything about it anyway". Because that's mostly what I've been hearing. > > And no, I'm not subscribed to any ck mailinglists nor do I hang around its > IRC community which will can account for part of that. I expect though that > the same holds for the people that actually matter in this, such as Andrew > Morton and Nick Piggin. > > -- 1: people being unconvinced it helps all that much > > At least partly caused by the updatedb i/dcache red herring that infected > this issue. Also, at the point VM pressure has mounted high enough to cause > enough to be swapped out to give you a bad experience, a lot of other things > have been dropped already as well. > > It's unsurprising though that it would for example help the issue of > openoffice with a large open spreadsheet having been thrown out overnight > meaning it's a matter of deciding whether or not this is an important enough > issue to fix inside the VM with something like swap-prefetch. > > Personally -- no opinion, I do not experience the problem (I even switch off > the machine at night and do not run cron at all). forget the nightly cron jobs for the moment. think of this scenerio. you have your memory fairly full with apps that you have open (including firefox with many tabs), you receive a spreadsheet you need to look at, so you fire up openoffice to look at it. then you exit openoffice and try to go back to firefox (after a pause while you walk to the printer to get the printout of the spreadsheet), only to find that it's going to be sluggish becouse it got swapped out due to the preasure from openoffice. no nightly cron job needed, just enough of a memory hog or a small enough amount of ram to have your working set exceed it. > -- 2: no serious investigation into possible downsides > > Swap-prefetch tries hard to be as free as possible and it seems to largely be > succeeding at that. Thing that (obviously -- as in I wouldn't want to state > it's the only possible worry anyone could have left) remains is the "papering > over effect" it has by design that one might not care for. > > -- 3: no serious consideration of possible alternatives > > Tweaking existing use-oce logic is one I've heard but if we consider the > i/dcache issue dead, I believe that one is as well. Going to userspace is > another one. Largest theoretical potential. I myself am extremely sceptical > about the Linux userland, and largely equate it with "smallest _practical_ > potential" -- but that might just be me. > > A larger swap granularity, possible even a self-training granularity. Up to > now, seeks only get costlier and costlier with respect to reads with every > generation of disk (flash would largely overcome it though) and doing more in > one read/write _greatly_ improves throughput, maybe up to the point that > swap-prefetch is no longer very useful. I myself don't know about the > tradeoffs involved. larger swap granularity may help, but waiting for the user to need the ram and have to wait for it to be read back in is always going to be worse for the user then pre-populating the free memory (for the case where the pre-population is right, for other cases it's the same). so I see this as a red herring > Any other alternatives? > > Any 4th and higher points? there are fully legitimate situations where this is useful, the 'papering over' effect is not referring to these, it's referring to other possible problems in the future. I see this argument as being in the same catagory as people wanting to remove the old, working driver for some hardware in favor of the new, unreliable driver for the same hardware in order to get more bug reports to find the holes in the new driver. that's causing users unessasary pain and within the last week Linus was takeing a driver author to task for attempting exactly that IIRC (and yes, there does come a point where there are no further bugs known in the new driver, and it appears to do everything the old driver did that you do remove the old driver, but you don't remove it early to help the new driver stabilize) David Lang > Rene. > > ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 11:41 ` david @ 2007-07-29 14:01 ` Rene Herman 2007-07-29 21:19 ` david 0 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-29 14:01 UTC (permalink / raw) To: david Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/29/2007 01:41 PM, david@lang.hm wrote: >> And now you do it again :-) There is no conclusion -- just the >> inescapable observation that swap-prefetch was (or may have been) >> masking the problem of GNU locate being a program that noone in their >> right mind should be using. > > isn't your conclusion then that if people just stopped useing that > version of updatedb the problem would be solved and there would be no > need for the swap prefetch patch? that seemed to be what you were > strongly implying (if not saying outright) No. What I said outright, every single time, is that swap-prefetch in itself seems to make sense. And specifically that even if the _direct_ problem is a crummy program, it _still_ makes sense generally. Every single time. But see -- you failed to notice this because you guys are stuck in this dumb adversary "us against them" thing so inherent of (online) communities, where you sit around your own habitats patting each other on the back for extended periods of time and then every once a while go out clinging on to each other vigorously and going "boo! hiss!" at the big bad outside world. I already got overly violent at one point in this thread so I'll leave out any further references to sense-deprived fanboy-culture but please, I said every single time that I'm not against swap-prefetch. I cannot communicate when I'm not being read. > I agree that tinkering with the core VM code should not be done lightly, > but this has been put through the proper process and is stalled with no > hints on how to move forward. It has not. Concerns that were raised (by specifically Nick Piggin) weren't being addressed. > forget the nightly cron jobs for the moment. think of this scenerio. you > have your memory fairly full with apps that you have open (including > firefox with many tabs), you receive a spreadsheet you need to look at, > so you fire up openoffice to look at it. then you exit openoffice and try > to go back to firefox (after a pause while you walk to the printer to > get the printout of the spreadsheet) And swinging a dead rat from its tail facing east-wards while reciting Documentation/CodingStyle. Okay, very very sorry, that was particularly childish, but that "walking to the printer" is ofcourse completely constructed and this _is_ something to take into account. Swap-prefetch wants to be free, which (also again) it is doing a good job at it seems, but this also means that it waits for the VM to be _very_ idle before it does anything and as such, we cannot just forget the "nightly" scenario and pretend it's about something else entirely. As long as the machine's being used, swap-prefetch doesn't kick in. Which is a good feature for swap-prefetch, but also something that needs to weighed alongside its other features in a discussion of alternatives, where for example something like a larger swap granularity would not have anything of the sort to take into account. If it were about walks to the printer, we could shelve the issue as being of too limited practical use for inclusion. >> -- 2: no serious investigation into possible downsides >> >> Swap-prefetch tries hard to be as free as possible and it seems to >> largely be succeeding at that. Thing that (obviously -- as in I >> wouldn't want to state it's the only possible worry anyone could have >> left) remains is the "papering over effect" it has by design that one >> might not care for. Arjan van de Ven made another point here about seeking away due to swap-prefetch (just) before the next request comes in, but that's probably a bit of a non-issue in practice with the "very idle" precondition. >> -- 3: no serious consideration of possible alternatives >> >> Tweaking existing use-oce logic is one I've heard but if we consider >> the i/dcache issue dead, I believe that one is as well. Going to >> userspace is another one. Largest theoretical potential. I myself am >> extremely sceptical about the Linux userland, and largely equate it >> with "smallest _practical_ potential" -- but that might just be me. >> >> A larger swap granularity, possible even a self-training granularity. >> Up to now, seeks only get costlier and costlier with respect to reads >> with every generation of disk (flash would largely overcome it though) >> and doing more in one read/write _greatly_ improves throughput, maybe >> up to the point that swap-prefetch is no longer very useful. I myself >> don't know about the tradeoffs involved. > > larger swap granularity may help, but waiting for the user to need the > ram and have to wait for it to be read back in is always going to be > worse for the user then pre-populating the free memory (for the case > where the pre-population is right, for other cases it's the same). so I > see this as a red herring I saw Chris Snook make a good post here and am going to defer this part to that discussion: http://lkml.org/lkml/2007/7/27/421 But no, it's not a red herring if _practically_ speaking the swapin is fast enough once started that people don't actually mind anymore since in that case you could simply do without yet more additional VM complexity (and kernel daemon). > there are fully legitimate situations where this is useful, the 'papering > over' effect is not referring to these, it's referring to other possible > problems in the future. No, it's not just future. Just look at the various things under discussion now such as improved use-once and better swapin. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 14:01 ` Rene Herman @ 2007-07-29 21:19 ` david 2007-08-06 2:14 ` Nick Piggin 0 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-29 21:19 UTC (permalink / raw) To: Rene Herman Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Sun, 29 Jul 2007, Rene Herman wrote: > On 07/29/2007 01:41 PM, david@lang.hm wrote: > >> I agree that tinkering with the core VM code should not be done lightly, >> but this has been put through the proper process and is stalled with no >> hints on how to move forward. > > It has not. Concerns that were raised (by specifically Nick Piggin) weren't > being addressed. I may have missed them, but what I saw from him weren't specific issues, but instead a nebulous 'something better may come along later' >> forget the nightly cron jobs for the moment. think of this scenerio. you >> have your memory fairly full with apps that you have open (including >> firefox with many tabs), you receive a spreadsheet you need to look at, so >> you fire up openoffice to look at it. then you exit openoffice and try >> to go back to firefox (after a pause while you walk to the printer to >> get the printout of the spreadsheet) > > And swinging a dead rat from its tail facing east-wards while reciting > Documentation/CodingStyle. > > Okay, very very sorry, that was particularly childish, but that "walking to > the printer" is ofcourse completely constructed and this _is_ something to > take into account. yes it was contrived for simplicity. the same effect would happen if instead of going back to firefox the user instead went to their e-mail software and read some mail. doing so should still make the machine idle enough to let prefetch kick in. > Swap-prefetch wants to be free, which (also again) it is > doing a good job at it seems, but this also means that it waits for the VM to > be _very_ idle before it does anything and as such, we cannot just forget the > "nightly" scenario and pretend it's about something else entirely. As long as > the machine's being used, swap-prefetch doesn't kick in. how long does the machine need to be idle? if someone spends 30 seconds reading an e-mail that's an incredibly long time for the system and I would think it should be enough to let the prefetch kick in. >> > -- 3: no serious consideration of possible alternatives >> > >> > Tweaking existing use-oce logic is one I've heard but if we consider >> > the i/dcache issue dead, I believe that one is as well. Going to >> > userspace is another one. Largest theoretical potential. I myself am >> > extremely sceptical about the Linux userland, and largely equate it >> > with "smallest _practical_ potential" -- but that might just be me. >> > >> > A larger swap granularity, possible even a self-training >> > granularity. Up to now, seeks only get costlier and costlier with >> > respect to reads with every generation of disk (flash would largely >> > overcome it though) and doing more in one read/write _greatly_ >> > improves throughput, maybe up to the point that swap-prefetch is no >> > longer very useful. I myself don't know about the tradeoffs >> > involved. >> >> larger swap granularity may help, but waiting for the user to need the >> ram and have to wait for it to be read back in is always going to be >> worse for the user then pre-populating the free memory (for the case >> where the pre-population is right, for other cases it's the same). so >> I see this as a red herring > > I saw Chris Snook make a good post here and am going to defer this part to > that discussion: > > http://lkml.org/lkml/2007/7/27/421 > > But no, it's not a red herring if _practically_ speaking the swapin is fast > enough once started that people don't actually mind anymore since in that > case you could simply do without yet more additional VM complexity (and > kernel daemon). swapin will always require disk access, and avoiding doing disk access while the user is waiting for it by doing it when the system isn't useing the disk will always be a win (possibly not as large of a win, but still a win) on slow laptop drives where you may only get 20MB/second of reads under optimal situations it doesn't take much reading to be noticed by the user. >> there are fully legitimate situations where this is useful, the 'papering >> over' effect is not referring to these, it's referring to other possible >> problems in the future. > > No, it's not just future. Just look at the various things under discussion > now such as improved use-once and better swapin. and these thing do not conflict with prefetch, they compliment it. improved use-once will avoid pushing things out to swap in the first place. this will help during normal workloads so is valuble in any case. better swapin (I assume you are talking about things like larger swap granularity) will also help during normal workloads when you are thrashing into swap. prefetch will help when you have pushed things out to swap and now have free memory and a momentarily idle system. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 21:19 ` david @ 2007-08-06 2:14 ` Nick Piggin 2007-08-06 2:22 ` david 0 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-08-06 2:14 UTC (permalink / raw) To: david Cc: Rene Herman, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel david@lang.hm wrote: > On Sun, 29 Jul 2007, Rene Herman wrote: > >> On 07/29/2007 01:41 PM, david@lang.hm wrote: >> >>> I agree that tinkering with the core VM code should not be done >>> lightly, >>> but this has been put through the proper process and is stalled with no >>> hints on how to move forward. >> >> >> It has not. Concerns that were raised (by specifically Nick Piggin) >> weren't being addressed. > > > I may have missed them, but what I saw from him weren't specific issues, > but instead a nebulous 'something better may come along later' Something better, ie. the problems with page reclaim being fixed. Why is that nebulous? -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-08-06 2:14 ` Nick Piggin @ 2007-08-06 2:22 ` david 2007-08-06 9:21 ` Nick Piggin 0 siblings, 1 reply; 535+ messages in thread From: david @ 2007-08-06 2:22 UTC (permalink / raw) To: Nick Piggin Cc: Rene Herman, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Mon, 6 Aug 2007, Nick Piggin wrote: > david@lang.hm wrote: >> On Sun, 29 Jul 2007, Rene Herman wrote: >> >> > On 07/29/2007 01:41 PM, david@lang.hm wrote: >> > >> > > I agree that tinkering with the core VM code should not be done >> > > lightly, >> > > but this has been put through the proper process and is stalled with >> > > no >> > > hints on how to move forward. >> > >> > >> > It has not. Concerns that were raised (by specifically Nick Piggin) >> > weren't being addressed. >> >> >> I may have missed them, but what I saw from him weren't specific issues, >> but instead a nebulous 'something better may come along later' > > Something better, ie. the problems with page reclaim being fixed. > Why is that nebulous? becouse that doesn't begin to address all the benifits. the approach of fixing page reclaim and updatedb is pretending that if you only do everything right pages won't get pushed to swap in the first place, and therefor swap prefetch won't be needed. this completely ignores the use case where the swapping was exactly the right thing to do, but memory has been freed up from a program exiting so that you couldnow fill that empty ram with data that was swapped out. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-08-06 2:22 ` david @ 2007-08-06 9:21 ` Nick Piggin 2007-08-06 9:55 ` Paolo Ciarrocchi 0 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-08-06 9:21 UTC (permalink / raw) To: david Cc: Rene Herman, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel --- david@lang.hm wrote: > On Mon, 6 Aug 2007, Nick Piggin wrote: > > > david@lang.hm wrote: > >> On Sun, 29 Jul 2007, Rene Herman wrote: > >> > >> > On 07/29/2007 01:41 PM, david@lang.hm wrote: > >> > > >> > > I agree that tinkering with the core VM code > should not be done > >> > > lightly, > >> > > but this has been put through the proper > process and is stalled with > >> > > no > >> > > hints on how to move forward. > >> > > >> > > >> > It has not. Concerns that were raised (by > specifically Nick Piggin) > >> > weren't being addressed. > >> > >> > >> I may have missed them, but what I saw from him > weren't specific issues, > >> but instead a nebulous 'something better may > come along later' > > > > Something better, ie. the problems with page > reclaim being fixed. > > Why is that nebulous? > > becouse that doesn't begin to address all the > benifits. What do you mean "address the benefits"? What I want to address is the page reclaim problems. > the approach of fixing page reclaim and updatedb is > pretending that if you > only do everything right pages won't get pushed to > swap in the first > place, and therefor swap prefetch won't be needed. You should read what I wrote. Anyway, the fact of the matter is that there are still fairly significant problems with page reclaim in this workload which I would like to see fixed. I personally still think some of the low hanging fruit *might* be better fixed before swap prefetch gets merged, but I've repeatedly said I'm sick of getting dragged back into the whole debate so I'm happy with whatever Andrew decides to do with it. I think it is sad to turn it off for laptops, if it really makes the "desktop" experience so much better. Surely for _most_ workloads we should be able to manage 1-2GB of RAM reasonably well. > this completely ignores the use case where the > swapping was exactly the > right thing to do, but memory has been freed up from > a program exiting so > that you couldnow fill that empty ram with data that > was swapped out. Yeah. However, merging patches (especially when changing heuristics, especially in page reclaim) is not about just thinking up a use-case that it works well for and telling people that they're putting their heads in the sand if they say anything against it. Read this thread and you'll find other examples of patches that have been around for as long or longer and also have some good use-cases and also have not been merged. ____________________________________________________________________________________ Yahoo!7 Mail has just got even bigger and better with unlimited storage on all webmail accounts. http://au.docs.yahoo.com/mail/unlimitedstorage.html ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-08-06 9:21 ` Nick Piggin @ 2007-08-06 9:55 ` Paolo Ciarrocchi 0 siblings, 0 replies; 535+ messages in thread From: Paolo Ciarrocchi @ 2007-08-06 9:55 UTC (permalink / raw) To: Nick Piggin, Andrew Morton Cc: david, Rene Herman, Daniel Hazelton, Mike Galbraith, Ingo Molnar, Frank Kingswood, Andi Kleen, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 8/6/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: [...] > > this completely ignores the use case where the > > swapping was exactly the > > right thing to do, but memory has been freed up from > > a program exiting so > > that you couldnow fill that empty ram with data that > > was swapped out. > > Yeah. However, merging patches (especially when > changing heuristics, especially in page reclaim) is > not about just thinking up a use-case that it works > well for and telling people that they're putting their > heads in the sand if they say anything against it. > Read this thread and you'll find other examples of > patches that have been around for as long or longer > and also have some good use-cases and also have not > been merged. What do you think Andrew? Swap prefetch is not the panacea, it's not going to solve all the problems but it seems to improve the "desktop experience" and it has been discussed and reviewed a lot (it's has even been discussed more than it should have be). Are you going to push upstream the patch? Ciao, -- Paolo http://paolo.ciarrocchi.googlepages.com/ ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 8:55 ` david 2007-07-28 10:11 ` Rene Herman @ 2007-07-28 15:56 ` Daniel Hazelton 2007-07-28 21:06 ` david 1 sibling, 1 reply; 535+ messages in thread From: Daniel Hazelton @ 2007-07-28 15:56 UTC (permalink / raw) To: david Cc: Rene Herman, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Saturday 28 July 2007 04:55:58 david@lang.hm wrote: > On Sat, 28 Jul 2007, Rene Herman wrote: > > On 07/27/2007 09:43 PM, david@lang.hm wrote: > >> On Fri, 27 Jul 2007, Rene Herman wrote: > >> > On 07/27/2007 07:45 PM, Daniel Hazelton wrote: > >> > > Questions about it: > >> > > Q) Does swap-prefetch help with this? > >> > > A) [From all reports I've seen (*)] > >> > > Yes, it does. > >> > > >> > No it does not. If updatedb filled memory to the point of causing > >> > swapping (which noone is reproducing anyway) it HAS FILLED MEMORY and > >> > swap-prefetch hasn't any memory to prefetch into -- updatedb itself > >> > doesn't use any significant memory. > >> > >> however there are other programs which are known to take up significant > >> amounts of memory and will cause the issue being described (openoffice > >> for example) > >> > >> please don't get hung up on the text 'updatedb' and accept that there > >> are programs that do run intermittently and do use a significant amount > >> of ram and then free it. > > > > Different issue. One that's worth pursueing perhaps, but a different > > issue from the VFS caches issue that people have been trying to track > > down. > > people are trying to track down the problem of their machine being slow > until enough data is swapped back in to operate normally. > > in at some situations swap prefetch can help becouse something that used > memory freed it so there is free memory that could be filled with data > (which is something that Linux does agressivly in most other situations) > > in some other situations swap prefetch cannot help becouse useless data is > getting cached at the expense of useful data. > > nobody is arguing that swap prefetch helps in the second cast. Actually, I made a mistake when tracking the thread and reading the code for the patch and started to argue just that. But I have to admit I made a mistake - the patches author has stated (as Rene was kind enough to point out) that swap prefetch can't help when memory is filled. > what people are arguing is that there are situations where it helps for > the first case. on some machines and version of updatedb the nighly run of > updatedb can cause both sets of problems. but the nightly updatedb run is > not the only thing that can cause problems Solving the cache filling memory case is difficult. There have been a number of discussions about it. The simplest solution, IMHO, would be to place a (configurable) hard limit on the maximum size any of the kernels caches can grow to. (The only solution that was discussed, however, is a complex beast) > > but let's talk about the concept here for a little bit > > the design is to use CPU and I/O capacity that's otherwise idle to fill > free memory with data from swap. > > pro: > more ram has potentially useful data in it > > con: > it takes a little extra effort to give this memory to another app (the > page must be removed from the list and zeroed at the time it's needed, I > assume that the data is left in swap so that it doesn't have to be written > out again) > > it adds some complexity to the kernel (~500 lines IIRC from this thread) > > by undoing recent swapouts it can potentially mask problems with swapout > > it looks to me like unless the code was really bad (and after 23 months in > -mm it doesn't sound like it is) that the only significant con left is the > potential to mask other problems. I'll second this. But with the swap system itself having seen as heavy testing as it has I don't know if it would be masking other problems. That is why I've been asking "What is so wrong with it?" - while it definately doesn't help with programs that cause caches to balloon (that problem does need another solution) it does help to speed things up when a memory hog has exited. (And since its a pretty safe assumption that swap is going to be noticeably slower than RAM this patch seems to me to be a rather visible and obvious solution to that problem) > however there are many legitimate cases where it is definantly dong the > right thing (swapout was correct in pushing out the pages, but now the > cause of that preasure is gone). the amount of benifit from this will vary > from situation to situation, but it's not reasonable to claim that this > provides no benifit (you have benchmark numbers that show it in synthetic > benchmarks, and you have user reports that show it in the real-worlk) Exactly. Though I have seen posts which (to me at least) appear to claim exactly that. It was part of the reason why I got a bit incensed. (The other was that it looked like the kernel devs with the ultra-powerful machines were claiming 'I don't see the problem on my machine, so it doesn't exist'. That sort of attitude is fine, in some cases, but not, IMHO, where performance is concerned) > there are lots of things in the kernel who's job is to pre-fill the memroy > with data that may (or may not) be useful in the future. this is just > another method of filling the cache. it does so my saying "the user wanted > these pages in the recent past, so it's a reasonable guess to say that the > user will want them again in the future" Yep. And it's a pretty obvious step forward. The VFS system already does readahead and caching for mounted volumes to improve performance - why not do similar to improve the performance of swap? The only real downside is that swap-prefetch won't be effective in all cases and it will cause some extra power consumption. (drives can't spin-down as soon as the would without it, etc...) While I can only make some suggestions as to how to fix the problem of ballooning caches (I've been wading through the VM code for a few days now and still don't fully understand any of it), the solution to the power consumption seems obvious - swap prefetch doesn't work when the system is running on battery (or UPS or whatever) DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 15:56 ` Daniel Hazelton @ 2007-07-28 21:06 ` david 2007-07-28 21:48 ` Daniel Hazelton 0 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-28 21:06 UTC (permalink / raw) To: Daniel Hazelton Cc: Rene Herman, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Sat, 28 Jul 2007, Daniel Hazelton wrote: > > On Saturday 28 July 2007 04:55:58 david@lang.hm wrote: >> On Sat, 28 Jul 2007, Rene Herman wrote: >>> On 07/27/2007 09:43 PM, david@lang.hm wrote: >>>> On Fri, 27 Jul 2007, Rene Herman wrote: >>>>> On 07/27/2007 07:45 PM, Daniel Hazelton wrote: >> >> nobody is arguing that swap prefetch helps in the second cast. > > Actually, I made a mistake when tracking the thread and reading the code for > the patch and started to argue just that. But I have to admit I made a > mistake - the patches author has stated (as Rene was kind enough to point > out) that swap prefetch can't help when memory is filled. I stand corrected, thaks for speaking up and correcting your position. >> what people are arguing is that there are situations where it helps for >> the first case. on some machines and version of updatedb the nighly run of >> updatedb can cause both sets of problems. but the nightly updatedb run is >> not the only thing that can cause problems > > Solving the cache filling memory case is difficult. There have been a number > of discussions about it. The simplest solution, IMHO, would be to place a > (configurable) hard limit on the maximum size any of the kernels caches can > grow to. (The only solution that was discussed, however, is a complex beast) limiting the size of the cache is also the wrong thing to do in many situations. it's only right if the cache pushes out other data you care about, if you are trying to do one thing as fast as you can you really do want the system to use all the memory it can for the cache. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 21:06 ` david @ 2007-07-28 21:48 ` Daniel Hazelton 0 siblings, 0 replies; 535+ messages in thread From: Daniel Hazelton @ 2007-07-28 21:48 UTC (permalink / raw) To: david Cc: Rene Herman, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Saturday 28 July 2007 17:06:50 david@lang.hm wrote: > On Sat, 28 Jul 2007, Daniel Hazelton wrote: > > On Saturday 28 July 2007 04:55:58 david@lang.hm wrote: > >> On Sat, 28 Jul 2007, Rene Herman wrote: > >>> On 07/27/2007 09:43 PM, david@lang.hm wrote: > >>>> On Fri, 27 Jul 2007, Rene Herman wrote: > >>>>> On 07/27/2007 07:45 PM, Daniel Hazelton wrote: > >> > >> nobody is arguing that swap prefetch helps in the second cast. > > > > Actually, I made a mistake when tracking the thread and reading the code > > for the patch and started to argue just that. But I have to admit I made > > a mistake - the patches author has stated (as Rene was kind enough to > > point out) that swap prefetch can't help when memory is filled. > > I stand corrected, thaks for speaking up and correcting your position. If you had made the statement before I decided to speak up you would have been correct :) Anyway, I try to always admit when I've made a mistake - its part of my philosophy. (There have been times when I haven't done it, but I'm trying to make that stop entirely) > >> what people are arguing is that there are situations where it helps for > >> the first case. on some machines and version of updatedb the nighly run > >> of updatedb can cause both sets of problems. but the nightly updatedb > >> run is not the only thing that can cause problems > > > > Solving the cache filling memory case is difficult. There have been a > > number of discussions about it. The simplest solution, IMHO, would be to > > place a (configurable) hard limit on the maximum size any of the kernels > > caches can grow to. (The only solution that was discussed, however, is a > > complex beast) > > limiting the size of the cache is also the wrong thing to do in many > situations. it's only right if the cache pushes out other data you care > about, if you are trying to do one thing as fast as you can you really do > want the system to use all the memory it can for the cache. After thinking about this you are partially correct. There are those sorts of situations where you want the system to use all the memory it can for caches. OTOH, if those situations could be described in some sort of simple heuristic, then a soft-limit that uses those heuristics to determine when to let the cache expand could exploit the benefits of having both a limited and unlimited cache. (And, potentially, if the heuristic has allowed a cache to expand beyond the limit then, when the heuristic no longer shows the oversize cache is no longer necessary it could trigger and automatic reclaim of that memory.) (I'm willing to help write and test code to do this exactly. There is no guarantee that I'll be able to help with more than testing - I don't understand the parts of the code involved all that well) DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 18:16 ` Rene Herman 2007-07-27 19:43 ` david @ 2007-07-27 20:28 ` Daniel Hazelton 2007-07-28 5:19 ` Rene Herman 2007-07-27 23:15 ` Björn Steinbrink 2 siblings, 1 reply; 535+ messages in thread From: Daniel Hazelton @ 2007-07-27 20:28 UTC (permalink / raw) To: Rene Herman Cc: Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Friday 27 July 2007 14:16:32 Rene Herman wrote: > On 07/27/2007 07:45 PM, Daniel Hazelton wrote: > > Updatedb or another process that uses the FS heavily runs on a users > > 256MB P3-800 (when it is idle) and the VFS caches grow, causing memory > > pressure that causes other applications to be swapped to disk. In the > > morning the user has to wait for the system to swap those applications > > back in. > > > > Questions about it: > > Q) Does swap-prefetch help with this? > > A) [From all reports I've seen (*)] Yes, it does. > > No it does not. If updatedb filled memory to the point of causing swapping > (which noone is reproducing anyway) it HAS FILLED MEMORY and swap-prefetch > hasn't any memory to prefetch into -- updatedb itself doesn't use any > significant memory. Check the attitude at the door then re-read what I actually said: > > Updatedb or another process that uses the FS heavily runs on a users > > 256MB P3-800 (when it is idle) and the VFS caches grow, causing memory > > pressure that causes other applications to be swapped to disk. In the > > morning the user has to wait for the system to swap those applications > > back in. I never said that it was the *program* itself - or *any* specific program (I used "Updatedb" because it has been the big name in the discussion) - doing the filling of memory. I actually said that the problem is that the kernel's caches - VFS and others - will grow *WITHOUT* *LIMIT*, filling all available memory. Swap prefetch on its own will not alleviate *all* of the problem, but it appears to fix enough of it that the problem doesn't seem to bother people anymore. (As I noted later on there are things that can be changes that would also fix things. Those changes, however, are quite tricky and involve changes to the page faulting mechanism, the way the various caches work and a number of other things) In light of the fact that swap prefetch appears to solve the problem for the people that have been vocal about it, and because it is a less intrusive change than the other potential solutions, I'd like to know why all the complaints and arguments against it come down to "Its treating the symptom". I mean it - because I fail to see how it isn't getting at the root of the problem - which is, pretty much, that Swap has classically been and, in the case of most modern systems, still is damned slow. By prefetching those pages that have most recently been evicted the problem of "slow swap" is being directly addressed. You want to know what causes the problem? The current design of the caches. They will extend without much limit, to the point of actually pushing pages to disk so they can grow even more. > Here's swap-prefetch's author saying the same: > > http://lkml.org/lkml/2007/2/9/112 > > | It can't help the updatedb scenario. Updatedb leaves the ram full and > | swap prefetch wants to cost as little as possible so it will never > | move anything out of ram in preference for the pages it wants to swap > | back in. > > Now please finally either understand this, or tell us how we're wrong. > > Rene. I already did. You completely ignored it because I happened to use the magic words "updatedb" and "swap prefetch". Did I ever say it was about "updatedb" in particular? You've got the statement in the part of my post that you quoted. Nope, appears that I used the name as a specific example - and one that has been used previously in the thread. Now drop the damned attitude and start using your brain. Okay? DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 20:28 ` Daniel Hazelton @ 2007-07-28 5:19 ` Rene Herman 0 siblings, 0 replies; 535+ messages in thread From: Rene Herman @ 2007-07-28 5:19 UTC (permalink / raw) To: Daniel Hazelton Cc: Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel, B.Steinbrink On 07/27/2007 10:28 PM, Daniel Hazelton wrote: > Check the attitude at the door then re-read what I actually said: Attitude? You wanted attitude dear boy? >>> Updatedb or another process that uses the FS heavily runs on a users >>> 256MB P3-800 (when it is idle) and the VFS caches grow, causing memory >>> pressure that causes other applications to be swapped to disk. In the >>> morning the user has to wait for the system to swap those applications >>> back in. > > I never said that it was the *program* itself - or *any* specific program (I > used "Updatedb" because it has been the big name in the discussion) - doing > the filling of memory. I actually said that the problem is that the kernel's > caches - VFS and others - will grow *WITHOUT* *LIMIT*, filling all available > memory. WHICH SWAP-PREFETCH DOES NOT HELP WITH. WHICH SWAP-PREFETCH DOES NOT HELP WITH. WHICH SWAP-PREFETCH DOES NOT HELP WITH. And now finally get that through your thick scull or shut up, right fucking now. > You want to know what causes the problem? The current design of the caches. > They will extend without much limit, to the point of actually pushing pages > to disk so they can grow even more. Due to being a generally nice guy, I am going to try _once_ more to try and make you understand. Not twice, once. So pay attention. Right now. Those caches are NOT causing any problem under discussion. If any caches grow to the point of causing swap-out, they have filled memory and swap-prefetch cannot and will not do anything since it needs free (as in not occupied by caches) memory. As such, people maintaining that swap-prefetch helps their situation are not being hit by caches. The only way swap-prefetch can (and will) do anything is when something that by itself takes up lots of memory runs and exits. So can we now please finally drop the fucking red herring and start talking about swap-prefetch? If we accept that some of the people maintaining that swap-prefetch helps them are not in fact deluded -- a bit of a stretch seeing as how not a single one of them is substantiating anything -- we have a number of slightly different possibilities for "something" in the above. -- 1) It could be an inefficient updatedb. Although he isn't experiencing the problem, Bjoern Steinbrink is posting numbers (weeee!) that show that at least the GNU version spawns a large memory "sort" process meaning that on a low-memory box updatedb itself can be what causes the observed problem. While in this situation switching to a different updatedb (slocate, mlocate) obviously makes sense it's the kind of situation where swap-prefetch will help. -- 2) It could be something else entirely such as a backup run. I suppose people would know if they were running anything of the sort though and wouldn't blaim anything on updatedb. Other than that, it's again the situation where swap-prefetch would help. -- 3) The something else entirely can also run _after_ updatedb, kicking out the VFS caches and leaving free memory upon exit. I still suppose the same thing as under (2) but this is the only way how updatedb / VFS caches can even be part of any problem, if the _combined_ memory pressure is just enough to make the difference. The direct problem is still just the "something else entirely" and needs someone affected to tell us what it is. > I already did. You completely ignored it because I happened to use the magic > words "updatedb" and "swap prefetch". No I did not. This thread is about swap-prefetch and you used the magic words VFS caches. I don't give a fryin' fuck if their filling is caused by updatedb or the cat sleeping on the "find /<enter>" keys on your keyboard, they're still not causing anything swap-prefetch helps with. This thread has seen input from a selection of knowledgeable people and Morton was even running benchmarks to look at this supposed VFS cache problem and not finding it. The only further input this thread needs is someone affected by the supposed problem. Which I ofcourse notice in a followup of yours you are not either -- you're just here to blabber, not to solve anything. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 18:16 ` Rene Herman 2007-07-27 19:43 ` david 2007-07-27 20:28 ` Daniel Hazelton @ 2007-07-27 23:15 ` Björn Steinbrink 2007-07-27 23:29 ` Andi Kleen 2007-07-28 7:35 ` Rene Herman 2 siblings, 2 replies; 535+ messages in thread From: Björn Steinbrink @ 2007-07-27 23:15 UTC (permalink / raw) To: Rene Herman Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 2007.07.27 20:16:32 +0200, Rene Herman wrote: > On 07/27/2007 07:45 PM, Daniel Hazelton wrote: > >> Updatedb or another process that uses the FS heavily runs on a users >> 256MB P3-800 (when it is idle) and the VFS caches grow, causing memory >> pressure that causes other applications to be swapped to disk. In the >> morning the user has to wait for the system to swap those applications >> back in. >> Questions about it: >> Q) Does swap-prefetch help with this? A) [From all reports I've seen (*)] >> Yes, it does. > > No it does not. If updatedb filled memory to the point of causing swapping > (which noone is reproducing anyway) it HAS FILLED MEMORY and swap-prefetch > hasn't any memory to prefetch into -- updatedb itself doesn't use any > significant memory. > > Here's swap-prefetch's author saying the same: > > http://lkml.org/lkml/2007/2/9/112 > > | It can't help the updatedb scenario. Updatedb leaves the ram full and > | swap prefetch wants to cost as little as possible so it will never > | move anything out of ram in preference for the pages it wants to swap > | back in. > > Now please finally either understand this, or tell us how we're wrong. Con might have been wrong there for boxes with really little memory. My desktop box has not even 300k inodes in use (IIRC someone posted a df -i output showing 1 million inodes in use). Still, the memory footprint of the "sort" process grows up to about 50MB. Assuming that the average filename length stays, that would mean 150MB for the 1 million inode case, just for the "sort" process. Now, sort cannot produce any output before its got all its input, so that RSS usage exists at least as long as the VFS cache is growing due to the ongoing search for files. And then, all that memory that "sort" uses is required, because sort needs to output its results. So if there's memory pressure, the VFS cache is likely to be dropped, because "sort" needs its data, for sorting and producing output. And then sort terminates and leaves that whole lot of memory _unused_. The other actions of updatedb only touch the locate db, which is just a few megs (4.5MB here) big so the cache won't grow that much again. OK, so we got about, say, at least 128MB of totally unused memory, maybe even more. If you look at the vmstat output I sent, you see that I had between 90MB and 128MB free, depending on the swappiness setting, with increased inode usage, that could very well scale up. Conclusion: updatedb does _not_ leave the RAM full. And for a box with little memory (say 256MB) it might even be 50% or more memory that is free after updatedb ran. Might that make swap prefetch kick in? Any faults in that reasoning? Thanks, Björn ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 23:15 ` Björn Steinbrink @ 2007-07-27 23:29 ` Andi Kleen 2007-07-28 0:08 ` Björn Steinbrink ` (2 more replies) 2007-07-28 7:35 ` Rene Herman 1 sibling, 3 replies; 535+ messages in thread From: Andi Kleen @ 2007-07-27 23:29 UTC (permalink / raw) To: Björn Steinbrink, Rene Herman, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel > Any faults in that reasoning? GNU sort uses a merge sort with temporary files on disk. Not sure how much it keeps in memory during that, but it's probably less than 150MB. At some point the dirty limit should kick in and write back the data of the temporary files; so it's not quite the same as anonymous memory. But it's not that different given. It would be better to measure than to guess. At least Andrew's measurements on 128MB actually didn't show updatedb being really that big a problem. Perhaps some people have much more files or simply a less efficient updatedb implementation? I guess the people who complain here that loudly really need to supply some real numbers. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 23:29 ` Andi Kleen @ 2007-07-28 0:08 ` Björn Steinbrink 2007-07-28 1:10 ` Daniel Hazelton 2007-07-29 12:53 ` Paul Jackson 2 siblings, 0 replies; 535+ messages in thread From: Björn Steinbrink @ 2007-07-28 0:08 UTC (permalink / raw) To: Andi Kleen Cc: Rene Herman, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 2007.07.28 01:29:19 +0200, Andi Kleen wrote: > > Any faults in that reasoning? > > GNU sort uses a merge sort with temporary files on disk. Not sure > how much it keeps in memory during that, but it's probably less > than 150MB. At some point the dirty limit should kick in and write back the > data of the temporary files; so it's not quite the same as anonymous memory. > But it's not that different given. Hm, does that change anything? The files need to be read at the end (so they go into the cache) and are delete afterwards (cache gets freed I guess?). > It would be better to measure than to guess. At least Andrew's measurements > on 128MB actually didn't show updatedb being really that big a problem. Here's a before/after memory usage for an updatedb run: root@atjola:~# free -m total used free shared buffers cached Mem: 2011 1995 15 0 269 779 -/+ buffers/cache: 946 1064 Swap: 1945 0 1945 root@atjola:~# updatedb root@atjola:~# free -m total used free shared buffers cached Mem: 2011 1914 96 0 209 746 -/+ buffers/cache: 958 1052 Swap: 1945 0 1944 81MB more unused RAM afterwards. If anyone can make use of that, here's a snippet from /proc/$PID/smaps of updatedb's sort process, when it was at about its peak memory usage (according to the RSS column in top), which was about 50MB. 2b90ab3c1000-2b90ae4c3000 rw-p 2b90ab3c1000 00:00 0 Size: 50184 kB Rss: 50184 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 50184 kB Referenced: 50184 kB > Perhaps some people have much more files or simply a less efficient > updatedb implementation? sort (GNU coreutils) 5.97 GNU updatedb version 4.2.31 > I guess the people who complain here that loudly really need to supply > some real numbers. Just to clarify: I'm not complaining either way, neither about not merging swap prefetch, nor about someone wanting that to be merge. It was rather the "discussion" that caught my attention... Just in case ;-) Björn ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 23:29 ` Andi Kleen 2007-07-28 0:08 ` Björn Steinbrink @ 2007-07-28 1:10 ` Daniel Hazelton 2007-07-29 12:53 ` Paul Jackson 2 siblings, 0 replies; 535+ messages in thread From: Daniel Hazelton @ 2007-07-28 1:10 UTC (permalink / raw) To: Andi Kleen Cc: Björn Steinbrink, Rene Herman, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Friday 27 July 2007 19:29:19 Andi Kleen wrote: > > Any faults in that reasoning? > > GNU sort uses a merge sort with temporary files on disk. Not sure > how much it keeps in memory during that, but it's probably less > than 150MB. At some point the dirty limit should kick in and write back the > data of the temporary files; so it's not quite the same as anonymous > memory. But it's not that different given. Yes, this should occur. But how many programs use temporary files like that? >From what I can tell FireFox and OpenOffice both keep all their data in memory, only using a single file for some buffering purposes. When they get pushed out by a memory hog (either short term or long term) it takes several seconds for them to be swapped back in. (I'm on a P4-1.3GHz machine with 1G of ram and rarely run more than four programs (Mail Client, XChat, FireFox and a console window) and I've seen this lag in FireFox when switching to it after starting OOo. I've also seen the same sort of lag when exiting OOo. I'll see about getting some numbers about this) > It would be better to measure than to guess. At least Andrew's measurements > on 128MB actually didn't show updatedb being really that big a problem. I agree. As I've said previously, it isn't updatedb itself which causes the problem. It's the way the VFS cache seems to just expand and expand - to the point of evicting pages to make room for itself. However, I may be wrong about that - I haven't actually tested it for myself, just looked at the numbers and other information that has been posted in this thread. > Perhaps some people have much more files or simply a less efficient > updatedb implementation? Yes, it could be the proliferation of files. It could also be some other sort of problem that is exposing a corner-case in the VFS cache or the MM. I, personally, am of the opinion that it is likely the aforementioned corner case for people reporting the "updatedb" problem. If it is, then swap-prefetch is just papering over the problem. However I do not have the knowledge and understanding of the subsystems involved to be able to do much more than make a (probably wrong) guess. > I guess the people who complain here that loudly really need to supply > some real numbers. I've seen numerous "real numbers" posted about this. As was said earlier in the thread "every time numbers are posted they are claimed to be no good". But hey, nobodies perfect :) Anyway, the discussion seems to be turning to the technical merits of swap-prefetch... Now, a completely different question: During the research (and lots of thinking) I've been doing while this thread has been going on I've often wondered why swap prefetch wasn't already in the kernel. The problem of slow swap-in has long been known, and, given current hardware, the optimal solution would be some sort of data prefetch - similar to what is done to speed up normal disk reads. Swap prefetch looks like it does exactly that. The algo could be argued over and/or improved (to suggest ways to do that I'd have to give it more than a 10 minute look) but it does provide a speed-up. This speed increase will probably be enjoyed more by the home users, but the performance increase could also help on enterprise systems. Now I'll be the first one to admit that there is a trade-off there - it will cause more power to be used because the disk's don't get a chance to spin down (or go through a cycle every time the prefetch system starts) but that could, potentially, be alleviated by having "laptop mode" switch it off. (And no, I'm not claiming that it is perfect - but then, what is when its first merged into the kernel?) DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 23:29 ` Andi Kleen 2007-07-28 0:08 ` Björn Steinbrink 2007-07-28 1:10 ` Daniel Hazelton @ 2007-07-29 12:53 ` Paul Jackson 2 siblings, 0 replies; 535+ messages in thread From: Paul Jackson @ 2007-07-29 12:53 UTC (permalink / raw) To: Andi Kleen Cc: B.Steinbrink, rene.herman, dhazelton, efault, akpm, mingo, frank, andi, nickpiggin, ray-lk, jesper.juhl, ck, linux-mm, linux-kernel Andi wrote: > GNU sort uses a merge sort with temporary files on disk. Not sure > how much it keeps in memory during that, but it's probably less > than 150MB. If I'm reading the source code for GNU sort correctly, then the following snippet of shell code displays how much memory it uses for its primary buffer on typical GNU/Linux systems: head -2 /proc/meminfo | awk ' NR == 1 { memtotal = $2 } NR == 2 { memfree = $2 } END { if (memfree > memtotal/8) m = memfree else m = memtotal/8 print "sort size:", m/2, "kB" } ' That is, over simplifying, GNU sort looks at the first two entries in /proc/meminfo, which for example on a machine near me happen to be: MemTotal: 2336472 kB MemFree: 110600 kB and then uses one-half of whichever is -greater- of MemTotal/8 or MemFree. ... However ... for the typical GNU locate updatedb run, it is sorting the list of pathnames for almost all files on the system, which is usually larger than fits in one of these sized buffers. So it ends up using quite a few of the temporary files you mention, which tends to chew up easily freed memory. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 23:15 ` Björn Steinbrink 2007-07-27 23:29 ` Andi Kleen @ 2007-07-28 7:35 ` Rene Herman 2007-07-28 8:51 ` Rene Herman 1 sibling, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-28 7:35 UTC (permalink / raw) To: Björn Steinbrink, Rene Herman, Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/28/2007 01:15 AM, Björn Steinbrink wrote: > On 2007.07.27 20:16:32 +0200, Rene Herman wrote: >> Here's swap-prefetch's author saying the same: >> >> http://lkml.org/lkml/2007/2/9/112 >> >> | It can't help the updatedb scenario. Updatedb leaves the ram full and >> | swap prefetch wants to cost as little as possible so it will never >> | move anything out of ram in preference for the pages it wants to swap >> | back in. >> >> Now please finally either understand this, or tell us how we're wrong. > > Con might have been wrong there for boxes with really little memory. Note -- with "the updatedb scenario" both he in the above and I are talking about the "VFS caches filling memory cause the problem" not updatedb in particular. > My desktop box has not even 300k inodes in use (IIRC someone posted a df > -i output showing 1 million inodes in use). Still, the memory footprint > of the "sort" process grows up to about 50MB. Assuming that the average > filename length stays, that would mean 150MB for the 1 million inode > case, just for the "sort" process. Even if it's not 150MB, 50MB is already a lot on a 128 or even a 256MB box. So, yes, we're now at the expected scenario of some hog pushing out things and freeing it upon exit again and it's something swap-prefetch definitely has potential to help with. Said early in the thread it's hard to imagine how it would not help in any such situation so that the discussion may as far as I'm concerned at that point concentrate on whether swap-prefetch hurts anything in others. Some people I believe are not convinced it helps very significantly due to at that point _everything_ having been thrown out but a copy of openoffice with a large spreadsheet open should come back to life much quicker it would seem. > Any faults in that reasoning? No. If the machine goes idle after some memory hog _itself_ pushes things out and then exits, swap-prefetch helps, at the veryvery least potentially. By the way -- I'm unable to make my slocate grow substantial here but I'll try what GNU locate does. If it's really as bad as I hear then regardless of anything else it should really be either fixed or dumped... Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 7:35 ` Rene Herman @ 2007-07-28 8:51 ` Rene Herman 0 siblings, 0 replies; 535+ messages in thread From: Rene Herman @ 2007-07-28 8:51 UTC (permalink / raw) To: Björn Steinbrink Cc: Daniel Hazelton, Mike Galbraith, Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On 07/28/2007 09:35 AM, Rene Herman wrote: > By the way -- I'm unable to make my slocate grow substantial here but > I'll try what GNU locate does. If it's really as bad as I hear then > regardless of anything else it should really be either fixed or dumped... Yes. GNU locate is broken and nobody should be using it. The updatedb from (my distribution standard) "slocate" uses around 2M allocated total during an entire run while GNU locate allocates some 30M to the sort process alone. GNU locate is also close to 4 times as slow (although that ofcourse only matters on cached runs anyways). So, GNU locate is just a pig pushing things out, with or without any added VFS cache pressure from the things it does by design. As such, we can trust people complaining about it but should first tell them to switch to halfway sane locate implementation. If you run memory hogs on small memory boxes, you're going to suffer. Leaves the fact that swap-prefetch sometimes helps alleviate these and other kinds of memory-hog situations and as such, might not (again) be a bad idea in itself. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 17:45 ` Daniel Hazelton 2007-07-27 18:16 ` Rene Herman @ 2007-07-27 22:08 ` Mike Galbraith 2007-07-27 22:51 ` Daniel Hazelton 1 sibling, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-07-27 22:08 UTC (permalink / raw) To: Daniel Hazelton Cc: Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, 2007-07-27 at 13:45 -0400, Daniel Hazelton wrote: > On Friday 27 July 2007 06:25:18 Mike Galbraith wrote: > > On Fri, 2007-07-27 at 03:00 -0700, Andrew Morton wrote: > > > So hrm. Are we sure that updatedb is the problem? There are quite a few > > > heavyweight things which happen in the wee small hours. > > > > The balance in _my_ world seems just fine. I don't let any of those > > system maintenance things run while I'm using the system, and it doesn't > > bother me if my working set has to be reconstructed after heavy-weight > > maintenance things are allowed to run. I'm not seeing anything I > > wouldn't expect to see when running a job the size of updatedb. > > > > -Mike > > Do you realize you've totally missed the point? Did you notice that I didn't make one disparaging remark about the patch or the concept behind it? Did you notice that I took _my time_ to test, to actually look at the problem? No, you're too busy running your mouth to appreciate the efforts of others. <snips load of useless spleen venting> Do yourself a favor, go dig into the VM source. Read it, understand it (not terribly easy), _then_ come back and preach to me. Have a nice day. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 22:08 ` Mike Galbraith @ 2007-07-27 22:51 ` Daniel Hazelton 2007-07-28 7:48 ` Mike Galbraith 0 siblings, 1 reply; 535+ messages in thread From: Daniel Hazelton @ 2007-07-27 22:51 UTC (permalink / raw) To: Mike Galbraith Cc: Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Friday 27 July 2007 18:08:44 Mike Galbraith wrote: > On Fri, 2007-07-27 at 13:45 -0400, Daniel Hazelton wrote: > > On Friday 27 July 2007 06:25:18 Mike Galbraith wrote: > > > On Fri, 2007-07-27 at 03:00 -0700, Andrew Morton wrote: > > > > So hrm. Are we sure that updatedb is the problem? There are quite a > > > > few heavyweight things which happen in the wee small hours. > > > > > > The balance in _my_ world seems just fine. I don't let any of those > > > system maintenance things run while I'm using the system, and it > > > doesn't bother me if my working set has to be reconstructed after > > > heavy-weight maintenance things are allowed to run. I'm not seeing > > > anything I wouldn't expect to see when running a job the size of > > > updatedb. > > > > > > -Mike > > > > Do you realize you've totally missed the point? > > Did you notice that I didn't make one disparaging remark about the patch > or the concept behind it? Did you notice that I took _my time_ to > test, to actually look at the problem? No, you're too busy running > your mouth to appreciate the efforts of others. If you're done being an ass, take note of the fact that I never even said you were doing that. What I was commenting on was the fact that you (and a lot of the other developers) seem to keep saying "It doesn't happen here, so it doesn't matter!" - ie: If I don't see something happening, it doesn't matter. > <snips load of useless spleen venting> > > Do yourself a favor, go dig into the VM source. Read it, understand it > (not terribly easy), _then_ come back and preach to me. I've been trying to do that since the thread started. Note that you snipped where I said (and I'm going to paraphrase myself) "There is another way to fix this, but I don't have the understanding necessary". Now, once more, I'm going to ask: What is so terribly wrong with swap prefetch? Why does it seem that everyone against it says "Its treating a symptom, so it can't go in"? Try coming up with an answer that isn't "I don't see the problem on my $10K system" or similar - try explaining it based on the *technical* merits. Does it cause the processor cache to get thrashed? Does it create locking problems? I stand by my statements, as vitriolic as you and Rene seem to want to get over it. So far in this thread I have not seen one bit of *technical* discussion over the merits, just the bits I've simplified and stated before. > Have a nice day. I am. You being nasty when somebody gets fed up with a line of BS doesn't stop me from having a nice day. Only thing that could make my life any better would be to have the questions I've asked answered, rather than having supposedly intelligent people act like trolls. DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 22:51 ` Daniel Hazelton @ 2007-07-28 7:48 ` Mike Galbraith 2007-07-28 15:36 ` Daniel Hazelton 0 siblings, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-07-28 7:48 UTC (permalink / raw) To: Daniel Hazelton Cc: Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Fri, 2007-07-27 at 18:51 -0400, Daniel Hazelton wrote: > Now, once more, I'm going to ask: What is so terribly wrong with swap > prefetch? Why does it seem that everyone against it says "Its treating a > symptom, so it can't go in"? And once again, I personally have nothing against swap-prefetch, or something like it. I can see how it or something like it could be made to improve the lives of people who get up in the morning to find their apps sitting on disk due to memory pressure generated by over-night system maintenance operations. The author himself however, says his implementation can't help with updatedb (though people seem to be saying that it does), or anything else that leaves memory full. That IMHO, makes it of questionable value toward solving what people are saying they want swap-prefetch for in the first place. I personally don't care if swap-prefetch goes in or not. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-28 7:48 ` Mike Galbraith @ 2007-07-28 15:36 ` Daniel Hazelton 0 siblings, 0 replies; 535+ messages in thread From: Daniel Hazelton @ 2007-07-28 15:36 UTC (permalink / raw) To: Mike Galbraith Cc: Andrew Morton, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Saturday 28 July 2007 03:48:13 Mike Galbraith wrote: > On Fri, 2007-07-27 at 18:51 -0400, Daniel Hazelton wrote: > > Now, once more, I'm going to ask: What is so terribly wrong with swap > > prefetch? Why does it seem that everyone against it says "Its treating a > > symptom, so it can't go in"? > > And once again, I personally have nothing against swap-prefetch, or > something like it. I can see how it or something like it could be made > to improve the lives of people who get up in the morning to find their > apps sitting on disk due to memory pressure generated by over-night > system maintenance operations. > > The author himself however, says his implementation can't help with > updatedb (though people seem to be saying that it does), or anything > else that leaves memory full. That IMHO, makes it of questionable value > toward solving what people are saying they want swap-prefetch for in the > first place. Okay. I have to agree with the author that, in such a situation, it wouldn't help. However there are, without a doubt, other situations where it would help immensely. (memory hogs forcing everything to disk and quitting, one off tasks that don't balloon the cache (kernel compiles, et al) - in those situations swap prefetch would really shine.) DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-27 8:47 ` Andrew Morton ` (2 preceding siblings ...) 2007-07-27 10:00 ` Andrew Morton @ 2007-07-29 1:33 ` Rik van Riel 2007-07-29 3:39 ` Andrew Morton 3 siblings, 1 reply; 535+ messages in thread From: Rik van Riel @ 2007-07-29 1:33 UTC (permalink / raw) To: Andrew Morton Cc: Mike Galbraith, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel Andrew Morton wrote: > What I think is killing us here is the blockdev pagecache: the pagecache > which backs those directory entries and inodes. These pages get read > multiple times because they hold multiple directory entries and multiple > inodes. These multiple touches will put those pages onto the active list > so they stick around for a long time and everything else gets evicted. > > I've never been very sure about this policy for the metadata pagecache. We > read the filesystem objects into the dcache and icache and then we won't > read from that page again for a long time (I expect). But the page will > still hang around for a long time. > > It could be that we should leave those pages inactive. Good idea for updatedb. However, it may be a bad idea for files that are often written to. Turning an inode write into a read plus a write does not sound like such a hot idea, we really want to keep those in the cache. I think what you need is to ignore multiple references to the same page when they all happen in one time interval, counting them only if they happen in multiple time intervals. The use-once cleanup (which takes a page flag for PG_new, I know...) would solve that problem. However, it would introduce the problem of having to scan all the pages on the list before a page becomes freeable. We would have to add some background scanning (or a separate list for PG_new pages) to make the initial pageout run use an acceptable amount of CPU time. Not sure that complexity will be worth it... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-29 1:33 ` Rik van Riel @ 2007-07-29 3:39 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-29 3:39 UTC (permalink / raw) To: Rik van Riel Cc: Mike Galbraith, Ingo Molnar, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Sat, 28 Jul 2007 21:33:59 -0400 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > > What I think is killing us here is the blockdev pagecache: the pagecache > > which backs those directory entries and inodes. These pages get read > > multiple times because they hold multiple directory entries and multiple > > inodes. These multiple touches will put those pages onto the active list > > so they stick around for a long time and everything else gets evicted. > > > > I've never been very sure about this policy for the metadata pagecache. We > > read the filesystem objects into the dcache and icache and then we won't > > read from that page again for a long time (I expect). But the page will > > still hang around for a long time. > > > > It could be that we should leave those pages inactive. > > Good idea for updatedb. > > However, it may be a bad idea for files that are often > written to. Turning an inode write into a read plus a > write does not sound like such a hot idea, we really > want to keep those in the cache. Remember that this problem applies to both inode blocks and to directory blocks. Yes, it might be useful to hold onto an inode block for a future write (atime, mtime, usually), but not a directory block. > I think what you need is to ignore multiple references > to the same page when they all happen in one time > interval, counting them only if they happen in multiple > time intervals. Yes, the sudden burst of accesses for adjacent inode/dirents will be a common pattern, and it'd make heaps of sense to treat that as a single touch. It'd have to be done in the fs I guess, and it might be a bit hard to do. And it turns out that embedding the touch_buffer() all the way down in __find_get_block() was convenient, but it's going to be tricky to change. For now I'm fairly inclined to just nuke the touch_buffer() on the read side and maybe add one on the modification codepaths and see what happens. As always, testing is the problem. > The use-once cleanup (which takes a page flag for PG_new, > I know...) would solve that problem. > > However, it would introduce the problem of having to scan > all the pages on the list before a page becomes freeable. > We would have to add some background scanning (or a separate > list for PG_new pages) to make the initial pageout run use > an acceptable amount of CPU time. > > Not sure that complexity will be worth it... > I suspect that the situation we have now is so bad that pretty much anything we do will be an improvement. I've always wondered "ytf is there so much blockdev pagecache?" This machine I'm typing at: MemTotal: 3975080 kB MemFree: 750400 kB Buffers: 547736 kB Cached: 1299532 kB SwapCached: 12772 kB Active: 1789864 kB Inactive: 861420 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 3975080 kB LowFree: 750400 kB SwapTotal: 4875716 kB SwapFree: 4715660 kB Dirty: 76 kB Writeback: 0 kB Mapped: 638036 kB Slab: 522724 kB CommitLimit: 6863256 kB Committed_AS: 1115632 kB PageTables: 14452 kB VmallocTotal: 34359738367 kB VmallocUsed: 36432 kB VmallocChunk: 34359696379 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB More that a quarter of my RAM in fs metadata! Most of it I'll bet is on the active list. And the fs on which I do most of the work is mounted noatime.. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 9:40 ` RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] Ingo Molnar 2007-07-26 10:09 ` Andrew Morton @ 2007-07-26 10:20 ` Al Viro 2007-07-26 12:23 ` Andi Kleen 2007-07-27 19:19 ` Paul Jackson 1 sibling, 2 replies; 535+ messages in thread From: Al Viro @ 2007-07-26 10:20 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Thu, Jul 26, 2007 at 11:40:24AM +0200, Ingo Molnar wrote: > below is an updatedb hack that sets vfs_cache_pressure down to 0 during > an updatedb run. Could someone who is affected by the 'morning after' > problem give it a try? If this works then we can think about any other > measures ... BTW, I really wonder how much pain could be avoided if updatedb recorded mtime of directories and checked it. I.e. instead of just doing blind find(1), walk the stored directory tree comparing timestamps with those in filesystem. If directory mtime has not changed, don't bother rereading it and just go for (stored) subdirectories. If it has changed - reread the sucker. If we have a match for stored subdirectory of changed directory, check inumber; if it doesn't match, consider the entire subtree as new one. AFAICS, that could eliminate quite a bit of IO... ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 10:20 ` Al Viro @ 2007-07-26 12:23 ` Andi Kleen 2007-07-26 14:59 ` Al Viro 2007-07-27 19:19 ` Paul Jackson 1 sibling, 1 reply; 535+ messages in thread From: Andi Kleen @ 2007-07-26 12:23 UTC (permalink / raw) To: Al Viro Cc: Ingo Molnar, Andrew Morton, Frank Kingswood, Andi Kleen, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel > BTW, I really wonder how much pain could be avoided if updatedb recorded > mtime of directories and checked it. I.e. instead of just doing blind > find(1), walk the stored directory tree comparing timestamps with those > in filesystem. If directory mtime has not changed, don't bother rereading > it and just go for (stored) subdirectories. If it has changed - reread the > sucker. If we have a match for stored subdirectory of changed directory, > check inumber; if it doesn't match, consider the entire subtree as new > one. AFAICS, that could eliminate quite a bit of IO... That would just save reading the directories. Not sure it helps that much. Much better would be actually if it didn't stat the individual files (and force their dentries/inodes in). I bet it does that to find out if they are directories or not. But in a modern system it could just check the type in the dirent on file systems that support that and not do a stat. Then you would get much less dentries/inodes. Also I expect in general the new slub dcache freeing that is pending will improve things a lot. But even if updatedb was fixed to be more efficient we probably still need a general solution for other tree walking programs that cannot be optimized this way. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 12:23 ` Andi Kleen @ 2007-07-26 14:59 ` Al Viro 2007-07-11 20:41 ` Pavel Machek 0 siblings, 1 reply; 535+ messages in thread From: Al Viro @ 2007-07-26 14:59 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Andrew Morton, Frank Kingswood, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel On Thu, Jul 26, 2007 at 02:23:30PM +0200, Andi Kleen wrote: > That would just save reading the directories. Not sure > it helps that much. Much better would be actually if it didn't stat the > individual files (and force their dentries/inodes in). I bet it does that to > find out if they are directories or not. But in a modern system it could just > check the type in the dirent on file systems that support > that and not do a stat. Then you would get much less dentries/inodes. FWIW, find(1) does *not* stat non-directories (and neither would this approach). So it's just dentries for directories and you can't realistically skip those. OK, you could - if you had banned cross-directory rename for directories and propagated "dirty since last look" towards root (note that it would be a boolean, not a timestamp). Then we could skip unchanged subtrees completely... ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 14:59 ` Al Viro @ 2007-07-11 20:41 ` Pavel Machek 0 siblings, 0 replies; 535+ messages in thread From: Pavel Machek @ 2007-07-11 20:41 UTC (permalink / raw) To: Al Viro Cc: Andi Kleen, Ingo Molnar, Andrew Morton, Frank Kingswood, Nick Piggin, Ray Lee, Jesper Juhl, ck list, Paul Jackson, linux-mm, linux-kernel Hi! > > That would just save reading the directories. Not sure > > it helps that much. Much better would be actually if it didn't stat the > > individual files (and force their dentries/inodes in). I bet it does that to > > find out if they are directories or not. But in a modern system it could just > > check the type in the dirent on file systems that support > > that and not do a stat. Then you would get much less dentries/inodes. > > FWIW, find(1) does *not* stat non-directories (and neither would this > approach). So it's just dentries for directories and you can't realistically > skip those. OK, you could - if you had banned cross-directory rename > for directories and propagated "dirty since last look" towards root (note > that it would be a boolean, not a timestamp). Then we could skip unchanged > subtrees completely... Could we help it a little from kernel and set 'dirty since last look' on directory renames? I mean, this is not only updatedb. KDE startup is limited by this, too. It would be nice to have effective 'what change in tree' operation. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] 2007-07-26 10:20 ` Al Viro 2007-07-26 12:23 ` Andi Kleen @ 2007-07-27 19:19 ` Paul Jackson 1 sibling, 0 replies; 535+ messages in thread From: Paul Jackson @ 2007-07-27 19:19 UTC (permalink / raw) To: Al Viro Cc: mingo, akpm, frank, andi, nickpiggin, ray-lk, jesper.juhl, ck, linux-mm, linux-kernel Al Viro wrote: > BTW, I really wonder how much pain could be avoided if updatedb recorded > mtime of directories and checked it. Someone mentioned a variant of slocate above that they called mlocate, and that Red Hat ships, that seems to do this (if I understand you and what mlocate does correctly.) -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 4:06 ` Nick Piggin ` (3 preceding siblings ...) 2007-07-25 20:46 ` Andi Kleen @ 2007-07-31 16:37 ` Matthew Hawkins 2007-08-06 2:11 ` Nick Piggin 4 siblings, 1 reply; 535+ messages in thread From: Matthew Hawkins @ 2007-07-31 16:37 UTC (permalink / raw) To: Nick Piggin Cc: Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 1199 bytes --] On 7/25/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > I guess /proc/meminfo, /proc/zoneinfo, /proc/vmstat, /proc/slabinfo > before and after the updatedb run with the latest kernel would be a > first step. top and vmstat output during the run wouldn't hurt either. Hi Nick, I've attached two files with this kind of info. Being up at the cron hours of the morning meant I got a better picture of what my system is doing. Here's a short summary of what I saw in top: beagleindexer used gobs of ram. 600M or so (I have 1G) updatedb didn't use much ram, but while it was running kswapd kept on frequenting the top 10 cpu hogs - it would stick around for 5 seconds or so then disappear for no more than 10 seconds, then come back again. This behaviour persisted during the run. updatedb ran third (beagleindexer was first, then update-dlocatedb) I'm going to do this again, this time under a CFS kernel & use Ingo's sched_debug script to see what the scheduler is doing also. Let me know if there's anything else you wish to see. The running kernel at the time was 2.6.22.1-ck. There's no slabinfo since I'm using slub instead (and I don't have slub debug enabled). Cheers, -- Matt [-- Attachment #2: beaglecron.ck --] [-- Type: application/octet-stream, Size: 4539 bytes --] MemTotal: 1028016 kB MemFree: 9368 kB Buffers: 56932 kB Cached: 115820 kB SwapCached: 19968 kB Active: 463284 kB Inactive: 418952 kB SwapTotal: 2096472 kB SwapFree: 1855632 kB Dirty: 2272 kB Writeback: 0 kB AnonPages: 695744 kB Mapped: 42732 kB Slab: 89452 kB SReclaimable: 74204 kB SUnreclaim: 15248 kB PageTables: 14656 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 2610480 kB Committed_AS: 1478988 kB VmallocTotal: 34359738367 kB VmallocUsed: 313292 kB VmallocChunk: 34359424507 kB nr_free_pages 3415 nr_inactive 102651 nr_active 115308 nr_anon_pages 169943 nr_mapped 10351 nr_file_pages 49521 nr_dirty 774 nr_writeback 0 nr_slab_reclaimable 20088 nr_slab_unreclaimable 3898 nr_page_table_pages 3664 nr_unstable 0 nr_bounce 0 nr_vmscan_write 64203 pgpgin 595981 pgpgout 472600 pswpin 3408 pswpout 63208 pgalloc_dma 3 pgalloc_dma32 1976896 pgalloc_normal 0 pgfree 1980436 pgactivate 117968 pgdeactivate 213723 pgfault 3468419 pgmajfault 3466 pgrefill_dma 0 pgrefill_dma32 562570 pgrefill_normal 0 pgsteal_dma 0 pgsteal_dma32 173803 pgsteal_normal 0 pgscan_kswapd_dma 0 pgscan_kswapd_dma32 241568 pgscan_kswapd_normal 0 pgscan_direct_dma 0 pgscan_direct_dma32 21792 pgscan_direct_normal 0 pginodesteal 3 slabs_scanned 455552 kswapd_steal 162131 kswapd_inodesteal 2519 pageoutrun 2497 allocstall 159 pgrotated 63202 00000000-0009f7ff : System RAM 00000000-00000000 : Crash kernel 0009f800-0009ffff : reserved 000cc800-000cffff : pnp 00:0c 000f0000-000fffff : reserved 00100000-3ffeffff : System RAM 00200000-0043db5f : Kernel code 0043db60-0056f6cf : Kernel data 3fff0000-3fff2fff : ACPI Non-volatile Storage 3fff3000-3fffffff : ACPI Tables d0000000-dfffffff : reserved e0000000-efffffff : PCI Bus #05 e0000000-efffffff : 0000:05:00.0 f0000000-f3ffffff : PCI Bus #05 f0000000-f1ffffff : 0000:05:00.0 f2000000-f2ffffff : 0000:05:00.0 f2000000-f2ffffff : nvidia f3000000-f301ffff : 0000:05:00.0 f4000000-f4000fff : 0000:00:04.0 f4000000-f4000fff : NVidia CK804 f4001000-f4001fff : 0000:00:07.0 f4001000-f4001fff : sata_nv f4002000-f4002fff : 0000:00:08.0 f4002000-f4002fff : sata_nv f4003000-f4003fff : 0000:00:0a.0 f4003000-f4003fff : forcedeth f4004000-f4004fff : 0000:00:02.0 f4004000-f4004fff : ohci_hcd feb00000-feb000ff : 0000:00:02.1 feb00000-feb000ff : ehci_hcd fec00000-fec00fff : IOAPIC 0 fee00000-fee00fff : Local APIC Node 0, zone DMA pages free 736 min 11 low 13 high 16 scanned 0 (a: 9 i: 9) spanned 4096 present 2858 nr_free_pages 736 nr_inactive 0 nr_active 0 nr_anon_pages 0 nr_mapped 1 nr_file_pages 0 nr_dirty 0 nr_writeback 0 nr_slab_reclaimable 0 nr_slab_unreclaimable 2 nr_page_table_pages 0 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 protection: (0, 994, 994) pagesets cpu: 0 pcp: 0 count: 0 high: 0 batch: 1 cpu: 0 pcp: 1 count: 0 high: 0 batch: 1 vm stats threshold: 4 cpu: 1 pcp: 0 count: 0 high: 0 batch: 1 cpu: 1 pcp: 1 count: 0 high: 0 batch: 1 vm stats threshold: 4 all_unreclaimable: 1 prev_priority: 12 start_pfn: 0 Node 0, zone DMA32 pages free 1786 min 1002 low 1252 high 1503 scanned 0 (a: 29 i: 23) spanned 258032 present 254505 nr_free_pages 1786 nr_inactive 93406 nr_active 118044 nr_anon_pages 154539 nr_mapped 9819 nr_file_pages 58466 nr_dirty 115 nr_writeback 0 nr_slab_reclaimable 27261 nr_slab_unreclaimable 4205 nr_page_table_pages 3664 nr_unstable 0 nr_bounce 0 nr_vmscan_write 78966 protection: (0, 0, 0) pagesets cpu: 0 pcp: 0 count: 29 high: 186 batch: 31 cpu: 0 pcp: 1 count: 3 high: 62 batch: 15 vm stats threshold: 16 cpu: 1 pcp: 0 count: 2 high: 186 batch: 31 cpu: 1 pcp: 1 count: 13 high: 62 batch: 15 vm stats threshold: 16 all_unreclaimable: 0 prev_priority: 12 start_pfn: 4096 [-- Attachment #3: updatedbcron.ck --] [-- Type: application/octet-stream, Size: 5382 bytes --] nr_free_pages 3478 nr_inactive 81057 nr_active 131617 nr_anon_pages 92818 nr_mapped 9092 nr_file_pages 130427 nr_dirty 4145 nr_writeback 0 nr_slab_reclaimable 22106 nr_slab_unreclaimable 5950 nr_page_table_pages 3572 nr_unstable 0 nr_bounce 0 nr_vmscan_write 145131 pgpgin 1305961 pgpgout 1321420 pswpin 21001 pswpout 144028 pgalloc_dma 3 pgalloc_dma32 2649338 pgalloc_normal 0 pgfree 2652884 pgactivate 251412 pgdeactivate 252293 pgfault 3565858 pgmajfault 5864 pgrefill_dma 0 pgrefill_dma32 836934 pgrefill_normal 0 pgsteal_dma 0 pgsteal_dma32 352216 pgsteal_normal 0 pgscan_kswapd_dma 0 pgscan_kswapd_dma32 510400 pgscan_kswapd_normal 0 pgscan_direct_dma 0 pgscan_direct_dma32 22112 pgscan_direct_normal 0 pginodesteal 3 slabs_scanned 809088 kswapd_steal 340352 kswapd_inodesteal 7286 pageoutrun 4661 allocstall 161 pgrotated 144024 00000000-0009f7ff : System RAM 00000000-00000000 : Crash kernel 0009f800-0009ffff : reserved 000cc800-000cffff : pnp 00:0c 000f0000-000fffff : reserved 00100000-3ffeffff : System RAM 00200000-0043db5f : Kernel code 0043db60-0056f6cf : Kernel data 3fff0000-3fff2fff : ACPI Non-volatile Storage 3fff3000-3fffffff : ACPI Tables d0000000-dfffffff : reserved e0000000-efffffff : PCI Bus #05 e0000000-efffffff : 0000:05:00.0 f0000000-f3ffffff : PCI Bus #05 f0000000-f1ffffff : 0000:05:00.0 f2000000-f2ffffff : 0000:05:00.0 f2000000-f2ffffff : nvidia f3000000-f301ffff : 0000:05:00.0 f4000000-f4000fff : 0000:00:04.0 f4000000-f4000fff : NVidia CK804 f4001000-f4001fff : 0000:00:07.0 f4001000-f4001fff : sata_nv f4002000-f4002fff : 0000:00:08.0 f4002000-f4002fff : sata_nv f4003000-f4003fff : 0000:00:0a.0 f4003000-f4003fff : forcedeth f4004000-f4004fff : 0000:00:02.0 f4004000-f4004fff : ohci_hcd feb00000-feb000ff : 0000:00:02.1 feb00000-feb000ff : ehci_hcd fec00000-fec00fff : IOAPIC 0 fee00000-fee00fff : Local APIC MemTotal: 1028016 kB MemFree: 26000 kB Buffers: 352752 kB Cached: 98672 kB SwapCached: 54272 kB Active: 513476 kB Inactive: 321336 kB SwapTotal: 2096472 kB SwapFree: 1523124 kB Dirty: 884 kB Writeback: 0 kB AnonPages: 371432 kB Mapped: 36368 kB Slab: 116016 kB SReclaimable: 91864 kB SUnreclaim: 24152 kB PageTables: 14288 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 2610480 kB Committed_AS: 1424352 kB VmallocTotal: 34359738367 kB VmallocUsed: 313292 kB VmallocChunk: 34359424507 kB nr_free_pages 4210 nr_inactive 81387 nr_active 129217 nr_anon_pages 92867 nr_mapped 9100 nr_file_pages 128320 nr_dirty 548 nr_writeback 0 nr_slab_reclaimable 23296 nr_slab_unreclaimable 6120 nr_page_table_pages 3572 nr_unstable 0 nr_bounce 0 nr_vmscan_write 145131 pgpgin 1356125 pgpgout 1348944 pswpin 21066 pswpout 144028 pgalloc_dma 3 pgalloc_dma32 2687673 pgalloc_normal 0 pgfree 2691918 pgactivate 257106 pgdeactivate 260491 pgfault 3566843 pgmajfault 5880 pgrefill_dma 0 pgrefill_dma32 859380 pgrefill_normal 0 pgsteal_dma 0 pgsteal_dma32 367095 pgsteal_normal 0 pgscan_kswapd_dma 0 pgscan_kswapd_dma32 525312 pgscan_kswapd_normal 0 pgscan_direct_dma 0 pgscan_direct_dma32 22112 pgscan_direct_normal 0 pginodesteal 3 slabs_scanned 828928 kswapd_steal 355231 kswapd_inodesteal 7286 pageoutrun 4897 allocstall 161 pgrotated 144024 Node 0, zone DMA pages free 736 min 11 low 13 high 16 scanned 0 (a: 18 i: 18) spanned 4096 present 2858 nr_free_pages 736 nr_inactive 0 nr_active 0 nr_anon_pages 0 nr_mapped 1 nr_file_pages 0 nr_dirty 0 nr_writeback 0 nr_slab_reclaimable 0 nr_slab_unreclaimable 2 nr_page_table_pages 0 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 protection: (0, 994, 994) pagesets cpu: 0 pcp: 0 count: 0 high: 0 batch: 1 cpu: 0 pcp: 1 count: 0 high: 0 batch: 1 vm stats threshold: 4 cpu: 1 pcp: 0 count: 0 high: 0 batch: 1 cpu: 1 pcp: 1 count: 0 high: 0 batch: 1 vm stats threshold: 4 all_unreclaimable: 1 prev_priority: 12 start_pfn: 0 Node 0, zone DMA32 pages free 2846 min 1002 low 1252 high 1503 scanned 0 (a: 0 i: 20) spanned 258032 present 254505 nr_free_pages 2846 nr_inactive 78337 nr_active 131382 nr_anon_pages 92885 nr_mapped 9099 nr_file_pages 127433 nr_dirty 411 nr_writeback 0 nr_slab_reclaimable 24595 nr_slab_unreclaimable 6314 nr_page_table_pages 3572 nr_unstable 0 nr_bounce 0 nr_vmscan_write 145131 protection: (0, 0, 0) pagesets cpu: 0 pcp: 0 count: 29 high: 186 batch: 31 cpu: 0 pcp: 1 count: 13 high: 62 batch: 15 vm stats threshold: 16 cpu: 1 pcp: 0 count: 9 high: 186 batch: 31 cpu: 1 pcp: 1 count: 12 high: 62 batch: 15 vm stats threshold: 16 all_unreclaimable: 0 prev_priority: 12 start_pfn: 4096 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-31 16:37 ` [ck] Re: -mm merge plans for 2.6.23 Matthew Hawkins @ 2007-08-06 2:11 ` Nick Piggin 0 siblings, 0 replies; 535+ messages in thread From: Nick Piggin @ 2007-08-06 2:11 UTC (permalink / raw) To: Matthew Hawkins Cc: Ray Lee, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson, Andrew Morton Matthew Hawkins wrote: > On 7/25/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>I guess /proc/meminfo, /proc/zoneinfo, /proc/vmstat, /proc/slabinfo >>before and after the updatedb run with the latest kernel would be a >>first step. top and vmstat output during the run wouldn't hurt either. > > > Hi Nick, > > I've attached two files with this kind of info. Being up at the cron > hours of the morning meant I got a better picture of what my system is > doing. Here's a short summary of what I saw in top: > > beagleindexer used gobs of ram. 600M or so (I have 1G) Hmm OK, beagleindexer. I thought beagle didn't need frequent reindexing because of inotify? Oh well... > updatedb didn't use much ram, but while it was running kswapd kept on > frequenting the top 10 cpu hogs - it would stick around for 5 seconds > or so then disappear for no more than 10 seconds, then come back > again. This behaviour persisted during the run. updatedb ran third > (beagleindexer was first, then update-dlocatedb) Kswapd will use CPU when memory is low, even if there is no swapping. Your "buffers" grew by 600% (from 50MB to 350MB), and slab also grew by a few thousand entries. This is not just a problem when it pushes out swap, it will also harm filebacked working set. This (which Ray's traces also show) is a bit of a problem. As Andrew noticed, use-once isn't working well for buffer cache, and it doesn't really for dentry and inode cache either (although those don't seem to be as much of a problem on your workload). Andrew has done a little test patch for this in -mm, but it probably wants more work and testing. If you can test the -mm kernel and see if things are improved, that would help. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 16:15 ` Ray Lee 2007-07-24 17:46 ` [ck] " Rashkae 2007-07-25 4:06 ` Nick Piggin @ 2007-07-25 4:46 ` david 2007-07-25 8:00 ` Rene Herman 2007-07-25 15:55 ` Ray Lee 2 siblings, 2 replies; 535+ messages in thread From: david @ 2007-07-25 4:46 UTC (permalink / raw) To: Ray Lee Cc: Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Tue, 24 Jul 2007, Ray Lee wrote: > On 7/23/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> Ray Lee wrote: > >> Looking at your past email, you have a 1GB desktop system and your >> overnight updatedb run is causing stuff to get swapped out such that >> swap prefetch makes it significantly better. This is really >> intriguing to me, and I would hope we can start by making this >> particular workload "not suck" without swap prefetch (and hopefully >> make it even better than it currently is with swap prefetch because >> we'll try not to evict useful file backed pages as well). > > updatedb is an annoying case, because one would hope that there would > be a better way to deal with that highly specific workload. It's also > pretty stat dominant, which puts it roughly in the same category as a > git diff. (They differ in that updatedb does a lot of open()s and > getdents on directories, git merely does a ton of lstat()s instead.) > > Anyway, my point is that I worry that tuning for an unusual and > infrequent workload (which updatedb certainly is), is the wrong way to > go. updatedb pushing out program data may be able to be improved on with drop behind or similar. however another scenerio that causes a similar problem is when a user is busy useing one of the big memory hogs and then switches to another (think switching between openoffice and firefox) >> After that we can look at other problems that swap prefetch helps >> with, or think of some ways to measure your "whole day" scenario. >> >> So when/if you have time, I can cook up a list of things to monitor >> and possibly a patch to add some instrumentation over this updatedb >> run. > > That would be appreciated. Don't spend huge amounts of time on it, > okay? Point me the right direction, and we'll see how far I can run > with it. you could make a synthetic test by writing a memory hog that allocates 3/4 of your ram then pauses waiting for input and then randomly accesses the memory for a while (say randomly accessing 2x # of pages allocated) and then pausing again before repeating run two of these, alternating which one is running at any one time. time how long it takes to do the random accesses. the difference in this time should be a fair example of how much it would impact the user. by the way, I've also seen comments on the Postgres performance mailing list about how slow linux is compared to other OS's in pulling data back in that's been pushed out to swap (not a factor on dedicated database machines, but a big factor on multi-purpose machines) >> Anyway, I realise swap prefetching has some situations where it will >> fundamentally outperform even the page replacement oracle. This is >> why I haven't asked for it to be dropped: it isn't a bad idea at all. > > <nod> > >> However, if we can improve basic page reclaim where it is obviously >> lacking, that is always preferable. eg: being a highly speculative >> operation, swap prefetch is not great for power efficiency -- but we >> still want laptop users to have a good experience as well, right? > > Absolutely. Disk I/O is the enemy, and the best I/O is one you never > had to do in the first place. almost always true, however there is some amount of I/O that is free with todays drives (remember, they read the entire track into ram and then give you the sectors on the track that you asked for). and if you have a raid array this is even more true. if you read one sector in from a raid5 array you have done all the same I/O that you would have to do to read in the entire stripe, but I don't believe that the current system will keep it all around if it exceeds the readahead limit. so in many cases readahead may end up being significantly cheaper then you expect. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 4:46 ` david @ 2007-07-25 8:00 ` Rene Herman 2007-07-25 8:07 ` david 2007-07-25 15:55 ` Ray Lee 1 sibling, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-25 8:00 UTC (permalink / raw) To: david Cc: Ray Lee, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 760 bytes --] On 07/25/2007 06:46 AM, david@lang.hm wrote: > you could make a synthetic test by writing a memory hog that allocates > 3/4 of your ram then pauses waiting for input and then randomly accesses > the memory for a while (say randomly accessing 2x # of pages allocated) > and then pausing again before repeating Something like this? > run two of these, alternating which one is running at any one time. time > how long it takes to do the random accesses. > > the difference in this time should be a fair example of how much it > would impact the user. Notenotenote, not sure what you're going to show with it (times are simply as horrendous as I'd expect) but thought I'd try to inject something other than steaming cups of 4-letter beverages. Rene. [-- Attachment #2: hog.c --] [-- Type: text/plain, Size: 974 bytes --] /* gcc -W -Wall -o hog hog.c */ #include <stdlib.h> #include <stdio.h> #include <sys/time.h> #include <unistd.h> int main(void) { int pages, pagesize, i; unsigned char *mem; struct timeval tv; pages = sysconf(_SC_PHYS_PAGES); if (pages < 0) { perror("_SC_PHYS_PAGES"); return EXIT_FAILURE; } pages = (3 * pages) / 4; pagesize = sysconf(_SC_PAGESIZE); if (pagesize < 0) { perror("_SC_PAGESIZE"); return EXIT_FAILURE; } mem = malloc(pages * pagesize); if (!mem) { fprintf(stderr, "out of memory\n"); return EXIT_FAILURE; } for (i = 0; i < pages; i++) mem[i * pagesize] = 0; gettimeofday(&tv, NULL); srand((unsigned int)tv.tv_sec); while (1) { struct timeval start; getchar(); gettimeofday(&start, NULL); for (i = 0; i < 2 * pages; i++) mem[(rand() / (RAND_MAX / pages + 1)) * pagesize] = 0; gettimeofday(&tv, NULL); timersub(&tv, &start, &tv); printf("%lu.%lu\n", tv.tv_sec, tv.tv_usec); } return EXIT_SUCCESS; } ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 8:00 ` Rene Herman @ 2007-07-25 8:07 ` david 2007-07-25 8:29 ` Rene Herman 0 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-25 8:07 UTC (permalink / raw) To: Rene Herman Cc: Ray Lee, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 25 Jul 2007, Rene Herman wrote: > On 07/25/2007 06:46 AM, david@lang.hm wrote: > >> you could make a synthetic test by writing a memory hog that allocates 3/4 >> of your ram then pauses waiting for input and then randomly accesses the >> memory for a while (say randomly accessing 2x # of pages allocated) and >> then pausing again before repeating > > Something like this? > >> run two of these, alternating which one is running at any one time. time >> how long it takes to do the random accesses. >> >> the difference in this time should be a fair example of how much it would >> impact the user. > > Notenotenote, not sure what you're going to show with it (times are simply as > horrendous as I'd expect) but thought I'd try to inject something other than > steaming cups of 4-letter beverages. when the swap readahead is enabled does it make a significant difference in the time to do the random access? if it does that should show a direct benifit of the patch in a simulation of a relativly common workflow (startup a memory hog like openoffice then try and go back to your prior work) David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 8:07 ` david @ 2007-07-25 8:29 ` Rene Herman 2007-07-25 8:31 ` david 0 siblings, 1 reply; 535+ messages in thread From: Rene Herman @ 2007-07-25 8:29 UTC (permalink / raw) To: david Cc: Ray Lee, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 07/25/2007 10:07 AM, david@lang.hm wrote: > On Wed, 25 Jul 2007, Rene Herman wrote: >> Something like this? [ ... ] > when the swap readahead is enabled does it make a significant difference > in the time to do the random access? I don't use swap prefetch (nor -ck or -mm). If someone who has the patch applied waits to hit enter until swap prefetch has prefetched it all back in again, it certainly will. Swap prefetch's potential to do larger reads back from swapspace than a random segfaulting app could well be very significant. Reads are dwarved by seeks. If this program does what you wanted, please use it to show us. Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 8:29 ` Rene Herman @ 2007-07-25 8:31 ` david 2007-07-25 8:33 ` david 0 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-25 8:31 UTC (permalink / raw) To: Rene Herman Cc: Ray Lee, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 25 Jul 2007, Rene Herman wrote: > On 07/25/2007 10:07 AM, david@lang.hm wrote: > >> On Wed, 25 Jul 2007, Rene Herman wrote: > >> > Something like this? > > [ ... ] > >> when the swap readahead is enabled does it make a significant difference >> in the time to do the random access? > > I don't use swap prefetch (nor -ck or -mm). If someone who has the patch > applied waits to hit enter until swap prefetch has prefetched it all back in > again, it certainly will. > > Swap prefetch's potential to do larger reads back from swapspace than a > random segfaulting app could well be very significant. Reads are dwarved by > seeks. If this program does what you wanted, please use it to show us. I haven't used swap prefetch either, the call was put out for what could be used to test the performance, and I was suggesting a test. if nobody else follows up on this I'll try to get some time to test it myself in a day or two. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 8:31 ` david @ 2007-07-25 8:33 ` david 2007-07-25 10:58 ` Rene Herman 0 siblings, 1 reply; 535+ messages in thread From: david @ 2007-07-25 8:33 UTC (permalink / raw) To: Rene Herman Cc: Ray Lee, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Wed, 25 Jul 2007, david@lang.hm wrote: > Subject: Re: -mm merge plans for 2.6.23 > > On Wed, 25 Jul 2007, Rene Herman wrote: > >> On 07/25/2007 10:07 AM, david@lang.hm wrote: >> >> > On Wed, 25 Jul 2007, Rene Herman wrote: >> >> > > Something like this? >> >> [ ... ] >> >> > when the swap readahead is enabled does it make a significant >> > difference >> > in the time to do the random access? >> >> I don't use swap prefetch (nor -ck or -mm). If someone who has the patch >> applied waits to hit enter until swap prefetch has prefetched it all back >> in again, it certainly will. >> >> Swap prefetch's potential to do larger reads back from swapspace than a >> random segfaulting app could well be very significant. Reads are dwarved >> by seeks. If this program does what you wanted, please use it to show us. > > I haven't used swap prefetch either, the call was put out for what could be > used to test the performance, and I was suggesting a test. > > if nobody else follows up on this I'll try to get some time to test it myself > in a day or two. this assumes that this isn't ruled an invalid test in the meantime. in any case thanks for codeing this up so quickly. David Lang ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 8:33 ` david @ 2007-07-25 10:58 ` Rene Herman 0 siblings, 0 replies; 535+ messages in thread From: Rene Herman @ 2007-07-25 10:58 UTC (permalink / raw) To: david Cc: Ray Lee, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 07/25/2007 10:33 AM, david@lang.hm wrote: >> I haven't used swap prefetch either, the call was put out for what >> could be used to test the performance, and I was suggesting a test. >> >> if nobody else follows up on this I'll try to get some time to test it >> myself in a day or two. > > this assumes that this isn't ruled an invalid test in the meantime. Let's save a little time and guess. While two instances of the hog are running no physical memory is free (as together they take up 1.5x physical) meaning that swap-prefetch wouldn't get a change to do anything and wouldn't make a difference. As such, the two instances test as you suggested would in fact not be testing anything it seems. However, if you quit one, and idle long enough to continue with the other one until swap-prefetch prefetched all its memory back in, it should be a difference on the order of minutes, even total if swap prefetch fetched it back in without seeking al over swap-space, and "total" isn't applicable if the idle time really is free. A program randomly touching single pages all over memory is a contrived worst case scenario and not a real-world issue. It is a boundary condition though, and it's simply quite impossible to think of any example where swap-prefetch would _not_ give you a snappier feeling machine after you've been idling. So really the only question would seem to be -- does it hurt any if you have _not_ been? Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 4:46 ` david 2007-07-25 8:00 ` Rene Herman @ 2007-07-25 15:55 ` Ray Lee 2007-07-25 20:16 ` Al Boldi 1 sibling, 1 reply; 535+ messages in thread From: Ray Lee @ 2007-07-25 15:55 UTC (permalink / raw) To: david, Al Boldi Cc: Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Hoo boy, lots of messages this morning. (Al? I've added you to the CC: because of your swap-in vs swap-out speed report from January. See below -- half-way down or so -- for more detals.) On 7/24/07, david@lang.hm <david@lang.hm> wrote: > On Tue, 24 Jul 2007, Ray Lee wrote: > > > On 7/23/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> Ray Lee wrote: > > > >> Looking at your past email, you have a 1GB desktop system and your > >> overnight updatedb run is causing stuff to get swapped out such that > >> swap prefetch makes it significantly better. This is really > >> intriguing to me, and I would hope we can start by making this > >> particular workload "not suck" without swap prefetch (and hopefully > >> make it even better than it currently is with swap prefetch because > >> we'll try not to evict useful file backed pages as well). > > > > updatedb is an annoying case, because one would hope that there would > > be a better way to deal with that highly specific workload. It's also > > pretty stat dominant, which puts it roughly in the same category as a > > git diff. (They differ in that updatedb does a lot of open()s and > > getdents on directories, git merely does a ton of lstat()s instead.) > > > > Anyway, my point is that I worry that tuning for an unusual and > > infrequent workload (which updatedb certainly is), is the wrong way to > > go. > > updatedb pushing out program data may be able to be improved on with drop > behind or similar. Hmm, I thought drop-behind wasn't going to be able to target metadata? > however another scenerio that causes a similar problem is when a user is > busy useing one of the big memory hogs and then switches to another (think > switching between openoffice and firefox) Yes, and that was the core of my original report months ago. I'm working for a while on one task, go to openoffice to view a report, or gimp to tweak the colors on a photo before uploading it, and then go back to my email and... and... and... there we go. The faults that occur when I context switch is what's most annoying. > >> After that we can look at other problems that swap prefetch helps > >> with, or think of some ways to measure your "whole day" scenario. > >> > >> So when/if you have time, I can cook up a list of things to monitor > >> and possibly a patch to add some instrumentation over this updatedb > >> run. > > > > That would be appreciated. Don't spend huge amounts of time on it, > > okay? Point me the right direction, and we'll see how far I can run > > with it. > > you could make a synthetic test by writing a memory hog that allocates 3/4 > of your ram then pauses waiting for input and then randomly accesses the > memory for a while (say randomly accessing 2x # of pages allocated) and > then pausing again before repeating Con wrote a benchmark much like that. It showed measurable improvement with swap prefetch. > by the way, I've also seen comments on the Postgres performance mailing > list about how slow linux is compared to other OS's in pulling data back > in that's been pushed out to swap (not a factor on dedicated database > machines, but a big factor on multi-purpose machines) Yeah, akpm and... one of the usual suspects, had mentioned something such as 2.6 is half the speed of 2.4 for swapin. (Let's see if I can find a reference for that, it's been a year or more...) Okay, misremembered. Swap in is half the speed of swap out ( http://lkml.org/lkml/2007/1/22/173 ). Al Boldi (added to the CC:, poor sod), is the one who knows how to measure that, I'm guessing. Al? How are you coming up with those figures? I'm interested in reproducing it. It could be due to something stupid, such as the VM faulting things out in reverse order or something... > >> Anyway, I realise swap prefetching has some situations where it will > >> fundamentally outperform even the page replacement oracle. This is > >> why I haven't asked for it to be dropped: it isn't a bad idea at all. > > > > <nod> > > > >> However, if we can improve basic page reclaim where it is obviously > >> lacking, that is always preferable. eg: being a highly speculative > >> operation, swap prefetch is not great for power efficiency -- but we > >> still want laptop users to have a good experience as well, right? > > > > Absolutely. Disk I/O is the enemy, and the best I/O is one you never > > had to do in the first place. > > almost always true, however there is some amount of I/O that is free with > todays drives (remember, they read the entire track into ram and then > give you the sectors on the track that you asked for). and if you have a > raid array this is even more true. Yeah, I knew I'd get called on that one :-). It's the seeks that'll really kill you, and as you say once you're on the track the rest is practically free (which is why the VM should prefer to evict larger chunks at a time rather than lots of small things, see http://lkml.org/lkml/2007/7/23/214 for something that's heading the right direction, though the side-effects are unfortunate. > if you read one sector in from a raid5 array you have done all the same > I/O that you would have to do to read in the entire stripe, but I don't > believe that the current system will keep it all around if it exceeds the > readahead limit. Fengguang Wu is doing lots of active work on making the readahead suck less. Ping him and he'll likely take an active interest in the RAID stuff. Ray ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 15:55 ` Ray Lee @ 2007-07-25 20:16 ` Al Boldi 2007-07-27 0:28 ` Magnus Naeslund 0 siblings, 1 reply; 535+ messages in thread From: Al Boldi @ 2007-07-25 20:16 UTC (permalink / raw) To: Ray Lee, david Cc: Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Ray Lee wrote: > On 7/24/07, david@lang.hm <david@lang.hm> wrote: > > by the way, I've also seen comments on the Postgres performance mailing > > list about how slow linux is compared to other OS's in pulling data back > > in that's been pushed out to swap (not a factor on dedicated database > > machines, but a big factor on multi-purpose machines) > > Yeah, akpm and... one of the usual suspects, had mentioned something > such as 2.6 is half the speed of 2.4 for swapin. (Let's see if I can > find a reference for that, it's been a year or more...) Okay, > misremembered. Swap in is half the speed of swap out ( > http://lkml.org/lkml/2007/1/22/173 ). Al Boldi (added to the CC:, poor > sod), is the one who knows how to measure that, I'm guessing. > > Al? How are you coming up with those figures? I'm interested in > reproducing it. It could be due to something stupid, such as the VM > faulting things out in reverse order or something... Thanks for asking. I'm rather surprised why nobody's noticing any of this slowdown. To be fair, it's not really a regression, on the contrary, 2.4 is lot worse wrt swapin and swapout, and Rik van Riel even considers a 50% swapin slowdown wrt swapout something like better than expected (see thread '[RFC] kswapd: Kernel Swapper performance'). He probably meant random swapin, which seems to offer a 4x slowdown. There are two ways to reproduce this: 1. swsusp to disk reports ~44mb/s swapout, and ~25mb/s swapin during resume 2. tmpfs swapout is superfast, whereas swapin is really slow (see thread '[PATCH] free swap space when (re)activating page') Here is an excerpt from that thread (note machine config in first line): ============================================ RAM 512mb , SWAP 1G #mount -t tmpfs -o size=1G none /dev/shm #time cat /dev/full > /dev/shm/x.dmp 15sec #time cat /dev/shm/x.dmp > /dev/null 58sec #time cat /dev/shm/x.dmp > /dev/null 72sec #time cat /dev/shm/x.dmp > /dev/null 85sec #time cat /dev/shm/x.dmp > /dev/null 93sec #time cat /dev/shm/x.dmp > /dev/null 99sec ============================================ As you can see, swapout is running full wirespeed, whereas swapin not only is 4x slower, but increasingly gets the VM tangled up to end at a ~6x slowdown. So again, I'm really surprised people haven't noticed. Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-25 20:16 ` Al Boldi @ 2007-07-27 0:28 ` Magnus Naeslund 0 siblings, 0 replies; 535+ messages in thread From: Magnus Naeslund @ 2007-07-27 0:28 UTC (permalink / raw) To: Al Boldi Cc: Ray Lee, david, Nick Piggin, Jesper Juhl, Andrew Morton, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel Al Boldi wrote: > > Thanks for asking. I'm rather surprised why nobody's noticing any of this > slowdown. To be fair, it's not really a regression, on the contrary, 2.4 is > lot worse wrt swapin and swapout, and Rik van Riel even considers a 50% > swapin slowdown wrt swapout something like better than expected (see thread > '[RFC] kswapd: Kernel Swapper performance'). He probably meant random > swapin, which seems to offer a 4x slowdown. > Sorry for the late reply. Well I think I reported this or another swap/tmpfs performance issue earlier ( http://marc.info/?t=116542915700004&r=1&w=2 ), we got the suggestion to increase /proc/sys/vm/page-cluster to 5, but we never came around to try it. Maybe this was the reason for my report to be almost entirely ignored, sorry for that. Regards, Magnus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 4:53 ` Ray Lee 2007-07-24 5:10 ` Jeremy Fitzhardinge 2007-07-24 5:16 ` Nick Piggin @ 2007-07-24 5:18 ` Andrew Morton 2007-07-24 6:01 ` Ray Lee 2007-07-25 1:26 ` [ck] " Matthew Hawkins 2 siblings, 2 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-24 5:18 UTC (permalink / raw) To: Ray Lee Cc: Nick Piggin, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Mon, 23 Jul 2007 21:53:38 -0700 "Ray Lee" <ray-lk@madrabbit.org> wrote: > > Since this merge period has appeared particularly frazzling for > Andrew, I've been keeping silent and waiting for him to get to a point > where there's a breather. I didn't feel it would be polite to request > yet more work out of him while he had a mess on his hands. Let it just be noted that Con is not the only one who has expended effort on this patch. It's been in -mm for nearly two years and it has meant ongoing effort for me and, to a lesser extent, other MM developers to keep it alive. > But, given this has come to a head, I'm asking now. > > Andrew? You've always given the impression that you want this run more > as an engineering effort than an artistic endeavour, so help us out > here. What are your concerns with swap prefetch? What sort of > comparative data would you like to see to justify its inclusion, or to > prove that it's not needed? Critera are different for each patch, but it usually comes down to a cost/benefit judgement. Does the benefit of the patch exceed its maintenance cost over the lifetime of the kernel (whatever that is). In this case the answer to that has never been clear to me. The (much older) fs-aio patches were (are) in a similar situation. The other consideration here is, as Nick points out, are the problems which people see this patch solving for them solveable in other, better ways? IOW, is this patch fixing up preexisting deficiencies post-facto? To attack the second question we could start out with bug reports: system A with workload B produces result C. I think result C is wrong for <reasons> and would prefer to see result D. > Or are we reading too much into the fact that it isn't merged? In > short, communicate please, it will help. Well. The above, plus there's always a lot of stuff happening in MM land, and I haven't seen much in the way of enthusiasm from the usual MM developers. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 5:18 ` Andrew Morton @ 2007-07-24 6:01 ` Ray Lee 2007-07-24 6:10 ` Andrew Morton 2007-07-24 9:38 ` Tilman Schmidt 2007-07-25 1:26 ` [ck] " Matthew Hawkins 1 sibling, 2 replies; 535+ messages in thread From: Ray Lee @ 2007-07-24 6:01 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On 7/23/07, Andrew Morton <akpm@linux-foundation.org> wrote: > Let it just be noted that Con is not the only one who has expended effort > on this patch. It's been in -mm for nearly two years and it has meant > ongoing effort for me and, to a lesser extent, other MM developers to keep > it alive. <nods> Yes, keeping patches from crufting up and stepping on other patches' toes is hard work; I did it for a bit, and it was one of the more thankless tasks I've tried a hand at. So, thanks. > Critera are different for each patch, but it usually comes down to a > cost/benefit judgement. Does the benefit of the patch exceed its > maintenance cost over the lifetime of the kernel (whatever that is). Well, I suspect it's 'lifetime of the feature,' in this case as it's no more user visible than the page replacement algorithm in the first place. > The other consideration here is, as Nick points out, are the problems which > people see this patch solving for them solveable in other, better ways? > IOW, is this patch fixing up preexisting deficiencies post-facto? In some cases, it almost certainly is. It also has the troubling aspect of mitigating future regressions without anyone terribly noticing, due to it being able to paper over those hypothetical future deficiencies when they're introduced. > To attack the second question we could start out with bug reports: system A > with workload B produces result C. I think result C is wrong for <reasons> > and would prefer to see result D. I spend a lot of time each day watching my computer fault my workingset back in when I switch contexts. I'd rather I didn't have to do that. Unfortunately, that's a pretty subjective problem report. For whatever it's worth, we have pretty subjective solution reports pointing to swap prefetch as providing a fix for them. My concern is that a subjective problem report may not be good enough. So, what do I measure to make this an objective problem report? And if I do that (and it shows a positive result), will that be good enough to argue for inclusion? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 6:01 ` Ray Lee @ 2007-07-24 6:10 ` Andrew Morton 2007-07-24 9:38 ` Tilman Schmidt 1 sibling, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-24 6:10 UTC (permalink / raw) To: Ray Lee Cc: Nick Piggin, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Mon, 23 Jul 2007 23:01:41 -0700 "Ray Lee" <ray-lk@madrabbit.org> wrote: > So, what do I measure to make this an objective problem report? Ideal would be to find a reproducible-by-others testcase which does what you believe to be the wrong thing. > And if > I do that (and it shows a positive result), will that be good enough > to argue for inclusion? That depends upon whether there are more suitable ways of fixing "the wrong thing". There may not be - it could well be that present behaviour is correct for the testcase, but it leaves the system in the wrong state for your large workload shift. In that case, prefetching (ie: restoring system state approximately to that which prevailed prior to "testcase") might well be a suitable fix. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-24 6:01 ` Ray Lee 2007-07-24 6:10 ` Andrew Morton @ 2007-07-24 9:38 ` Tilman Schmidt 1 sibling, 0 replies; 535+ messages in thread From: Tilman Schmidt @ 2007-07-24 9:38 UTC (permalink / raw) To: Ray Lee Cc: Andrew Morton, Nick Piggin, Jesper Juhl, ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1106 bytes --] Ray Lee schrieb: > I spend a lot of time each day watching my computer fault my > workingset back in when I switch contexts. I'd rather I didn't have to > do that. Unfortunately, that's a pretty subjective problem report. For > whatever it's worth, we have pretty subjective solution reports > pointing to swap prefetch as providing a fix for them. Add me. > My concern is that a subjective problem report may not be good enough. That's my impression too, seeing the insistence on numbers. > So, what do I measure to make this an objective problem report? That seems to be the crux of the matter: how to measure subjective usability issues (aka user experience) when simple reports along the lines of "A is much better than B for everyday work" are not enough. The same problem already impaired the "fair scheduler" discussion. It would really help to have a clear direction there. -- Tilman Schmidt E-Mail: tilman@imap.cc Bonn, Germany Diese Nachricht besteht zu 100% aus wiederverwerteten Bits. Ungeöffnet mindestens haltbar bis: (siehe Rückseite) [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 250 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-24 5:18 ` Andrew Morton 2007-07-24 6:01 ` Ray Lee @ 2007-07-25 1:26 ` Matthew Hawkins 2007-07-25 1:35 ` David Miller 1 sibling, 1 reply; 535+ messages in thread From: Matthew Hawkins @ 2007-07-25 1:26 UTC (permalink / raw) To: Andrew Morton Cc: Ray Lee, Nick Piggin, Jesper Juhl, linux-kernel, ck list, linux-mm, Paul Jackson On 7/24/07, Andrew Morton <akpm@linux-foundation.org> wrote: > The other consideration here is, as Nick points out, are the problems which > people see this patch solving for them solveable in other, better ways? > IOW, is this patch fixing up preexisting deficiencies post-facto? So let me get this straight - you don't want to merge swap prefetch which exists now and solves issues many people are seeing, and has been tested more than a gazillion other bits & pieces that do get merged - because it could be possible that in the future some other patch, which doesn't yet exist and nobody is working on, may solve the problem better? You know what, just release Linux 0.02 as 2.6.23 because, using your logic, everything that was merged since October 5, 1991 could be replaced by something better. Perhaps. So there's obviously no point having it there in the first place & there'll be untold savings in storage costs and compilation time for the kernel tree, also bandwidth for the mirror sites etc. in the mean time while we wait for the magic pixies to come and deliver the one true piece of code that cannot be improved upon. > Well. The above, plus there's always a lot of stuff happening in MM land, > and I haven't seen much in the way of enthusiasm from the usual MM > developers. I haven't seen much in the way of enthusiasm from developers, period. People are tired of maintaining patches for years that never get merged into mainline because of totally bullshit reasons (usually amounting to NIH syndrome) -- Matt ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [ck] Re: -mm merge plans for 2.6.23 2007-07-25 1:26 ` [ck] " Matthew Hawkins @ 2007-07-25 1:35 ` David Miller 0 siblings, 0 replies; 535+ messages in thread From: David Miller @ 2007-07-25 1:35 UTC (permalink / raw) To: darthmdh Cc: akpm, ray-lk, nickpiggin, jesper.juhl, linux-kernel, ck, linux-mm, pj From: "Matthew Hawkins" <darthmdh@gmail.com> Date: Wed, 25 Jul 2007 11:26:57 +1000 > On 7/24/07, Andrew Morton <akpm@linux-foundation.org> wrote: > > The other consideration here is, as Nick points out, are the problems which > > people see this patch solving for them solveable in other, better ways? > > IOW, is this patch fixing up preexisting deficiencies post-facto? > > So let me get this straight - you don't want to merge swap prefetch > which exists now and solves issues many people are seeing, and has > been tested more than a gazillion other bits & pieces that do get > merged - because it could be possible that in the future some other > patch, which doesn't yet exist and nobody is working on, may solve the > problem better? I have to generally agree that the objections to the swap prefetch patches have been conjecture and in general wasting time and frustrating people. There is a point at which it might be wise to just step back and let the river run it's course and see what happens. Initially, it's good to play games of "what if", but after several months it's not a productive thing and slows down progress for no good reason. If a better mechanism gets implemented, great! We'll can easily replace the swap prefetch stuff at such time. But until then swap prefetch is what we have and it's sat long enough in -mm with no major problems to merge it. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 10:15 ` -mm merge plans for 2.6.23 Con Kolivas [not found] ` <b21f8390707101802o2d546477n2a18c1c3547c3d7a@mail.gmail.com> 2007-07-23 23:08 ` Jesper Juhl @ 2007-07-24 0:08 ` Con Kolivas 2 siblings, 0 replies; 535+ messages in thread From: Con Kolivas @ 2007-07-24 0:08 UTC (permalink / raw) To: Andrew Morton; +Cc: ck list, Ingo Molnar, Paul Jackson, linux-mm, linux-kernel On Tuesday 10 July 2007 20:15, Con Kolivas wrote: > On Tuesday 10 July 2007 18:31, Andrew Morton wrote: > > When replying, please rewrite the subject suitably and try to Cc: the > > appropriate developer(s). > > ~swap prefetch > > Nick's only remaining issue which I could remotely identify was to make it > cpuset aware: > http://marc.info/?l=linux-mm&m=117875557014098&w=2 > as discussed with Paul Jackson it was cpuset aware: > http://marc.info/?l=linux-mm&m=117895463120843&w=2 > > I fixed all bugs I could find and improved it as much as I could last > kernel cycle. > > Put me and the users out of our misery and merge it now or delete it > forever please. And if the meaningless handwaving that I 100% expect as a > response begins again, then that's fine. I'll take that as a no and you can > dump it. The window for 2.6.23 has now closed and your position on this is clear. I've been supporting this code in -mm for 21 months since 16-Oct-2005 without any obvious decision for this code forwards or backwards. I am no longer part of your operating system's kernel's world; thus I cannot support this code any longer. Unless someone takes over the code base for swap prefetch you have to assume it is now unmaintained and should delete it. Please respect my request to not be contacted further regarding this or any other kernel code. -- -ck ^ permalink raw reply [flat|nested] 535+ messages in thread
* containers (was Re: -mm merge plans for 2.6.23) 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (3 preceding siblings ...) 2007-07-10 10:15 ` -mm merge plans for 2.6.23 Con Kolivas @ 2007-07-10 10:52 ` Srivatsa Vaddagiri 2007-07-10 11:19 ` Ingo Molnar 2007-07-10 18:34 ` Paul Menage 2007-07-10 11:52 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: " Theodore Tso ` (21 subsequent siblings) 26 siblings, 2 replies; 535+ messages in thread From: Srivatsa Vaddagiri @ 2007-07-10 10:52 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, menage, containers On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > cpuset-zero-malloc-revert-the-old-cpuset-fix.patch > containersv10-basic-container-framework.patch > containersv10-basic-container-framework-fix.patch > containersv10-basic-container-framework-fix-2.patch > containersv10-basic-container-framework-fix-3.patch > containersv10-basic-container-framework-fix-for-bad-lock-balance-in-containers.patch > containersv10-example-cpu-accounting-subsystem.patch > containersv10-example-cpu-accounting-subsystem-fix.patch > containersv10-add-tasks-file-interface.patch > containersv10-add-tasks-file-interface-fix.patch > containersv10-add-tasks-file-interface-fix-2.patch > containersv10-add-fork-exit-hooks.patch > containersv10-add-fork-exit-hooks-fix.patch > containersv10-add-container_clone-interface.patch > containersv10-add-container_clone-interface-fix.patch > containersv10-add-procfs-interface.patch > containersv10-add-procfs-interface-fix.patch > containersv10-make-cpusets-a-client-of-containers.patch > containersv10-make-cpusets-a-client-of-containers-whitespace.patch > containersv10-share-css_group-arrays-between-tasks-with-same-container-memberships.patch > containersv10-share-css_group-arrays-between-tasks-with-same-container-memberships-fix.patch > containersv10-share-css_group-arrays-between-tasks-with-same-container-memberships-cpuset-zero-malloc-fix-for-new-containers.patch > containersv10-simple-debug-info-subsystem.patch > containersv10-simple-debug-info-subsystem-fix.patch > containersv10-simple-debug-info-subsystem-fix-2.patch > containersv10-support-for-automatic-userspace-release-agents.patch > containersv10-support-for-automatic-userspace-release-agents-whitespace.patch > add-containerstats-v3.patch > add-containerstats-v3-fix.patch > update-getdelays-to-become-containerstats-aware.patch > containers-implement-subsys-post_clone.patch > containers-implement-namespace-tracking-subsystem-v3.patch > > Container stuff. Hold, I guess. I was expecting updates from Paul. Paul, Are you working on a new version? I thought it was mostly ready for mainline. -- Regards, vatsa ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-10 10:52 ` containers (was Re: -mm merge plans for 2.6.23) Srivatsa Vaddagiri @ 2007-07-10 11:19 ` Ingo Molnar 2007-07-10 18:34 ` Paul Menage 1 sibling, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-07-10 11:19 UTC (permalink / raw) To: Srivatsa Vaddagiri; +Cc: Andrew Morton, linux-kernel, menage, containers * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote: > On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > > cpuset-zero-malloc-revert-the-old-cpuset-fix.patch > > containersv10-basic-container-framework.patch > > containersv10-basic-container-framework-fix.patch > > containersv10-basic-container-framework-fix-2.patch > > containersv10-basic-container-framework-fix-3.patch > > containersv10-basic-container-framework-fix-for-bad-lock-balance-in-containers.patch > > containersv10-example-cpu-accounting-subsystem.patch > > containersv10-example-cpu-accounting-subsystem-fix.patch > > containersv10-add-tasks-file-interface.patch > > containersv10-add-tasks-file-interface-fix.patch > > containersv10-add-tasks-file-interface-fix-2.patch > > containersv10-add-fork-exit-hooks.patch > > containersv10-add-fork-exit-hooks-fix.patch > > containersv10-add-container_clone-interface.patch > > containersv10-add-container_clone-interface-fix.patch > > containersv10-add-procfs-interface.patch > > containersv10-add-procfs-interface-fix.patch > > containersv10-make-cpusets-a-client-of-containers.patch > > containersv10-make-cpusets-a-client-of-containers-whitespace.patch > > containersv10-share-css_group-arrays-between-tasks-with-same-container-memberships.patch > > containersv10-share-css_group-arrays-between-tasks-with-same-container-memberships-fix.patch > > containersv10-share-css_group-arrays-between-tasks-with-same-container-memberships-cpuset-zero-malloc-fix-for-new-containers.patch > > containersv10-simple-debug-info-subsystem.patch > > containersv10-simple-debug-info-subsystem-fix.patch > > containersv10-simple-debug-info-subsystem-fix-2.patch > > containersv10-support-for-automatic-userspace-release-agents.patch > > containersv10-support-for-automatic-userspace-release-agents-whitespace.patch > > add-containerstats-v3.patch > > add-containerstats-v3-fix.patch > > update-getdelays-to-become-containerstats-aware.patch > > containers-implement-subsys-post_clone.patch > > containers-implement-namespace-tracking-subsystem-v3.patch > > > > Container stuff. Hold, I guess. I was expecting updates from Paul. > > Paul, > Are you working on a new version? I thought it was mostly ready > for mainline. in particular CONFIG_FAIR_GROUP_SCHED depends on these APIs. Once the APIs to configure are upstream, the group scheduler can be enabled too. (basically all the group scheduling bits are upstream now as part of CFS.) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-10 10:52 ` containers (was Re: -mm merge plans for 2.6.23) Srivatsa Vaddagiri 2007-07-10 11:19 ` Ingo Molnar @ 2007-07-10 18:34 ` Paul Menage 2007-07-10 18:53 ` Andrew Morton 1 sibling, 1 reply; 535+ messages in thread From: Paul Menage @ 2007-07-10 18:34 UTC (permalink / raw) To: vatsa; +Cc: Andrew Morton, linux-kernel, containers On 7/10/07, Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote: > > > > Container stuff. Hold, I guess. I was expecting updates from Paul. > > Paul, > Are you working on a new version? I thought it was mostly ready > for mainline. > There are definitely some big changes that I want to make internally to the framework, but I guess they don't have to block pushing the basic framework to mainline. I've got a new patchset that's primarily got all the various -mm fix patches rolled into the appropriate original patches, along with plus some small tweaks - changed the Kconfig files to avoid using "select" - adding the subsystem name as a prefix for each control file to enforce namespace scoping - misc contributions from others Short-term I also want to: - rethink the linked list that runs through each task to its css_group object, since that seemed to hurt performance a bit, but for now that can probably be solved by just ripping it out and going back to scanning the tasklist to enumerate tasks in a container. - extend the options parsing, so we can have more than just a list of subsystems. Probably changing the existing -o<subsys1>,<subsys2>,... to be one of: -osubsys=<subsys1>:<subsys2>:...,<otheropt>=<otherval> -osubsys=<subsys1>,subsys=<subsys2>,subsys=...,<otheropt>=<otherval> (what's the preferred convention for fs mount options with multiple values?) I'd not realised that anything else depending on containers was ready for upstream merge, but if CFS group support is ready then merging a subset of them is probably a good idea, since this is an application that I can see a lot of people wanting to play with. Andrew, how about we merge enough of the container framework to support CFS? Bits we could leave out for now include container_clone() support and the nsproxy subsystem, fork/exit callback hooks, and possibly leave cpusets alone for now (which would also mean we could skip the automatic release-agent stuff). I'm in Tokyo for the Linux Foundation Japan symposium right now, but I should be able to get the new patchset to you for Friday afternoon. Paul ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-10 18:34 ` Paul Menage @ 2007-07-10 18:53 ` Andrew Morton 2007-07-10 19:05 ` Paul Menage 2007-07-11 4:55 ` Srivatsa Vaddagiri 0 siblings, 2 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 18:53 UTC (permalink / raw) To: Paul Menage; +Cc: vatsa, linux-kernel, containers On Tue, 10 Jul 2007 11:34:38 -0700 "Paul Menage" <menage@google.com> wrote: > Andrew, how about we merge enough of the container framework to > support CFS? Bits we could leave out for now include container_clone() > support and the nsproxy subsystem, fork/exit callback hooks, and > possibly leave cpusets alone for now (which would also mean we could > skip the automatic release-agent stuff). I'm in Tokyo for the Linux > Foundation Japan symposium right now, but I should be able to get the > new patchset to you for Friday afternoon. mm.. Given that you propose leaving bits out for the 2.6.23 merge, and that changes are still pending and that nothing will _use_ the framework in 2.6.23 I'd be inclined to err on the side of caution and hold it all back from 2.6.23. This has the advantage that the merge will happen after the kernel-summit containers discussion which I suspect will be an important point in the life of this project... ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-10 18:53 ` Andrew Morton @ 2007-07-10 19:05 ` Paul Menage 2007-07-11 4:55 ` Srivatsa Vaddagiri 1 sibling, 0 replies; 535+ messages in thread From: Paul Menage @ 2007-07-10 19:05 UTC (permalink / raw) To: Andrew Morton; +Cc: vatsa, linux-kernel, containers On 7/10/07, Andrew Morton <akpm@linux-foundation.org> wrote: > > Andrew, how about we merge enough of the container framework to > > support CFS? Bits we could leave out for now include container_clone() > > support and the nsproxy subsystem, fork/exit callback hooks, and > > possibly leave cpusets alone for now (which would also mean we could > > skip the automatic release-agent stuff). I'm in Tokyo for the Linux > > Foundation Japan symposium right now, but I should be able to get the > > new patchset to you for Friday afternoon. > > mm.. Given that you propose leaving bits out for the 2.6.23 merge, and > that changes are still pending and that nothing will _use_ the framework in > 2.6.23 That's what I was originally thinking too, but since CFS has been merged, CFS group scheduling would use it. Paul ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-10 18:53 ` Andrew Morton 2007-07-10 19:05 ` Paul Menage @ 2007-07-11 4:55 ` Srivatsa Vaddagiri 2007-07-11 5:29 ` Andrew Morton 1 sibling, 1 reply; 535+ messages in thread From: Srivatsa Vaddagiri @ 2007-07-11 4:55 UTC (permalink / raw) To: Andrew Morton; +Cc: Paul Menage, linux-kernel, containers, Ingo Molnar On Tue, Jul 10, 2007 at 11:53:19AM -0700, Andrew Morton wrote: > On Tue, 10 Jul 2007 11:34:38 -0700 > "Paul Menage" <menage@google.com> wrote: > > > Andrew, how about we merge enough of the container framework to > > support CFS? Bits we could leave out for now include container_clone() > > support and the nsproxy subsystem, fork/exit callback hooks, and > > possibly leave cpusets alone for now (which would also mean we could > > skip the automatic release-agent stuff). I'm in Tokyo for the Linux > > Foundation Japan symposium right now, but I should be able to get the > > new patchset to you for Friday afternoon. > > mm.. Given that you propose leaving bits out for the 2.6.23 merge, and > that changes are still pending and that nothing will _use_ the framework in > 2.6.23 [...] Andrew, The cpu group scheduler is ready and waiting for the container patches in 2.6.23 :) Here are some options with us: a. (As Paul says) merge enough of container patches to enable its use with cfs group scheduler (and possibly cpusets?) b. Enable group scheduling bits in 2.6.23 using the user-id grouping mechanism (aka fair user scheduler). For 2.6.24, we could remove this interface and use Paul's container patches instead. Since this means change of API interface between 2.6.23 and 2.6.24, I don't prefer this option. c. Enable group scheduling bits only in -mm for now (2.6.23-mmX), using Paul's container patches. I can send you a short patch that hooks up cfs group scheduler with Paul's container infrastructure. If a. is not possible, I would prefer c. Let me know your thoughts .. -- Regards, vatsa ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 4:55 ` Srivatsa Vaddagiri @ 2007-07-11 5:29 ` Andrew Morton 2007-07-11 6:03 ` Srivatsa Vaddagiri ` (2 more replies) 0 siblings, 3 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-11 5:29 UTC (permalink / raw) To: vatsa; +Cc: Paul Menage, linux-kernel, containers, Ingo Molnar On Wed, 11 Jul 2007 10:25:16 +0530 Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote: > On Tue, Jul 10, 2007 at 11:53:19AM -0700, Andrew Morton wrote: > > On Tue, 10 Jul 2007 11:34:38 -0700 > > "Paul Menage" <menage@google.com> wrote: > > > > > Andrew, how about we merge enough of the container framework to > > > support CFS? Bits we could leave out for now include container_clone() > > > support and the nsproxy subsystem, fork/exit callback hooks, and > > > possibly leave cpusets alone for now (which would also mean we could > > > skip the automatic release-agent stuff). I'm in Tokyo for the Linux > > > Foundation Japan symposium right now, but I should be able to get the > > > new patchset to you for Friday afternoon. > > > > mm.. Given that you propose leaving bits out for the 2.6.23 merge, and > > that changes are still pending and that nothing will _use_ the framework in > > 2.6.23 [...] > > Andrew, > The cpu group scheduler is ready and waiting for the container patches > in 2.6.23 :) > > Here are some options with us: > > a. (As Paul says) merge enough of container patches to enable > its use with cfs group scheduler (and possibly cpusets?) > > b. Enable group scheduling bits in 2.6.23 using the user-id grouping > mechanism (aka fair user scheduler). For 2.6.24, we could remove > this interface and use Paul's container patches instead. Since this > means change of API interface between 2.6.23 and 2.6.24, I don't > prefer this option. > > c. Enable group scheduling bits only in -mm for now (2.6.23-mmX), using > Paul's container patches. I can send you a short patch that hooks up > cfs group scheduler with Paul's container infrastructure. > > If a. is not possible, I would prefer c. > > Let me know your thoughts .. I'm inclined to take the cautious route here - I don't think people will be dying for the CFS thingy (which I didn't even know about?) in .23, and it's rather a lot of infrastructure to add for a CPU scheduler configurator gadget (what does it do, anyway?) We have plenty of stuff for 2.6.23 already ;) Is this liveable with?? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 5:29 ` Andrew Morton @ 2007-07-11 6:03 ` Srivatsa Vaddagiri 2007-07-11 9:04 ` Ingo Molnar 2007-07-11 19:44 ` Paul Menage 2 siblings, 0 replies; 535+ messages in thread From: Srivatsa Vaddagiri @ 2007-07-11 6:03 UTC (permalink / raw) To: Andrew Morton; +Cc: Paul Menage, linux-kernel, containers, Ingo Molnar On Tue, Jul 10, 2007 at 10:29:42PM -0700, Andrew Morton wrote: > I'm inclined to take the cautious route here - I don't think people will be > dying for the CFS thingy (which I didn't even know about?) in .23, and it's > rather a lot of infrastructure to add for a CPU scheduler configurator > gadget (what does it do, anyway?) Hmm ok, if you think the container patches is too early for 2.6.23, fine. We should definitely target to have it in 2.6.24, by which time I am thinking the memory rss controller will also be in a good shape. > We have plenty of stuff for 2.6.23 already ;) > > Is this liveable with?? Fine. I will request you to enable group cpu scheduling in 2.6.23-rcX-mmY atleast, so that it gets some amount of testing. The essential group scheduling bits is already in Linus' tree now (as part of cfs merge), so what you need in -mm is a slim patch to hook it with Paul's container infrastructure (which I trust will continue to be in -mm until it goes mainline). I will send across that slim patch later (to be included in 2.6.23-rc1-mm1 perhaps). -- Regards, vatsa ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 5:29 ` Andrew Morton 2007-07-11 6:03 ` Srivatsa Vaddagiri @ 2007-07-11 9:04 ` Ingo Molnar 2007-07-11 9:23 ` Paul Jackson 2007-07-11 19:44 ` Paul Menage 2 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-11 9:04 UTC (permalink / raw) To: Andrew Morton; +Cc: vatsa, Paul Menage, linux-kernel, containers * Andrew Morton <akpm@linux-foundation.org> wrote: > > c. Enable group scheduling bits only in -mm for now (2.6.23-mmX), using > > Paul's container patches. I can send you a short patch that hooks up > > cfs group scheduler with Paul's container infrastructure. > > > > If a. is not possible, I would prefer c. > > > > Let me know your thoughts .. > > I'm inclined to take the cautious route here - I don't think people > will be dying for the CFS thingy (which I didn't even know about?) in > .23, and it's rather a lot of infrastructure to add for a CPU > scheduler configurator gadget (what does it do, anyway?) > > We have plenty of stuff for 2.6.23 already ;) > > Is this liveable with?? another option would be to trivially hook up CONFIG_FAIR_GROUP_SCHED with cpusets, and to offer CONFIG_FAIR_GROUP_SCHED in the Kconfig, dependent on CPUSETS and defaulting to off. That would give it a chance to be tested, benchmarked, etc. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 9:04 ` Ingo Molnar @ 2007-07-11 9:23 ` Paul Jackson 2007-07-11 10:03 ` Srivatsa Vaddagiri 0 siblings, 1 reply; 535+ messages in thread From: Paul Jackson @ 2007-07-11 9:23 UTC (permalink / raw) To: Ingo Molnar; +Cc: akpm, vatsa, menage, linux-kernel, containers Ingo wrote: > another option would be to trivially hook up CONFIG_FAIR_GROUP_SCHED > with cpusets, ... ah ... you triggered my procmail filter for 'cpuset' ... ;). What would it mean to hook up CFS with cpusets? I've a pretty good idea what a cpuset is, but don't know what kind of purpose you have in mind for such a hook. Could you say a few words to that? Thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 9:23 ` Paul Jackson @ 2007-07-11 10:03 ` Srivatsa Vaddagiri 2007-07-11 10:19 ` Ingo Molnar 2007-07-11 11:10 ` Paul Jackson 0 siblings, 2 replies; 535+ messages in thread From: Srivatsa Vaddagiri @ 2007-07-11 10:03 UTC (permalink / raw) To: Paul Jackson; +Cc: Ingo Molnar, akpm, menage, linux-kernel, containers On Wed, Jul 11, 2007 at 02:23:52AM -0700, Paul Jackson wrote: > Ingo wrote: > > another option would be to trivially hook up CONFIG_FAIR_GROUP_SCHED > > with cpusets, ... > > ah ... you triggered my procmail filter for 'cpuset' ... ;). :-) > What would it mean to hook up CFS with cpusets? CFS is the new cpu scheduler in Linus's tree (http://lwn.net/Articles/241085/). It has some group scheduling capabilities added i.e the core scheduler now recognizes the concept of a task-group and providing fair cpu time to each task-group (in addition to providing fair time to each task in a group). The core scheduler however is not concerned with how task groups are formed and/or how tasks migrate between groups. Thats where a patch like Paul Menage's container infrastructure comes in hand - to provide a user-interface for managing task-groups (create/delete task groups, migrate task from one group to another etc). Whatever the chosen user-interface is, cpu scheduler needs to know about such task-group creation/destruction, migration of tasks across groups etc. Unfortunately, the group-scheduler bits will be ready in 2.6.23 while Paul Menage's container patches aren't ready for 2.6.23 yet. So Ingo was proposing we use cpuset as that user interface to manage task-groups. This will be only for 2.6.23. In 2.6.24, when hopefully Paul Menage's container patches will be ready and will be merged, the group cpu scheduler will stop using cpuset as that interface and use the container infrastructure instead. If you recall, I have attempted to use cpuset for such an interface in the past (metered cpusets - see [1]). It brings in some semantic changes for cpusets, most notably: - metered cpusets cannot have grand-children - all cpusets under a metered cpuset need to share the same set of cpus. Is it fine if I introduce these semantic changes, only for 2.6.23 and only when CONFIG_FAIR_GROUP_SCHED is enabled? This will let the group cpu scheduler to receive some amount of testing. The other alternative is to hook up group scheduler with user-id's (again only for 2.6.23). > I've a pretty > good idea what a cpuset is, but don't know what kind of purpose > you have in mind for such a hook. Could you say a few words to > that? Thanks. Reference: 1. http://marc.info/?l=linux-kernel&m=115946525811848&w=2 -- Regards, vatsa ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 10:03 ` Srivatsa Vaddagiri @ 2007-07-11 10:19 ` Ingo Molnar 2007-07-11 11:39 ` Srivatsa Vaddagiri 2007-07-11 11:10 ` Paul Jackson 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-11 10:19 UTC (permalink / raw) To: Srivatsa Vaddagiri; +Cc: Paul Jackson, akpm, menage, linux-kernel, containers * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote: > The other alternative is to hook up group scheduler with user-id's > (again only for 2.6.23). could you just try this and send an as simple patch as possible? This is actually something that non-container people would be interested in as well. (btw., if this goes into 2.6.23 then we cannot possibly turn it off in 2.6.24, so it must be sane - but per UID task groups are certainly sane, the only question is how to configure the per-UID weight after bootup. [the default after-bootup should be something along the lines of 'same weight for all users, a bit more for root'.]) This would make it possible for users to test that thing. (it would also help X-heavy workloads.) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 10:19 ` Ingo Molnar @ 2007-07-11 11:39 ` Srivatsa Vaddagiri 2007-07-11 11:42 ` Paul Jackson 2007-07-11 12:30 ` Srivatsa Vaddagiri 0 siblings, 2 replies; 535+ messages in thread From: Srivatsa Vaddagiri @ 2007-07-11 11:39 UTC (permalink / raw) To: Ingo Molnar; +Cc: containers, menage, Paul Jackson, akpm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2593 bytes --] On Wed, Jul 11, 2007 at 12:19:58PM +0200, Ingo Molnar wrote: > * Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote: > > > The other alternative is to hook up group scheduler with user-id's > > (again only for 2.6.23). > > could you just try this and send an as simple patch as possible? This is > actually something that non-container people would be interested in as > well. Note that interfacing with container infrastructure doesn't preclude the possibility of doing fair-user scheduling (that a normal university server or desktop user would want). All that is needed is a daemon which listens for uid change events from kernel (using process-event connector) and moves the task (whose uid is changing) to an appropriate container for that user. Primitive source for such a daemon is attached. > (btw., if this goes into 2.6.23 then we cannot possibly turn it off in 2.6.24, The fact that we will have two interface for group scheduler in 2.6.24 is what worries me a bit (one user-id based and other container based). We would need some mechanism for admin to choose only one interface (and not both together, otherwise the group definitions may conflict), which doesn't sound very clean to me. Ideally I would have liked to hook onto only container infrastructure and let user-space decide grouping policy (whether user-id based or something else). Hmm ..would it help if I maintain a patch outside the mainline which turns on fair-user scheduling in 2.6.23-rcX? Folks will have to apply that patch on top of 2.6.23-rcX to use and test fair-user scheduling. In 2.6.24, when container infrastructure goes in, people can get fair-user scheduling off-the-shelf by simply starting the daemon attached at bootup/initrd time. Or would you rather prefer that I add user-id based interface permanently and in 2.6.24 introduce a compile/run-time switch for admin to select one of the two interfaces (user-id based or container-based)? > so it must be sane - but per UID task groups are > certainly sane, the only question is how to configure the per-UID weight > after bootup. Yeah ..the container based infrastructure allows for configuring such things very easily using a fs-based interface. In the absence of that, we either provide some /proc interface or settle for the non-configurable default that you mention below. > [the default after-bootup should be something along the > lines of 'same weight for all users, a bit more for root'.]) This would > make it possible for users to test that thing. (it would also help > X-heavy workloads.) -- Regards, vatsa [-- Attachment #2: cpuctld.c --] [-- Type: text/plain, Size: 7072 bytes --] /* * cpuctl_group_changer.c * * Used to change the group of running tasks to the correct * uid container. * * Copyright IBM Corporation, 2007 * Author: Dhaval Giani <dhaval@linux.vnet.ibm.com> * Derived from test_cn_proc.c by Matt Helsley * Original copyright notice follows * * Copyright (C) Matt Helsley, IBM Corp. 2005 * Derived from fcctl.c by Guillaume Thouvenin * Original copyright notice follows: * * Copyright (C) 2005 BULL SA. * Written by Guillaume Thouvenin <guillaume.thouvenin@bull.net> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <string.h> #include <sys/socket.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/param.h> #include <linux/connector.h> #include <linux/netlink.h> #include "linux/cn_proc.h" #include <errno.h> #include <signal.h> #include <setjmp.h> #define SEND_MESSAGE_LEN (NLMSG_LENGTH(sizeof(struct cn_msg) + \ sizeof(enum proc_cn_mcast_op))) #define RECV_MESSAGE_LEN (NLMSG_LENGTH(sizeof(struct cn_msg) + \ sizeof(struct proc_event))) #define SEND_MESSAGE_SIZE (NLMSG_SPACE(SEND_MESSAGE_LEN)) #define RECV_MESSAGE_SIZE (NLMSG_SPACE(RECV_MESSAGE_LEN)) #define max(x,y) ((y)<(x)?(x):(y)) #define min(x,y) ((y)>(x)?(x):(y)) #define BUFF_SIZE (max(max(SEND_MESSAGE_SIZE, RECV_MESSAGE_SIZE), 1024)) #define MIN_RECV_SIZE (min(SEND_MESSAGE_SIZE, RECV_MESSAGE_SIZE)) #define PROC_CN_MCAST_LISTEN (1) #define PROC_CN_MCAST_IGNORE (2) /* * SIGINT causes the program to exit gracefully * this could happen any time after the LISTEN message has * been sent */ #define INTR_SIG SIGINT sigjmp_buf g_jmp; char cpuctl_fs_path[MAXPATHLEN]; void handle_intr (int signum) { siglongjmp(g_jmp, signum); } static inline void itos(int i, char* str) { sprintf(str, "%d", i); } int set_notify_release(int val) { FILE *f; f = fopen("notify_on_release", "r+"); fprintf(f, "%d\n", val); fclose(f); return 0; } int add_task_pid(int pid) { FILE *f; f = fopen("tasks", "a"); fprintf(f, "%d\n", pid); fclose(f); return 0; } int set_value(char* file, char *str) { FILE *f; f=fopen(file, "w"); fprintf(f, "%s", str); fclose(f); return 0; } int change_group(int pid, int uid) { char str[100]; int ret; ret = chdir(cpuctl_fs_path); itos(uid, str); ret = mkdir(str, 0777); if (ret == -1) { /* * If the folder already exists, then it is alright. anything * else should be killed */ if (errno != EEXIST) { perror("mkdir"); return -1; } } ret = chdir(str); if (ret == -1) { /*Again, i am just quitting the program!*/ perror("chdir"); return -1; } /*If using cpusets set cpus and mems* * * set_value("cpus","0"); * set_value("mems","0"); */ set_notify_release(1); add_task_pid(pid); return 0; } int handle_msg (struct cn_msg *cn_hdr) { struct proc_event *ev; int ret; ev = (struct proc_event*)cn_hdr->data; switch(ev->what){ case PROC_EVENT_UID: printf("UID Change happening\n"); printf("UID = %d\tPID=%d\n", ev->event_data.id.e.euid, ev->event_data.id.process_pid); ret = change_group(ev->event_data.id.process_pid, ev->event_data.id.r.ruid); break; case PROC_EVENT_FORK: case PROC_EVENT_EXEC: case PROC_EVENT_EXIT: default: break; } return ret; } int main(int argc, char **argv) { int sk_nl; int err; struct sockaddr_nl my_nla, kern_nla, from_nla; socklen_t from_nla_len; char buff[BUFF_SIZE]; int rc = -1; struct nlmsghdr *nl_hdr; struct cn_msg *cn_hdr; enum proc_cn_mcast_op *mcop_msg; size_t recv_len = 0; FILE *f; if (argc == 1) strcpy(cpuctl_fs_path, "/dev/cpuctl"); else strcpy(cpuctl_fs_path, argv[1]); chdir(cpuctl_fs_path); f = fopen("tasks", "r"); if (f == NULL) { printf("Container not mounted at %s\n", cpuctl_fs_path); return -1; } fclose(f); f = fopen("notify_on_release", "r"); if (f == NULL) { printf("Container not mounted at %s\n", cpuctl_fs_path); return -1; } fclose(f); if (getuid() != 0) { printf("Only root can start/stop the fork connector\n"); return 0; } /* * Create an endpoint for communication. Use the kernel user * interface device (PF_NETLINK) which is a datagram oriented * service (SOCK_DGRAM). The protocol used is the connector * protocol (NETLINK_CONNECTOR) */ sk_nl = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR); if (sk_nl == -1) { printf("socket sk_nl error"); return rc; } my_nla.nl_family = AF_NETLINK; my_nla.nl_groups = CN_IDX_PROC; my_nla.nl_pid = getpid(); kern_nla.nl_family = AF_NETLINK; kern_nla.nl_groups = CN_IDX_PROC; kern_nla.nl_pid = 1; err = bind(sk_nl, (struct sockaddr *)&my_nla, sizeof(my_nla)); if (err == -1) { printf("binding sk_nl error"); goto close_and_exit; } nl_hdr = (struct nlmsghdr *)buff; cn_hdr = (struct cn_msg *)NLMSG_DATA(nl_hdr); mcop_msg = (enum proc_cn_mcast_op*)&cn_hdr->data[0]; printf("sending proc connector: PROC_CN_MCAST_LISTEN... "); memset(buff, 0, sizeof(buff)); *mcop_msg = PROC_CN_MCAST_LISTEN; signal(INTR_SIG, handle_intr); /* fill the netlink header */ nl_hdr->nlmsg_len = SEND_MESSAGE_LEN; nl_hdr->nlmsg_type = NLMSG_DONE; nl_hdr->nlmsg_flags = 0; nl_hdr->nlmsg_seq = 0; nl_hdr->nlmsg_pid = getpid(); /* fill the connector header */ cn_hdr->id.idx = CN_IDX_PROC; cn_hdr->id.val = CN_VAL_PROC; cn_hdr->seq = 0; cn_hdr->ack = 0; cn_hdr->len = sizeof(enum proc_cn_mcast_op); if (send(sk_nl, nl_hdr, nl_hdr->nlmsg_len, 0) != nl_hdr->nlmsg_len) { printf("failed to send proc connector mcast ctl op!\n"); goto close_and_exit; } printf("sent\n"); for(memset(buff, 0, sizeof(buff)), from_nla_len = sizeof(from_nla); ; memset(buff, 0, sizeof(buff)), from_nla_len = sizeof(from_nla)) { struct nlmsghdr *nlh = (struct nlmsghdr*)buff; memcpy(&from_nla, &kern_nla, sizeof(from_nla)); recv_len = recvfrom(sk_nl, buff, BUFF_SIZE, 0, (struct sockaddr*)&from_nla, &from_nla_len); if (recv_len < 1) continue; while (NLMSG_OK(nlh, recv_len)) { cn_hdr = NLMSG_DATA(nlh); if (nlh->nlmsg_type == NLMSG_NOOP) continue; if ((nlh->nlmsg_type == NLMSG_ERROR) || (nlh->nlmsg_type == NLMSG_OVERRUN)) break; if(handle_msg(cn_hdr)<0) { goto close_and_exit; } if (nlh->nlmsg_type == NLMSG_DONE) break; nlh = NLMSG_NEXT(nlh, recv_len); } } close_and_exit: close(sk_nl); exit(rc); return 0; } ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 11:39 ` Srivatsa Vaddagiri @ 2007-07-11 11:42 ` Paul Jackson 2007-07-11 12:06 ` Peter Zijlstra 2007-07-11 12:30 ` Srivatsa Vaddagiri 1 sibling, 1 reply; 535+ messages in thread From: Paul Jackson @ 2007-07-11 11:42 UTC (permalink / raw) To: vatsa; +Cc: mingo, containers, menage, akpm, linux-kernel Srivatsa wrote: > The fact that we will have two interface for group scheduler in 2.6.24 > is what worries me a bit (one user-id based and other container based). Yeah. One -could- take linear combinations, as Peter drew in his ascii art, but would one -want- to do that? I imagine some future time, when users of this wonder why the API is more complicated than seems necessary, with two factors determining task-groups where one seems sufficient, and the answer is "the other factor, user-id's, is just there because we needed it as an interim mechanism, and then had to keep it, to preserve ongoing compatibility. That's not a very persuasive justification. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 11:42 ` Paul Jackson @ 2007-07-11 12:06 ` Peter Zijlstra 2007-07-11 17:03 ` Paul Jackson 0 siblings, 1 reply; 535+ messages in thread From: Peter Zijlstra @ 2007-07-11 12:06 UTC (permalink / raw) To: Paul Jackson; +Cc: vatsa, mingo, containers, menage, akpm, linux-kernel On Wed, 2007-07-11 at 04:42 -0700, Paul Jackson wrote: > Srivatsa wrote: > > The fact that we will have two interface for group scheduler in 2.6.24 > > is what worries me a bit (one user-id based and other container based). > > Yeah. > > One -could- take linear combinations, as Peter drew in his ascii art, > but would one -want- to do that? I'd very much like to have it, but that is just me. We could take a weight of 0 to mean disabling of that grouping and default to that. That way it would not complicate regular behaviour. It could be implemented with a simple hashing scheme where sched_group_hash(tsk) and sched_group_cmp(tsk, group->some_task) could be used to identify a schedule group. pseudo code: u64 sched_group_hash(struct task_struct *tsk) { u64 hash = 0; if (tsk->pid->weight) hash_add(&hash, tsk->pid); if (tsk->pgrp->weight) hash_add(&hash, tsk->pgrp); if (tsk->uid->weight) hash_add(&hash, tsk->uid); if (tsk->container->weight) hash_add(&hash, tsk->container); ... return hash; } s64 sched_group_cmp(struct task_struct *t1, struct task_struct *t2) { s64 cmp; if (t1->pid->weight || t2->pid->weight) { cmp = t1->pid->weight - t2->pid->weight; if (cmp) return cmp; } ... return 0; } u64 sched_group_weight(struct task_struct *tsk) { u64 weight = 1024; /* 1 fixed point 10 bits */ if (tsk->pid->weight) { weight *= tsk->pid->weight; weight /= 1024; } .... return weight; } ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 12:06 ` Peter Zijlstra @ 2007-07-11 17:03 ` Paul Jackson 2007-07-11 18:47 ` Peter Zijlstra 0 siblings, 1 reply; 535+ messages in thread From: Paul Jackson @ 2007-07-11 17:03 UTC (permalink / raw) To: Peter Zijlstra; +Cc: vatsa, mingo, containers, menage, akpm, linux-kernel Peter wrote: > I'd very much like to have it, but that is just me. Why? [linear combinations of uid, container, pid, pgrp weighting] You provide some implementation details and complications, but no motivation that I noticed. Well ... a little motivation ... "just me", which would go a long way of your first name was Linus. For the rest of us ... ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 17:03 ` Paul Jackson @ 2007-07-11 18:47 ` Peter Zijlstra 0 siblings, 0 replies; 535+ messages in thread From: Peter Zijlstra @ 2007-07-11 18:47 UTC (permalink / raw) To: Paul Jackson; +Cc: vatsa, mingo, containers, menage, akpm, linux-kernel On Wed, 2007-07-11 at 10:03 -0700, Paul Jackson wrote: > Peter wrote: > > I'd very much like to have it, but that is just me. > > Why? [linear combinations of uid, container, pid, pgrp weighting] Good question, and I really have no other answer than that it seems usefull and not impossible (or even hard) to implement :-/ I'm not even that interested in using it, it just seems like a nice idea. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 11:39 ` Srivatsa Vaddagiri 2007-07-11 11:42 ` Paul Jackson @ 2007-07-11 12:30 ` Srivatsa Vaddagiri 1 sibling, 0 replies; 535+ messages in thread From: Srivatsa Vaddagiri @ 2007-07-11 12:30 UTC (permalink / raw) To: Ingo Molnar; +Cc: containers, menage, Paul Jackson, akpm, linux-kernel On Wed, Jul 11, 2007 at 05:09:53PM +0530, Srivatsa Vaddagiri wrote: > > (btw., if this goes into 2.6.23 then we cannot possibly turn it off in 2.6.24, > > The fact that we will have two interface for group scheduler in 2.6.24 > is what worries me a bit (one user-id based and other container based). I know breaking user-interface is a bad thing across releases. But in this particular case, it's probably ok (since fair-group scheduling is a brand new feature in Linux)? If we have that option of breaking API between 2.6.23 and 2.6.24 for fair-group scheduler, then we are in a much more flexible position. For 2.6.23, I can send a user-id based interface for fair-group scheduler (with some /proc interface to tune group nice value). For 2.6.24, this user-id interface will be removed and we will instead switch to container based interface. Fair-user scheduling will continue to work, its just that users will have to use a daemon (sources sent in previous mail) to enable it on top of container-based interface. Hmm..? -- Regards, vatsa ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 10:03 ` Srivatsa Vaddagiri 2007-07-11 10:19 ` Ingo Molnar @ 2007-07-11 11:10 ` Paul Jackson 2007-07-11 11:24 ` Peter Zijlstra 1 sibling, 1 reply; 535+ messages in thread From: Paul Jackson @ 2007-07-11 11:10 UTC (permalink / raw) To: vatsa; +Cc: mingo, akpm, menage, linux-kernel, containers Srivatsa wrote: > So Ingo was proposing we use cpuset as that user interface to manage > task-groups. This will be only for 2.6.23. Good explanation - thanks. In short, the proposal was to use the task partition defined by cpusets to define CFS task-groups, until the real process containers are available. Or, I see in the next message, Ingo responding favorably to your alternative, using task uid's to partition the tasks into CFS task-groups. Yeah, Ingo's preference for using uid's (or gid's ??) sounds right to me - a sustainable API. Wouldn't want to be adding a cpuset API for a single 2.6.N release. ... gid's -- why not? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 11:10 ` Paul Jackson @ 2007-07-11 11:24 ` Peter Zijlstra 2007-07-11 11:30 ` Peter Zijlstra 0 siblings, 1 reply; 535+ messages in thread From: Peter Zijlstra @ 2007-07-11 11:24 UTC (permalink / raw) To: Paul Jackson; +Cc: vatsa, mingo, akpm, menage, linux-kernel, containers On Wed, 2007-07-11 at 04:10 -0700, Paul Jackson wrote: > Srivatsa wrote: > > So Ingo was proposing we use cpuset as that user interface to manage > > task-groups. This will be only for 2.6.23. > > Good explanation - thanks. > > In short, the proposal was to use the task partition defined by cpusets > to define CFS task-groups, until the real process containers are > available. > > Or, I see in the next message, Ingo responding favorably to your > alternative, using task uid's to partition the tasks into CFS > task-groups. > > Yeah, Ingo's preference for using uid's (or gid's ??) sounds right to > me - a sustainable API. > > Wouldn't want to be adding a cpuset API for a single 2.6.N release. > > .... gid's -- why not? Or process or process groups, or all of the above :-) One thing to think on though, we cannot have per process,uid,gid,pgrp scheduling for one release only. So we'd have to manage interaction with process containers. It might be that a simple weight multiplication scheme is good enough: weight = uid_weight * pgrp_weight * container_weight Of course, if we'd only have a single level group scheduler (as was proposed IIRC) it'd have to create intersection sets (as there might be non trivial overlaps) based on these various weights and schedule these resulting sets instead of the initial groupings. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 11:24 ` Peter Zijlstra @ 2007-07-11 11:30 ` Peter Zijlstra 2007-07-11 13:14 ` Srivatsa Vaddagiri 0 siblings, 1 reply; 535+ messages in thread From: Peter Zijlstra @ 2007-07-11 11:30 UTC (permalink / raw) To: Paul Jackson; +Cc: vatsa, mingo, akpm, menage, linux-kernel, containers On Wed, 2007-07-11 at 13:24 +0200, Peter Zijlstra wrote: > On Wed, 2007-07-11 at 04:10 -0700, Paul Jackson wrote: > > Srivatsa wrote: > > > So Ingo was proposing we use cpuset as that user interface to manage > > > task-groups. This will be only for 2.6.23. > > > > Good explanation - thanks. > > > > In short, the proposal was to use the task partition defined by cpusets > > to define CFS task-groups, until the real process containers are > > available. > > > > Or, I see in the next message, Ingo responding favorably to your > > alternative, using task uid's to partition the tasks into CFS > > task-groups. > > > > Yeah, Ingo's preference for using uid's (or gid's ??) sounds right to > > me - a sustainable API. > > > > Wouldn't want to be adding a cpuset API for a single 2.6.N release. > > > > .... gid's -- why not? > > > Or process or process groups, or all of the above :-) > > One thing to think on though, we cannot have per process,uid,gid,pgrp > scheduling for one release only. So we'd have to manage interaction with > process containers. It might be that a simple weight multiplication > scheme is good enough: > > weight = uid_weight * pgrp_weight * container_weight > > Of course, if we'd only have a single level group scheduler (as was > proposed IIRC) it'd have to create intersection sets (as there might be > non trivial overlaps) based on these various weights and schedule these > resulting sets instead of the initial groupings. Lets illustrate with some ASCII art: so we have this dual level weight grouping (uid, container) uid: a a a a a b b b b b c c c c c container: A A A A A A A B B B B B B B B set: 1 1 1 1 1 2 2 3 3 3 4 4 4 4 4 resulting in schedule sets 1,2,3,4 so that (for instance) weight_2 = weight_b * weight_A ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 11:30 ` Peter Zijlstra @ 2007-07-11 13:14 ` Srivatsa Vaddagiri 0 siblings, 0 replies; 535+ messages in thread From: Srivatsa Vaddagiri @ 2007-07-11 13:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul Jackson, akpm, linux-kernel, containers, menage, mingo On Wed, Jul 11, 2007 at 01:30:40PM +0200, Peter Zijlstra wrote: > > One thing to think on though, we cannot have per process,uid,gid,pgrp > > scheduling for one release only. So we'd have to manage interaction with > > process containers. It might be that a simple weight multiplication > > scheme is good enough: > > > > weight = uid_weight * pgrp_weight * container_weight We would need something like this to flatten hierarchy, so that for example it is possible to do fair-container scheduling + fair-user/process scheduling inside a container using a hierarchy depth of just 1 (containers) that core scheduler understands. We discussed this a bit at http://marc.info/?l=linux-kernel&m=118054481416140&w=2 and is very much on my todo list to experiment with. > > Of course, if we'd only have a single level group scheduler (as was > > proposed IIRC) it'd have to create intersection sets (as there might be > > non trivial overlaps) based on these various weights and schedule these > > resulting sets instead of the initial groupings. > > Lets illustrate with some ASCII art: > > so we have this dual level weight grouping (uid, container) > > uid: a a a a a b b b b b c c c c c > container: A A A A A A A B B B B B B B B > > set: 1 1 1 1 1 2 2 3 3 3 4 4 4 4 4 > > resulting in schedule sets 1,2,3,4 Wouldn't it be simpler if admin created these sets as containers directly? i.e: uid: a a a a a b b b b b c c c c c container: 1 1 1 1 1 2 2 3 3 3 4 4 4 4 4 That way scheduler will not have to "guess" such intersecting schedulable sets/groups. It seems much simpler to me this way. Surely there is some policy which is driving some tasks of userid 'b' to be in container A and some to be in B. It should be trivial enough to hook onto that policy making script and create separate containers like above. > so that (for instance) weight_2 = weight_b * weight_A -- Regards, vatsa ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 5:29 ` Andrew Morton 2007-07-11 6:03 ` Srivatsa Vaddagiri 2007-07-11 9:04 ` Ingo Molnar @ 2007-07-11 19:44 ` Paul Menage 2007-07-12 5:39 ` Srivatsa Vaddagiri 2 siblings, 1 reply; 535+ messages in thread From: Paul Menage @ 2007-07-11 19:44 UTC (permalink / raw) To: Andrew Morton; +Cc: vatsa, linux-kernel, containers, Ingo Molnar On 7/10/07, Andrew Morton <akpm@linux-foundation.org> wrote: > > I'm inclined to take the cautious route here - I don't think people will be > dying for the CFS thingy (which I didn't even know about?) in .23, and it's > rather a lot of infrastructure to add for a CPU scheduler configurator Selecting the relevant patches to give enough of the container framework to support a CFS container subsystem (slightly tweaked/updated versions of the base patch, procfs interface patch and tasks file interface patch) is about 1600 lines in kernel/container.c and another 200 in kernel/container.h, which is about 99% of the non-documentation changes. So not tiny, but it's not very intrusive on the rest of the kernel, and would avoid having to introduce a temporary API based on uids. Paul ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: containers (was Re: -mm merge plans for 2.6.23) 2007-07-11 19:44 ` Paul Menage @ 2007-07-12 5:39 ` Srivatsa Vaddagiri 0 siblings, 0 replies; 535+ messages in thread From: Srivatsa Vaddagiri @ 2007-07-12 5:39 UTC (permalink / raw) To: Paul Menage; +Cc: Andrew Morton, containers, Ingo Molnar, linux-kernel On Wed, Jul 11, 2007 at 12:44:42PM -0700, Paul Menage wrote: > >I'm inclined to take the cautious route here - I don't think people will be > >dying for the CFS thingy (which I didn't even know about?) in .23, and it's > >rather a lot of infrastructure to add for a CPU scheduler configurator > > Selecting the relevant patches to give enough of the container > framework to support a CFS container subsystem (slightly > tweaked/updated versions of the base patch, procfs interface patch and > tasks file interface patch) is about 1600 lines in kernel/container.c > and another 200 in kernel/container.h, which is about 99% of the > non-documentation changes. > > So not tiny, but it's not very intrusive on the rest of the kernel, > and would avoid having to introduce a temporary API based on uids. Yes that would be good. As long as the user-land interface for process containers doesn't change (much?) between 2.6.23 and later releases this should be a good workaround for us. -- Regards, vatsa ^ permalink raw reply [flat|nested] 535+ messages in thread
* fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: re: -mm merge plans for 2.6.23) 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (4 preceding siblings ...) 2007-07-10 10:52 ` containers (was Re: -mm merge plans for 2.6.23) Srivatsa Vaddagiri @ 2007-07-10 11:52 ` Theodore Tso 2007-07-10 17:15 ` Andrew Morton 2007-07-10 12:37 ` clam Andy Whitcroft ` (20 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Theodore Tso @ 2007-07-10 11:52 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Amit Arora, Andi Kleen, Paul Mackerras, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Heiko Carstens, Martin Schwidefsky, Theodore Ts'o, Mark Fasheh, Andrew Morton On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > Merge > > fallocate-implementation-on-i86-x86_64-and-powerpc.patch Andrew, Could you replace the comment/header section of fallocate-implementation-on-i86-x86_64-and-powerpc.patch with the following (attached below) ? This is from the ext4 patches, where Amit had cleaned up description, which will make for a cleaner and easier to understand submission into the git tree. I've reviewed the other fallocate patches, noting the request to drop the s390 patches since Martin has said he will wire up it up after this hits mainline, and the only other change that I've found between what we have in the ext4 tree and -mm is that we have fallocate-on-ia64.patch and fallocate-on-ia64-fix.patch merged into a single patch. It probably would be better to merge them before sending it off to Linus, in the interests of cleanliness and making the tree more git-bisect friendly. Regards, - Ted From: Amit Arora <aarora@in.ibm.com> sys_fallocate() implementation on i386, x86_64 and powerpc fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2. A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Signed-off-by: Amit Arora <aarora@in.ibm.com> Cc: Andi Kleen <ak@suse.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: re: -mm merge plans for 2.6.23) 2007-07-10 11:52 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: " Theodore Tso @ 2007-07-10 17:15 ` Andrew Morton 2007-07-10 17:44 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Jeff Garzik 2007-07-10 19:07 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: re: -mm merge plans for 2.6.23) Theodore Tso 0 siblings, 2 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 17:15 UTC (permalink / raw) To: Theodore Tso Cc: linux-kernel, Amit Arora, Andi Kleen, Paul Mackerras, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Heiko Carstens, Martin Schwidefsky, Mark Fasheh On Tue, 10 Jul 2007 07:52:51 -0400 Theodore Tso <tytso@mit.edu> wrote: > On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > > Merge > > > > fallocate-implementation-on-i86-x86_64-and-powerpc.patch > > Andrew, > > Could you replace the comment/header section of > fallocate-implementation-on-i86-x86_64-and-powerpc.patch with the > following (attached below) ? This is from the ext4 patches, where > Amit had cleaned up description, which will make for a cleaner and > easier to understand submission into the git tree. There were issues with the x86 patch, the s390 patch was wrong and Tony wants the the ia64 patch to use a different syscall number. So I dropped everything. Let's start again from scratch. I'd suggest that for now we go with just an i386/x86_64 implementation, let the arch maintainers wire things up when that has settled down. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch 2007-07-10 17:15 ` Andrew Morton @ 2007-07-10 17:44 ` Jeff Garzik 2007-07-10 23:27 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Paul Mackerras 2007-07-10 19:07 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: re: -mm merge plans for 2.6.23) Theodore Tso 1 sibling, 1 reply; 535+ messages in thread From: Jeff Garzik @ 2007-07-10 17:44 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Tso, linux-kernel, Amit Arora, Andi Kleen, Paul Mackerras, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Heiko Carstens, Martin Schwidefsky, Mark Fasheh, linux-arch Andrew Morton wrote: > So I dropped everything. Let's start again from scratch. I'd suggest that > for now we go with just an i386/x86_64 implementation, let the arch > maintainers wire things up when that has settled down. It's my observation that that plan usually works the best. Arch maintainers come along and wire up batches of syscalls when they have a chance to glance at the ABI, and catch up with x86[-64]. Jeff ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch 2007-07-10 17:44 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Jeff Garzik @ 2007-07-10 23:27 ` Paul Mackerras 2007-07-11 0:16 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Andrew Morton 0 siblings, 1 reply; 535+ messages in thread From: Paul Mackerras @ 2007-07-10 23:27 UTC (permalink / raw) To: Jeff Garzik Cc: Andrew Morton, Theodore Tso, linux-kernel, Amit Arora, Andi Kleen, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Heiko Carstens, Martin Schwidefsky, Mark Fasheh, linux-arch Jeff Garzik writes: > Andrew Morton wrote: > > So I dropped everything. Let's start again from scratch. I'd suggest that > > for now we go with just an i386/x86_64 implementation, let the arch > > maintainers wire things up when that has settled down. > > > It's my observation that that plan usually works the best. Arch ... except when the initial implementer picks an argument order which doesn't work on some archs, as happened with sys_sync_file_range. That is also the case with fallocate IIRC. We did come up with an order that worked for everybody, but that discussion seemed to get totally ignored by the ext4 developers. Paul. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch 2007-07-10 23:27 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Paul Mackerras @ 2007-07-11 0:16 ` Andrew Morton 2007-07-11 0:50 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Paul Mackerras 0 siblings, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-11 0:16 UTC (permalink / raw) To: Paul Mackerras Cc: Jeff Garzik, Theodore Tso, linux-kernel, Amit Arora, Andi Kleen, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Heiko Carstens, Martin Schwidefsky, Mark Fasheh, linux-arch On Wed, 11 Jul 2007 09:27:40 +1000 Paul Mackerras <paulus@samba.org> wrote: > We did come up with an order that worked for everybody, but that > discussion seemed to get totally ignored by the ext4 developers. It was a long discussion. Can someone please remind us what the signature of the syscall (and the compat handler) should be? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch 2007-07-11 0:16 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Andrew Morton @ 2007-07-11 0:50 ` Paul Mackerras 2007-07-11 15:39 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Theodore Tso 0 siblings, 1 reply; 535+ messages in thread From: Paul Mackerras @ 2007-07-11 0:50 UTC (permalink / raw) To: Andrew Morton Cc: Jeff Garzik, Theodore Tso, linux-kernel, Amit Arora, Andi Kleen, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Heiko Carstens, Martin Schwidefsky, Mark Fasheh, linux-arch Andrew Morton writes: > On Wed, 11 Jul 2007 09:27:40 +1000 > Paul Mackerras <paulus@samba.org> wrote: > > > We did come up with an order that worked for everybody, but that > > discussion seemed to get totally ignored by the ext4 developers. > > It was a long discussion. > > Can someone please remind us what the signature of the syscall > (and the compat handler) should be? long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) should work for everybody. The compat handler would be long compat_sys_fallocate(u32 offset_hi, u32 offset_lo, u32 len_hi, u32 len_lo, int fd, int mode) for big-endian, or swap hi/lo for little-endian. (Actually it would be good to have an arch-dependent "stitch two args together" macro and call them offset_0, offset_1 etc.) Paul. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch 2007-07-11 0:50 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Paul Mackerras @ 2007-07-11 15:39 ` Theodore Tso 2007-07-11 18:47 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Heiko Carstens 0 siblings, 1 reply; 535+ messages in thread From: Theodore Tso @ 2007-07-11 15:39 UTC (permalink / raw) To: Paul Mackerras Cc: Andrew Morton, Jeff Garzik, linux-kernel, Amit Arora, Andi Kleen, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Heiko Carstens, Martin Schwidefsky, Mark Fasheh, linux-arch On Wed, Jul 11, 2007 at 10:50:49AM +1000, Paul Mackerras wrote: > > On Wed, 11 Jul 2007 09:27:40 +1000 > > Paul Mackerras <paulus@samba.org> wrote: > > > > > We did come up with an order that worked for everybody, but that > > > discussion seemed to get totally ignored by the ext4 developers. Well, in the end it was a toss-up between asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) and asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) There were a number of folks who preferred having int fd first, and I *thought* Amit had gotten agreement from either Martin or Heiko that it was ok to do this as an exception, even though it was extra work for that arch. But if not, we can try going back the second alternative, or even the 6 32-bits args (off_high, off_low, len_high, len_low) approach, but I think that drew even more fire. Basically, no one approach made everyone happy, and at the end of the day sometimes you have to choose. I thought we had settled this in May with something that people could live with, but if we need to reopen the discussion, better now than later..... - Ted ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch 2007-07-11 15:39 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Theodore Tso @ 2007-07-11 18:47 ` Heiko Carstens 2007-07-11 20:32 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Martin Schwidefsky 0 siblings, 1 reply; 535+ messages in thread From: Heiko Carstens @ 2007-07-11 18:47 UTC (permalink / raw) To: Theodore Tso, Paul Mackerras, Andrew Morton, Jeff Garzik, linux-kernel, Amit Arora, Andi Kleen, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Martin Schwidefsky, Mark Fasheh, linux-arch On Wed, Jul 11, 2007 at 11:39:39AM -0400, Theodore Tso wrote: > On Wed, Jul 11, 2007 at 10:50:49AM +1000, Paul Mackerras wrote: > > > On Wed, 11 Jul 2007 09:27:40 +1000 > > > Paul Mackerras <paulus@samba.org> wrote: > > > > > > > We did come up with an order that worked for everybody, but that > > > > discussion seemed to get totally ignored by the ext4 developers. > > Well, in the end it was a toss-up between > > asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) > > and > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) > > There were a number of folks who preferred having int fd first, and I > *thought* Amit had gotten agreement from either Martin or Heiko that > it was ok to do this as an exception, even though it was extra work > for that arch. But if not, we can try going back the second > alternative, or even the 6 32-bits args (off_high, off_low, len_high, > len_low) approach, but I think that drew even more fire. The second approach would work for all architectures.. but some people didn't like (no technical reason) not having fd as first argument. Just go ahead with the current approach. s390 seems to be the only architecture which suffers from this and I wouldn't like to start this discussion again. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch 2007-07-11 18:47 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Heiko Carstens @ 2007-07-11 20:32 ` Martin Schwidefsky 0 siblings, 0 replies; 535+ messages in thread From: Martin Schwidefsky @ 2007-07-11 20:32 UTC (permalink / raw) To: Heiko Carstens Cc: Theodore Tso, Paul Mackerras, Andrew Morton, Jeff Garzik, linux-kernel, Amit Arora, Andi Kleen, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Mark Fasheh, linux-arch On Wed, 2007-07-11 at 20:47 +0200, Heiko Carstens wrote: > > There were a number of folks who preferred having int fd first, and I > > *thought* Amit had gotten agreement from either Martin or Heiko that > > it was ok to do this as an exception, even though it was extra work > > for that arch. But if not, we can try going back the second > > alternative, or even the 6 32-bits args (off_high, off_low, len_high, > > len_low) approach, but I think that drew even more fire. > > The second approach would work for all architectures.. but some people > didn't like (no technical reason) not having fd as first argument. For s390 we would have liked the second approach with the two int's as last arguments since it would avoid the wrapper in the kernel. It does not avoid the wrapper in user space since the call uses 6 register on 31 bit. So the fallocate call need special treatement in glibc so I don't mind that it needs another wrapper in the kernel. > Just go ahead with the current approach. s390 seems to be the only > architecture which suffers from this and I wouldn't like to start this > discussion again. Yes, don't worry about s390 for fallocate, the patch that had been in -mm only had an incorrect system call number. The wrapper is fine. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: re: -mm merge plans for 2.6.23) 2007-07-10 17:15 ` Andrew Morton 2007-07-10 17:44 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Jeff Garzik @ 2007-07-10 19:07 ` Theodore Tso 2007-07-10 19:31 ` Andrew Morton 1 sibling, 1 reply; 535+ messages in thread From: Theodore Tso @ 2007-07-10 19:07 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Amit Arora, Andi Kleen, Paul Mackerras, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Heiko Carstens, Martin Schwidefsky, Mark Fasheh On Tue, Jul 10, 2007 at 10:15:58AM -0700, Andrew Morton wrote: > So I dropped everything. Let's start again from scratch. I'd suggest that > for now we go with just an i386/x86_64 implementation, let the arch > maintainers wire things up when that has settled down. Ok, so no objections if we push the i386/x86_64 implementation (only), plus the ext4 support to Linus? - Ted ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: re: -mm merge plans for 2.6.23) 2007-07-10 19:07 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: re: -mm merge plans for 2.6.23) Theodore Tso @ 2007-07-10 19:31 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 19:31 UTC (permalink / raw) To: Theodore Tso Cc: linux-kernel, Amit Arora, Andi Kleen, Paul Mackerras, Benjamin Herrenschmidt, Arnd Bergmann, Luck, Tony, Heiko Carstens, Martin Schwidefsky, Mark Fasheh On Tue, 10 Jul 2007 15:07:35 -0400 Theodore Tso <tytso@mit.edu> wrote: > On Tue, Jul 10, 2007 at 10:15:58AM -0700, Andrew Morton wrote: > > So I dropped everything. Let's start again from scratch. I'd suggest that > > for now we go with just an i386/x86_64 implementation, let the arch > > maintainers wire things up when that has settled down. > > Ok, so no objections if we push the i386/x86_64 implementation (only), > plus the ext4 support to Linus? > Sounds like a plan. I haven't seriously looked at ext4 code in many months. When I did I found the changes to be quite incomprehensible, very, very poorly commented and with quite a lot of odd-looking things about which I asked but got, iirc, no useful reply. Hopefully it got better. Which patches are you proposing merging into 2.6.23? ^ permalink raw reply [flat|nested] 535+ messages in thread
* clam 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (5 preceding siblings ...) 2007-07-10 11:52 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: " Theodore Tso @ 2007-07-10 12:37 ` Andy Whitcroft 2007-07-11 9:34 ` Re: -mm merge plans -- lumpy reclaim Andy Whitcroft 2007-07-10 15:08 ` -mm merge plans for 2.6.23 Serge E. Hallyn ` (19 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Andy Whitcroft @ 2007-07-10 12:37 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Mel Gorman, Christoph Lameter Andrew Morton wrote: [...] > lumpy-reclaim-v4.patch > have-kswapd-keep-a-minimum-order-free-other-than-order-0.patch > only-check-absolute-watermarks-for-alloc_high-and-alloc_harder-allocations.patch > > Lumpy reclaim. In a similar situation to Mel's patches. Stuck due to > general lack or interest and effort. The lumpy reclaim patches originally came out of work to support Mel's anti-fragmentation work. As such I think they have become somewhat attached to those patches. Whilst lumpy is most effective where placement controls are in place as offered by Mel's work, we see benefit from reduction in the "blunderbuss" effect when we reclaim at higher orders. While placement control is pretty much required for the very highest orders such as huge page size, lower order allocations are benefited in terms of lower collateral damage. There are now a few areas other than huge page allocations which can benefit. Stacks are still order 1. Jumbo frames want higher order contiguous pages for there incoming hardware buffers. SLUB is showing performance benefits from moving to a higher allocation order. All of these should benefit from more aggressive targeted reclaim, indeed I have been surprised just how often my test workloads trigger lumpy at order 1 to get new stacks. Truly representative work loads are hard to generate for some of these. Though we have heard some encouraging noises from those who can reproduce these problems. [...] -apw ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: Re: -mm merge plans -- lumpy reclaim 2007-07-10 12:37 ` clam Andy Whitcroft @ 2007-07-11 9:34 ` Andy Whitcroft 2007-07-11 16:46 ` Andrew Morton 0 siblings, 1 reply; 535+ messages in thread From: Andy Whitcroft @ 2007-07-11 9:34 UTC (permalink / raw) To: Andy Whitcroft Cc: Andrew Morton, linux-kernel, Mel Gorman, Christoph Lameter, Peter Zijlstra [Seems a PEBKAC occured on the subject line, resending lest it become a victim of "oh thats spam".] Andy Whitcroft wrote: > Andrew Morton wrote: > > [...] >> lumpy-reclaim-v4.patch >> have-kswapd-keep-a-minimum-order-free-other-than-order-0.patch >> only-check-absolute-watermarks-for-alloc_high-and-alloc_harder-allocations.patch >> >> Lumpy reclaim. In a similar situation to Mel's patches. Stuck due to >> general lack or interest and effort. > > The lumpy reclaim patches originally came out of work to support Mel's > anti-fragmentation work. As such I think they have become somewhat > attached to those patches. Whilst lumpy is most effective where > placement controls are in place as offered by Mel's work, we see benefit > from reduction in the "blunderbuss" effect when we reclaim at higher > orders. While placement control is pretty much required for the very > highest orders such as huge page size, lower order allocations are > benefited in terms of lower collateral damage. > > There are now a few areas other than huge page allocations which can > benefit. Stacks are still order 1. Jumbo frames want higher order > contiguous pages for there incoming hardware buffers. SLUB is showing > performance benefits from moving to a higher allocation order. All of > these should benefit from more aggressive targeted reclaim, indeed I > have been surprised just how often my test workloads trigger lumpy at > order 1 to get new stacks. > > Truly representative work loads are hard to generate for some of these. > Though we have heard some encouraging noises from those who can > reproduce these problems. > > [...] > > -apw ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans -- lumpy reclaim 2007-07-11 9:34 ` Re: -mm merge plans -- lumpy reclaim Andy Whitcroft @ 2007-07-11 16:46 ` Andrew Morton 2007-07-11 18:38 ` Andy Whitcroft 2007-07-16 10:37 ` Mel Gorman 0 siblings, 2 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-11 16:46 UTC (permalink / raw) To: Andy Whitcroft Cc: linux-kernel, Mel Gorman, Christoph Lameter, Peter Zijlstra On Wed, 11 Jul 2007 10:34:31 +0100 Andy Whitcroft <apw@shadowen.org> wrote: > [Seems a PEBKAC occured on the subject line, resending lest it become a > victim of "oh thats spam".] > > Andy Whitcroft wrote: > > Andrew Morton wrote: > > > > [...] > >> lumpy-reclaim-v4.patch > >> have-kswapd-keep-a-minimum-order-free-other-than-order-0.patch > >> only-check-absolute-watermarks-for-alloc_high-and-alloc_harder-allocations.patch > >> > >> Lumpy reclaim. In a similar situation to Mel's patches. Stuck due to > >> general lack or interest and effort. > > > > The lumpy reclaim patches originally came out of work to support Mel's > > anti-fragmentation work. As such I think they have become somewhat > > attached to those patches. Whilst lumpy is most effective where > > placement controls are in place as offered by Mel's work, we see benefit > > from reduction in the "blunderbuss" effect when we reclaim at higher > > orders. While placement control is pretty much required for the very > > highest orders such as huge page size, lower order allocations are > > benefited in terms of lower collateral damage. > > > > There are now a few areas other than huge page allocations which can > > benefit. Stacks are still order 1. Jumbo frames want higher order > > contiguous pages for there incoming hardware buffers. SLUB is showing > > performance benefits from moving to a higher allocation order. All of > > these should benefit from more aggressive targeted reclaim, indeed I > > have been surprised just how often my test workloads trigger lumpy at > > order 1 to get new stacks. > > > > Truly representative work loads are hard to generate for some of these. > > Though we have heard some encouraging noises from those who can > > reproduce these problems. I'd expect that the main application for lumpy-reclaim is in keeping a pool of order-2 (say) pages in reserve for GFP_ATOMIC allocators. ie: jumbo frames. At present this relies upon the wakeup_kswapd(..., order) mechanism. How effective is this at solving the jumbo frame problem? (And do we still have a jumbo frame problem? Reports seems to have subsided) ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans -- lumpy reclaim 2007-07-11 16:46 ` Andrew Morton @ 2007-07-11 18:38 ` Andy Whitcroft 2007-07-16 10:37 ` Mel Gorman 1 sibling, 0 replies; 535+ messages in thread From: Andy Whitcroft @ 2007-07-11 18:38 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Mel Gorman, Christoph Lameter, Peter Zijlstra Andrew Morton wrote: >>>> lumpy-reclaim-v4.patch >>>> have-kswapd-keep-a-minimum-order-free-other-than-order-0.patch >>>> only-check-absolute-watermarks-for-alloc_high-and-alloc_harder-allocations.patch >>>> >>>> Lumpy reclaim. In a similar situation to Mel's patches. Stuck due to >>>> general lack or interest and effort. >>> The lumpy reclaim patches originally came out of work to support Mel's >>> anti-fragmentation work. As such I think they have become somewhat >>> attached to those patches. Whilst lumpy is most effective where >>> placement controls are in place as offered by Mel's work, we see benefit >>> from reduction in the "blunderbuss" effect when we reclaim at higher >>> orders. While placement control is pretty much required for the very >>> highest orders such as huge page size, lower order allocations are >>> benefited in terms of lower collateral damage. >>> >>> There are now a few areas other than huge page allocations which can >>> benefit. Stacks are still order 1. Jumbo frames want higher order >>> contiguous pages for there incoming hardware buffers. SLUB is showing >>> performance benefits from moving to a higher allocation order. All of >>> these should benefit from more aggressive targeted reclaim, indeed I >>> have been surprised just how often my test workloads trigger lumpy at >>> order 1 to get new stacks. >>> >>> Truly representative work loads are hard to generate for some of these. >>> Though we have heard some encouraging noises from those who can >>> reproduce these problems. > > I'd expect that the main application for lumpy-reclaim is in keeping a pool > of order-2 (say) pages in reserve for GFP_ATOMIC allocators. ie: jumbo > frames. > > At present this relies upon the wakeup_kswapd(..., order) mechanism. > > How effective is this at solving the jumbo frame problem? The tie in between allocator and kswapd is essentially unchanged, so if allocators are dropping below the watermarks at the specified order, reclaim will be triggered at that order. Reclaim continues until we return above the high watermarks, at the order at which we are reclaiming. What lumpy brings is a greater targetting of effort to get the pages. kswapd now uses the desired allocator order when applying reclaim. This leads to pressure being applied to contigious areas at the required order, and so a higher chance of that order becoming available. Traditional reclaim could end up applying pressure to a number of pages, but not all pages in any area at the required order, leading to a very low chance of success. By targetting areas at the required order we significantly increase the chances of success for any given amount of reclaim. As we will reclaim until we have the desired number of free pages, we will have to reclaim less to achieve this compared to random reclaim. This certainly is appealing intuitivly, and our testing at higher orders shows that the cost of each reclaimed page is lower and more importantly the time to reclaim each page is reduced. So for a 'continuing' consumer like an incoming packet stream, we should have to do much less work and thus disrupt the system as a whole much less to get its pages. Where demand for atomic higher order pages is not heavy we would expect kswapd to maintain free levels pages more readily and so under higher demand. Though it should be stressed without placement control success rates drop off significantly at higher orders as the probabality of reclaim succeeding on all pages in the area tends to zero. > (And do we still have a jumbo frame problem? Reports seems to have subsided) It is not in the least bit clear if the problem is resolved or if the reporters have simply gone quiet. Overall the approach taken in lumpy reclaim seems to be a logical extension of the regular reclaim algorithm, leading to more efficient reclaim. -apw ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans -- lumpy reclaim 2007-07-11 16:46 ` Andrew Morton 2007-07-11 18:38 ` Andy Whitcroft @ 2007-07-16 10:37 ` Mel Gorman 1 sibling, 0 replies; 535+ messages in thread From: Mel Gorman @ 2007-07-16 10:37 UTC (permalink / raw) To: Andrew Morton Cc: Andy Whitcroft, linux-kernel, Christoph Lameter, Peter Zijlstra On (11/07/07 09:46), Andrew Morton didst pronounce: > On Wed, 11 Jul 2007 10:34:31 +0100 Andy Whitcroft <apw@shadowen.org> wrote: > > > [Seems a PEBKAC occured on the subject line, resending lest it become a > > victim of "oh thats spam".] > > > > Andy Whitcroft wrote: > > > Andrew Morton wrote: > > > > > > [...] > > >> lumpy-reclaim-v4.patch > > >> have-kswapd-keep-a-minimum-order-free-other-than-order-0.patch > > >> only-check-absolute-watermarks-for-alloc_high-and-alloc_harder-allocations.patch > > >> > > >> Lumpy reclaim. In a similar situation to Mel's patches. Stuck due to > > >> general lack or interest and effort. > > > > > > The lumpy reclaim patches originally came out of work to support Mel's > > > anti-fragmentation work. As such I think they have become somewhat > > > attached to those patches. Whilst lumpy is most effective where > > > placement controls are in place as offered by Mel's work, we see benefit > > > from reduction in the "blunderbuss" effect when we reclaim at higher > > > orders. While placement control is pretty much required for the very > > > highest orders such as huge page size, lower order allocations are > > > benefited in terms of lower collateral damage. > > > > > > There are now a few areas other than huge page allocations which can > > > benefit. Stacks are still order 1. Jumbo frames want higher order > > > contiguous pages for there incoming hardware buffers. SLUB is showing > > > performance benefits from moving to a higher allocation order. All of > > > these should benefit from more aggressive targeted reclaim, indeed I > > > have been surprised just how often my test workloads trigger lumpy at > > > order 1 to get new stacks. > > > > > > Truly representative work loads are hard to generate for some of these. > > > Though we have heard some encouraging noises from those who can > > > reproduce these problems. > > I'd expect that the main application for lumpy-reclaim is in keeping a pool > of order-2 (say) pages in reserve for GFP_ATOMIC allocators. ie: jumbo > frames. > > At present this relies upon the wakeup_kswapd(..., order) mechanism. > > How effective is this at solving the jumbo frame problem? > > (And do we still have a jumbo frame problem? Reports seems to have subsided) The patches have an application with hugepage pool resizing. When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can be resized with greater reliability. Testing on a desktop machine with 2GB of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own was very slow as the success rate was quite low. Without lumpy-reclaim, each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages. With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (6 preceding siblings ...) 2007-07-10 12:37 ` clam Andy Whitcroft @ 2007-07-10 15:08 ` Serge E. Hallyn 2007-07-10 15:11 ` Rafael J. Wysocki ` (18 subsequent siblings) 26 siblings, 0 replies; 535+ messages in thread From: Serge E. Hallyn @ 2007-07-10 15:08 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Andrew Morgan Quoting Andrew Morton (akpm@linux-foundation.org): ... > implement-file-posix-capabilities.patch > implement-file-posix-capabilities-fix.patch > file-capabilities-introduce-cap_setfcap.patch > file-capabilities-get_file_caps-cleanups.patch > file-caps-update-selinux-xattr-hooks.patch > > file-caps seems to be stuck. There has been some movement lately, might > merge it subject to suiable acks from suitable parties. Andrew Morgan has requested a series of changes. Since one of these would involve a change in the on-disk format of file capabilities, I guess these should (sigh) wait another cycle. I will try to get that change out the door next, as soon as possible, so that hopefully there are no more definite blocking requests. thanks, -serge ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (7 preceding siblings ...) 2007-07-10 15:08 ` -mm merge plans for 2.6.23 Serge E. Hallyn @ 2007-07-10 15:11 ` Rafael J. Wysocki 2007-07-10 16:29 ` -mm merge plans for 2.6.23 (pcmcia) Randy Dunlap ` (17 subsequent siblings) 26 siblings, 0 replies; 535+ messages in thread From: Rafael J. Wysocki @ 2007-07-10 15:11 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Nigel Cunningham, Pavel Machek On Tuesday, 10 July 2007 10:31, Andrew Morton wrote: [--snip--] > > freezer-make-kernel-threads-nonfreezable-by-default.patch > > Merge, subject to re-review. Hmm, I'm not sure what that means. Am I supposed to do anything about it? Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 (pcmcia) 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (8 preceding siblings ...) 2007-07-10 15:11 ` Rafael J. Wysocki @ 2007-07-10 16:29 ` Randy Dunlap 2007-07-10 17:30 ` Andrew Morton 2007-07-10 16:31 ` -mm merge plans for 2.6.23 - ioat/dma engine Kok, Auke ` (16 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Randy Dunlap @ 2007-07-10 16:29 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-pcmcia On Tue, 10 Jul 2007 01:31:52 -0700 Andrew Morton wrote: > > When replying, please rewrite the subject suitably and try to Cc: the > appropriate developer(s). ... > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > use-menuconfig-objects-pcmcia.patch > > Am a bit stuck with the pcmcia patches. Dominik has disappeared. The menuconfig patch looks fine. I looked at the May-2007 discussion of ioctl removal. I don't see that much has changed since then, so it's either be brave/foolish/whatever and see what happens or just wait. I'll gladly send a patch to update the removal date in feature-removal-schedule.txt --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 (pcmcia) 2007-07-10 16:29 ` -mm merge plans for 2.6.23 (pcmcia) Randy Dunlap @ 2007-07-10 17:30 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 17:30 UTC (permalink / raw) To: Randy Dunlap; +Cc: linux-kernel, linux-pcmcia On Tue, 10 Jul 2007 09:29:58 -0700 Randy Dunlap <randy.dunlap@oracle.com> wrote: > On Tue, 10 Jul 2007 01:31:52 -0700 Andrew Morton wrote: > > > > > When replying, please rewrite the subject suitably and try to Cc: the > > appropriate developer(s). > > ... > > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > use-menuconfig-objects-pcmcia.patch > > > > Am a bit stuck with the pcmcia patches. Dominik has disappeared. > > The menuconfig patch looks fine. Yeah, I'll merge that. > I looked at the May-2007 discussion of ioctl removal. > I don't see that much has changed since then, so it's either be > brave/foolish/whatever and see what happens or just wait. > I'll gladly send a patch to update the removal date in > feature-removal-schedule.txt I have a note here that the ioctl-removal patch needs Dominik consideration. I see no rush on it so I'll just sit on it. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 - ioat/dma engine 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (9 preceding siblings ...) 2007-07-10 16:29 ` -mm merge plans for 2.6.23 (pcmcia) Randy Dunlap @ 2007-07-10 16:31 ` Kok, Auke 2007-07-10 18:05 ` Nelson, Shannon 2007-07-10 17:42 ` ata and netdev (was Re: -mm merge plans for 2.6.23) Jeff Garzik ` (15 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Kok, Auke @ 2007-07-10 16:31 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Nelson, Shannon, Leech, Christopher Andrew Morton wrote: > git-ioat-vs-git-md-accel.patch > ioat-warning-fix.patch > fix-i-oat-for-kexec.patch > > I don't seem to be able to get rid of these. Chris Leech appears to have > vanished. Chris is a moving target. Thankfully we have Shannon Nelson taking over Chris' duties. Shannon, can you take a look at these and see what needs to happen to it ? Most likely these just need to be pushed to the right person. Cheers, Auke PS: I think we should add an I/OAT / DMA engine section in the MAINTAINERS... ^ permalink raw reply [flat|nested] 535+ messages in thread
* RE: -mm merge plans for 2.6.23 - ioat/dma engine 2007-07-10 16:31 ` -mm merge plans for 2.6.23 - ioat/dma engine Kok, Auke @ 2007-07-10 18:05 ` Nelson, Shannon 2007-07-10 18:47 ` Andrew Morton 0 siblings, 1 reply; 535+ messages in thread From: Nelson, Shannon @ 2007-07-10 18:05 UTC (permalink / raw) To: Kok, Auke-jan H, Andrew Morton; +Cc: linux-kernel, Leech, Christopher Kok, Auke wrote: >Andrew Morton wrote: >> git-ioat-vs-git-md-accel.patch >> ioat-warning-fix.patch >> fix-i-oat-for-kexec.patch >> >> I don't seem to be able to get rid of these. Chris Leech >appears to have >> vanished. > >Chris is a moving target. Thankfully we have Shannon Nelson >taking over Chris' >duties. Shannon, can you take a look at these and see what >needs to happen to it >? Most likely these just need to be pushed to the right person. Auke: Thanks for the introduction :-). Andrew: All three of these patches are reasonable and can be pushed on up. You can add my sign-off to all three: Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> > >Cheers, > >Auke > > >PS: I think we should add an I/OAT / DMA engine section in the >MAINTAINERS... > I'll be posting a MAINTAINERS patch Real Soon Now with my name on IOAT/DMA. sln ====================================================================== Mr. Shannon Nelson LAN Access Division, Intel Corp. Shannon.Nelson@intel.com I don't speak for Intel (503) 712-7659 Parents can't afford to be squeamish. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 - ioat/dma engine 2007-07-10 18:05 ` Nelson, Shannon @ 2007-07-10 18:47 ` Andrew Morton 2007-07-10 21:18 ` Nelson, Shannon 0 siblings, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-10 18:47 UTC (permalink / raw) To: Nelson, Shannon; +Cc: Kok, Auke-jan H, linux-kernel, Leech, Christopher On Tue, 10 Jul 2007 11:05:45 -0700 "Nelson, Shannon" <shannon.nelson@intel.com> wrote: > Kok, Auke wrote: > >Andrew Morton wrote: > >> git-ioat-vs-git-md-accel.patch > >> ioat-warning-fix.patch > >> fix-i-oat-for-kexec.patch > >> > >> I don't seem to be able to get rid of these. Chris Leech > >appears to have > >> vanished. > > > >Chris is a moving target. Thankfully we have Shannon Nelson > >taking over Chris' > >duties. Shannon, can you take a look at these and see what > >needs to happen to it > >? Most likely these just need to be pushed to the right person. > > Auke: Thanks for the introduction :-). Hi, Shannon. > Andrew: All three of these patches are reasonable and can be pushed on > up. You can add my sign-off to all three: > Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> OK, the way it works is that I send these patches at the git tree maintainer, then the git tree maintainer merges them (this step is unreliable) and then when I repull that git tree maintainer's tree I see that they got merged so I drop them from -mm. The git tree maintainer decides when to send them to Linus. I am presently pulling git://lost.foo-projects.org/~cleech/linux-2.6#master into -mm. Will you be taking over the IOAT git tree? If so, please send me a suitable git URL when it's ready. The above tree has several changes in it from January (see ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/git-ioat.patch). Please take a look at those, work out what we should do with it all. ^ permalink raw reply [flat|nested] 535+ messages in thread
* RE: -mm merge plans for 2.6.23 - ioat/dma engine 2007-07-10 18:47 ` Andrew Morton @ 2007-07-10 21:18 ` Nelson, Shannon 0 siblings, 0 replies; 535+ messages in thread From: Nelson, Shannon @ 2007-07-10 21:18 UTC (permalink / raw) To: Andrew Morton; +Cc: Kok, Auke-jan H, linux-kernel, Leech, Christopher Andrew Morton [mailto:akpm@linux-foundation.org] > >I am presently pulling >git://lost.foo-projects.org/~cleech/linux-2.6#master >into -mm. > >Will you be taking over the IOAT git tree? If so, please send me a >suitable git URL when it's ready. I'll be getting there Real Soon Now. The transition seems to be a little rocky at the moment... >The above tree has several changes in it from January (see >ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2 >.6.22-rc6/2.6.22-rc6-mm1/broken-out/git-ioat.patch). > Please take a look at those, work out what we should do with it all. Will do. Thanks for your patience. sln ====================================================================== Mr. Shannon Nelson LAN Access Division, Intel Corp. Shannon.Nelson@intel.com I don't speak for Intel (503) 712-7659 Parents can't afford to be squeamish. ^ permalink raw reply [flat|nested] 535+ messages in thread
* ata and netdev (was Re: -mm merge plans for 2.6.23) 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (10 preceding siblings ...) 2007-07-10 16:31 ` -mm merge plans for 2.6.23 - ioat/dma engine Kok, Auke @ 2007-07-10 17:42 ` Jeff Garzik 2007-07-10 18:24 ` Andrew Morton 2007-07-10 19:56 ` Sergei Shtylyov 2007-07-10 17:49 ` ext2 reservations (Re: " Alexey Dobriyan ` (14 subsequent siblings) 26 siblings, 2 replies; 535+ messages in thread From: Jeff Garzik @ 2007-07-10 17:42 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, IDE/ATA development list, netdev, Tejun Heo (just to provide my indicator of status) Andrew Morton wrote: > libata-config_pm=n-compile-fix.patch that's for a branch that you don't get via libata-dev#ALL, #mv-ahci-pata. > pata_acpi-restore-driver.patch see Alan's comments. I've been ignoring pata_acpi for a while, because IMO it always needed work. > libata-core-convert-to-use-cancel_rearming_delayed_work.patch will merge > libata-implement-ata_wait_after_reset.patch I'm pretty much this is obsolete. Tejun? > sata_promise-sata-hotplug-support.patch will merge > libata-add-irq_flags-to-struct-pata_platform_info-fix.patch are other pata_platform people happy with this? I don't know embedded well enough to know if adding this struct member will break things. > ata-add-the-sw-ncq-support-to-sata_nv-for-mcp51-mcp55-mcp61.patch > sata_nv-allow-changing-queue-depth.patch should be combined, really. will merge eventually. basic concept OK, but need to review in depth. > pata_hpt3x3-major-reworking-and-testing.patch > iomap-sort-out-the-broken-address-reporting-caused-by-the-iomap-layer.patch > ata-use-iomap_name.patch generally OK > libata-check-for-an-support.patch > scsi-expose-an-to-user-space.patch > libata-expose-an-to-user-space.patch > scsi-save-disk-in-scsi_device.patch > libata-send-event-when-an-received.patch > > Am sitting on these due to confusion regarding the status of the ata-ahci > patches. I will apply what I can, but it seems there are lifetime problems > ata-ahci-alpm-store-interrupt-value.patch > ata-ahci-alpm-expose-power-management-policy-option-to-users.patch > ata-ahci-alpm-enable-link-power-management-for-ata-drivers.patch > ata-ahci-alpm-enable-aggressive-link-power-management-for-ahci-controllers.patch > > These appear to need some work. seemed mostly OK to me. what comments did I miss? > libata-add-human-readable-error-value-decoding.patch still pondering; in my mbox queue > libata-fix-hopefully-all-the-remaining-problems-with.patch > testing-patch-for-ali-pata-fixes-hopefully-for-the-problems-with-atapi-dma.patch > pata_ali-more-work.patch No idea. I would poke Alan. Probably drop. > 8139too-force-media-setting-fix.patch > blackfin-on-chip-ethernet-mac-controller-driver.patch > atari_pamsnetc-old-declaration-ritchie-style-fix.patch > sundance-phy-address-form-0-only-for-device-id-0x0200.patch Needs a bug fix, so that the newly modified loop doesn't scan the final phy id, then loop back around to scan the first again. > 3x59x-fix-pci-resource-management.patch > update-smc91x-driver-with-arm-versatile-board-info.patch > drivers-net-ns83820c-add-paramter-to-disable-auto.patch > > netdev patches which are stuck in limbo land. ? I don't think I've seen these. > bonding-bond_mainc-make-2-functions-static.patch FWIW bonding stuff should go to me, since it lives mostly in drivers/net > x86-initial-fixmap-support.patch Andi material? > mm-revert-kernel_ds-buffered-write-optimisation.patch > revert-81b0c8713385ce1b1b9058e916edcf9561ad76d6.patch > revert-6527c2bdf1f833cc18e8f42bd97973d583e4aa83.patch > mm-clean-up-buffered-write-code.patch > mm-debug-write-deadlocks.patch > mm-trim-more-holes.patch > mm-buffered-write-cleanup.patch > mm-write-iovec-cleanup.patch > mm-fix-pagecache-write-deadlocks.patch > mm-buffered-write-iterator.patch > fs-fix-data-loss-on-error.patch > mm-restore-kernel_ds-optimisations.patch > pagefault-in-write deadlock fixes. Will hold for 2.6.24. Any of the above worth 2.6.23? Just wondering if they were useful cleanups / minor fixes prior to new aops patches? > more-scheduled-oss-driver-removal.patch ACK > oss-trident-massive-whitespace-removal.patch > oss-trident-fix-locking-around-write_voice_regs.patch > oss-trident-replace-deprecated-pci_find_device-with-pci_get_device.patch > remove-options-depending-on-oss_obsolete.patch > > Merge what about just removing the OSS drivers in question? :) > intel-iommu-dmar-detection-and-parsing-logic.patch > intel-iommu-pci-generic-helper-function.patch > intel-iommu-clflush_cache_range-now-takes-size-param.patch > intel-iommu-iova-allocation-and-management-routines.patch > intel-iommu-intel-iommu-driver.patch > intel-iommu-avoid-memory-allocation-failures-in-dma-map-api-calls.patch > intel-iommu-intel-iommu-cmdline-option-forcedac.patch > intel-iommu-dmar-fault-handling-support.patch > intel-iommu-iommu-gfx-workaround.patch > intel-iommu-iommu-floppy-workaround.patch > > Don't know. I don't think there were any great objections, but I don't > think much benefit has been demonstrated? Just the general march of progress on new hardware :) I would like to see this support merged in /some/ form. We've been telling Intel for years they were sillyheads for not bothering with an IOMMU. Now that they have, we should give them a cookie and support good technology. Jeff ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: ata and netdev (was Re: -mm merge plans for 2.6.23) 2007-07-10 17:42 ` ata and netdev (was Re: -mm merge plans for 2.6.23) Jeff Garzik @ 2007-07-10 18:24 ` Andrew Morton 2007-07-10 18:55 ` James Bottomley ` (3 more replies) 2007-07-10 19:56 ` Sergei Shtylyov 1 sibling, 4 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 18:24 UTC (permalink / raw) To: Jeff Garzik Cc: linux-kernel, IDE/ATA development list, netdev, Tejun Heo, Alan Cox, Deepak Saxena, Dan Faerch, Benjamin LaHaise On Tue, 10 Jul 2007 13:42:16 -0400 Jeff Garzik <jeff@garzik.org> wrote: > > (just to provide my indicator of status) Thanks. > > libata-add-irq_flags-to-struct-pata_platform_info-fix.patch > > are other pata_platform people happy with this? I don't know embedded > well enough to know if adding this struct member will break things. This is just a silly remove-unneeded-cast-of-void* cleanup. I wrote this as a fixup against libata-add-irq_flags-to-struct-pata_platform_info.patch with the intention of folding it into that base patch, but you went and merged the submitter's original patch so this trivial fixup got stranded in -mm. Feel free to give it the piss-off-too-trivial treatment. > > ata-ahci-alpm-store-interrupt-value.patch > > ata-ahci-alpm-expose-power-management-policy-option-to-users.patch > > ata-ahci-alpm-enable-link-power-management-for-ata-drivers.patch > > ata-ahci-alpm-enable-aggressive-link-power-management-for-ahci-controllers.patch > > > > These appear to need some work. > > seemed mostly OK to me. what comments did I miss? Oh, I thought these were the patches which affected scsi and which James had issues with. I guess I got confused. > > > libata-add-human-readable-error-value-decoding.patch > > still pondering; in my mbox queue > > > libata-fix-hopefully-all-the-remaining-problems-with.patch > > testing-patch-for-ali-pata-fixes-hopefully-for-the-problems-with-atapi-dma.patch > > pata_ali-more-work.patch > > No idea. I would poke Alan. Probably drop. > Alan: poke. > > > 8139too-force-media-setting-fix.patch > > blackfin-on-chip-ethernet-mac-controller-driver.patch > > atari_pamsnetc-old-declaration-ritchie-style-fix.patch > > sundance-phy-address-form-0-only-for-device-id-0x0200.patch > > Needs a bug fix, so that the newly modified loop doesn't scan the final > phy id, then loop back around to scan the first again. > > > > 3x59x-fix-pci-resource-management.patch > > update-smc91x-driver-with-arm-versatile-board-info.patch > > drivers-net-ns83820c-add-paramter-to-disable-auto.patch > > > > netdev patches which are stuck in limbo land. > > ? I don't think I've seen these. > 3x59x-fix-pci-resource-management.patch: you wrote it ;) I have a comment here: - I don't remember the story with cardbus either. Presumably once upon a time the cardbus layer was claiming IO regions on behalf of cardbus devices (?) Need to think about that. update-smc91x-driver-with-arm-versatile-board-info.patch: See comment from rmk in changelog: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/update-smc91x-driver-with-arm-versatile-board-info.patch Deepak, can we move this along a bit please? drivers-net-ns83820c-add-paramter-to-disable-auto.patch: See comments in changelog: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/drivers-net-ns83820c-add-paramter-to-disable-auto.patch Dan, Ben: is there any prospect of progress here? > > > bonding-bond_mainc-make-2-functions-static.patch > > FWIW bonding stuff should go to me, since it lives mostly in drivers/net > Ah, noted. > > x86-initial-fixmap-support.patch > > Andi material? > Spose so. But it's buried in the middle of a series of four patches. > > > mm-revert-kernel_ds-buffered-write-optimisation.patch > > revert-81b0c8713385ce1b1b9058e916edcf9561ad76d6.patch > > revert-6527c2bdf1f833cc18e8f42bd97973d583e4aa83.patch > > mm-clean-up-buffered-write-code.patch > > mm-debug-write-deadlocks.patch > > mm-trim-more-holes.patch > > mm-buffered-write-cleanup.patch > > mm-write-iovec-cleanup.patch > > mm-fix-pagecache-write-deadlocks.patch > > mm-buffered-write-iterator.patch > > fs-fix-data-loss-on-error.patch > > mm-restore-kernel_ds-optimisations.patch > > pagefault-in-write deadlock fixes. Will hold for 2.6.24. > > Any of the above worth 2.6.23? Just wondering if they were useful > cleanups / minor fixes prior to new aops patches? > The first few patches will a) fix up our writev performance regression and b) reintroduce the writev() deadlock which the writev()-regresion-adding patch fixed. So it's all a bit ugly. > > > oss-trident-massive-whitespace-removal.patch > > oss-trident-fix-locking-around-write_voice_regs.patch > > oss-trident-replace-deprecated-pci_find_device-with-pci_get_device.patch > > remove-options-depending-on-oss_obsolete.patch > > > > Merge > > what about just removing the OSS drivers in question? :) > Hey, I only work here. > > > intel-iommu-dmar-detection-and-parsing-logic.patch > > intel-iommu-pci-generic-helper-function.patch > > intel-iommu-clflush_cache_range-now-takes-size-param.patch > > intel-iommu-iova-allocation-and-management-routines.patch > > intel-iommu-intel-iommu-driver.patch > > intel-iommu-avoid-memory-allocation-failures-in-dma-map-api-calls.patch > > intel-iommu-intel-iommu-cmdline-option-forcedac.patch > > intel-iommu-dmar-fault-handling-support.patch > > intel-iommu-iommu-gfx-workaround.patch > > intel-iommu-iommu-floppy-workaround.patch > > > > Don't know. I don't think there were any great objections, but I don't > > think much benefit has been demonstrated? > > Just the general march of progress on new hardware :) > > I would like to see this support merged in /some/ form. We've been > telling Intel for years they were sillyheads for not bothering with an > IOMMU. Now that they have, we should give them a cookie and support > good technology. OK, thanks. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: ata and netdev (was Re: -mm merge plans for 2.6.23) 2007-07-10 18:24 ` Andrew Morton @ 2007-07-10 18:55 ` James Bottomley 2007-07-10 18:57 ` Jeff Garzik ` (2 subsequent siblings) 3 siblings, 0 replies; 535+ messages in thread From: James Bottomley @ 2007-07-10 18:55 UTC (permalink / raw) To: Andrew Morton Cc: Jeff Garzik, linux-kernel, IDE/ATA development list, netdev, Tejun Heo, Alan Cox, Deepak Saxena, Dan Faerch, Benjamin LaHaise On Tue, 2007-07-10 at 11:24 -0700, Andrew Morton wrote: > > > ata-ahci-alpm-store-interrupt-value.patch > > > ata-ahci-alpm-expose-power-management-policy-option-to-users.patch > > > ata-ahci-alpm-enable-link-power-management-for-ata-drivers.patch > > > ata-ahci-alpm-enable-aggressive-link-power-management-for-ahci-controllers.patch > > > > > > These appear to need some work. > > > > seemed mostly OK to me. what comments did I miss? > > Oh, I thought these were the patches which affected scsi and which James > had issues with. I guess I got confused. Well ... my concern was really how to make them more generic ... ahci isn't the only controller that can do phy power management, and it also seemed to me that the most generic entity for power management was the transport rather than the SCSI mid-layer, but that debate is still ongoing. James ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: ata and netdev (was Re: -mm merge plans for 2.6.23) 2007-07-10 18:24 ` Andrew Morton 2007-07-10 18:55 ` James Bottomley @ 2007-07-10 18:57 ` Jeff Garzik 2007-07-10 20:31 ` Sergei Shtylyov 2007-07-11 16:47 ` Dan Faerch 3 siblings, 0 replies; 535+ messages in thread From: Jeff Garzik @ 2007-07-10 18:57 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, IDE/ATA development list, netdev, Tejun Heo, Alan Cox, Deepak Saxena, Dan Faerch, Benjamin LaHaise Andrew Morton wrote: > On Tue, 10 Jul 2007 13:42:16 -0400 > Jeff Garzik <jeff@garzik.org> wrote: > >> (just to provide my indicator of status) > > Thanks. > >>> libata-add-irq_flags-to-struct-pata_platform_info-fix.patch >> are other pata_platform people happy with this? I don't know embedded >> well enough to know if adding this struct member will break things. > > This is just a silly remove-unneeded-cast-of-void* cleanup. I wrote this > as a fixup against > libata-add-irq_flags-to-struct-pata_platform_info.patch with the intention > of folding it into that base patch, but you went and merged the submitter's > original patch so this trivial fixup got stranded in -mm. Feel free to give > it the piss-off-too-trivial treatment. I'm sorry, I didn't look closely enough. I was referring to the add-irq-flags patch itself, not your small fix. >>> ata-ahci-alpm-store-interrupt-value.patch >>> ata-ahci-alpm-expose-power-management-policy-option-to-users.patch >>> ata-ahci-alpm-enable-link-power-management-for-ata-drivers.patch >>> ata-ahci-alpm-enable-aggressive-link-power-management-for-ahci-controllers.patch >>> >>> These appear to need some work. >> seemed mostly OK to me. what comments did I miss? > > Oh, I thought these were the patches which affected scsi and which James > had issues with. I guess I got confused. hrm. ISTR James wanted some cleanups, Kristen did some cleanups, then looking at the cleanups decided they were needed / appropriate at this time. Anyway, these are in my mbox queue and the libata portions (of which the code is the majority) seem OK. Need to give them a final review. Jeff ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: ata and netdev (was Re: -mm merge plans for 2.6.23) 2007-07-10 18:24 ` Andrew Morton 2007-07-10 18:55 ` James Bottomley 2007-07-10 18:57 ` Jeff Garzik @ 2007-07-10 20:31 ` Sergei Shtylyov 2007-07-10 20:35 ` Andrew Morton 2007-07-11 16:47 ` Dan Faerch 3 siblings, 1 reply; 535+ messages in thread From: Sergei Shtylyov @ 2007-07-10 20:31 UTC (permalink / raw) To: Andrew Morton; +Cc: Jeff Garzik, linux-kernel, netdev Hello. Andrew Morton wrote: > 3x59x-fix-pci-resource-management.patch: you wrote it ;) I have a comment No, I did, almost a year ago already. :-) > here: > - I don't remember the story with cardbus either. Presumably once upon a > time the cardbus layer was claiming IO regions on behalf of cardbus > devices (?) IIRC, that's your own comment. > Need to think about that. WBR, Sergei ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: ata and netdev (was Re: -mm merge plans for 2.6.23) 2007-07-10 20:31 ` Sergei Shtylyov @ 2007-07-10 20:35 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 20:35 UTC (permalink / raw) To: Sergei Shtylyov; +Cc: Jeff Garzik, linux-kernel, netdev On Wed, 11 Jul 2007 00:31:23 +0400 Sergei Shtylyov <sshtylyov@ru.mvista.com> wrote: > Hello. > > Andrew Morton wrote: > > > 3x59x-fix-pci-resource-management.patch: you wrote it ;) I have a comment > > No, I did, almost a year ago already. :-) I thought that was odd. I fixed the attribution. > > here: > > > - I don't remember the story with cardbus either. Presumably once upon a > > time the cardbus layer was claiming IO regions on behalf of cardbus > > devices (?) > > IIRC, that's your own comment. yup, that's what "I have a comment" meant ;) The comment seems rather bogus actually. Let's just merge it. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: ata and netdev (was Re: -mm merge plans for 2.6.23) 2007-07-10 18:24 ` Andrew Morton ` (2 preceding siblings ...) 2007-07-10 20:31 ` Sergei Shtylyov @ 2007-07-11 16:47 ` Dan Faerch 3 siblings, 0 replies; 535+ messages in thread From: Dan Faerch @ 2007-07-11 16:47 UTC (permalink / raw) To: Andrew Morton Cc: Jeff Garzik, linux-kernel, IDE/ATA development list, netdev, Tejun Heo, Alan Cox, Deepak Saxena, Benjamin LaHaise Andrew Morton wrote: > drivers-net-ns83820c-add-paramter-to-disable-auto.patch: > > See comments in changelog: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/broken-out/drivers-net-ns83820c-add-paramter-to-disable-auto.patch > > Dan, Ben: is there any prospect of progress here? Mmm.. Ben had 2 comments last year: In regards to the ethtool stuff i coded: > This part is good, although doing something for copper cards needs doing, I know very little about hardware and only own the fiber version of this card. Even if i tried to make code for the copper version, it would probably blow it up the phy and set the switches on fire ;). And in regards to the '"disable_autoneg" module argument': > This is the part I disagree with. Are you sure it isn't a bug in the > link autonegotiation state machine for fibre cards? It should be defaulting > to 1Gbit/full duplex if no autonegotiation is happening, and if it isn't > then that should be fixed instead of papering over things with a config > option. This is pretty much Russian to me. I wouldnt know where to find the "link-autonegotiation-state-machine-for-fibre-cards" or know what to do with it anyway :). The "disable_autoneg" is a convenient feature (for me and the other guy who made the same patch last year) and i consider it a harmless feature in every way. It is simply an 'if'-statement, that skips the "start autoneg" function upon load. We can simply remove the feature entirely if it is deemed undesirable. So in conclusion: - I vote "use the patch as-is", but im fine with it being changed. - If it needs support for copper, someone else has to code it. Regards - Dan ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: ata and netdev (was Re: -mm merge plans for 2.6.23) 2007-07-10 17:42 ` ata and netdev (was Re: -mm merge plans for 2.6.23) Jeff Garzik 2007-07-10 18:24 ` Andrew Morton @ 2007-07-10 19:56 ` Sergei Shtylyov 1 sibling, 0 replies; 535+ messages in thread From: Sergei Shtylyov @ 2007-07-10 19:56 UTC (permalink / raw) To: Jeff Garzik; +Cc: Andrew Morton, linux-kernel, netdev Hello. Jeff Garzik wrote: >> 3x59x-fix-pci-resource-management.patch Now that the fix for CONFIG_PCI=n has been merged, what's left is to test this on EISA (at least Andrew wanted it :-). >> netdev patches which are stuck in limbo land. > ? I don't think I've seen these. You should have, I was sending it to you. WBR, Sergei ^ permalink raw reply [flat|nested] 535+ messages in thread
* ext2 reservations (Re: -mm merge plans for 2.6.23) 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (11 preceding siblings ...) 2007-07-10 17:42 ` ata and netdev (was Re: -mm merge plans for 2.6.23) Jeff Garzik @ 2007-07-10 17:49 ` Alexey Dobriyan 2007-07-10 18:34 ` PCI probing changes Jesse Barnes ` (13 subsequent siblings) 26 siblings, 0 replies; 535+ messages in thread From: Alexey Dobriyan @ 2007-07-10 17:49 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-ext4 > ext2-reservations.patch > > Still needs decent testing. Was this oops silently fixed? http://lkml.org/lkml/2007/3/2/138 2.6.21-rc2-mm1: EIP is at ext2_discard_reservation+0x1c/0x52 I still have that ext2 partition backed up. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: PCI probing changes 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (12 preceding siblings ...) 2007-07-10 17:49 ` ext2 reservations (Re: " Alexey Dobriyan @ 2007-07-10 18:34 ` Jesse Barnes 2007-07-10 18:55 ` Andrew Morton 2007-07-10 18:44 ` agp / cpufreq Dave Jones ` (12 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Jesse Barnes @ 2007-07-10 18:34 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Tuesday, July 10, 2007 1:31:52 Andrew Morton wrote: > pci-disable-decode-of-io-memory-during-bar-sizing.patch This is a core PCI change, should probably go through Greg and/or linux-pci instead? Or just send it to Linus directly, iirc everyone was ok with the change for 2.6.23. Also, you can add a Signed-off-by: Jesse Barnes <jesse.barnes@intel.com> to it. Thanks, Jesse ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: PCI probing changes 2007-07-10 18:34 ` PCI probing changes Jesse Barnes @ 2007-07-10 18:55 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-10 18:55 UTC (permalink / raw) To: Jesse Barnes; +Cc: linux-kernel On Tue, 10 Jul 2007 11:34:03 -0700 Jesse Barnes <jesse.barnes@intel.com> wrote: > On Tuesday, July 10, 2007 1:31:52 Andrew Morton wrote: > > pci-disable-decode-of-io-memory-during-bar-sizing.patch > > This is a core PCI change, should probably go through Greg and/or > linux-pci instead? Or just send it to Linus directly, iirc everyone > was ok with the change for 2.6.23. Ah, thanks. I moved it to the gregkh-pci queue. > Also, you can add a > Signed-off-by: Jesse Barnes <jesse.barnes@intel.com> > to it. Updated, thanks. ^ permalink raw reply [flat|nested] 535+ messages in thread
* agp / cpufreq. 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (13 preceding siblings ...) 2007-07-10 18:34 ` PCI probing changes Jesse Barnes @ 2007-07-10 18:44 ` Dave Jones 2007-07-10 20:09 ` -mm merge plans for 2.6.23 Christoph Lameter ` (11 subsequent siblings) 26 siblings, 0 replies; 535+ messages in thread From: Dave Jones @ 2007-07-10 18:44 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > working-3d-dri-intel-agpko-resume-for-i815-chip.patch > > Sent to davej You managed to sneak this to me just hours before I handed AGP maintainership to Dave Airlie. FWIW, I think this needs to redone in a much more generic manner before it goes mainline. > bugfix-cpufreq-in-combination-with-performance-governor.patch > restore-previously-used-governor-on-a-hot-replugged-cpu.patch > > Sent to davej Will start merging the backlog once someone (ahem) pulls the last lot of stuff. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (14 preceding siblings ...) 2007-07-10 18:44 ` agp / cpufreq Dave Jones @ 2007-07-10 20:09 ` Christoph Lameter 2007-07-11 9:42 ` Mel Gorman 2007-07-11 11:35 ` Christoph Hellwig ` (10 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Christoph Lameter @ 2007-07-10 20:09 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Mel Gorman On Tue, 10 Jul 2007, Andrew Morton wrote: > slub-exploit-page-mobility-to-increase-allocation-order.patch > slub-reduce-antifrag-max-order.patch > > These are slub changes which are dependent on Mel's stuff, and I have a note > here that there were reports of page allocation failures with these. What's > up with that? Those were fixed and all has been well since as far as I know. > Maybe I should just drop the 100-odd marginal-looking MM patches? We're > simply not showing compelling reasons for merging them and quite a lot of them > are stuck in a 90% complete state. As far as I can tell the antifrag patches are stable and are significantly enhancing various aspects of the VM and also make it more reliable. SLUB can use it to increase scalability. MM has been using order 3 allocs via SLUB for months now without a problem. Without the antifrag patches order 1 allocs could cause OOMs. It opens the door for functionality that we wanted for a long time such a memory unplug etc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 20:09 ` -mm merge plans for 2.6.23 Christoph Lameter @ 2007-07-11 9:42 ` Mel Gorman 2007-07-11 17:49 ` Christoph Lameter 0 siblings, 1 reply; 535+ messages in thread From: Mel Gorman @ 2007-07-11 9:42 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel On (10/07/07 13:09), Christoph Lameter didst pronounce: > On Tue, 10 Jul 2007, Andrew Morton wrote: > > > slub-exploit-page-mobility-to-increase-allocation-order.patch > > slub-reduce-antifrag-max-order.patch > > > > These are slub changes which are dependent on Mel's stuff, and I have a note > > here that there were reports of page allocation failures with these. What's > > up with that? > > Those were fixed and all has been well since as far as I know. > SLUB using high orders without page allocation failures do depend on two very questionable patches that I brought to attention in my inital merge mail. If grouping pages by mobility goes through, I'll be revisiting that properly to make sure it can work without deadlocking ever under any circumstances. Right now, it theoritically could livelock although I've never been able to reproduce it. The patches as they are will work for high-order allocations if you are willing to wait and reclaim memory. The more stressful users need more effort but it's already been shown that it can be made work with one approach as the last few months in -mm have shown. > > Maybe I should just drop the 100-odd marginal-looking MM patches? We're > > simply not showing compelling reasons for merging them and quite a lot of them > > are stuck in a 90% complete state. > > As far as I can tell the antifrag patches are stable and are significantly > enhancing various aspects of the VM and also make it more reliable. SLUB > can use it to increase scalability. MM has been using order 3 allocs via > SLUB for months now without a problem. Without the antifrag patches order > 1 allocs could cause OOMs. > > It opens the door for functionality that we wanted for a long time such a > memory unplug etc. And I want to avoid a catch-22 here where the features that depend on grouping pages by mobility have to exist before grouping pages by mobility is pushed through. I would like the patches to go through on the grounds that higher order allocations can succeed. However, I am also happy to say that order-0 pages should be used as much as possible, that case should always be made as fast as possible and the world must not end if a high-order allocation fails. -- -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-11 9:42 ` Mel Gorman @ 2007-07-11 17:49 ` Christoph Lameter 0 siblings, 0 replies; 535+ messages in thread From: Christoph Lameter @ 2007-07-11 17:49 UTC (permalink / raw) To: Mel Gorman; +Cc: Andrew Morton, linux-kernel On Wed, 11 Jul 2007, Mel Gorman wrote: > I would like the patches to go through on the grounds that higher order > allocations can succeed. However, I am also happy to say that order-0 > pages should be used as much as possible, that case should always be > made as fast as possible and the world must not end if a high-order > allocation fails. SLUB can easily be made to not use higher order pages. If the SLUB mobility patches are not merged then higher order page use can be explicitly enabled via passing the following to the kernel on boot slub_max_order=<desired max order> If they are merged then the higher order page use can be disabled in case of trouble via slub_max_order=0 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (15 preceding siblings ...) 2007-07-10 20:09 ` -mm merge plans for 2.6.23 Christoph Lameter @ 2007-07-11 11:35 ` Christoph Hellwig 2007-07-11 11:39 ` David Woodhouse 2007-07-11 11:37 ` scsi, was " Christoph Hellwig ` (9 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Christoph Hellwig @ 2007-07-11 11:35 UTC (permalink / raw) To: Andrew Morton; +Cc: dwmw2, linux-kernel, linux-fsdevel On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > romfs-printk-format-warnings.patch NACK on this one. This bloats romfs by almost half of it's previous size to add mtd support to it. Given that romfs is a compltely trivial filesystem it's much better to have a separate filesystem driver handling the format on mtd instead of adding all these indirections. In addition to that argument the switch on the underlying subsystem is done horrible. There's lots of ifdefs instead of proper functions pointers, there's one file containing both block and mtd code instead of seaparate files, etc. And the get_unmapped_area method in a bare filesystem needs a _lot_ of explanation. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-11 11:35 ` Christoph Hellwig @ 2007-07-11 11:39 ` David Woodhouse 2007-07-11 17:21 ` Andrew Morton 0 siblings, 1 reply; 535+ messages in thread From: David Woodhouse @ 2007-07-11 11:39 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel, linux-fsdevel On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote: > On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > > romfs-printk-format-warnings.patch > > NACK on this one. The rest of it is nacked anyway, until we unify the point and get_unmapped_area methods of the MTD API. -- dwmw2 ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-11 11:39 ` David Woodhouse @ 2007-07-11 17:21 ` Andrew Morton 2007-07-11 17:28 ` Randy Dunlap 0 siblings, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-11 17:21 UTC (permalink / raw) To: David Woodhouse Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, David Howells On Wed, 11 Jul 2007 12:39:42 +0100 David Woodhouse <dwmw2@infradead.org> wrote: > On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote: > > On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > > > romfs-printk-format-warnings.patch > > > > NACK on this one. > > The rest of it is nacked anyway, until we unify the point and > get_unmapped_area methods of the MTD API. > Methinks you meant nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not romfs-printk-format-warnings.patch. I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-11 17:21 ` Andrew Morton @ 2007-07-11 17:28 ` Randy Dunlap 0 siblings, 0 replies; 535+ messages in thread From: Randy Dunlap @ 2007-07-11 17:28 UTC (permalink / raw) To: Andrew Morton Cc: David Woodhouse, Christoph Hellwig, linux-kernel, linux-fsdevel, David Howells On Wed, 11 Jul 2007 10:21:03 -0700 Andrew Morton wrote: > On Wed, 11 Jul 2007 12:39:42 +0100 David Woodhouse <dwmw2@infradead.org> wrote: > > > On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote: > > > On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote: > > > > romfs-printk-format-warnings.patch > > > > > > NACK on this one. > > > > The rest of it is nacked anyway, until we unify the point and > > get_unmapped_area methods of the MTD API. > > > > Methinks you meant > nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not > romfs-printk-format-warnings.patch. > > I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks. Thanks. I was certainly getting confused. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** ^ permalink raw reply [flat|nested] 535+ messages in thread
* scsi, was Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (16 preceding siblings ...) 2007-07-11 11:35 ` Christoph Hellwig @ 2007-07-11 11:37 ` Christoph Hellwig 2007-07-11 17:22 ` Andrew Morton 2007-07-11 11:39 ` buffered write patches, " Christoph Hellwig ` (8 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Christoph Hellwig @ 2007-07-11 11:37 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-scsi > restore-acpi-change-for-scsi.patch > git-scsi-misc-vs-greg-sysfs-stuff.patch > aacraid-rename-check_reset.patch > scsi-dont-build-scsi_dma_mapunmap-for-has_dma.patch > drivers-scsi-small-cleanups.patch > sym53c8xx_2-claims-cpqarray-device.patch > drivers-scsi-wd33c93c-cleanups.patch > make-seagate_st0x_detect-static.patch > pci-error-recovery-symbios-scsi-base-support.patch > pci-error-recovery-symbios-scsi-first-failure.patch > drivers-scsi-pcmcia-nsp_csc-remove-kernel-24-code.patch > drivers-message-i2o-devicec-remove-redundant-gfp_atomic-from-kmalloc.patch > drivers-scsi-aic7xxx_oldc-remove-redundant-gfp_atomic-from-kmalloc.patch > use-menuconfig-objects-ii-scsi.patch > remove-dead-references-to-module_parm-macro.patch > ppa-coding-police-and-printk-levels.patch > remove-the-dead-cyberstormiii_scsi-option.patch > config_scsi_fd_8xx-no-longer-exists.patch > use-mutex-instead-of-semaphore-in-megaraid-mailbox-driver.patch > > Sent to James. Care to drop the patches James NACKed every single time? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: scsi, was Re: -mm merge plans for 2.6.23 2007-07-11 11:37 ` scsi, was " Christoph Hellwig @ 2007-07-11 17:22 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-11 17:22 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, linux-scsi On Wed, 11 Jul 2007 13:37:18 +0200 Christoph Hellwig <hch@lst.de> wrote: > > restore-acpi-change-for-scsi.patch > > git-scsi-misc-vs-greg-sysfs-stuff.patch > > aacraid-rename-check_reset.patch > > scsi-dont-build-scsi_dma_mapunmap-for-has_dma.patch > > drivers-scsi-small-cleanups.patch > > sym53c8xx_2-claims-cpqarray-device.patch > > drivers-scsi-wd33c93c-cleanups.patch > > make-seagate_st0x_detect-static.patch > > pci-error-recovery-symbios-scsi-base-support.patch > > pci-error-recovery-symbios-scsi-first-failure.patch > > drivers-scsi-pcmcia-nsp_csc-remove-kernel-24-code.patch > > drivers-message-i2o-devicec-remove-redundant-gfp_atomic-from-kmalloc.patch > > drivers-scsi-aic7xxx_oldc-remove-redundant-gfp_atomic-from-kmalloc.patch > > use-menuconfig-objects-ii-scsi.patch > > remove-dead-references-to-module_parm-macro.patch > > ppa-coding-police-and-printk-levels.patch > > remove-the-dead-cyberstormiii_scsi-option.patch > > config_scsi_fd_8xx-no-longer-exists.patch > > use-mutex-instead-of-semaphore-in-megaraid-mailbox-driver.patch > > > > Sent to James. > > Care to drop the patches James NACKed every single time? I'm not aware of any which fit that description. There may be a couple in there which fix real bugs in an unapproved way. But I keep such patches as a matter of policy, so people keep on getting pestered about their bugs. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: buffered write patches, -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (17 preceding siblings ...) 2007-07-11 11:37 ` scsi, was " Christoph Hellwig @ 2007-07-11 11:39 ` Christoph Hellwig 2007-07-11 17:23 ` Andrew Morton 2007-07-11 11:55 ` Christoph Hellwig ` (7 subsequent siblings) 26 siblings, 1 reply; 535+ messages in thread From: Christoph Hellwig @ 2007-07-11 11:39 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm > pagefault-in-write deadlock fixes. Will hold for 2.6.24. Why that? This stuff has been in forever and is needed at various levels. We need this in for anything to move forward on the buffered write front. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: buffered write patches, -mm merge plans for 2.6.23 2007-07-11 11:39 ` buffered write patches, " Christoph Hellwig @ 2007-07-11 17:23 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-11 17:23 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, linux-mm, Nick Piggin On Wed, 11 Jul 2007 13:39:44 +0200 Christoph Hellwig <hch@lst.de> wrote: > > pagefault-in-write deadlock fixes. Will hold for 2.6.24. > > Why that? At Nick's request. More work is needed and the code hasn't had a lot of testing/thought/exposure/review. > This stuff has been in forever and is needed at various > levels. We need this in for anything to move forward on the buffered > write front. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (18 preceding siblings ...) 2007-07-11 11:39 ` buffered write patches, " Christoph Hellwig @ 2007-07-11 11:55 ` Christoph Hellwig 2007-07-11 12:00 ` fallocate, " Christoph Hellwig ` (6 subsequent siblings) 26 siblings, 0 replies; 535+ messages in thread From: Christoph Hellwig @ 2007-07-11 11:55 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel > mutex_unlock-later-in-seq_lseek.patch > zs-move-to-the-serial-subsystem.patch > fs-block_devc-use-list_for_each_entry.patch > > introduce-o_cloexec-take-2.patch > o_cloexec-for-scm_rights.patch > Umm, Andrew - mixing new userspace interface, compltely rewritten drivers and simple fixes in a simple misc category doesn't exactly help reading this list :) ^ permalink raw reply [flat|nested] 535+ messages in thread
* fallocate, Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (19 preceding siblings ...) 2007-07-11 11:55 ` Christoph Hellwig @ 2007-07-11 12:00 ` Christoph Hellwig 2007-07-11 12:23 ` lguest, " Christoph Hellwig ` (5 subsequent siblings) 26 siblings, 0 replies; 535+ messages in thread From: Christoph Hellwig @ 2007-07-11 12:00 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel > fallocate-implementation-on-i86-x86_64-and-powerpc.patch > fallocate-on-s390.patch > fallocate-on-ia64.patch > fallocate-on-ia64-fix.patch > > Merge. Hopefull this will be done during the 2.6.23 merge window, but right now it's not (yet). ^ permalink raw reply [flat|nested] 535+ messages in thread
* lguest, Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (20 preceding siblings ...) 2007-07-11 12:00 ` fallocate, " Christoph Hellwig @ 2007-07-11 12:23 ` Christoph Hellwig 2007-07-11 15:45 ` Randy Dunlap ` (2 more replies) 2007-07-11 12:43 ` x86 status was " Andi Kleen ` (4 subsequent siblings) 26 siblings, 3 replies; 535+ messages in thread From: Christoph Hellwig @ 2007-07-11 12:23 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, rusty, linux-mm > lguest-export-symbols-for-lguest-as-a-module.patch __put_task_struct is one of those no way in hell should this be exported things because we don't want modules messing with task lifetimes. Fortunately I can't find anything actually using this in lguest, so it looks the issue has been solved in the meantime. I also have a rather bad feeling about exporting access_process_vm. This is the proverbial sledge hammer for access to user vm addresses and I'd rather keep it away from module programmers with "if all you have is a hammer ..." in mind. In lguest this is used by send_dma which from my short reading of the code seems to be the central IPC mechanism. The double copy here doesn't look very efficient to me either. Maybe some VM folks could look into a better way to archive this that might be both more efficient and not require the export. > lguest-the-guest-code.patch > lguest-the-host-code.patch > lguest-the-host-code-lguest-vs-clockevents-fix-resume-logic.patch > lguest-the-asm-offsets.patch > lguest-the-makefile-and-kconfig.patch > lguest-the-console-driver.patch > lguest-the-net-driver.patch > lguest-the-block-driver.patch > lguest-the-documentation-example-launcher.patch Just started to reading this (again) so no useful comment here, but it would be nice if the code could follow CodingStyle and place the || and && at the end of the line in multiline conditionals instead of at the beginning of the new one. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-11 12:23 ` lguest, " Christoph Hellwig @ 2007-07-11 15:45 ` Randy Dunlap 2007-07-11 18:04 ` Andrew Morton 2007-07-12 1:21 ` Rusty Russell 2 siblings, 0 replies; 535+ messages in thread From: Randy Dunlap @ 2007-07-11 15:45 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel, rusty, linux-mm On Wed, 11 Jul 2007 14:23:24 +0200 Christoph Hellwig wrote: ... > > lguest-the-guest-code.patch > > lguest-the-host-code.patch > > lguest-the-host-code-lguest-vs-clockevents-fix-resume-logic.patch > > lguest-the-asm-offsets.patch > > lguest-the-makefile-and-kconfig.patch > > lguest-the-console-driver.patch > > lguest-the-net-driver.patch > > lguest-the-block-driver.patch > > lguest-the-documentation-example-launcher.patch > > Just started to reading this (again) so no useful comment here, but it > would be nice if the code could follow CodingStyle and place the || and > && at the end of the line in multiline conditionals instead of at the > beginning of the new one. I prefer them at the ends of lines also, but that's not in CodingStyle, it's just how we do it most of the time (so "coding style", without caps). --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-11 12:23 ` lguest, " Christoph Hellwig 2007-07-11 15:45 ` Randy Dunlap @ 2007-07-11 18:04 ` Andrew Morton 2007-07-12 1:21 ` Rusty Russell 2 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-11 18:04 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, rusty, linux-mm On Wed, 11 Jul 2007 14:23:24 +0200 Christoph Hellwig <hch@lst.de> wrote: > > lguest-export-symbols-for-lguest-as-a-module.patch > > __put_task_struct is one of those no way in hell should this be exported > things because we don't want modules messing with task lifetimes. > > Fortunately I can't find anything actually using this in lguest, so > it looks the issue has been solved in the meantime. > Ther are a couple of calls to put_task_struct() in there, and that needs __put_task_struct() exported. > > I also have a rather bad feeling about exporting access_process_vm. > This is the proverbial sledge hammer for access to user vm addresses > and I'd rather keep it away from module programmers with "if all > you have is a hammer ..." in mind. hm, well, access_process_vm() is a convenience wrapper around get_user_pages(), whcih is exported. > In lguest this is used by send_dma which from my short reading of the > code seems to be the central IPC mechanism. The double copy here > doesn't look very efficient to me either. Maybe some VM folks could > look into a better way to archive this that might be both more > efficient and not require the export. > > > > lguest-the-guest-code.patch > > lguest-the-host-code.patch > > lguest-the-host-code-lguest-vs-clockevents-fix-resume-logic.patch > > lguest-the-asm-offsets.patch > > lguest-the-makefile-and-kconfig.patch > > lguest-the-console-driver.patch > > lguest-the-net-driver.patch > > lguest-the-block-driver.patch > > lguest-the-documentation-example-launcher.patch > > Just started to reading this (again) so no useful comment here, but it > would be nice if the code could follow CodingStyle and place the || and > && at the end of the line in multiline conditionals instead of at the > beginning of the new one. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-11 12:23 ` lguest, " Christoph Hellwig 2007-07-11 15:45 ` Randy Dunlap 2007-07-11 18:04 ` Andrew Morton @ 2007-07-12 1:21 ` Rusty Russell 2007-07-12 2:28 ` David Miller 2 siblings, 1 reply; 535+ messages in thread From: Rusty Russell @ 2007-07-12 1:21 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel, linux-mm On Wed, 2007-07-11 at 14:23 +0200, Christoph Hellwig wrote: > > lguest-export-symbols-for-lguest-as-a-module.patch > > __put_task_struct is one of those no way in hell should this be exported > things because we don't want modules messing with task lifetimes. > > Fortunately I can't find anything actually using this in lguest, so > it looks the issue has been solved in the meantime. To do inter-guest (ie. inter-process) I/O you really have to make sure the other side doesn't go away. > I also have a rather bad feeling about exporting access_process_vm. > This is the proverbial sledge hammer for access to user vm addresses > and I'd rather keep it away from module programmers with "if all > you have is a hammer ..." in mind. > > In lguest this is used by send_dma which from my short reading of the > code seems to be the central IPC mechanism. The double copy here > doesn't look very efficient to me either. Maybe some VM folks could > look into a better way to archive this that might be both more > efficient and not require the export. It's not a double copy: it's a map & copy. If KVM develops inter-guest I/O then this could all be extracted into a helper function and made more efficient. > Just started to reading this (again) so no useful comment here, but it > would be nice if the code could follow CodingStyle and place the || and > && at the end of the line in multiline conditionals instead of at the > beginning of the new one. Surprisingly, you have a point here. Since the key purpose of lguest is as demonstration code, it meticulously match kernel style. I shall immediately prepare a patch to convert the rest of the kernel to the correct "&& at beginning of line" style. Rusty. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 1:21 ` Rusty Russell @ 2007-07-12 2:28 ` David Miller 2007-07-12 2:48 ` Rusty Russell 0 siblings, 1 reply; 535+ messages in thread From: David Miller @ 2007-07-12 2:28 UTC (permalink / raw) To: rusty; +Cc: hch, akpm, linux-kernel, linux-mm From: Rusty Russell <rusty@rustcorp.com.au> Date: Thu, 12 Jul 2007 11:21:51 +1000 > To do inter-guest (ie. inter-process) I/O you really have to make sure > the other side doesn't go away. You should just let it exit and when it does you receive some kind of exit notification that resets your virtual device channel. I think the reference counting approach is error and deadlock prone. Be more loose and let the events reset the virtual devices when guests go splat. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 2:28 ` David Miller @ 2007-07-12 2:48 ` Rusty Russell 2007-07-12 2:51 ` David Miller 2007-07-12 4:24 ` Andrew Morton 0 siblings, 2 replies; 535+ messages in thread From: Rusty Russell @ 2007-07-12 2:48 UTC (permalink / raw) To: David Miller; +Cc: hch, akpm, linux-kernel, linux-mm On Wed, 2007-07-11 at 19:28 -0700, David Miller wrote: > From: Rusty Russell <rusty@rustcorp.com.au> > Date: Thu, 12 Jul 2007 11:21:51 +1000 > > > To do inter-guest (ie. inter-process) I/O you really have to make sure > > the other side doesn't go away. > > You should just let it exit and when it does you receive some kind of > exit notification that resets your virtual device channel. > > I think the reference counting approach is error and deadlock prone. > Be more loose and let the events reset the virtual devices when > guests go splat. There are two places where we grab task refcnt. One might be avoidable (will test and get back) but the deferred wakeup isn't really: /* We cache one process to wakeup: helps for batching & wakes outside locks. */ void set_wakeup_process(struct lguest *lg, struct task_struct *p) { if (p == lg->wake) return; if (lg->wake) { wake_up_process(lg->wake); put_task_struct(lg->wake); } lg->wake = p; if (lg->wake) get_task_struct(lg->wake); } We drop the lock after I/O, and then do this wakeup. Meanwhile the other task might have exited. I could get rid of it, but I don't think there's anything wrong with the code... Cheers, Rusty. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 2:48 ` Rusty Russell @ 2007-07-12 2:51 ` David Miller 2007-07-12 3:15 ` Rusty Russell 2007-07-12 4:24 ` Andrew Morton 1 sibling, 1 reply; 535+ messages in thread From: David Miller @ 2007-07-12 2:51 UTC (permalink / raw) To: rusty; +Cc: hch, akpm, linux-kernel, linux-mm From: Rusty Russell <rusty@rustcorp.com.au> Date: Thu, 12 Jul 2007 12:48:41 +1000 > We drop the lock after I/O, and then do this wakeup. Meanwhile the > other task might have exited. I already understand what you're doing. Is it possible to use exit notifiers to handle this case? That's what I'm trying to suggest. :) ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 2:51 ` David Miller @ 2007-07-12 3:15 ` Rusty Russell 2007-07-12 3:35 ` David Miller 0 siblings, 1 reply; 535+ messages in thread From: Rusty Russell @ 2007-07-12 3:15 UTC (permalink / raw) To: David Miller; +Cc: hch, akpm, linux-kernel, linux-mm On Wed, 2007-07-11 at 19:51 -0700, David Miller wrote: > From: Rusty Russell <rusty@rustcorp.com.au> > Date: Thu, 12 Jul 2007 12:48:41 +1000 > > > We drop the lock after I/O, and then do this wakeup. Meanwhile the > > other task might have exited. > > I already understand what you're doing. > > Is it possible to use exit notifiers to handle this case? > That's what I'm trying to suggest. :) Sure, the process has /dev/lguest open, so I can do something in the close routine. Instead of keeping a reference to the tsk, I can keep a reference to the struct lguest (currently it doesn't have or need a refcnt). Then I need another lock, to protect lg->tsk. This seems like a lot of dancing to avoid one export. If it's that important I'd far rather drop the code and do a normal wakeup under the big lguest lock for 2.6.23. Cheers, Rusty. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 3:15 ` Rusty Russell @ 2007-07-12 3:35 ` David Miller 0 siblings, 0 replies; 535+ messages in thread From: David Miller @ 2007-07-12 3:35 UTC (permalink / raw) To: rusty; +Cc: hch, akpm, linux-kernel, linux-mm From: Rusty Russell <rusty@rustcorp.com.au> Date: Thu, 12 Jul 2007 13:15:18 +1000 > Sure, the process has /dev/lguest open, so I can do something in the > close routine. Instead of keeping a reference to the tsk, I can keep a > reference to the struct lguest (currently it doesn't have or need a > refcnt). Then I need another lock, to protect lg->tsk. > > This seems like a lot of dancing to avoid one export. If it's that > important I'd far rather drop the code and do a normal wakeup under the > big lguest lock for 2.6.23. I'm not against the export, so use if it really helps. Ref-counting just seems clumsy to me given how the hw assisted virtualization stuff works on platforms I am intimately familiar with :) ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 2:48 ` Rusty Russell 2007-07-12 2:51 ` David Miller @ 2007-07-12 4:24 ` Andrew Morton 2007-07-12 4:52 ` Rusty Russell 1 sibling, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-12 4:24 UTC (permalink / raw) To: Rusty Russell; +Cc: David Miller, hch, linux-kernel, linux-mm On Thu, 12 Jul 2007 12:48:41 +1000 Rusty Russell <rusty@rustcorp.com.au> wrote: > On Wed, 2007-07-11 at 19:28 -0700, David Miller wrote: > > From: Rusty Russell <rusty@rustcorp.com.au> > > Date: Thu, 12 Jul 2007 11:21:51 +1000 > > > > > To do inter-guest (ie. inter-process) I/O you really have to make sure > > > the other side doesn't go away. > > > > You should just let it exit and when it does you receive some kind of > > exit notification that resets your virtual device channel. > > > > I think the reference counting approach is error and deadlock prone. > > Be more loose and let the events reset the virtual devices when > > guests go splat. > > There are two places where we grab task refcnt. One might be avoidable > (will test and get back) but the deferred wakeup isn't really: > > /* We cache one process to wakeup: helps for batching & wakes outside locks. */ > void set_wakeup_process(struct lguest *lg, struct task_struct *p) > { > if (p == lg->wake) > return; > > if (lg->wake) { > wake_up_process(lg->wake); > put_task_struct(lg->wake); > } > lg->wake = p; > if (lg->wake) > get_task_struct(lg->wake); > } <handwaving> We seem to be taking the reference against the wrong thing here. It should be against the mm, not against a task_struct? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 4:24 ` Andrew Morton @ 2007-07-12 4:52 ` Rusty Russell 2007-07-12 11:10 ` Avi Kivity 2007-07-19 17:27 ` Christoph Hellwig 0 siblings, 2 replies; 535+ messages in thread From: Rusty Russell @ 2007-07-12 4:52 UTC (permalink / raw) To: Andrew Morton; +Cc: David Miller, hch, linux-kernel, linux-mm On Wed, 2007-07-11 at 21:24 -0700, Andrew Morton wrote: > We seem to be taking the reference against the wrong thing here. It should > be against the mm, not against a task_struct? This is solely for the wakeup: you don't wake an mm 8) The mm reference is held as well under the big lguest_mutex (mm gets destroyed before files get closed, so we definitely do need to hold a reference). I just completed benchmarking: the cached wakeup with the current naive drivers makes no difference (at one stage I was playing with batched hypercalls, where it seemed to help). Thanks Christoph, DaveM! === Remove export of __put_task_struct, and usage in lguest lguest takes a reference count of tasks for two reasons. The first is bogus: the /dev/lguest close callback will be called before the task is destroyed anyway, so no need to take a reference on open. The second is code to defer waking up tasks for inter-guest I/O, but the current lguest drivers are too simplistic to benefit (only batched hypercalls will see an effect, and it's likely that lguests' entire I/O model will be replaced with virtio and ringbuffers anyway). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> --- drivers/lguest/hypercalls.c | 1 - drivers/lguest/io.c | 18 +----------------- drivers/lguest/lg.h | 1 - drivers/lguest/lguest_user.c | 2 -- kernel/fork.c | 1 - 5 files changed, 1 insertion(+), 22 deletions(-) =================================================================== --- a/drivers/lguest/hypercalls.c +++ b/drivers/lguest/hypercalls.c @@ -189,5 +189,4 @@ void do_hypercalls(struct lguest *lg) do_hcall(lg, lg->regs); clear_hcall(lg); } - set_wakeup_process(lg, NULL); } =================================================================== --- a/drivers/lguest/io.c +++ b/drivers/lguest/io.c @@ -296,7 +296,7 @@ static int dma_transfer(struct lguest *s /* Do this last so dst doesn't simply sleep on lock. */ set_bit(dst->interrupt, dstlg->irqs_pending); - set_wakeup_process(srclg, dstlg->tsk); + wake_up_process(dstlg->tsk); return i == dst->num_dmas; fail: @@ -333,7 +333,6 @@ again: /* Give any recipients one chance to restock. */ up_read(¤t->mm->mmap_sem); mutex_unlock(&lguest_lock); - set_wakeup_process(lg, NULL); empty++; goto again; } @@ -360,21 +359,6 @@ void release_all_dma(struct lguest *lg) unlink_dma(&lg->dma[i]); } up_read(&lg->mm->mmap_sem); -} - -/* We cache one process to wakeup: helps for batching & wakes outside locks. */ -void set_wakeup_process(struct lguest *lg, struct task_struct *p) -{ - if (p == lg->wake) - return; - - if (lg->wake) { - wake_up_process(lg->wake); - put_task_struct(lg->wake); - } - lg->wake = p; - if (lg->wake) - get_task_struct(lg->wake); } /* Userspace wants a dma buffer from this guest. */ =================================================================== --- a/drivers/lguest/lg.h +++ b/drivers/lguest/lg.h @@ -240,7 +240,6 @@ void release_all_dma(struct lguest *lg); void release_all_dma(struct lguest *lg); unsigned long get_dma_buffer(struct lguest *lg, unsigned long key, unsigned long *interrupt); -void set_wakeup_process(struct lguest *lg, struct task_struct *p); /* hypercalls.c: */ void do_hypercalls(struct lguest *lg); =================================================================== --- a/drivers/lguest/lguest_user.c +++ b/drivers/lguest/lguest_user.c @@ -141,7 +141,6 @@ static int initialize(struct file *file, setup_guest_gdt(lg); init_clockdev(lg); lg->tsk = current; - get_task_struct(lg->tsk); lg->mm = get_task_mm(lg->tsk); init_waitqueue_head(&lg->break_wq); lg->last_pages = NULL; @@ -205,7 +204,6 @@ static int close(struct inode *inode, st hrtimer_cancel(&lg->hrt); release_all_dma(lg); free_guest_pagetable(lg); - put_task_struct(lg->tsk); mmput(lg->mm); if (!IS_ERR(lg->dead)) kfree(lg->dead); =================================================================== --- a/kernel/fork.c +++ b/kernel/fork.c @@ -128,7 +128,6 @@ void __put_task_struct(struct task_struc if (!profile_handoff_task(tsk)) free_task(tsk); } -EXPORT_SYMBOL_GPL(__put_task_struct); void __init fork_init(unsigned long mempages) { ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 4:52 ` Rusty Russell @ 2007-07-12 11:10 ` Avi Kivity 2007-07-12 23:20 ` Rusty Russell 2007-07-19 17:27 ` Christoph Hellwig 1 sibling, 1 reply; 535+ messages in thread From: Avi Kivity @ 2007-07-12 11:10 UTC (permalink / raw) To: Rusty Russell; +Cc: Andrew Morton, David Miller, hch, linux-kernel, linux-mm Rusty Russell wrote: > Remove export of __put_task_struct, and usage in lguest > > lguest takes a reference count of tasks for two reasons. The first is > bogus: the /dev/lguest close callback will be called before the task > is destroyed anyway, so no need to take a reference on open. > > What about Open /dev/lguest transfer fd using SCM_RIGHTS (or clone()?) close fd in original task exit() ? My feeling is that if you want to be bound to a task, not a file, you need to use syscalls, not ioctls. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 11:10 ` Avi Kivity @ 2007-07-12 23:20 ` Rusty Russell 0 siblings, 0 replies; 535+ messages in thread From: Rusty Russell @ 2007-07-12 23:20 UTC (permalink / raw) To: Avi Kivity; +Cc: Andrew Morton, David Miller, hch, linux-kernel, carsteno On Thu, 2007-07-12 at 14:10 +0300, Avi Kivity wrote: > Rusty Russell wrote: > > Remove export of __put_task_struct, and usage in lguest > > > > lguest takes a reference count of tasks for two reasons. The first is > > bogus: the /dev/lguest close callback will be called before the task > > is destroyed anyway, so no need to take a reference on open. > > > > > > What about > > Open /dev/lguest > transfer fd using SCM_RIGHTS (or clone()?) > close fd in original task > exit() > > ? > > My feeling is that if you want to be bound to a task, not a file, you > need to use syscalls, not ioctls. "Don't do that". You'll lose the ability to access the operations on the fd once you are no longer the original task (explicit check)., It's not an exact match, but a file is a remarkably convenient abstraction for a non-ABI such as lguest. Of course, Carsten was talking about unifying the lguest & kvm userspace interface, so this could well change anyway. Cheers, Rusty. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-12 4:52 ` Rusty Russell 2007-07-12 11:10 ` Avi Kivity @ 2007-07-19 17:27 ` Christoph Hellwig 2007-07-20 3:27 ` Rusty Russell 1 sibling, 1 reply; 535+ messages in thread From: Christoph Hellwig @ 2007-07-19 17:27 UTC (permalink / raw) To: Rusty Russell; +Cc: Andrew Morton, David Miller, hch, linux-kernel, linux-mm On Thu, Jul 12, 2007 at 02:52:23PM +1000, Rusty Russell wrote: > This is solely for the wakeup: you don't wake an mm 8) > > The mm reference is held as well under the big lguest_mutex (mm gets > destroyed before files get closed, so we definitely do need to hold a > reference). > > I just completed benchmarking: the cached wakeup with the current naive > drivers makes no difference (at one stage I was playing with batched > hypercalls, where it seemed to help). > > Thanks Christoph, DaveM! The version that just got into mainline still has the __put_task_struct export despite not needing it anymore. Care to fix this up? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-19 17:27 ` Christoph Hellwig @ 2007-07-20 3:27 ` Rusty Russell 2007-07-20 7:15 ` Christoph Hellwig 0 siblings, 1 reply; 535+ messages in thread From: Rusty Russell @ 2007-07-20 3:27 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, David Miller, linux-kernel, linux-mm On Thu, 2007-07-19 at 19:27 +0200, Christoph Hellwig wrote: > The version that just got into mainline still has the __put_task_struct > export despite not needing it anymore. Care to fix this up? No, it got patched in then immediately patched out again. Andrew mis-mixed my patches, but there have been so many of them I find it hard to blame him. Rusty. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: lguest, Re: -mm merge plans for 2.6.23 2007-07-20 3:27 ` Rusty Russell @ 2007-07-20 7:15 ` Christoph Hellwig 0 siblings, 0 replies; 535+ messages in thread From: Christoph Hellwig @ 2007-07-20 7:15 UTC (permalink / raw) To: Rusty Russell Cc: Christoph Hellwig, Andrew Morton, David Miller, linux-kernel, linux-mm On Fri, Jul 20, 2007 at 01:27:26PM +1000, Rusty Russell wrote: > On Thu, 2007-07-19 at 19:27 +0200, Christoph Hellwig wrote: > > The version that just got into mainline still has the __put_task_struct > > export despite not needing it anymore. Care to fix this up? > > No, it got patched in then immediately patched out again. Andrew > mis-mixed my patches, but there have been so many of them I find it hard > to blame him. Indeed, the export is gone in last mainline gone. ^ permalink raw reply [flat|nested] 535+ messages in thread
* x86 status was Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (21 preceding siblings ...) 2007-07-11 12:23 ` lguest, " Christoph Hellwig @ 2007-07-11 12:43 ` Andi Kleen 2007-07-11 17:33 ` Jesse Barnes ` (3 more replies) 2007-07-11 23:03 ` generic clockevents/ (hr)time(r) patches " Thomas Gleixner ` (3 subsequent siblings) 26 siblings, 4 replies; 535+ messages in thread From: Andi Kleen @ 2007-07-11 12:43 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, tglx, jeremy, Tim Hockin, jesse.barnes Andrew Morton <akpm@linux-foundation.org> writes: > revert-x86_64-mm-verify-cpu-rename.patch > add-kstrndup-fix.patch > xen-build-fix.patch > fix-x86_64-numa-fake-apicid_to_node-mapping-for-fake-numa-2.patch > fix-x86_64-mm-xen-xen-smp-guest-support.patch > more-fix-x86_64-mm-xen-xen-smp-guest-support.patch > fix-x86_64-mm-sched-clock-share.patch > fix-x86_64-mm-xen-add-xen-virtual-block-device-driver.patch > fix-x86_64-mm-add-common-orderly_poweroff.patch > fix-x86_64-mm-xen-xen-event-channels.patch > arch-i386-xen-mmuc-must-include-linux-schedh.patch > tidy-up-usermode-helper-waiting-a-bit-fix.patch > update-x86_64-mm-xen-use-iret-directly-where-possible.patch Xen is probably going to be merged. I'm still not fully happy about the review status of the drivers and xenbus, but there doesn't seem to be much value in delaying it further. I'll consolidate the fixes and fixes-to-fixes. These all need re-review: > i386-add-support-for-picopower-irq-router.patch > make-arch-i386-kernel-setupcremapped_pgdat_init-static.patch > arch-i386-kernel-i8253c-should-include-asm-timerh.patch > make-arch-i386-kernel-io_apicctimer_irq_works-static-again.patch > quicklist-support-for-x86_64.patch > x86_64-extract-helper-function-from-e820_register_active_regions.patch > x86_64-fix-e820_hole_size-based-on-address-ranges.patch > x86_64-acpi-disable-srat-when-numa-emulation-succeeds.patch > x86_64-slit-fake-pxm-to-node-mapping-for-fake-numa-2.patch > x86_64-numa-fake-apicid_to_node-mapping-for-fake-numa-2.patch > x86-use-elfnoteh-to-generate-vsyscall-notes-fix.patch > mmconfig-x86_64-i386-insert-unclaimed-mmconfig-resources.patch > x86_64-fix-smp_call_function_single-return-value.patch > x86_64-o_excl-on-dev-mcelog.patch > x86_64-support-poll-on-dev-mcelog.patch It's still not clear to me this is any useful. The current code can run a program on MCE which should be really fast enough for machine check handling. > i386-fix-machine-rebooting.patch > x86-fix-section-mismatch-warnings-in-mtrr.patch > x86_64-ratelimit-segfault-reporting-rate.patch I think that one was bogus. > x86_64-pm_trace-support.patch > make-alt-sysrq-p-display-the-debug-register-contents.patch > i386-flush_tlb_kernel_range-add-reference-to-the-arguments.patch > round_jiffies-for-i386-and-x86-64-non-critical-corrected-mce-polling.patch > pci-disable-decode-of-io-memory-during-bar-sizing.patch > mmconfig-validate-against-acpi-motherboard-resources.patch > x86_64-irq-check-remote-irr-bit-before-migrating-level-triggered-irq-v3.patch > i386-remove-support-for-the-rise-cpu.patch > i386-make-arch-i386-mm-pgtablecpgd_cdtor-static.patch > i386-fix-section-mismatch-warning-in-intel_cacheinfo.patch > i386-do-not-restore-reserved-memory-after-hibernation.patch > paravirt-helper-to-disable-all-io-space-fix.patch > dmi_match-patch-in-rebootc-for-sff-dell-optiplex-745-fixes-hang.patch > i386-hpet-check-if-the-counter-works.patch > i386-trim-memory-not-covered-by-wb-mtrrs.patch Might need more testing? More review: > kprobes-x86_64-fix-for-mark-ro-data.patch > kprobes-i386-fix-for-mark-ro-data.patch > divorce-config_x86_pae-from-config_highmem64g.patch > remove-unneeded-test-of-task-in-dump_trace.patch > i386-move-the-kernel-to-16mb-for-numa-q.patch > i386-show-unhandled-signals.patch > i386-minor-nx-handling-adjustment.patch > x86-smp-alt-once-option-is-only-useful-with-hotplug_cpu.patch > x86-64-remove-unused-variable-maxcpus.patch > move-functions-declarations-to-header-file.patch > x86_64-during-vm-oom-condition.patch > i386-during-vm-oom-condition.patch > x86-64-disable-the-gart-in-shutdown.patch > x86_84-move-iommu-declaration-from-proto-to-iommuh.patch > i386-uaccessh-replace-hard-coded-constant-with-appropriate-macro-from-kernelh.patch > i386-add-cpu_relax-to-cmos_lock.patch > x86_64-flush_tlb_kernel_range-warning-fix.patch > x86_64-add-ioapic-nmi-support.patch > x86_64-change-_map_single-to-static-in-pci_gartc-etc.patch > x86_64-geode-hw-random-number-generator-depend-on-x86_3.patch > x86_64-fix-wrong-comment-regarding-set_fixmap.patch > arch-x86_64-kernel-processc-lower-printk-severity.patch > nohz-fix-nohz-x86-dyntick-idle-handling.patch > acpi-move-timer-broadcast-and-pmtimer-access-before-c3-arbiter-shutdown.patch > clockevents-fix-typo-in-acpi_pmc.patch > timekeeping-fixup-shadow-variable-argument.patch > timerc-cleanup-recently-introduced-whitespace-damage.patch > clockevents-remove-prototypes-of-removed-functions.patch > clockevents-fix-resume-logic.patch > clockevents-fix-device-replacement.patch > tick-management-spread-timer-interrupt.patch > highres-improve-debug-output.patch > hrtimer-speedup-hrtimer_enqueue.patch > pcspkr-use-the-global-pit-lock.patch > ntp-move-the-cmos-update-code-into-ntpc.patch > i386-pit-stop-only-when-in-periodic-or-oneshot-mode.patch > i386-remove-volatile-in-apicc.patch > i386-hpet-assumes-boot-cpu-is-0.patch > i386-move-pit-function-declarations-and-constants-to-correct-header-file.patch > x86_64-untangle-asm-hpeth-from-asm-timexh.patch > x86_64-use-generic-cmos-update.patch > x86_64-remove-dead-code-and-other-janitor-work-in-tscc.patch > x86_64-fix-apic-typo.patch > x86_64-convert-to-cleckevents.patch > acpi-remove-the-useless-ifdef-code.patch > x86_64-hpet-restore-vread.patch > x86_64-restore-restore-nohpet-cmdline.patch > x86_64-block-irq-balancing-for-timer.patch > x86_64-prep-idle-loop-for-dynticks.patch > x86_64-enable-high-resolution-timers-and-dynticks.patch > x86_64-dynticks-disable-hpet_id_legsup-hpets.patch I'm sceptical about the dynticks code. It just rips out the x86-64 timing code completely, which needs a lot more review and testing. Probably not .23 More review: > xen-fix-x86-config-dependencies.patch > x86_64-get-mp_bus_to_node-as-early.patch > xen-suppress-abs-symbol-warnings-for-unused-reloc-pointers.patch > xen-cant-support-numa-yet.patch > x86-fix-iounmaps-use-of-vm_structs-size-field.patch > arch-x86_64-kernel-aperturec-lower-printk-severity.patch > arch-x86_64-kernel-e820c-lower-printk-severity.patch > ich-force-hpet-make-generic-time-capable-of-switching-broadcast-timer.patch > ich-force-hpet-restructure-hpet-generic-clock-code.patch > ich-force-hpet-ich7-or-later-quirk-to-force-detect-enable.patch > ich-force-hpet-late-initialization-of-hpet-after-quirk.patch > ich-force-hpet-ich5-quirk-to-force-detect-enable.patch > ich-force-hpet-ich5-fix-a-bug-with-suspend-resume.patch > ich-force-hpet-add-ich7_0-pciid-to-quirk-list.patch > geode-basic-infrastructure-support-for-amd-geode-class.patch > geode-mfgpt-support-for-geode-class-machines.patch > geode-mfgpt-clock-event-device-support.patch > i386-x86_64-insert-hpet-firmware-resource-after-pci-enumeration-has-completed.patch > i386-ioapic-remove-old-irq-balancing-debug-cruft.patch > i386-deactivate-the-test-for-the-dead-config_debug_page_type.patch -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 12:43 ` x86 status was " Andi Kleen @ 2007-07-11 17:33 ` Jesse Barnes 2007-07-11 17:42 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 0 replies; 535+ messages in thread From: Jesse Barnes @ 2007-07-11 17:33 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, linux-kernel > > i386-trim-memory-not-covered-by-wb-mtrrs.patch > > Might need more testing? For the mtrr trim patch at least, I think the coverage we've received in -mm is probably sufficient (the failure mode would be fairly obvious). The only thing I'm nervous about is adding AMD support for the quirk, since I don't have any way of testing it. We can easily add that later though, if a tester steps forward or we see demand for it (should just be an extra conditional in the trim code). Thanks, Jesse ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 12:43 ` x86 status was " Andi Kleen 2007-07-11 17:33 ` Jesse Barnes @ 2007-07-11 17:42 ` Ingo Molnar 2007-07-11 21:02 ` Randy Dunlap ` (2 more replies) 2007-07-11 18:14 ` Jeremy Fitzhardinge 2007-07-12 19:33 ` Christoph Lameter 3 siblings, 3 replies; 535+ messages in thread From: Ingo Molnar @ 2007-07-11 17:42 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Linus Torvalds, Chris Wright * Andi Kleen <andi@firstfloor.org> wrote: > > clockevents-fix-typo-in-acpi_pmc.patch > > timekeeping-fixup-shadow-variable-argument.patch > > timerc-cleanup-recently-introduced-whitespace-damage.patch > > clockevents-remove-prototypes-of-removed-functions.patch > > clockevents-fix-resume-logic.patch > > clockevents-fix-device-replacement.patch > > tick-management-spread-timer-interrupt.patch > > highres-improve-debug-output.patch > > hrtimer-speedup-hrtimer_enqueue.patch > > pcspkr-use-the-global-pit-lock.patch > > ntp-move-the-cmos-update-code-into-ntpc.patch > > i386-pit-stop-only-when-in-periodic-or-oneshot-mode.patch > > i386-remove-volatile-in-apicc.patch > > i386-hpet-assumes-boot-cpu-is-0.patch > > i386-move-pit-function-declarations-and-constants-to-correct-header-file.patch > > x86_64-untangle-asm-hpeth-from-asm-timexh.patch > > x86_64-use-generic-cmos-update.patch > > x86_64-remove-dead-code-and-other-janitor-work-in-tscc.patch > > x86_64-fix-apic-typo.patch > > x86_64-convert-to-cleckevents.patch > > acpi-remove-the-useless-ifdef-code.patch > > x86_64-hpet-restore-vread.patch > > x86_64-restore-restore-nohpet-cmdline.patch > > x86_64-block-irq-balancing-for-timer.patch > > x86_64-prep-idle-loop-for-dynticks.patch > > x86_64-enable-high-resolution-timers-and-dynticks.patch > > x86_64-dynticks-disable-hpet_id_legsup-hpets.patch > > I'm sceptical about the dynticks code. It just rips out the x86-64 > timing code completely, which needs a lot more review and testing. > Probably not .23 What you just did here is a slap in the face to a lot of contributors who worked hard on this code :( Let me tell you about the history of this project first. Arjan wrote the first version of it a year ago, and it was added to -rt and tested there by many people and went through many iterations and fixes. Chris Wright then created a x86_64 clockevents cleanup and dynticks enabling patchset from it this spring and sent it to lkml three and a half months ago, on March 31: http://lwn.net/Articles/229094/ Thomas, the high resolution timers and clockevents maintainer, immediately picked up Chris' splitup/splitout/cleanup work and fixed and extended it, and sent a first cut to lkml on May 6th: http://lwn.net/Articles/233226/ Thomas then sent an updated version of the x86_64 clockevents cleanup and dynticks code to lkml (on June 10th), for a second round of review: http://lwn.net/Articles/237687/ As Thomas stated it in his submission: " The patch set has been tested in the -hrt and -rt trees for quite a while and the initial problems have been sorted out. Thanks to the folks from the PowerTop project for testing and feedback. " Then on June 16th Thomas sent the third series: http://lwn.net/Articles/238834/ (which too was in -rt and was tested there on numerous machines. It was also added to -mm.) Then on June 23rd Thomas sent the fourth series of the x86_64 clockevents and dynticks code: http://lwn.net/Articles/239620/ We finally have someone (Thomas) with core kernel clue who actually _cares_ about the x86 time code and does not see it as an ugly chore, one who collects the right patches and maintains the -hrt tree and co-maintains the -rt tree and interacts with other contributors. What he did was _hard_ to do but we are making really good progress: http://lkml.org/lkml/2007/7/5/242 " All in all, personally I'm very happy to see Linux making such a huge step forward with tickless and can't wait for this step to be available in all distros and for all architectures... " Yes, touching the time code is a pain because both the hardware and the kernel has skeletons hidden in the closet, but we are mapping them one by one, and we already fixed several kernel skeletons in the process. The code is in -mm and there is no open regression related to this queue of patches at the moment. But what is curiously absent from all this positive and dynamic activity around these patches on lkml? There is not a single email from Andi Kleen, the official maintainer of the x86_64 tree directly reacting to this submission in any way, shape or form. Not a single email from you thanking Arjan, Chris and Thomas for this amount of cleanup to the architecture you are maintaining: 31 files changed, 698 insertions(+), 1367 deletions(-) Not a single email from you reviewing the patchset in any meaningful way. Not a single email from you to indicate that you even did so much as boot into this patchset. What contribution do we have from you instead? A week before the .23 merge window is closed, in the very last possible moment, we finally get your first reaction to this patchset, albeit in the form of three terse sentences: > I'm sceptical about the dynticks code. It just rips out the x86-64 > timing code completely, which needs a lot more review and testing. > Probably not .23 In the past 3+ months there was not a single email from you indicating that you are "doubtful" about this submission, and that you think that it needs "lot more review and testing". You dont offer any alternative, you dont offer any feedback, no review, no testing, no support, just a simple rejection on lkml that prevents this project from going upstream. Yes, maintainers have veto power and often have to make hard decisions, but, and let me stress this properly: Only if they actually act as honest maintainers! Altogether 197 emails on lkml discussed these patches, and you were Cc:-ed to every one of them. Over a dozen kernel developers reviewed it or reacted to the patchset in one way or another. And your only reaction to this is silence and a rejection claiming that it needs "lot more review"? I'm utterly speechless. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 17:42 ` Ingo Molnar @ 2007-07-11 21:02 ` Randy Dunlap 2007-07-11 21:39 ` Thomas Gleixner 2007-07-11 21:16 ` Andi Kleen 2007-07-11 21:42 ` x86 status was Re: -mm merge plans for 2.6.23 Linus Torvalds 2 siblings, 1 reply; 535+ messages in thread From: Randy Dunlap @ 2007-07-11 21:02 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Linus Torvalds, Chris Wright On Wed, 11 Jul 2007 19:42:52 +0200 Ingo Molnar wrote: > > * Andi Kleen <andi@firstfloor.org> wrote: > > > > clockevents-fix-typo-in-acpi_pmc.patch > > > timekeeping-fixup-shadow-variable-argument.patch > > > timerc-cleanup-recently-introduced-whitespace-damage.patch > > > clockevents-remove-prototypes-of-removed-functions.patch > > > clockevents-fix-resume-logic.patch > > > clockevents-fix-device-replacement.patch > > > tick-management-spread-timer-interrupt.patch > > > highres-improve-debug-output.patch > > > hrtimer-speedup-hrtimer_enqueue.patch > > > pcspkr-use-the-global-pit-lock.patch > > > ntp-move-the-cmos-update-code-into-ntpc.patch > > > i386-pit-stop-only-when-in-periodic-or-oneshot-mode.patch > > > i386-remove-volatile-in-apicc.patch > > > i386-hpet-assumes-boot-cpu-is-0.patch > > > i386-move-pit-function-declarations-and-constants-to-correct-header-file.patch > > > x86_64-untangle-asm-hpeth-from-asm-timexh.patch > > > x86_64-use-generic-cmos-update.patch > > > x86_64-remove-dead-code-and-other-janitor-work-in-tscc.patch > > > x86_64-fix-apic-typo.patch > > > x86_64-convert-to-cleckevents.patch > > > acpi-remove-the-useless-ifdef-code.patch > > > x86_64-hpet-restore-vread.patch > > > x86_64-restore-restore-nohpet-cmdline.patch > > > x86_64-block-irq-balancing-for-timer.patch > > > x86_64-prep-idle-loop-for-dynticks.patch > > > x86_64-enable-high-resolution-timers-and-dynticks.patch > > > x86_64-dynticks-disable-hpet_id_legsup-hpets.patch > > > > I'm sceptical about the dynticks code. It just rips out the x86-64 > > timing code completely, which needs a lot more review and testing. > > Probably not .23 > > What you just did here is a slap in the face to a lot of contributors > who worked hard on this code :( > > Let me tell you about the history of this project first. ... [lwn.net articles and other quotes snipped] > But what is curiously absent from all this positive and dynamic activity > around these patches on lkml? There is not a single email from Andi > Kleen, the official maintainer of the x86_64 tree directly reacting to > this submission in any way, shape or form. Not a single email from you > thanking Arjan, Chris and Thomas for this amount of cleanup to the > architecture you are maintaining: > > 31 files changed, 698 insertions(+), 1367 deletions(-) Hm, I don't usually get thanks emails. Do other people? > Not a single email from you reviewing the patchset in any meaningful > way. Not a single email from you to indicate that you even did so much > as boot into this patchset. > > What contribution do we have from you instead? A week before the .23 > merge window is closed, in the very last possible moment, we finally get > your first reaction to this patchset, albeit in the form of three terse > sentences: > > > I'm sceptical about the dynticks code. It just rips out the x86-64 > > timing code completely, which needs a lot more review and testing. > > Probably not .23 > > In the past 3+ months there was not a single email from you indicating > that you are "doubtful" about this submission, and that you think that > it needs "lot more review and testing". You dont offer any alternative, > you dont offer any feedback, no review, no testing, no support, just a > simple rejection on lkml that prevents this project from going upstream. > > Yes, maintainers have veto power and often have to make hard decisions, > but, and let me stress this properly: > > Only if they actually act as honest maintainers! > > Altogether 197 emails on lkml discussed these patches, and you were > Cc:-ed to every one of them. Over a dozen kernel developers reviewed it > or reacted to the patchset in one way or another. And your only reaction > to this is silence and a rejection claiming that it needs "lot more > review"? I'm utterly speechless. I can understand being disappointed, but not quite as upset as you appear to be. so have you (Ingo) reviewed the ext4 patches? or reiser4 patches? or lumpy reclaim? or anti-fragmentation? I certainly haven't. I can barely keep up with reading about 1/2 of lkml emails. And in my non-scientific method, I think that we are suffering from both (a) more patch submittals and (b) fewer qualified reviewers (per kernel KLOC) than we had 3-5 years ago. I don't see how you can expect Andrew to review these or any other specific patchset. Do you have some suggestions on how to clone Andrew? --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:02 ` Randy Dunlap @ 2007-07-11 21:39 ` Thomas Gleixner 2007-07-11 23:21 ` Randy Dunlap 0 siblings, 1 reply; 535+ messages in thread From: Thomas Gleixner @ 2007-07-11 21:39 UTC (permalink / raw) To: Randy Dunlap Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Linus Torvalds, Chris Wright Randy, On Wed, 2007-07-11 at 14:02 -0700, Randy Dunlap wrote: > I certainly haven't. I can barely keep up with reading about 1/2 > of lkml emails. And in my non-scientific method, I think that we > are suffering from both (a) more patch submittals and (b) fewer > qualified reviewers (per kernel KLOC) than we had 3-5 years ago. > > I don't see how you can expect Andrew to review these or any other > specific patchset. Do you have some suggestions on how to clone > Andrew? Ingo was talking to Andi, the x86_64 maintainer, not to Andrew. And I share his opinion that the maintainer of the subsystem, which is affected by such a fundamental patch, could have at least shown any public sign of interest, disgust, comment or what ever in a 3+ month time frame. Especially about a patch, which is a logical consequence of an almost two years public and transparent effort to consolidate the time code in the kernel. I for my part have no problem maintaining the set for another round out of tree and weed out eventually problems in -mm, but my expectation for qualified response of the responsible maintainer is exactly zero right now. Thanks, tglx ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:39 ` Thomas Gleixner @ 2007-07-11 23:21 ` Randy Dunlap 0 siblings, 0 replies; 535+ messages in thread From: Randy Dunlap @ 2007-07-11 23:21 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Linus Torvalds, Chris Wright On Wed, 11 Jul 2007 23:39:19 +0200 Thomas Gleixner wrote: > Randy, > > On Wed, 2007-07-11 at 14:02 -0700, Randy Dunlap wrote: > > I certainly haven't. I can barely keep up with reading about 1/2 > > of lkml emails. And in my non-scientific method, I think that we > > are suffering from both (a) more patch submittals and (b) fewer > > qualified reviewers (per kernel KLOC) than we had 3-5 years ago. > > > > I don't see how you can expect Andrew to review these or any other > > specific patchset. Do you have some suggestions on how to clone > > Andrew? > > Ingo was talking to Andi, the x86_64 maintainer, not to Andrew. Yep, I see that when I re-read it. I apologize. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 17:42 ` Ingo Molnar 2007-07-11 21:02 ` Randy Dunlap @ 2007-07-11 21:16 ` Andi Kleen 2007-07-11 21:46 ` Valdis.Kletnieks ` (2 more replies) 2007-07-11 21:42 ` x86 status was Re: -mm merge plans for 2.6.23 Linus Torvalds 2 siblings, 3 replies; 535+ messages in thread From: Andi Kleen @ 2007-07-11 21:16 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Linus Torvalds, Chris Wright Well I spent a lot of time making the x86-64 timing code work well on a variety of machines; working around a wide variety of hardware and platform bugs. I obviously don't agree on your description of its maintenance state. > What contribution do we have from you instead? A week before the .23 I told him my objections privately earlier. Basically i would like to see an actually debuggable step-by-step change, not a rip everything out. If that isn't possible it needs very careful review which just hasn't happened yet. But I'm not convinced even step by step is not possible here. I thought it was clear that rip everything out is rarely a good idea in Linux land? That's really not something I should need to harp on repeatedly. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:16 ` Andi Kleen @ 2007-07-11 21:46 ` Valdis.Kletnieks 2007-07-11 21:54 ` Chris Wright 2007-07-11 22:12 ` Linus Torvalds 2007-07-11 21:46 ` Thomas Gleixner 2007-07-11 21:46 ` Andrea Arcangeli 2 siblings, 2 replies; 535+ messages in thread From: Valdis.Kletnieks @ 2007-07-11 21:46 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Linus Torvalds, Chris Wright [-- Attachment #1: Type: text/plain, Size: 1125 bytes --] On Wed, 11 Jul 2007 23:16:38 +0200, Andi Kleen said: (Note - I'm just a usually-confused crash test dummy here...) > Well I spent a lot of time making the x86-64 timing code work > well on a variety of machines; working around a wide variety > of hardware and platform bugs. I obviously don't agree on your description > of its maintenance state. I'm seeing a bit of a disconnect here. If you spent all that time making it work, how come the guys who developed the patch are saying you didn't provide any feedback about the patchset? > > What contribution do we have from you instead? A week before the .23 > > I told him my objections privately earlier. Basically i would > like to see an actually debuggable step-by-step change, not a rip everything > out. Odd, I looked at the patchset fairly closely a number of times, as I was hand-retrofitting the -rc[1-4] versions onto -rc[1-4]-mm kernels, and it looked to *me* like it was a nice set of 20 or so step-by-step changes (bisectable and everything - I got to do that once trying to figure out which one I botched). Was there something in there that I missed? [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:46 ` Valdis.Kletnieks @ 2007-07-11 21:54 ` Chris Wright 2007-07-11 22:11 ` Valdis.Kletnieks 2007-07-11 22:12 ` Linus Torvalds 1 sibling, 1 reply; 535+ messages in thread From: Chris Wright @ 2007-07-11 21:54 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Linus Torvalds, Chris Wright * Valdis.Kletnieks@vt.edu (Valdis.Kletnieks@vt.edu) wrote: > On Wed, 11 Jul 2007 23:16:38 +0200, Andi Kleen said: > (Note - I'm just a usually-confused crash test dummy here...) > > > Well I spent a lot of time making the x86-64 timing code work > > well on a variety of machines; working around a wide variety > > of hardware and platform bugs. I obviously don't agree on your description > > of its maintenance state. > > I'm seeing a bit of a disconnect here. If you spent all that time making it > work, how come the guys who developed the patch are saying you didn't provide > any feedback about the patchset? I think Andi's referring to the existing x86_64 code, which gets replaced by the patchset in question. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:54 ` Chris Wright @ 2007-07-11 22:11 ` Valdis.Kletnieks 2007-07-11 22:20 ` Chris Wright 2007-07-11 22:33 ` Linus Torvalds 0 siblings, 2 replies; 535+ messages in thread From: Valdis.Kletnieks @ 2007-07-11 22:11 UTC (permalink / raw) To: Chris Wright Cc: Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 1348 bytes --] On Wed, 11 Jul 2007 14:54:12 PDT, Chris Wright said: > * Valdis.Kletnieks@vt.edu (Valdis.Kletnieks@vt.edu) wrote: > > On Wed, 11 Jul 2007 23:16:38 +0200, Andi Kleen said: > > (Note - I'm just a usually-confused crash test dummy here...) > > > > > Well I spent a lot of time making the x86-64 timing code work > > > well on a variety of machines; working around a wide variety > > > of hardware and platform bugs. I obviously don't agree on your description > > > of its maintenance state. > > > > I'm seeing a bit of a disconnect here. If you spent all that time making it > > work, how come the guys who developed the patch are saying you didn't provide > > any feedback about the patchset? > > I think Andi's referring to the existing x86_64 code, which gets > replaced by the patchset in question. <Takes a closer look at the patches> D'Oh! :) Yeah, the -rc4 version I'm looking at is like a dozen 1-3K patches setting up and cleaning up, and then one monster 65K patch doing the clockevents conversion, then another 6 or 8 small ones. Yeah, that one big patch really doesn't look separable to me. But as I said, I'm just a crash test dummy here. :) Andrew - how do you feel about keeping this in the -mm tree until Linus, Andi, Ingo, and Thomas get on the same page (which may be around the 2.6.24 merge window, by my guesstimate)? [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:11 ` Valdis.Kletnieks @ 2007-07-11 22:20 ` Chris Wright 2007-07-11 22:33 ` Linus Torvalds 1 sibling, 0 replies; 535+ messages in thread From: Chris Wright @ 2007-07-11 22:20 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Chris Wright, Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Linus Torvalds * Valdis.Kletnieks@vt.edu (Valdis.Kletnieks@vt.edu) wrote: > Andrew - how do you feel about keeping this in the -mm tree until Linus, > Andi, Ingo, and Thomas get on the same page (which may be around the 2.6.24 > merge window, by my guesstimate)? Well, that's supposed to be Andi's tree and aggregated by Andrew into -mm. But keeping it in -mm isn't the hard part. It's getting enough testing to convince Linus it's safe, since there's no simple way to enable clockevents in a slow manner. IOW, keeping it in -mm just postpones the issue. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:11 ` Valdis.Kletnieks 2007-07-11 22:20 ` Chris Wright @ 2007-07-11 22:33 ` Linus Torvalds 1 sibling, 0 replies; 535+ messages in thread From: Linus Torvalds @ 2007-07-11 22:33 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Chris Wright, Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven On Wed, 11 Jul 2007, Valdis.Kletnieks@vt.edu wrote: > > <Takes a closer look at the patches> D'Oh! :) Yeah, the -rc4 version I'm > looking at is like a dozen 1-3K patches setting up and cleaning up, and then > one monster 65K patch doing the clockevents conversion, then another 6 or 8 > small ones. > > Yeah, that one big patch really doesn't look separable to me. I think it should be. That big patch really does do a *lot* more than just the "clockevents conversion". It does all the hpet clock setup changes etc that are about the hardware, and have *nothing* to do with actually changing the interfaces. For example, look at the hpet.c part of that patch. Totally independent cleanups of everything else. Or look at the changes to __setup_APIC_LVTT(). Same thing. All the actual hardware interface changes are *totally* independent of the software interface changes, and a lot of them are just cleanups. But those hardware interface changes are easily the things that can break, where some cleanup results in register writes being done in a different order or something, and so if there's a bug there (and it's not visible on most setups), now you cannot tell where the bug is. Another example: setup_APIC_timer() used to wait for a timer interrupt trigger to happen on the i8259 timer (or HPET). That code just got removed (or maybe it got moved so subtly that I just don't see it). What has that got to do with switching from the old timer interface to the new one? NOTHING. So those kinds of changes that change hardware access functions should have been done separately. Maybe there's a machine where that early synchronization was necessary for some subtle timing reason. If so, removing it sounds like a bug, no? Wouldn't it have been nice to see that removal as a separate patch that was independent of the interface switch- over? I'd be a *lot* happier with switching over interfaces if I thought that the low-level hardware drivers didn't change at the same time. But they *do* change, afaik. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:46 ` Valdis.Kletnieks 2007-07-11 21:54 ` Chris Wright @ 2007-07-11 22:12 ` Linus Torvalds 1 sibling, 0 replies; 535+ messages in thread From: Linus Torvalds @ 2007-07-11 22:12 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright On Wed, 11 Jul 2007, Valdis.Kletnieks@vt.edu wrote: > > Odd, I looked at the patchset fairly closely a number of times, as I was > hand-retrofitting the -rc[1-4] versions onto -rc[1-4]-mm kernels, and it looked > to *me* like it was a nice set of 20 or so step-by-step changes (bisectable > and everything - I got to do that once trying to figure out which one I botched). > Was there something in there that I missed? The patch-set itself actually looks fine, as far as I'm concerned. But it does seem to have that "enable everything in one go" problem. I'd much rather see one time source at a time being converted, and enabled then and there, so that when people report problems and do a bisection, if it was HPET that broke, you get the commit that changed HPET. As it is, looking at that set, it *looks* like you'd get the "ok, now enable it all" as the commit that breaks, which tells you hardly anything, since the commit that _shows_ the behaviour has absolutely nothing to do with the code that actually causes it. But yeah, the patch series per se doesn't look bad. If it wasn't for me being burnt by the last big switch-over for timers, I probably wouldn't mind it at all, personally. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:16 ` Andi Kleen 2007-07-11 21:46 ` Valdis.Kletnieks @ 2007-07-11 21:46 ` Thomas Gleixner 2007-07-11 21:52 ` Chris Wright 2007-07-11 22:18 ` Andi Kleen 2007-07-11 21:46 ` Andrea Arcangeli 2 siblings, 2 replies; 535+ messages in thread From: Thomas Gleixner @ 2007-07-11 21:46 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Andrew Morton, linux-kernel, Arjan van de Ven, Linus Torvalds, Chris Wright Andi, On Wed, 2007-07-11 at 23:16 +0200, Andi Kleen wrote: > > What contribution do we have from you instead? A week before the .23 > > I told him my objections privately earlier. Basically i would > like to see an actually debuggable step-by-step change, not a rip everything > out. You promised privately to do a thorough review as well, which I'm still waiting for since months. > If that isn't possible it needs very careful review which just hasn't > happened yet. But I'm not convinced even step by step is not possible > here. There is no step by step thing. You convert an arch to clock events or you convert it not. > I thought it was clear that rip everything out is rarely a good idea > in Linux land? That's really not something I should need to harp on > repeatedly. If you have technical objections, put them on the table. Point by point. All I heard so far from you are platitudes, which are not worth the electrons to transport them. tglx ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:46 ` Thomas Gleixner @ 2007-07-11 21:52 ` Chris Wright 2007-07-11 22:18 ` Andi Kleen 1 sibling, 0 replies; 535+ messages in thread From: Chris Wright @ 2007-07-11 21:52 UTC (permalink / raw) To: Thomas Gleixner Cc: Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Arjan van de Ven, Linus Torvalds, Chris Wright * Thomas Gleixner (tglx@linutronix.de) wrote: > > If that isn't possible it needs very careful review which just hasn't > > happened yet. But I'm not convinced even step by step is not possible > > here. > > There is no step by step thing. You convert an arch to clock events or > you convert it not. Indeed, about the only thing can be done is to take a slower approach to converging the arch specific implementations (hpet, pit, etc). thanks, -chris ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:46 ` Thomas Gleixner 2007-07-11 21:52 ` Chris Wright @ 2007-07-11 22:18 ` Andi Kleen 1 sibling, 0 replies; 535+ messages in thread From: Andi Kleen @ 2007-07-11 22:18 UTC (permalink / raw) To: Thomas Gleixner Cc: Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Arjan van de Ven, Linus Torvalds, Chris Wright On Wed, Jul 11, 2007 at 11:46:41PM +0200, Thomas Gleixner wrote: > Andi, > > On Wed, 2007-07-11 at 23:16 +0200, Andi Kleen wrote: > > > What contribution do we have from you instead? A week before the .23 > > > > I told him my objections privately earlier. Basically i would > > like to see an actually debuggable step-by-step change, not a rip everything > > out. > > You promised privately to do a thorough review as well, which I'm still > waiting for since months. I did some reviewing, but never the big write up and feedback. That was my fault, sorry. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:16 ` Andi Kleen 2007-07-11 21:46 ` Valdis.Kletnieks 2007-07-11 21:46 ` Thomas Gleixner @ 2007-07-11 21:46 ` Andrea Arcangeli 2007-07-11 22:09 ` Linus Torvalds 2 siblings, 1 reply; 535+ messages in thread From: Andrea Arcangeli @ 2007-07-11 21:46 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Linus Torvalds, Chris Wright Hi Andi, On Wed, Jul 11, 2007 at 11:16:38PM +0200, Andi Kleen wrote: > I thought it was clear that rip everything out is rarely a good idea > in Linux land? That's really not something I should need to harp on > repeatedly. I'm going to change topic big time because your sentence above perfectly applies to the O(1) scheduler too. It's not like process schedulers are sacred and there shall be only one, while I/O schedulers and packet schedulers are profane and there can be many of them. FWIW IMHO the right way would have been to make the new scheduler pluggable and switchable at runtime, too bad it was ripped off instead. The difficulty of making the scheduler pluggable isn't really enormous, there have been patches floating around to achieve it, some I even deal with them myself once. The only positive side of being forced to CFS I can imagine, is that more testing will make it more stable and more tuned more quickly. But I'm fairly certain Ingo's good enough to achieve without it, perhaps with a few more weeks. Personally I very much like the unfariness of O(1), I'm afraid CFS will overschedule under a certain number of workloads in its attempt to provide a complete fair queieing at all costs, and it won't deal with the X server as nicely as O(1), but I may as well be wrong. The only thing I'm more sure about is that the computational complexity is higher, and that reason alone is a good technical reason to provide both and let the java folks stick with O(1) if they want. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:46 ` Andrea Arcangeli @ 2007-07-11 22:09 ` Linus Torvalds 2007-07-12 15:36 ` Oleg Verych 2007-07-13 2:23 ` Roman Zippel 0 siblings, 2 replies; 535+ messages in thread From: Linus Torvalds @ 2007-07-11 22:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright On Wed, 11 Jul 2007, Andrea Arcangeli wrote: > > I'm going to change topic big time because your sentence above > perfectly applies to the O(1) scheduler too. I disagree to a large degree. We almost never have problems with code you can "think about". Sure, bugs happen, but code that everybody runs the same generally doesn't break. So a CPU scheduler doesn't worry me all that much. CPU schedulers are "easy". What worries me is interfaces to hardware that we know looks different for different people. That means that any testing that one person has done doesn't necessarily translate to anything at *all* on another persons machine. The timer problems we had when merging the stuff in 2.6.21 just scarred me. I'd _really_ hate to have to go through that again. And no, the "gradual" thing where the patch that actually *enables* something isn't very gradual at all, so that's the absolutely worst kind of thing, because then people can "git bisect" to the point where it got enabled and tell us that's where things broke, but that doesn't actually say anything at all about the patch that actually implements the new behaviour. So the "enable" kind of patch is actually the worst of the lot, when it comes to hardware. When it comes to pure software algorithms, and things like schedulers, you'll still obviously have timing issues and tuning, but generally things *work*, which makes it a lot easier to debug and describe. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:09 ` Linus Torvalds @ 2007-07-12 15:36 ` Oleg Verych 2007-07-13 2:23 ` Roman Zippel 1 sibling, 0 replies; 535+ messages in thread From: Oleg Verych @ 2007-07-12 15:36 UTC (permalink / raw) To: linux-kernel * Linus Torvalds "Wed, 11 Jul 2007 15:09:28 -0700 (PDT)" > > On Wed, 11 Jul 2007, Andrea Arcangeli wrote: >> >> I'm going to change topic big time because your sentence above >> perfectly applies to the O(1) scheduler too. > > I disagree to a large degree. > > We almost never have problems with code you can "think about". > > Sure, bugs happen, but code that everybody runs the same generally doesn't > break. So a CPU scheduler doesn't worry me all that much. CPU schedulers > are "easy". > > What worries me is interfaces to hardware that we know looks different for > different people. That means that any testing that one person has done > doesn't necessarily translate to anything at *all* on another persons > machine. > > The timer problems we had when merging the stuff in 2.6.21 just scarred > me. I'd _really_ hate to have to go through that again. And no, the > "gradual" thing where the patch that actually *enables* something isn't > very gradual at all, so that's the absolutely worst kind of thing, because > then people can "git bisect" to the point where it got enabled and tell us > that's where things broke, but that doesn't actually say anything at all > about the patch that actually implements the new behaviour. > > So the "enable" kind of patch is actually the worst of the lot, when it > comes to hardware. > > When it comes to pure software algorithms, and things like schedulers, > you'll still obviously have timing issues and tuning, but generally things > *work*, which makes it a lot easier to debug and describe. > > Linus Seconded (obviously). -- -o--=O`C #oo'L O <___=E M ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:09 ` Linus Torvalds 2007-07-12 15:36 ` Oleg Verych @ 2007-07-13 2:23 ` Roman Zippel 2007-07-13 4:40 ` Andrew Morton 2007-07-13 4:47 ` Mike Galbraith 1 sibling, 2 replies; 535+ messages in thread From: Roman Zippel @ 2007-07-13 2:23 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright Hi, On Wed, 11 Jul 2007, Linus Torvalds wrote: > Sure, bugs happen, but code that everybody runs the same generally doesn't > break. So a CPU scheduler doesn't worry me all that much. CPU schedulers > are "easy". A little more advance warning wouldn't have hurt though. The new scheduler does _a_lot_ of heavy 64 bit calculations without any attempt to scale that down a little... One can blame me now for not having it brought up earlier, but discussions with Ingo are not something I'm looking forward to. :( bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-13 2:23 ` Roman Zippel @ 2007-07-13 4:40 ` Andrew Morton 2007-07-13 4:47 ` Mike Galbraith 1 sibling, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-13 4:40 UTC (permalink / raw) To: Roman Zippel Cc: Linus Torvalds, Andrea Arcangeli, Andi Kleen, Ingo Molnar, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright On Fri, 13 Jul 2007 04:23:43 +0200 (CEST) Roman Zippel <zippel@linux-m68k.org> wrote: > Hi, > > On Wed, 11 Jul 2007, Linus Torvalds wrote: > > > Sure, bugs happen, but code that everybody runs the same generally doesn't > > break. So a CPU scheduler doesn't worry me all that much. CPU schedulers > > are "easy". > > A little more advance warning wouldn't have hurt though. > The new scheduler does _a_lot_ of heavy 64 bit calculations without any > attempt to scale that down a little... > One can blame me now for not having it brought up earlier, but discussions > with Ingo are not something I'm looking forward to. :( > I brought that up a couple of weeks ago, got handwaved at and gave up. It still isn't obvious to me that all that arith needs to be 64-bit on 32-bit machines, or even on 64-bit. 4e9 is a big number. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-13 2:23 ` Roman Zippel 2007-07-13 4:40 ` Andrew Morton @ 2007-07-13 4:47 ` Mike Galbraith 2007-07-13 17:23 ` Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-07-13 4:47 UTC (permalink / raw) To: Roman Zippel Cc: Linus Torvalds, Andrea Arcangeli, Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright On Fri, 2007-07-13 at 04:23 +0200, Roman Zippel wrote: > Hi, Hi, > The new scheduler does _a_lot_ of heavy 64 bit calculations without any > attempt to scale that down a little... See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity. Perhaps more can be done, but "without any attempt..." isn't accurate. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-13 4:47 ` Mike Galbraith @ 2007-07-13 17:23 ` Roman Zippel 2007-07-13 19:43 ` [PATCH] CFS: Fix missing digit off in wmult table Thomas Gleixner 2007-07-14 5:04 ` x86 status was Re: -mm merge plans for 2.6.23 Mike Galbraith 0 siblings, 2 replies; 535+ messages in thread From: Roman Zippel @ 2007-07-13 17:23 UTC (permalink / raw) To: Mike Galbraith Cc: Linus Torvalds, Andrea Arcangeli, Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright Hi, On Fri, 13 Jul 2007, Mike Galbraith wrote: > > The new scheduler does _a_lot_ of heavy 64 bit calculations without any > > attempt to scale that down a little... > > See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity. > Perhaps more can be done, but "without any attempt..." isn't accurate. Calculating these values at runtime would have been completely insane, the alternative would be a crummy approximation, so using a lookup table is actually a good thing. That's not the problem. BTW could someone please verify the prio_to_wmult table, especially [16] and [21] look a little off, like a digit was cut off. While I'm at this, the 10% scaling there looks a little much (unless there are other changes I haven't looked at yet), the old code used more like 5%. This would mean a prio -20 task would get 98.86% cpu time compared to a prio 0 task, that was previously about the difference between -20 and 19 (and it would have previously gotten only 88.89%), now a prio -20 task would get 99.98% cpu time compared to a prio 19 task. The individual levels are unfortunately not that easily comparable, but at the overall scale the change looks IMHO a little drastic. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* [PATCH] CFS: Fix missing digit off in wmult table 2007-07-13 17:23 ` Roman Zippel @ 2007-07-13 19:43 ` Thomas Gleixner 2007-07-16 6:18 ` James Bruce 2007-07-14 5:04 ` x86 status was Re: -mm merge plans for 2.6.23 Mike Galbraith 1 sibling, 1 reply; 535+ messages in thread From: Thomas Gleixner @ 2007-07-13 19:43 UTC (permalink / raw) To: Roman Zippel Cc: Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Roman Zippel noticed inconsistency of the wmult table. wmult[16] has a missing digit. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> diff --git a/kernel/sched.c b/kernel/sched.c index 0559665..3332bbb 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -750,7 +750,7 @@ static const u32 prio_to_wmult[40] = { 48356, 60446, 75558, 94446, 118058, 147573, 184467, 230589, 288233, 360285, 450347, 562979, 703746, 879575, 1099582, 1374389, - 717986, 2147483, 2684354, 3355443, 4194304, + 1717986, 2147483, 2684354, 3355443, 4194304, 5244160, 6557201, 8196502, 10250518, 12782640, 16025997, 19976592, 24970740, 31350126, 39045157, 49367440, 61356675, 76695844, 95443717, 119304647, ^ permalink raw reply related [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-13 19:43 ` [PATCH] CFS: Fix missing digit off in wmult table Thomas Gleixner @ 2007-07-16 6:18 ` James Bruce 2007-07-16 7:06 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: James Bruce @ 2007-07-16 6:18 UTC (permalink / raw) To: linux-kernel Cc: Roman Zippel, Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Thomas Gleixner wrote: > Roman Zippel noticed inconsistency of the wmult table. > wmult[16] has a missing digit. [snip] While we're at it, isn't the comment above the wmult table incorrect? The multiplier is 1.25, meaning a 25% change per nice level, not 10%. - Jim ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 6:18 ` James Bruce @ 2007-07-16 7:06 ` Ingo Molnar 2007-07-16 7:41 ` Ingo Molnar 2007-07-16 10:18 ` Roman Zippel 0 siblings, 2 replies; 535+ messages in thread From: Ingo Molnar @ 2007-07-16 7:06 UTC (permalink / raw) To: James Bruce Cc: Thomas Gleixner, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * James Bruce <bruce@andrew.cmu.edu> wrote: > While we're at it, isn't the comment above the wmult table incorrect? > The multiplier is 1.25, meaning a 25% change per nice level, not 10%. yes, the weight multiplier 1.25, but the actual difference in CPU utilization, when running two CPU intense tasks, is ~10%: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8246 mingo 20 0 1576 244 196 R 55 0.0 0:11.96 loop 8247 mingo 21 1 1576 244 196 R 45 0.0 0:10.52 loop so the first task 'wins' +10% CPU utilization (relative to the 50% it had before), the second task 'loses' -10% CPU utilization (relative to the 50% it had before). so what the comment says is true: * The "10% effect" is relative and cumulative: from _any_ nice level, * if you go up 1 level, it's -10% CPU usage, if you go down 1 level * it's +10% CPU usage. for there to be a ~+10% change in CPU utilization for a task that races against another CPU-intense task there needs to be a ~25% change in the weight. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 7:06 ` Ingo Molnar @ 2007-07-16 7:41 ` Ingo Molnar 2007-07-16 15:02 ` James Bruce 2007-07-16 10:18 ` Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-16 7:41 UTC (permalink / raw) To: James Bruce Cc: Thomas Gleixner, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Ingo Molnar <mingo@elte.hu> wrote: > * James Bruce <bruce@andrew.cmu.edu> wrote: > > > While we're at it, isn't the comment above the wmult table incorrect? > > The multiplier is 1.25, meaning a 25% change per nice level, not 10%. > > yes, the weight multiplier 1.25, but the actual difference in CPU > utilization, when running two CPU intense tasks, is ~10%: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 8246 mingo 20 0 1576 244 196 R 55 0.0 0:11.96 loop > 8247 mingo 21 1 1576 244 196 R 45 0.0 0:10.52 loop > > so the first task 'wins' +10% CPU utilization (relative to the 50% it > had before), the second task 'loses' -10% CPU utilization (relative to > the 50% it had before). > > so what the comment says is true: > > * The "10% effect" is relative and cumulative: from _any_ nice level, > * if you go up 1 level, it's -10% CPU usage, if you go down 1 level > * it's +10% CPU usage. > > for there to be a ~+10% change in CPU utilization for a task that > races against another CPU-intense task there needs to be a ~25% change > in the weight. in any case more documentation is justified, so i've added some clarification to the comments - see the patch below. Ingo ------------------------> Subject: sched: improve weight-array comments From: Ingo Molnar <mingo@elte.hu> improve the comments around the wmult array (which controls the weight of niced tasks). Clarify that to achieve a 10% difference in CPU utilization, a weight multiplier of 1.25 has to be used. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -736,7 +736,9 @@ static void update_curr_load(struct rq * * * The "10% effect" is relative and cumulative: from _any_ nice level, * if you go up 1 level, it's -10% CPU usage, if you go down 1 level - * it's +10% CPU usage. + * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25. + * If a task goes up by ~10% and another task goes down by ~10% then + * the relative distance between them is ~25%.) */ static const int prio_to_weight[40] = { /* -20 */ 88818, 71054, 56843, 45475, 36380, 29104, 23283, 18626, 14901, 11921, ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 7:41 ` Ingo Molnar @ 2007-07-16 15:02 ` James Bruce 0 siblings, 0 replies; 535+ messages in thread From: James Bruce @ 2007-07-16 15:02 UTC (permalink / raw) To: Ingo Molnar Cc: Thomas Gleixner, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Ingo Molnar wrote: > * Ingo Molnar <mingo@elte.hu> wrote: >> * James Bruce <bruce@andrew.cmu.edu> wrote: >>> While we're at it, isn't the comment above the wmult table incorrect? >>> The multiplier is 1.25, meaning a 25% change per nice level, not 10%. >> yes, the weight multiplier 1.25, but the actual difference in CPU >> utilization, when running two CPU intense tasks, is ~10%: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 8246 mingo 20 0 1576 244 196 R 55 0.0 0:11.96 loop >> 8247 mingo 21 1 1576 244 196 R 45 0.0 0:10.52 loop >> >> so the first task 'wins' +10% CPU utilization (relative to the 50% it >> had before), the second task 'loses' -10% CPU utilization (relative to >> the 50% it had before). >> >> so what the comment says is true: >> >> * The "10% effect" is relative and cumulative: from _any_ nice level, >> * if you go up 1 level, it's -10% CPU usage, if you go down 1 level >> * it's +10% CPU usage. >> >> for there to be a ~+10% change in CPU utilization for a task that >> races against another CPU-intense task there needs to be a ~25% change >> in the weight. > > in any case more documentation is justified, so i've added some > clarification to the comments - see the patch below. Ah ok so it's 10% of the original CPU usage, not relative to a tasks share from before. While I guess I still think in terms of relative CPU share, your comments now make sense to me. Thanks for the clarification. - Jim > ------------------------> > Subject: sched: improve weight-array comments > From: Ingo Molnar <mingo@elte.hu> > > improve the comments around the wmult array (which controls the weight > of niced tasks). Clarify that to achieve a 10% difference in CPU > utilization, a weight multiplier of 1.25 has to be used. > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > --- > kernel/sched.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > Index: linux/kernel/sched.c > =================================================================== > --- linux.orig/kernel/sched.c > +++ linux/kernel/sched.c > @@ -736,7 +736,9 @@ static void update_curr_load(struct rq * > * > * The "10% effect" is relative and cumulative: from _any_ nice level, > * if you go up 1 level, it's -10% CPU usage, if you go down 1 level > - * it's +10% CPU usage. > + * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25. > + * If a task goes up by ~10% and another task goes down by ~10% then > + * the relative distance between them is ~25%.) > */ > static const int prio_to_weight[40] = { > /* -20 */ 88818, 71054, 56843, 45475, 36380, 29104, 23283, 18626, 14901, 11921, ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 7:06 ` Ingo Molnar 2007-07-16 7:41 ` Ingo Molnar @ 2007-07-16 10:18 ` Roman Zippel 2007-07-16 11:20 ` Ingo Molnar 1 sibling, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-16 10:18 UTC (permalink / raw) To: Ingo Molnar Cc: James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > yes, the weight multiplier 1.25, but the actual difference in CPU > utilization, when running two CPU intense tasks, is ~10%: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 8246 mingo 20 0 1576 244 196 R 55 0.0 0:11.96 loop > 8247 mingo 21 1 1576 244 196 R 45 0.0 0:10.52 loop > > so the first task 'wins' +10% CPU utilization (relative to the 50% it > had before), the second task 'loses' -10% CPU utilization (relative to > the 50% it had before). As soon as you add another loop the difference changes again, while it's always correct to say it gets 25% more cpu time (which I still think is a little too much). bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 10:18 ` Roman Zippel @ 2007-07-16 11:20 ` Ingo Molnar 2007-07-16 11:58 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-16 11:20 UTC (permalink / raw) To: Roman Zippel Cc: James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > > yes, the weight multiplier 1.25, but the actual difference in CPU > > utilization, when running two CPU intense tasks, is ~10%: > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 8246 mingo 20 0 1576 244 196 R 55 0.0 0:11.96 loop > > 8247 mingo 21 1 1576 244 196 R 45 0.0 0:10.52 loop > > > > so the first task 'wins' +10% CPU utilization (relative to the 50% > > it had before), the second task 'loses' -10% CPU utilization > > (relative to the 50% it had before). > > As soon as you add another loop the difference changes again, while > it's always correct to say it gets 25% more cpu time [...] yep, and i'll add the relative effect to the comment too. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 11:20 ` Ingo Molnar @ 2007-07-16 11:58 ` Roman Zippel 2007-07-16 12:12 ` Ingo Molnar 2007-07-16 17:47 ` Linus Torvalds 0 siblings, 2 replies; 535+ messages in thread From: Roman Zippel @ 2007-07-16 11:58 UTC (permalink / raw) To: Ingo Molnar Cc: James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > > As soon as you add another loop the difference changes again, while > > it's always correct to say it gets 25% more cpu time [...] > > yep, and i'll add the relative effect to the comment too. Why did you cut off the rest of the sentence? To illustrate the problem a little different: a task with a nice level -20 got around 700% more cpu time (or 8 times more), now it gets 8500% more cpu time (or 86.7 times more). You don't think that change to the nice levels is a little drastic? bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 11:58 ` Roman Zippel @ 2007-07-16 12:12 ` Ingo Molnar 2007-07-16 12:42 ` Roman Zippel 2007-07-16 17:47 ` Linus Torvalds 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-16 12:12 UTC (permalink / raw) To: Roman Zippel Cc: James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > On Mon, 16 Jul 2007, Ingo Molnar wrote: > > > > As soon as you add another loop the difference changes again, > > > while it's always correct to say it gets 25% more cpu time [...] > > > > yep, and i'll add the relative effect to the comment too. > > Why did you cut off the rest of the sentence? (no need to become hostile, i answered to that portion of your sentence separately, which was logically detached from the other portion of your sentence. I marked the cut with the '[...]' sign. ) > To illustrate the problem a little different: a task with a nice level > -20 got around 700% more cpu time (or 8 times more), now it gets 8500% > more cpu time (or 86.7 times more). You don't think that change to the > nice levels is a little drastic? This was discussed on lkml in detail, see the CFS threads. It has been a common request for nice levels to be more logical (i.e. to make them universal and to detach them from HZ) and for them to be more effective as well. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 12:12 ` Ingo Molnar @ 2007-07-16 12:42 ` Roman Zippel 2007-07-16 13:40 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-16 12:42 UTC (permalink / raw) To: Ingo Molnar Cc: James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > > > > As soon as you add another loop the difference changes again, > > > > while it's always correct to say it gets 25% more cpu time [...] > > > > > > yep, and i'll add the relative effect to the comment too. > > > > Why did you cut off the rest of the sentence? > > (no need to become hostile, i answered to that portion of your sentence > separately, which was logically detached from the other portion of your > sentence. I marked the cut with the '[...]' sign. ) Could you please stop with these accusations? Could you please point me to the mail with the separate answer? > > To illustrate the problem a little different: a task with a nice level > > -20 got around 700% more cpu time (or 8 times more), now it gets 8500% > > more cpu time (or 86.7 times more). You don't think that change to the > > nice levels is a little drastic? > > This was discussed on lkml in detail, see the CFS threads. Which are quite big, so I skipped most of it, a more precise pointer would be appreciated. > It has been a > common request for nice levels to be more logical (i.e. to make them > universal and to detach them from HZ) and for them to be more effective > as well. Huh? What has this to do with HZ? The scheduler used ticks internally, but it's irrelevant to what the user sees via the nice levels. So the question still stands that this change may be a little drastic, as you changed the nice levels of _all_ users, not just of those who were previously interested in CFS. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 12:42 ` Roman Zippel @ 2007-07-16 13:40 ` Ingo Molnar 2007-07-16 14:01 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-16 13:40 UTC (permalink / raw) To: Roman Zippel Cc: James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > > It has been a common request for nice levels to be more logical > > (i.e. to make them universal and to detach them from HZ) and for > > them to be more effective as well. > > Huh? What has this to do with HZ? The scheduler used ticks internally, > but it's irrelevant to what the user sees via the nice levels. [...] unfortunately you are wrong again - there are various HZ related artifacts in the nice level support code of the old scheduler. v2.6.22, CONFIG_HZ=100, nice +19 task against a nice-0 CPU-intense task: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2446 mingo 25 0 1576 244 196 R 90.9 0.0 0:32.79 loop 2448 mingo 39 19 1580 248 196 R 9.1 0.0 0:02.94 loop v2.6.22, CONFIG_HZ=250, nice +19 task against a nice-0 CPU-intense task: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2358 mingo 25 0 1576 248 196 R 96.1 0.0 0:31.97 loop_silent 2363 mingo 39 19 1576 244 196 R 3.9 0.0 0:01.24 loop_silent v2.6.22, CONFIG_HZ=300, nice +19 task against a nice-0 CPU-intense task: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2332 mingo 25 0 1580 248 196 R 95.1 0.0 0:11.84 loop_silent 2335 mingo 39 19 1576 244 196 R 3.1 0.0 0:00.39 loop_silent to sum it up: a nice +19 task (the most commonly used nice level in practice) gets 9.1%, 3.9%, 3.1% of CPU time on the old scheduler, depending on the value of HZ. This is quite inconsistent and illogical. this HZ dependency of nice levels existed for many years, and the new scheduler solves that inconsistency - every nice level will get the same amount of time, regardless of HZ. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 13:40 ` Ingo Molnar @ 2007-07-16 14:01 ` Roman Zippel 2007-07-16 20:31 ` Matt Mackall 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-16 14:01 UTC (permalink / raw) To: Ingo Molnar Cc: James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > to sum it up: a nice +19 task (the most commonly used nice level in > practice) gets 9.1%, 3.9%, 3.1% of CPU time on the old scheduler, > depending on the value of HZ. This is quite inconsistent and illogical. You're correct that you can find artifacts in the extreme cases, it's subjective whether this is a serious problem. It's nice that these artifacts are gone, but that still doesn't explain why this ratio had to be increase that much from around 1:10 to 1:69. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 14:01 ` Roman Zippel @ 2007-07-16 20:31 ` Matt Mackall 2007-07-16 21:18 ` Ingo Molnar 2007-07-16 21:25 ` Roman Zippel 0 siblings, 2 replies; 535+ messages in thread From: Matt Mackall @ 2007-07-16 20:31 UTC (permalink / raw) To: Roman Zippel Cc: Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Mon, Jul 16, 2007 at 04:01:17PM +0200, Roman Zippel wrote: > Hi, > > On Mon, 16 Jul 2007, Ingo Molnar wrote: > > > to sum it up: a nice +19 task (the most commonly used nice level in > > practice) gets 9.1%, 3.9%, 3.1% of CPU time on the old scheduler, > > depending on the value of HZ. This is quite inconsistent and illogical. > > You're correct that you can find artifacts in the extreme cases, it's > subjective whether this is a serious problem. > It's nice that these artifacts are gone, but that still doesn't explain > why this ratio had to be increase that much from around 1:10 to 1:69. More dynamic range is better? If you actually want a task to get 20x the CPU time of another, the older scheduler doesn't really allow it. Getting 1/69th of a modern CPU is still a fair number of cycles. Nevermind 1/69th of a machine with > 64 cores. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 20:31 ` Matt Mackall @ 2007-07-16 21:18 ` Ingo Molnar 2007-07-16 22:13 ` Roman Zippel 2007-07-16 21:25 ` Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-16 21:18 UTC (permalink / raw) To: Matt Mackall Cc: Roman Zippel, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Matt Mackall <mpm@selenic.com> wrote: > More dynamic range is better? If you actually want a task to get 20x > the CPU time of another, the older scheduler doesn't really allow it. > > Getting 1/69th of a modern CPU is still a fair number of cycles. > Nevermind 1/69th of a machine with > 64 cores. yeah. furthermore, nice -20 is only admin-selectable. Here are the current CPU-use values for positive nice levels: nice 0: 100.00% nice 1: 80.00% nice 2: 64.10% nice 3: 51.28% nice 4: 40.98% nice 5: 32.78% nice 6: 26.24% nice 7: 21.00% nice 8: 16.77% nice 9: 13.42% nice 10: 10.74% nice 11: 8.59% nice 12: 6.87% nice 13: 5.50% nice 14: 4.39% nice 15: 3.51% nice 16: 2.81% nice 17: 2.25% nice 18: 1.80% nice 19: 1.44% here's the CPU utilization table for negative nice levels (relative to a nice -20 task): nice 0: 1.15% nice -1: 1.44% nice -2: 1.80% nice -3: 2.25% nice -4: 2.81% nice -5: 3.51% nice -6: 4.39% nice -7: 5.50% nice -8: 6.87% nice -9: 8.59% nice -10: 10.74% nice -11: 13.42% nice -12: 16.77% nice -13: 21.00% nice -14: 26.24% nice -15: 32.78% nice -16: 40.98% nice -17: 51.28% nice -18: 64.10% nice -19: 80.00% nice -20: 100.00% these are pretty sane, and symmetric across the origo. Nice -20 is the odd one out, because there is no nice +20. But its value is still logical, it's the mirror image of an imaginery nice +20. and note that even on the old scheduler, nice-0 was "3200% more powerful" than nice +19 (with CONFIG_HZ=300), and nice -19 was only 700% more powerful than nice-0. So not only was it inconsistent (and i can create scary numbers too ;), it gave the admin-controlled negative nice levels less of a punch than to user-controlled nice +19. A number of people complainted about that, and CFS addresses this. in fact i like it that nice -20 has a slightly bigger punch than it used to have before: it might remove the need to run audio apps (and other multimedia apps) under SCHED_FIFO. (SCHED_FIFO is unprotected against lockups, while under CFS a nice 0 task is still starvation protected against a nice -20 task.) furthermore, there is a quality of implementation issue as well, look at the definition of the nice system call: asmlinkage long sys_nice(int increment) the "increment" is relative. So nice(1) has the same behavioral effect under CFS, regardless of which nice level you start out from. Under the old scheduler, the result depended on which nice level you started out from. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 21:18 ` Ingo Molnar @ 2007-07-16 22:13 ` Roman Zippel 2007-07-16 22:29 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-16 22:13 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Mon, 16 Jul 2007, Ingo Molnar wrote: > and note that even on the old scheduler, nice-0 was "3200% more > powerful" than nice +19 (with CONFIG_HZ=300), How did you get that value? At any HZ the ratio should be around 1:10 (+- rounding error). > in fact i like it that nice -20 has a slightly bigger punch than it used > to have before: "Slightly bigger"??? You're joking, right? Especially the user levels are doing something completely different now, which may break user expectation. While the user couldn't expect anything precise, it's still a big difference whether a process at nice 5 gets 75% of the time or only 30%. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 22:13 ` Roman Zippel @ 2007-07-16 22:29 ` Ingo Molnar 2007-07-17 0:02 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-16 22:29 UTC (permalink / raw) To: Roman Zippel Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > Hi, > > On Mon, 16 Jul 2007, Ingo Molnar wrote: > > > and note that even on the old scheduler, nice-0 was "3200% more > > powerful" than nice +19 (with CONFIG_HZ=300), > > How did you get that value? At any HZ the ratio should be around 1:10 > (+- rounding error). you are wrong again. I sent you the numbers earlier today already: | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | 2332 mingo 25 0 1580 248 196 R 95.1 0.0 0:11.84 loop | 2335 mingo 39 19 1576 244 196 R 3.1 0.0 0:00.39 loop 3.1% is 3067% more than 95.1%, and the ratio is 1:30.67. You again deny above that this is the case, and there's nothing i can do about your denial of facts - that is your own private problem. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 22:29 ` Ingo Molnar @ 2007-07-17 0:02 ` Roman Zippel 2007-07-17 3:20 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-17 0:02 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Tue, 17 Jul 2007, Ingo Molnar wrote: > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > Hi, > > > > On Mon, 16 Jul 2007, Ingo Molnar wrote: > > > > > and note that even on the old scheduler, nice-0 was "3200% more > > > powerful" than nice +19 (with CONFIG_HZ=300), > > > > How did you get that value? At any HZ the ratio should be around 1:10 > > (+- rounding error). > > you are wrong again. I sent you the numbers earlier today already: > > | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > | 2332 mingo 25 0 1580 248 196 R 95.1 0.0 0:11.84 loop > | 2335 mingo 39 19 1576 244 196 R 3.1 0.0 0:00.39 loop > > 3.1% is 3067% more than 95.1%, and the ratio is 1:30.67. You again deny > above that this is the case, and there's nothing i can do about your > denial of facts - that is your own private problem. Ingo, how am I supposed to react to this? I'm asking a simple question and I get this? I'm at serious loss how to deal with you. :-( Above is based on theoritical values, for a 300HZ kernel these two processes should get 30 and 3 ticks. Should there be any rounding error or off by one error so that the processes get one tick less than they should get or one tick is accounted to the wrong process, my theoritical value is still within the possible error range and doesn't contradict your practical values. Playing around with some other nice levels, confirms the theory that something is a little off, so I'm quite correct at saying that the ratio _should_ be 1:10. OTOH you are the one who is wrong about me (again). :-( bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-17 0:02 ` Roman Zippel @ 2007-07-17 3:20 ` Roman Zippel 2007-07-17 8:02 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-17 3:20 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Tue, 17 Jul 2007, I wrote: > Playing around with some other nice levels, confirms the theory that > something is a little off, so I'm quite correct at saying that the ratio > _should_ be 1:10. Rechecking everything there was actually a small error in my test program, so the ratio should be at 1:20. Sorry about that mistake. Nice level 19 shows the largest artifacts, as that level only gets a single tick, so the ratio is often 1:HZ/10 (except for 1000HZ where it's 5:100). Nevertheless it's still true that in general nice levels were independent of HZ (that's all I wanted to say a couple of mails ago). Ingo, you can start now gloating, but contrary to you I have no problems with admitting mistakes and apologizing for them. The point is just that I'm reacting better to factual arguments instead of flames (and I think it's not just me), so I'm pretty sure I'm still correct about this: > OTOH you are the one who is wrong about me (again). :-( bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-17 3:20 ` Roman Zippel @ 2007-07-17 8:02 ` Ingo Molnar 2007-07-17 14:06 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-17 8:02 UTC (permalink / raw) To: Roman Zippel Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > Nice level 19 shows the largest artifacts, as that level only gets a > single tick, so the ratio is often 1:HZ/10 (except for 1000HZ where > it's 5:100). [...] Roman, please do me a favor, and ask me the following question: " Ingo, you've been maintaining the scheduler for years. In fact you wrote the old nice code we are talking about here. You changed it a number of times since then. So you really know what's going on here. Why does the old nice code behave like that for nice +19 levels? " I've been waiting for that obvious question, and i _might_ be able to answer it, but somehow it never occured to you ;-) Thanks, Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-17 8:02 ` Ingo Molnar @ 2007-07-17 14:06 ` Roman Zippel 2007-07-18 10:40 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-17 14:06 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Tue, 17 Jul 2007, Ingo Molnar wrote: > Roman, please do me a favor, and ask me the following question: > > " Ingo, you've been maintaining the scheduler for years. In fact you > wrote the old nice code we are talking about here. You changed it a > number of times since then. So you really know what's going on here. > Why does the old nice code behave like that for nice +19 levels? " > > I've been waiting for that obvious question, and i _might_ be able to > answer it, but somehow it never occured to you ;-) Thanks, Do you have any idea how insulting and arrogant this is? Let me translate for you, how this arrived: "O Ingo, who art our god of the scheduler. You have blessed the paths I walked in. You kept me from sinning numerous times. Your wisdom is infinite. Guide me on the journey that layeth ahead of me into this world knowledge of Your truth." (I apologize already in advance, if I should have hurt anyones religious feelings.) It's obvious that you have more experience with the scheduler code, but does that make you unfailable? Does that give you the right to act like a jerk? I do make mistakes, I try to learn from them and life goes on, I have no problem with that, but what I have a problem with is if someone is abusing this to his own advantage. I have to be extremely carful what I say to you, because you jump on the first small mistake and I have to bear your insults like "there's nothing i can do about your denial of facts - that is your own private problem." I have no problems with facts, I'm only trying very hard to ignore your arrogant behaviour... If you have something to contribute to this discussion which might clear things up, then just say it, but I'm not going to beg for it. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-17 14:06 ` Roman Zippel @ 2007-07-18 10:40 ` Ingo Molnar 2007-07-18 12:40 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-18 10:40 UTC (permalink / raw) To: Roman Zippel Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > On Tue, 17 Jul 2007, Ingo Molnar wrote: > > > Roman, please do me a favor, and ask me the following question: > > > > " Ingo, you've been maintaining the scheduler for years. In fact you > > wrote the old nice code we are talking about here. You changed it a > > number of times since then. So you really know what's going on here. > > Why does the old nice code behave like that for nice +19 levels? " > > > > I've been waiting for that obvious question, and i _might_ be able > > to answer it, but somehow it never occured to you ;-) Thanks, [...] > It's obvious that you have more experience with the scheduler code, > but does that make you unfailable? [...] Roman, it is really not about 'experience', and yes, we all make frequent mistakes. it's about the plain fact that i happened to write _both_ the old and the new code you were talking about all along. In this discussion about nice levels you were (very) agressively asserting things that were untrue, you were suggesting that i dont understand the code, instead of simply asking me why the code was written in such a way and what the motivation behind it was. I'd be glad to attempt to answer such a friendly question, if you are interested in asking it and if you are interested in my answer. Thanks, Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 10:40 ` Ingo Molnar @ 2007-07-18 12:40 ` Roman Zippel 2007-07-18 16:17 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-18 12:40 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: > > > Roman, please do me a favor, and ask me the following question: > > > > > > [insult deleted] > In this discussion about > nice levels you were (very) agressively asserting things that were > untrue, Instead of simply asserting things, how about you provide some examples? I made so far a single mistake of mixing up nice levels 18 and 19. If you would point me to such examples, I could learn how to tone it down a little, since the nice levels are not the only issue I have with the new scheduler, the heavy stuff is still about to come. The problem here is there is too much burnt ground so I can't just present raw ideas, which get flamed by you, I have to be sufficiently confident they are valid, what you might then interpret as "agressive assertion". > you were suggesting that i dont understand the code, Again, please point me to examples, so I at least have a chance to clear things up, since it was never my intention to make such a suggestion, but this gives me no chance to defend myself. OTOH I can tell you exactly how you continuously insult me, e.g. by suggesting I ask "stupid questions" or that I'm in "denial of facts". Don't make such suggestions if you have no idea how insulting they are. Especially the one deleted insult above where you have the impertinence to quote it, such tone is more appropriate between lord and inferior, where the latter have to make a request and the former "might" grant it. _Never_ make me beg. :-( bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 12:40 ` Roman Zippel @ 2007-07-18 16:17 ` Ingo Molnar 2007-07-20 13:38 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-18 16:17 UTC (permalink / raw) To: Roman Zippel Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > Don't make such suggestions if you have no idea how insulting they > are. Especially the one deleted insult above where you have the > impertinence to quote it, such tone is more appropriate between lord > and inferior, where the latter have to make a request and the former > "might" grant it. [...] uhm, [and the uninterested reader might want to skip to the next mail ;-)], i'm really confused about your reply. Do you really mean this: > > Roman, please do me a favor, and ask me the following question: > > > > " Ingo, you've been maintaining the scheduler for years. In fact you > > wrote the old nice code we are talking about here. You changed it a > > number of times since then. So you really know what's going on here. > > Why does the old nice code behave like that for nice +19 levels? " > > > > I've been waiting for that obvious question, and i _might_ be able > > to answer it, but somehow it never occured to you ;-) Thanks, the ";-)" emoticon (and its contents) clearly signals this as a sarcastic, tongue-in-cheek remark. To make it even clearer, please re-read it with the <sarcastic> tag added as well for clarity: > > <sarcastic> > > > > Roman, please do me a favor, and ask me the following question: > > > > " Ingo, you've been maintaining the scheduler for years. In fact you > > wrote the old nice code we are talking about here. You changed it a > > number of times since then. So you really know what's going on here. > > Why does the old nice code behave like that for nice +19 levels? " > > > > I've been waiting for that obvious question, and i _might_ be able > > to answer it, but somehow it never occured to you ;-) Thanks, > > > > </sarcastic> ok? (If you didnt see/read it as sarcastic straight away then my apologies for insulting you!) The "_might_ be able to answer" bit is of course sarcastic too, and contrary to your (i have to say, pretty absurd) suggestion i did not suggest that i "might be _willing_ to answer" - which would be quite arrogant indeed and which i never said or suggested. To make it even clearer: i'm definitely able to answer questions about code i wrote originally and which i just changed, were you to show genuine interest in hearing my opinion :-) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 16:17 ` Ingo Molnar @ 2007-07-20 13:38 ` Roman Zippel 0 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-07-20 13:38 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: > > > [more rude insults deleted] > > > I've been waiting for that obvious question, and i _might_ be able > > > to answer it, but somehow it never occured to you ;-) Thanks, > > the ";-)" emoticon (and its contents) clearly signals this as a > sarcastic, tongue-in-cheek remark. To take another example why is this still insulting and inappropriate, this is a behaviour I would characterize as school bullying: A bully attacks someone obviously weaker than himself and for example takes something away and than continues like "If you ask nicely I'll give it back to you.", this often accompied by laughter to signal he's enjoying himself and the power he has, but for the other person it's everything but funny. Maybe you don't know what it feels like, but I do and I can't find anything funny, sarcastic or whatever about this, no matter how many smileys or other tags you add there. If the communication is already that troubled as this, such "humor" is really the worst thing you can do and I find it rather sad that you can't realize this yourself. > ok? (If you didnt see/read it as sarcastic straight away then my > apologies for insulting you!) Sorry, that is too little too late. You've apologized before and you continued to make fun of me personally to the point of spreading wrong information about me, which you could have very easily verified yourself, if you only wanted. What I want from you is that you treat me with respect and to keep your "sarcasm" to yourself. I told you very clearly how I think about you requoting this crap and yet you repeat it again _twice_, so on the one hand I get this apology attempt and on the other hand you continue to kick me in the crotch? How do you think am I supposed to feel about this? It's also always interesting what you don't respond to. I asked you for examples which would prove the (rather strong) assertions you made about me, what does it tell me now if you can't back up your statements? bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 20:31 ` Matt Mackall 2007-07-16 21:18 ` Ingo Molnar @ 2007-07-16 21:25 ` Roman Zippel 2007-07-17 7:53 ` Ingo Molnar 1 sibling, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-16 21:25 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Mon, 16 Jul 2007, Matt Mackall wrote: > > It's nice that these artifacts are gone, but that still doesn't explain > > why this ratio had to be increase that much from around 1:10 to 1:69. > > More dynamic range is better? If you actually want a task to get 20x > the CPU time of another, the older scheduler doesn't really allow it. You can already have that, the complete range level from 19 to -20 was about 1:80. There is also something like too much range, I tried it with top at 19 and as soon as something runs at -20 it's practically dead, because it gets now only 1/5900 of cpu time. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 21:25 ` Roman Zippel @ 2007-07-17 7:53 ` Ingo Molnar 2007-07-17 15:12 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-17 7:53 UTC (permalink / raw) To: Roman Zippel Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > > > It's nice that these artifacts are gone, but that still doesn't > > > explain why this ratio had to be increase that much from around > > > 1:10 to 1:69. > > > > More dynamic range is better? If you actually want a task to get 20x > > the CPU time of another, the older scheduler doesn't really allow > > it. > > You can already have that, the complete range level from 19 to -20 was > about 1:80. But that is irrelevant: all tasks start out at nice 0, and what matters is the dynamic range around 0. So the dynamic range has been made uniform in the positive from 1:10...1:20...1:30 to 1:69 for nice +19, and from 1:8 to 1:69 in the minus. (with 1:86 nice -20) If you look at the negative nice levels alone it's a substantial increase but if you compare it with positive nice levels you'll similar kinds of dynamic ranges were already present in the old scheduler and you'll see why we've done it. Negative nice levels are admin-controlled, the increase in the negative levels is is not a big issue and people actually like the increased dynamic range and the consistency. The positive range _might_ be a bigger issue but there we were largely inconsistent anyway, and again, people like the increased dynamic range. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-17 7:53 ` Ingo Molnar @ 2007-07-17 15:12 ` Roman Zippel 0 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-07-17 15:12 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, James Bruce, Thomas Gleixner, Mike Galbraith, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Tue, 17 Jul 2007, Ingo Molnar wrote: > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > > > It's nice that these artifacts are gone, but that still doesn't > > > > explain why this ratio had to be increase that much from around > > > > 1:10 to 1:69. > > > > > > More dynamic range is better? If you actually want a task to get 20x > > > the CPU time of another, the older scheduler doesn't really allow > > > it. > > > > You can already have that, the complete range level from 19 to -20 was > > about 1:80. > > But that is irrelevant: all tasks start out at nice 0, and what matters > is the dynamic range around 0. > > So the dynamic range has been made uniform in the positive from > 1:10...1:20...1:30 to 1:69 for nice +19, and from 1:8 to 1:69 in the > minus. (with 1:86 nice -20) If you look at the negative nice levels > alone it's a substantial increase but if you compare it with positive > nice levels you'll similar kinds of dynamic ranges were already present > in the old scheduler and you'll see why we've done it. So let's look at them: for (i=0;i<20;i++) print i, " : ", (20-i)*5, " : ", 100*1.25^-i, " : ", e(l(2)*(-i/5))*100, "\n"; 0 : 100 : 100 : 100.00000000000000000000 1 : 95 : 80.00000000000000000000 : 87.05505632961241391300 2 : 90 : 64.00000000000000000000 : 75.78582832551990411700 3 : 85 : 51.20000000000000000000 : 65.97539553864471296900 4 : 80 : 40.96000000000000000000 : 57.43491774985175034000 5 : 75 : 32.76800000000000000000 : 50.00000000000000000000 6 : 70 : 26.21440000000000000000 : 43.52752816480620695700 7 : 65 : 20.97152000000000000000 : 37.89291416275995205900 8 : 60 : 16.77721600000000000000 : 32.98769776932235648400 9 : 55 : 13.42177280000000000000 : 28.71745887492587517000 10 : 50 : 10.73741824000000000000 : 25.00000000000000000000 11 : 45 : 8.58993459200000000000 : 21.76376408240310347800 12 : 40 : 6.87194767360000000000 : 18.94645708137997602900 13 : 35 : 5.49755813888000000000 : 16.49384888466117824200 14 : 30 : 4.39804651110400000000 : 14.35872943746293758500 15 : 25 : 3.51843720888320000000 : 12.50000000000000000000 16 : 20 : 2.81474976710656000000 : 10.88188204120155173900 17 : 15 : 2.25179981368524800000 : 9.47322854068998801400 18 : 10 : 1.80143985094819840000 : 8.24692444233058912100 19 : 5 : 1.44115188075855872000 : 7.17936471873146879200 (nice level : old % : new % : my suggested %) Your levels divert very quickly from what they used to be (upto a factor of 7), it's also not really easy to remember what the individual levels mean. I at least try to keep them somewhat in the range they used to be (and the difference is limited to a factor of about 2), also every 5 levels the amount of cpu time is halved, which is very easy to remember. If you need more dynamic range, is there a law that prevents us from going beyond 19? For example: for (i=20;i<=30;i++) print i, " : ", (20-i)*5, " : ", 100*1.25^-i, " : ", e(l(2)*(-i/5))*100, "\n"; 20 : 0 : 1.15292150460684697600 : 6.25000000000000000000 21 : -5 : .92233720368547758000 : 5.44094102060077586900 22 : -10 : .73786976294838206400 : 4.73661427034499400700 23 : -15 : .59029581035870565100 : 4.12346222116529456000 24 : -20 : .47223664828696452100 : 3.58968235936573439600 25 : -25 : .37778931862957161700 : 3.12500000000000000000 26 : -30 : .30223145490365729300 : 2.72047051030038793400 27 : -35 : .24178516392292583400 : 2.36830713517249700300 28 : -40 : .19342813113834066700 : 2.06173111058264728000 29 : -45 : .15474250491067253400 : 1.79484117968286719800 30 : -50 : .12379400392853802700 : 1.56250000000000000000 setpriority() accepts such values without error. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 11:58 ` Roman Zippel 2007-07-16 12:12 ` Ingo Molnar @ 2007-07-16 17:47 ` Linus Torvalds 2007-07-16 18:12 ` Roman Zippel 2007-07-18 10:27 ` Peter Zijlstra 1 sibling, 2 replies; 535+ messages in thread From: Linus Torvalds @ 2007-07-16 17:47 UTC (permalink / raw) To: Roman Zippel Cc: Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Mon, 16 Jul 2007, Roman Zippel wrote: > > To illustrate the problem a little different: a task with a nice level -20 > got around 700% more cpu time (or 8 times more), now it gets 8500% more > cpu time (or 86.7 times more). Ingo, that _does_ sound excessive. How about trying a much less aggressive nice-level (and preferably linear, not exponential)? Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 17:47 ` Linus Torvalds @ 2007-07-16 18:12 ` Roman Zippel 2007-07-18 10:27 ` Peter Zijlstra 1 sibling, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-07-16 18:12 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Mon, 16 Jul 2007, Linus Torvalds wrote: > How about trying a much less aggressive nice-level (and preferably linear, > not exponential)? I think the exponential increase isn't the problem. The old code did approximate something like this rather crudely with the result that there was a big gap between level 0 and -1. Something like this: echo 'for (i=-20;i<=20;i++) print i, " : ", 1024*e(l(2)*(-i/20*3)), "\n";' | bc -l would produce a range similiar to the old code. Replacing the factor 3 with 4 would be IMO a more reasonable increase and had the advantage for the user that it's easier to understand that every 5 levels the time a process gets is doubled. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-16 17:47 ` Linus Torvalds 2007-07-16 18:12 ` Roman Zippel @ 2007-07-18 10:27 ` Peter Zijlstra 2007-07-18 12:45 ` Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Peter Zijlstra @ 2007-07-18 10:27 UTC (permalink / raw) To: Linus Torvalds Cc: Roman Zippel, Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Mon, 2007-07-16 at 10:47 -0700, Linus Torvalds wrote: > > On Mon, 16 Jul 2007, Roman Zippel wrote: > > > > To illustrate the problem a little different: a task with a nice level -20 > > got around 700% more cpu time (or 8 times more), now it gets 8500% more > > cpu time (or 86.7 times more). > > Ingo, that _does_ sound excessive. > > How about trying a much less aggressive nice-level (and preferably linear, > not exponential)? I actually like the extra range, it allows for a much softer punch of background tasks even on somewhat slower boxen. I've been testing CFS on my 1200 MHz lappy for some time and a strongly niced kbuild leaves a very usable system. The old scheduler would leave the thing rather jumpy. And while CFS fully fixes the jumpyness, I just did a nice +13 (which should be equivalent to the old schedulers nice +19 for my HZ) and did a nice +19 kbuild and I can definitely feel the difference between them. Early CFS versions had an pretty aggressive nice range (0.1% for +19), and that has been toned down based on feedback. The current levels seem to work well, at least on my boxen. - Peter ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 10:27 ` Peter Zijlstra @ 2007-07-18 12:45 ` Roman Zippel 2007-07-18 12:52 ` Peter Zijlstra 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-18 12:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: > I actually like the extra range, it allows for a much softer punch of > background tasks even on somewhat slower boxen. The extra range is not really a problem, in http://www.ussg.iu.edu/hypermail/linux/kernel/0707.2/0850.html I suggested how we can have both. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 12:45 ` Roman Zippel @ 2007-07-18 12:52 ` Peter Zijlstra 2007-07-18 12:59 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 535+ messages in thread From: Peter Zijlstra @ 2007-07-18 12:52 UTC (permalink / raw) To: Roman Zippel Cc: Linus Torvalds, Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Wed, 2007-07-18 at 14:45 +0200, Roman Zippel wrote: > Hi, > > On Wed, 18 Jul 2007, Peter Zijlstra wrote: > > > I actually like the extra range, it allows for a much softer punch of > > background tasks even on somewhat slower boxen. > > The extra range is not really a problem, in > > http://www.ussg.iu.edu/hypermail/linux/kernel/0707.2/0850.html > > I suggested how we can have both. By breaking the UNIX model of nice levels. Not an option in my book. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 12:52 ` Peter Zijlstra @ 2007-07-18 12:59 ` Ingo Molnar 2007-07-18 13:07 ` Roman Zippel 2007-07-18 13:26 ` Roman Zippel 2 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-07-18 12:59 UTC (permalink / raw) To: Peter Zijlstra Cc: Roman Zippel, Linus Torvalds, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, 2007-07-18 at 14:45 +0200, Roman Zippel wrote: > > Hi, > > > > On Wed, 18 Jul 2007, Peter Zijlstra wrote: > > > > > I actually like the extra range, it allows for a much softer punch of > > > background tasks even on somewhat slower boxen. > > > > The extra range is not really a problem, in > > > > http://www.ussg.iu.edu/hypermail/linux/kernel/0707.2/0850.html > > > > I suggested how we can have both. > > By breaking the UNIX model of nice levels. Not an option in my book. yeah, that's pretty much out of question. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 12:52 ` Peter Zijlstra 2007-07-18 12:59 ` Ingo Molnar @ 2007-07-18 13:07 ` Roman Zippel 2007-07-18 13:27 ` Peter Zijlstra 2007-07-18 13:48 ` Ingo Molnar 2007-07-18 13:26 ` Roman Zippel 2 siblings, 2 replies; 535+ messages in thread From: Roman Zippel @ 2007-07-18 13:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: > By breaking the UNIX model of nice levels. Not an option in my book. Breaking user expectations of nice levels is? bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 13:07 ` Roman Zippel @ 2007-07-18 13:27 ` Peter Zijlstra 2007-07-18 13:58 ` Roman Zippel 2007-07-18 13:48 ` Ingo Molnar 1 sibling, 1 reply; 535+ messages in thread From: Peter Zijlstra @ 2007-07-18 13:27 UTC (permalink / raw) To: Roman Zippel Cc: Linus Torvalds, Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Wed, 2007-07-18 at 15:07 +0200, Roman Zippel wrote: > Hi, > > On Wed, 18 Jul 2007, Peter Zijlstra wrote: > > > By breaking the UNIX model of nice levels. Not an option in my book. > > Breaking user expectations of nice levels is? http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html specifically: "3.239 Nice Value A number used as advice to the system to alter process scheduling. Numerically smaller values give a process additional preference when scheduling a process to run. Numerically larger values reduce the preference and make a process less likely to run. Typically, a process with a smaller nice value runs to completion more quickly than an equivalent process with a higher nice value. The symbol {NZERO} specifies the default nice value of the system." The only expectation is that a process with a lower nice level gets more time. Any other expectation is a bug. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 13:27 ` Peter Zijlstra @ 2007-07-18 13:58 ` Roman Zippel 0 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-07-18 13:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: > The only expectation is that a process with a lower nice level gets more > time. Any other expectation is a bug. Yes, users are buggy, they expect a lot of stupid things... Is this really reason enough to break this? What exactly is the damage if setpriority() accepts a few more levels? bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 13:07 ` Roman Zippel 2007-07-18 13:27 ` Peter Zijlstra @ 2007-07-18 13:48 ` Ingo Molnar 2007-07-18 14:14 ` Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-18 13:48 UTC (permalink / raw) To: Roman Zippel Cc: Peter Zijlstra, Linus Torvalds, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > > By breaking the UNIX model of nice levels. Not an option in my book. > > Breaking user expectations of nice levels is? _changing_ it is an option within reason, and we've done it a couple of times already in the past, and even within CFS (as Peter correctly observed) we've been through a couple of iterations already. And as i mentioned it before, the outer edge of nice levels (+19, by far the most commonly used nice level) was inconsistent to begin with: 3%, 5%, 9% of nice-0, depending on HZ. So changing that to a consistent (and user-requested) 1.5% is a much smaller change than you seem to make it out to be. CFS itself is a far larger "change of expectations" than this tweak to nice levels. So by your standard we could never change the scheduler. (which your ultimate argument might be after all =B-) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 13:48 ` Ingo Molnar @ 2007-07-18 14:14 ` Roman Zippel 2007-07-18 16:02 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-18 14:14 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Linus Torvalds, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: > _changing_ it is an option within reason, and we've done it a couple of > times already in the past, and even within CFS (as Peter correctly > observed) we've been through a couple of iterations already. And as i > mentioned it before, the outer edge of nice levels (+19, by far the most > commonly used nice level) was inconsistent to begin with: 3%, 5%, 9% of > nice-0, depending on HZ. Why do you constantly stress level 19? Yes, that one is special, all other positive levels were already relatively consistent. > So changing that to a consistent (and > user-requested) How old is CFS and how many users did it have so far? How many users has the old scheduler, which will be exposed to the new one soon? > 1.5% is a much smaller change than you seem to make it > out to be. The percentage levels are off by a factor of upto _seven_, sorry I fail see how you can characterize this as "small". > So by your standard we could never change the > scheduler. (which your ultimate argument might be after all =B-) Careful, you make assertion about me, for which you have absolutely no base, adding a smiley doesn't make this any funnier. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 14:14 ` Roman Zippel @ 2007-07-18 16:02 ` Ingo Molnar 2007-07-20 15:03 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-18 16:02 UTC (permalink / raw) To: Roman Zippel Cc: Peter Zijlstra, Linus Torvalds, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Roman Zippel <zippel@linux-m68k.org> wrote: > > _changing_ it is an option within reason, and we've done it a couple > > of times already in the past, and even within CFS (as Peter > > correctly observed) we've been through a couple of iterations > > already. And as i mentioned it before, the outer edge of nice levels > > (+19, by far the most commonly used nice level) was inconsistent to > > begin with: 3%, 5%, 9% of nice-0, depending on HZ. > > Why do you constantly stress level 19? Yes, that one is special, all > other positive levels were already relatively consistent. i constantly stress it for the reason i mentioned a good number of times: because it's by far the most commonly used (and complained about) nice level. =B-) but because you are asking, i'm glad to give you some first-hand historic background about Linux nice levels (in case you are interested) and the motivations behind their old and new implementations: nice levels were always so weak under Linux (just read Peter's report) that people continuously bugged me about making nice +19 tasks use up much less CPU time. Unfortunately that was not that easy to implement (otherwise we'd have done it long ago) because nice level support was historically coupled to timeslice length, and timeslice units were driven by the HZ tick, so the smallest timeslice was 1/HZ. In the O(1) scheduler (about 4 years ago) i changed negative nice levels to be much stronger than they were before in 2.4 (and people were happy about that change), and i also intentionally calibrated the linear timeslice rule so that nice +19 level would be _exactly_ 1 jiffy. To better understand it, the timeslice graph went like this (cheesy ASCII art alert!): A \ | [timeslice length] \ | \ | \ | \ | \|___100msecs |^ . _ | ^ . _ | ^ . _ -*----------------------------------*-----> [nice level] -20 | +19 | | so that if someone wants to really renice tasks, +19 would give a much bigger hit than the normal linear rule would do. (The solution of changing the ABI to extend priorities was discarded early on.) This approach worked to some degree for some time, but later on with HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which we felt to be a bit excessive. Excessive _not_ because it's too small of a CPU utilization, but because it causes too frequent (once per millisec) rescheduling. (and would thus trash the cache, etc. Remember, this was 4-5 years ago when hardware was weaker and caches were smaller, and people were running number crunching apps at nice +19.) So for HZ=1000 i changed nice +19 to 5msecs, because that felt like the right minimal granularity - and this translates to 5% CPU utilization. But the fundamental HZ-sensitive property for nice+19 still remained, and i never got a single complaint about nice +19 being too _weak_ in terms of CPU utilization, i only got complaints about it (still) being way too _strong_. To sum it up: i always wanted to make nice levels more consistent, but within the constraints of HZ and jiffies and their nasty design level coupling to timeslices and granularity it was not really viable. The second (less frequent but still periodically occuring) complaint about Linux's nice level support was its assymetry around the origo (which you can see demonstrated in the picture above), or more accurately: the fact that nice level behavior depended on the _absolute_ nice level as well, while the nice API itself is fundamentally "relative": int nice(int inc); asmlinkage long sys_nice(int increment) (the first one is the glibc API, the second one is the syscall API.) Note that the 'inc' is relative to the current nice level. Tools like bash's "nice" command mirror this relative API. With the old scheduler, if you for example started a niced task with +1 and another task with +2, the CPU split between the two tasks would depend on the nice level of the parent shell - if it was at nice -10 the CPU split was different than if it was at +5 or +10. A third complaint against Linux's nice level support was that negative nice levels were not 'punchy enough', so lots of people had to resort to run audio (and other multimedia) apps under RT priorities such as SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation proof, and a buggy SCHED_FIFO app can also lock up the system for good. CFS addresses all three types of complaints: To address the first complaint (of nice levels being not "punchy" enough), i decoupled the scheduler from 'time slice' and HZ concepts (and made granularity a separate concept from nice levels) and thus CFS was able to implement better and more consistent nice +19 support: now in CFS nice +19 tasks get a HZ-independent 1.5%, instead of the variable 3%-5%-9% range they got in the old scheduler. To address the second complaint (of nice levels not being consistent), i made nice(1) have the same CPU utilization effect on tasks, regardless of their absolute nice levels. So on CFS, running a nice +10 and a nice +11 task has the same CPU utilization "split" between them as running a nice -5 and a nice -4 task. (one will get 55% of the CPU, the other 45%.) That is why I changed nice levels to be "multiplicative" (or exponential) - that way it does not matter which nice level you start out from, the 'relative result' will always be the same. The third complaint (of negative nice levels not being "punchy" enough and forcing audio apps to run under the more dangerous SCHED_FIFO scheduling policy) is addressed by CFS almost automatically: stronger negative nice levels are an automatic side-effect of the recalibrated dynamic range of nice levels. Hope this helps, Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 16:02 ` Ingo Molnar @ 2007-07-20 15:03 ` Roman Zippel 0 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-07-20 15:03 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Linus Torvalds, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Wed, 18 Jul 2007, Ingo Molnar wrote: > > Why do you constantly stress level 19? Yes, that one is special, all > > other positive levels were already relatively consistent. > > i constantly stress it for the reason i mentioned a good number of > times: because it's by far the most commonly used (and complained about) > nice level. =B-) How do you know that? Most complained about makes most commonly used? > but because you are asking, i'm glad to give you some first-hand > historic background about Linux nice levels (in case you are interested) > and the motivations behind their old and new implementations: I guess I should be thankful now? I'm curious why you post this now, after I "asked" about this. Most of the information is either rather generic or not specific enough for the problem at hand. If you had posted this information earlier, it had been far more valueable as it could have been a nice base for a discussion. But posting it this late I can't lose the feeling you're more interested in "teaching" me. > nice levels were always so weak under Linux (just read Peter's report) -ENOLINK > Hope this helps, Not completely. For negative nice levels you mentioned audio apps, but these aren't really interested in a fair share, they would use the higher percentage only to guarantee they get the amount of time they need independent of the current load. I think they would be better served with e.g. a deadline scheduler, which guarantees them an absolute time share not a relative one. On the other end with positive levels I more remember requests for something closer to idle scheduling, where a process only runs when nothing else is running. So assuming we had scheduling classes for the above use cases, what other reasons are left for such extreme nice levels? My proposed nice levels have otherwise the same properties as yours (e.g. being consistent). There is one propery you haven't commented on at all yet. My proposed levels give the average use a far better idea what they actually mean, i.e. that every 5 levels the process gets double/halve the cpu time. This is IMO a considerable advantage. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 12:52 ` Peter Zijlstra 2007-07-18 12:59 ` Ingo Molnar 2007-07-18 13:07 ` Roman Zippel @ 2007-07-18 13:26 ` Roman Zippel 2007-07-18 13:31 ` Peter Zijlstra 2 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-07-18 13:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Hi, On Wed, 18 Jul 2007, Peter Zijlstra wrote: > By breaking the UNIX model of nice levels. Not an option in my book. BTW what is the "UNIX model of nice levels"? SUS specifies the limit via NZERO, which is defined as "Minimum Acceptable Value: 20", I can't find any information that it must be 20. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [PATCH] CFS: Fix missing digit off in wmult table 2007-07-18 13:26 ` Roman Zippel @ 2007-07-18 13:31 ` Peter Zijlstra 0 siblings, 0 replies; 535+ messages in thread From: Peter Zijlstra @ 2007-07-18 13:31 UTC (permalink / raw) To: Roman Zippel Cc: Linus Torvalds, Ingo Molnar, James Bruce, Thomas Gleixner, Mike Galbraith, Andrea Arcangeli, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Wed, 2007-07-18 at 15:26 +0200, Roman Zippel wrote: > Hi, > > On Wed, 18 Jul 2007, Peter Zijlstra wrote: > > > By breaking the UNIX model of nice levels. Not an option in my book. > > BTW what is the "UNIX model of nice levels"? > > SUS specifies the limit via NZERO, which is defined as "Minimum Acceptable > Value: 20", I can't find any information that it must be 20. I have never encountered a UNIX where it is anything other than 20. Convention (alas not specification) does dictate 20. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-13 17:23 ` Roman Zippel 2007-07-13 19:43 ` [PATCH] CFS: Fix missing digit off in wmult table Thomas Gleixner @ 2007-07-14 5:04 ` Mike Galbraith 2007-08-01 3:41 ` CFS review Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-07-14 5:04 UTC (permalink / raw) To: Roman Zippel Cc: Linus Torvalds, Andrea Arcangeli, Andi Kleen, Ingo Molnar, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright On Fri, 2007-07-13 at 19:23 +0200, Roman Zippel wrote: > Hi, > > On Fri, 13 Jul 2007, Mike Galbraith wrote: > > > > The new scheduler does _a_lot_ of heavy 64 bit calculations without any > > > attempt to scale that down a little... > > > > See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity. > > Perhaps more can be done, but "without any attempt..." isn't accurate. > > Calculating these values at runtime would have been completely insane, the > alternative would be a crummy approximation, so using a lookup table is > actually a good thing. That's not the problem. I meant see usage. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* CFS review 2007-07-14 5:04 ` x86 status was Re: -mm merge plans for 2.6.23 Mike Galbraith @ 2007-08-01 3:41 ` Roman Zippel 2007-08-01 7:12 ` Ingo Molnar ` (6 more replies) 0 siblings, 7 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-01 3:41 UTC (permalink / raw) To: Mike Galbraith; +Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 11264 bytes --] Hi, On Sat, 14 Jul 2007, Mike Galbraith wrote: > > On Fri, 13 Jul 2007, Mike Galbraith wrote: > > > > > > The new scheduler does _a_lot_ of heavy 64 bit calculations without any > > > > attempt to scale that down a little... > > > > > > See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity. > > > Perhaps more can be done, but "without any attempt..." isn't accurate. > > > > Calculating these values at runtime would have been completely insane, the > > alternative would be a crummy approximation, so using a lookup table is > > actually a good thing. That's not the problem. > > I meant see usage. I more meant serious attempts. At this point I'm not that much interested in a few localized optimizations, what I'm interested in is how can this optimized at the design level (e.g. how can arch information be used to simplify things). So I spent quite a bit of time looking through cfs and experimenting with some ideas. I want to put the main focus on the performance aspect, but there are a few other issues as well. But first something else (especially for Ingo): I tried to be very careful with any claims made in this mail, but this of course doesn't exclude the possibility of errors, in which case I'd appreciate any corrections. Any explanations done in this mail don't imply that anyone needs any such explanations, they're done to keep things in context, so that interested readers have a chance to follow even if they don't have the complete background information. Any suggestions made don't imply that they have to be implemented like this, there are more an incentive for further discussion and I'm always interested in better solutions. A first indication that something may not be quite right is the increase in code size: 2.6.22: text data bss dec hex filename 10150 24 3344 13518 34ce kernel/sched.o recent git: text data bss dec hex filename 14724 228 2020 16972 424c kernel/sched.o That's i386 without stats/debug. A lot of the new code is in regularly executed regions and it's often not exactly trivial code as cfs added lots of heavy 64bit calculations. With the increased text comes increased runtime memory usage, e.g. task_struct increased so that only 5 of them instead 6 fit now into 8KB. Since sched-design-CFS.txt doesn't really go into any serious detail, so the EEVDF paper was more helpful and after playing with the ideas a little I noticed that the whole idea of fair scheduling can be explained somewhat simpler and I'm a little surprised not finding it mentioned anywhere. So a different view on this is that the runtime of a task is simply normalized and the virtual time (or fair_clock) is the weighted average of these normalized runtimes. The advantage of normalization is that it makes things comparable, once the normalized time values are equal each task got its fair share. It's more obvious in the EEVDF paper, cfs makes it a bit more complicated, as it uses the virtual time to calculate the eligible runtime, but it doesn't maintain a per process virtual time (fair_key is not quite the same). Here we get to the first problem, cfs is not overly accurate at maintaining a precise balance. First there a lot of rounding errors due to constant conversion between normalized and non-normalized values and the higher the update frequency the bigger the error. The effect of this can be seen by running: while (1) sched_yield(); and watching the sched_debug output and watch the underrun counter go crazy. cfs thus needs the limiting to keep this misbehaviour under control. The problem here is that it's not that difficult to hit one of the many limits, which may change the behaviour and makes cfs hard to predict how it will behave under different situations. The next issue is scheduler granularity, here I don't quite understand why the actual running time has no influence at all, which makes it difficult to predict how much cpu time a process will get at a time (even the comments only refer to the vmstat output). What is basically used instead is the normalized time since it was enqueued and practically it's a bit more complicated, as fair_key is not entirely a normalized time value. If the wait_runtime value is positive, higher prioritized tasks are given even more priority than they already get from their larger wait_runtime value. The problem here is that this triggers underruns and lower priority tasks get even less time. Another issue is the sleep bonus given to sleeping tasks. A problem here is that this can be exploited, if a job is spread over a few threads, they can get more time relativ to other tasks, e.g. in this example there are three tasks that run only for about 1ms every 3ms, but they get far more time than should have gotten fairly: 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l The debug output for this is also interesting: task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun ------------------------------------------------------------------------------------------------------------------------------------------------------------------ lt 4544 42958653977764 -2800118 2797765 11753 120 11449444657 201615584 23750292211 9600 0 lt 4545 42958653767067 -3010815 2759284 11747 120 11442604925 202910051 23746575676 9613 0 l 4547 42958653330276 -3447606 1332892 32333 120 1035284848 -47628857 0 0 14247 Practically this means a few interactive tasks can steal quite a lot of time from other tasks, which might try to get some work done. This may be fine in Desktop environments, but I'm not sure it can be that easily generalized. This can make cfs quite unfair, if waiting outside the runqueue has more advantages than waiting in the runqueue. Finally there is still the issue with the nice levels, where I still think the massive increase overcompensates for adequate scheduling classes, e.g. to give audio apps a fixed amount of time instead of a relative portion. Overall while cfs has a number of good ideas, I don't think it was quite ready for a stable release. Maybe it's possible to fix this until the release, but I certainly would have prefered less time pressure. It's not just the small inaccuracies, it's also the significantly increased complexity. In this regard I find the claim that cfs "has no heuristics whatsoever" interesting, for that to be true I would expect a little more accuracy, but that wouldn't be a big problem, if cfs were a really fast and compact scheduler and here it's quite hard to argue that cfs has been improvement in this particular area. Anyway, before Ingo starts accussing me now of "negativism", let's look at some possibilities how this could be improved. To reduce the inaccuracies it's better to avoid conversion between normalized and real time values, so the first example program pretty much shows just that and demonstrate the very core of a scheduler. It maintains per task normalized times and uses only that to make the scheduling decision (it doesn't even need an explicit wait_runtime value). The nice thing about this how consistently it gives out time shares - unless there is a higher priority task, that task will get the maximum share, which makes the scheduling behaviour quite a bit more predictable. The second example program is more complete (e.g. it demonstrates adding/removing tasks) but it's based on the same basic idea. First the floating point values are converted to fixed point values. To maintain accuracy one has to take overflows into account, cfs currently avoids overflows by throwing away (possibly important) information, but that adds checks all over the place instead of dealing with them within the design, so I consider this overflow avoidance a failed strategy - it doesn't make anything simpler and creates problems elsewhere. Of course the example program has its own limits, but in this case I can define them, so that within them the scheduler will work correctly. The most important limits are: max normalized task time delta = max inverse weight * max cpu share max normalized average time delta = max active tasks * max inverse weight * cpu share base The first limit is used for comparing individual task and the second one is used for maintaining the average. With these limits it's possible to choose the appropriate data types that can hold these maximum values and then I don't really have to worry about overflows and I know the scheduler is accurate without the need for a lot of extra checks. The second example also adds a normalized average, but contrary to current cfs it's not central to the design. An average is not needed to give every task its fair share, but it can be used to make better scheduling decisions to control _when_ a task gets its share. In this example I provided two possibilities where it's used to schedule new tasks, the first divides time into slices and sets a new task to the start of that slice, the second gives the task a full time share relative to the current average, but approximating the average (by looking just at the min/max time) should work as well. The average here is not a weighted average, a weighted average is a little more complex to maintain accurately and has issues regarding overflows, so I'm using a simple average, which is sufficient especially since it's not a primary part of the scheduler anymore. BTW above unfairness of sleeping tasks can be easily avoided in this model by simply making sure that normalized time never goes backward. The accuracy of this model makes it possible to further optimize the code (it's really a key element, that's why I'm stressing it so much, OTOH it's rather difficult to further optimize current cfs without risking to make it worse). For example the regular updates aren't really necessary, they can also be done when necessary (i.e. when scheduling, where an update is necessary anyway for precise accounting), the next schedule time can be easily precalculated instead. OTOH the regular updates allow for very cheap incremental updates, especially if one already knows that scheduler clock has only limited resolution (because it's based on jiffies), it becomes possible to use mostly 32bit values. I hope the code example helps to further improve scheduler, I'm quite aware that it doesn't implement everything, but this just means some of the cfs design decisions need more explanation. I'm not really that much interested in scheduler, I only want a small and fast scheduler and that's some areas where cfs is no real improvement right now. cfs practically obliterated my efforts I put into the ntp code to keep the regular updates both cheap and highly precise... bye, Roman [-- Attachment #2: Type: TEXT/x-csrc, Size: 1247 bytes --] #include <stdio.h> int weight[10] = { 20, 20, 20, 50, 20, 20, 20 }; double time[10]; double ntime[10]; #define MIN_S 1 #define MAX_S 10 #define SLICE(x) (MAX_S / (double)weight[x]) #define MINSLICE(x) (MIN_S / (double)weight[x]) int main(void) { int i, j, l, w; double s, t, min, max; for (i = 0; i < 10; i++) ntime[i] = time[i] = 0; j = 0; l = 0; s = 0; while (1) { j = l ? 0 : 1; for (i = 0; i < 10; i++) { if (!weight[i] || i == l) continue; if (ntime[i] + MINSLICE(i) < ntime[j] + MINSLICE(j)) j = i; } if (ntime[l] >= ntime[j] + SLICE(j) || (ntime[l] >= ntime[j] && ntime[l] >= s + SLICE(l))) { l = j; s = ntime[l]; } time[l] += MIN_S; ntime[l] += MIN_S / (double)weight[l]; printf("%u", l); for (i = 0, w = 0, t = 0; i < 10; i++) { if (!weight[i]) continue; w += weight[i]; t += ntime[i] * weight[i]; printf("\t%3u/%u:%5g/%-7g", i, weight[i], time[i], ntime[i]); } t /= w; min = max = t; for (i = 0; i < 10; i++) { if (!weight[i]) continue; if (ntime[i] < min) min = ntime[i]; if (ntime[i] > max) max = ntime[i]; } printf("\t| %g (%g)\n", t, max - min); } } [-- Attachment #3: Type: TEXT/x-csrc, Size: 4966 bytes --] #include <stdio.h> #include <stdlib.h> struct task { unsigned int weight, weight_inv; int active; unsigned int time, time_avg; int time_norm, avg_fract; } task[10] = { { .weight = 10 }, { .weight = 40 }, { .weight = 80 }, }; #define MIN_S 100 #define MAX_S 1000 #define SLICE(x) (MAX_S * task[x].weight_inv) #define MINSLICE(x) (MIN_S * task[x].weight_inv) #define WEIGTH0 40 #define WEIGTH0_INV ((1 << 16) / WEIGTH0) unsigned int time_avg, time_norm_sum; int avg_fract, weight_sum_inv; static void normalize_avg(int i) { if (!weight_sum_inv) return; /* assume the common case of 0/1 first, then fallback */ if (task[i].avg_fract < 0 || task[i].avg_fract >= WEIGTH0_INV * MAX_S) { task[i].time_avg++; task[i].avg_fract -= WEIGTH0_INV * MAX_S; if (task[i].avg_fract < 0 || task[i].avg_fract >= WEIGTH0_INV * MAX_S) { task[i].time_avg += task[i].avg_fract / (WEIGTH0_INV * MAX_S); task[i].avg_fract %= WEIGTH0_INV * MAX_S; } } if (avg_fract < 0 || avg_fract >= weight_sum_inv) { time_avg++; avg_fract -= weight_sum_inv; if (avg_fract < 0 || avg_fract >= weight_sum_inv) { time_avg += avg_fract / weight_sum_inv; avg_fract %= weight_sum_inv; } } } int main(void) { int i, j, l, task_cnt; unsigned int s; unsigned int time_sum, time_sum2; task_cnt = time_avg = 0; for (i = 0; i < 10; i++) { if (!task[i].weight) continue; task[i].active = 1; task_cnt++; task[i].weight_inv = (1 << 16) / task[i].weight; } weight_sum_inv = task_cnt * WEIGTH0_INV * MAX_S; printf("w: %u,%u\n", WEIGTH0_INV * MAX_S, weight_sum_inv); time_norm_sum = avg_fract = 0; l = 0; s = 0; while (1) { j = -1; for (i = 0; i < 10; i++) { if (i == l) continue; if (!task[i].active && task[i].weight) { if (!(rand() % 30)) { normalize_avg(i); task[i].active = 1; if (!task_cnt) goto done; #if 1 if ((int)(task[i].time_avg - time_avg) < 0) { task[i].time_norm -= (int)(task[i].time_avg - time_avg) * WEIGTH0_INV * MAX_S + task[i].avg_fract; task[i].time_avg = time_avg; task[i].avg_fract = 0; } #else unsigned int new_time_avg = time_avg; int new_avg_fract = avg_fract / task_cnt - task[i].weight_inv * MAX_S; while (new_avg_fract < 0) { new_time_avg--; new_avg_fract += WEIGTH0_INV * MAX_S; } if ((int)(task[i].time_avg - new_time_avg) < 0 || ((int)(task[i].time_avg - new_time_avg) == 0 && task[i].avg_fract < new_avg_fract)) { task[i].time_norm += (int)(new_time_avg - task[i].time_avg) * WEIGTH0_INV * MAX_S + new_avg_fract - task[i].avg_fract; task[i].time_avg = new_time_avg; task[i].avg_fract = new_avg_fract; } #endif done: task_cnt++; weight_sum_inv += WEIGTH0_INV * MAX_S; avg_fract += (int)(task[i].time_avg - time_avg) * WEIGTH0_INV * MAX_S + task[i].avg_fract; time_norm_sum += task[i].time_norm; } } if (!task[i].active) continue; if (j < 0 || (int)(task[i].time_norm + MINSLICE(i) - (task[j].time_norm + MINSLICE(j))) < 0) j = i; } if (!task[l].active) { if (j < 0) continue; goto do_switch; } if (!(rand() % 100)) { task[l].active = 0; task_cnt--; weight_sum_inv -= WEIGTH0_INV * MAX_S; avg_fract -= (int)(task[l].time_avg - time_avg) * WEIGTH0_INV * MAX_S + task[l].avg_fract; time_norm_sum -= task[l].time_norm; if (j < 0) continue; goto do_switch; } if (j >= 0 && ((int)(task[l].time_norm - (task[j].time_norm + SLICE(j))) >= 0 || ((int)(task[l].time_norm - task[j].time_norm) >= 0 && (int)(task[l].time_norm - (s + SLICE(l))) >= 0))) { int prev_time_avg; do_switch: prev_time_avg = time_avg; normalize_avg(l); if (prev_time_avg < time_avg) printf("-\n"); l = j; s = task[l].time_norm; } task[l].time += MIN_S; task[l].time_norm += MINSLICE(l); task[l].avg_fract += MINSLICE(l); time_norm_sum += MINSLICE(l); avg_fract += MINSLICE(l); printf("%u", l); time_sum = time_sum2 = 0; for (i = 0; i < 10; i++) { if (!task[i].active) { if (task[i].weight) printf("\t%3u/%u: -\t", i, task[i].weight); continue; } time_sum += task[i].time_norm; time_sum2 += task[i].time_avg * WEIGTH0_INV * MAX_S + task[i].avg_fract; printf("\t%3u/%u:%5u/%-7g/%-7g", i, task[i].weight, task[i].time, (double)task[i].time_norm / (1 << 16), task[i].time_avg + (double)task[i].avg_fract / (WEIGTH0_INV * MAX_S)); } if (time_sum != time_norm_sum) abort(); if (time_sum2 != time_avg * weight_sum_inv + avg_fract) abort(); if (time_sum != time_sum2) abort(); printf("\t| %g/%g\n", (double)time_norm_sum / task_cnt / (1 << 16), time_avg + (double)(int)avg_fract / weight_sum_inv); } } ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 3:41 ` CFS review Roman Zippel @ 2007-08-01 7:12 ` Ingo Molnar 2007-08-01 7:26 ` Mike Galbraith 2007-08-01 13:19 ` Roman Zippel 2007-08-01 11:22 ` Ingo Molnar ` (5 subsequent siblings) 6 siblings, 2 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 7:12 UTC (permalink / raw) To: Roman Zippel; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel Roman, Thanks for the testing and the feedback, it's much appreciated! :-) On what platform did you do your tests, and what .config did you use (and could you please send me your .config)? Please also send me the output of this script: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh (if the output is too large send it to me privately, or bzip2 -9 it.) Could you also please send the source code for the "l.c" and "lt.c" apps you used for your testing so i can have a look. Thanks! Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 7:12 ` Ingo Molnar @ 2007-08-01 7:26 ` Mike Galbraith 2007-08-01 7:30 ` Ingo Molnar 2007-08-01 13:19 ` Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-08-01 7:26 UTC (permalink / raw) To: Ingo Molnar; +Cc: Roman Zippel, Linus Torvalds, Andrew Morton, linux-kernel On Wed, 2007-08-01 at 09:12 +0200, Ingo Molnar wrote: > Roman, > > Thanks for the testing and the feedback, it's much appreciated! :-) On > what platform did you do your tests, and what .config did you use (and > could you please send me your .config)? > > Please also send me the output of this script: > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > (if the output is too large send it to me privately, or bzip2 -9 it.) > > Could you also please send the source code for the "l.c" and "lt.c" apps > you used for your testing so i can have a look. Thanks! I haven't been able to reproduce this with any combination of features, and massive_intr tweaked to his work/sleep cycle. I notice he's collecting stats though, and they look funky. Recompiling. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 7:26 ` Mike Galbraith @ 2007-08-01 7:30 ` Ingo Molnar 2007-08-01 7:36 ` Mike Galbraith 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 7:30 UTC (permalink / raw) To: Mike Galbraith; +Cc: Roman Zippel, Linus Torvalds, Andrew Morton, linux-kernel * Mike Galbraith <efault@gmx.de> wrote: > > Thanks for the testing and the feedback, it's much appreciated! :-) > > On what platform did you do your tests, and what .config did you use > > (and could you please send me your .config)? > > > > Please also send me the output of this script: > > > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > > > (if the output is too large send it to me privately, or bzip2 -9 > > it.) > > > > Could you also please send the source code for the "l.c" and "lt.c" > > apps you used for your testing so i can have a look. Thanks! > > I haven't been able to reproduce this with any combination of > features, and massive_intr tweaked to his work/sleep cycle. I notice > he's collecting stats though, and they look funky. Recompiling. yeah, the posted numbers look most weird, but there's a complete lack of any identification of test environment - so we'll need some more word from Roman. Perhaps this was run on some really old box that does not have a high-accuracy sched_clock()? The patch below should simulate that scenario on 32-bit x86. Ingo Index: linux/arch/i386/kernel/tsc.c =================================================================== --- linux.orig/arch/i386/kernel/tsc.c +++ linux/arch/i386/kernel/tsc.c @@ -110,7 +110,7 @@ unsigned long long native_sched_clock(vo * very important for it to be as fast as the platform * can achive it. ) */ - if (unlikely(!tsc_enabled && !tsc_unstable)) +// if (unlikely(!tsc_enabled && !tsc_unstable)) /* No locking but a rare wrong value is not a big deal: */ return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ); ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 7:30 ` Ingo Molnar @ 2007-08-01 7:36 ` Mike Galbraith 2007-08-01 8:49 ` Mike Galbraith 0 siblings, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-08-01 7:36 UTC (permalink / raw) To: Ingo Molnar; +Cc: Roman Zippel, Linus Torvalds, Andrew Morton, linux-kernel On Wed, 2007-08-01 at 09:30 +0200, Ingo Molnar wrote: > * Mike Galbraith <efault@gmx.de> wrote: > > I haven't been able to reproduce this with any combination of > > features, and massive_intr tweaked to his work/sleep cycle. I notice > > he's collecting stats though, and they look funky. Recompiling. > > yeah, the posted numbers look most weird, but there's a complete lack of > any identification of test environment - so we'll need some more word > >from Roman. Perhaps this was run on some really old box that does not > have a high-accuracy sched_clock()? The patch below should simulate that > scenario on 32-bit x86. > > Ingo > > Index: linux/arch/i386/kernel/tsc.c > =================================================================== > --- linux.orig/arch/i386/kernel/tsc.c > +++ linux/arch/i386/kernel/tsc.c > @@ -110,7 +110,7 @@ unsigned long long native_sched_clock(vo > * very important for it to be as fast as the platform > * can achive it. ) > */ > - if (unlikely(!tsc_enabled && !tsc_unstable)) > +// if (unlikely(!tsc_enabled && !tsc_unstable)) > /* No locking but a rare wrong value is not a big deal: */ > return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ); > Ah, thanks. I noticed that clocksource= went away. I'll test with stats, with and without jiffies resolution. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 7:36 ` Mike Galbraith @ 2007-08-01 8:49 ` Mike Galbraith 0 siblings, 0 replies; 535+ messages in thread From: Mike Galbraith @ 2007-08-01 8:49 UTC (permalink / raw) To: Ingo Molnar; +Cc: Roman Zippel, Linus Torvalds, Andrew Morton, linux-kernel On Wed, 2007-08-01 at 09:36 +0200, Mike Galbraith wrote: > On Wed, 2007-08-01 at 09:30 +0200, Ingo Molnar wrote: > > > yeah, the posted numbers look most weird, but there's a complete lack of > > any identification of test environment - so we'll need some more word > > >from Roman. Perhaps this was run on some really old box that does not > > have a high-accuracy sched_clock()? The patch below should simulate that > > scenario on 32-bit x86. > > > > Ingo > > > > Index: linux/arch/i386/kernel/tsc.c > > =================================================================== > > --- linux.orig/arch/i386/kernel/tsc.c > > +++ linux/arch/i386/kernel/tsc.c > > @@ -110,7 +110,7 @@ unsigned long long native_sched_clock(vo > > * very important for it to be as fast as the platform > > * can achive it. ) > > */ > > - if (unlikely(!tsc_enabled && !tsc_unstable)) > > +// if (unlikely(!tsc_enabled && !tsc_unstable)) > > /* No locking but a rare wrong value is not a big deal: */ > > return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ); > > > > Ah, thanks. I noticed that clocksource= went away. I'll test with > stats, with and without jiffies resolution. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 6465 root 20 0 1432 356 296 R 30 0.0 1:02.55 1 chew 6462 root 20 0 1576 216 140 R 23 0.0 0:50.29 1 massive_intr_x 6463 root 20 0 1576 216 140 R 23 0.0 0:50.23 1 massive_intr_x 6464 root 20 0 1576 216 140 R 23 0.0 0:50.28 1 massive_intr_x Well, jiffies resolution clock did upset fairness a bit with a right at jiffies resolution burn time, but not nearly as bad as on Roman's box, and not in favor of the sleepers. With the longer burn time of stock massive_intr.c (8ms burn, 1ms sleep), lower resolution clock didn't upset it. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 6511 root 20 0 1572 220 140 R 25 0.0 1:00.11 1 massive_intr 6512 root 20 0 1572 220 140 R 25 0.0 1:00.14 1 massive_intr 6514 root 20 0 1432 356 296 R 25 0.0 1:00.31 1 chew 6513 root 20 0 1572 220 140 R 24 0.0 1:00.14 1 massive_intr -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 7:12 ` Ingo Molnar 2007-08-01 7:26 ` Mike Galbraith @ 2007-08-01 13:19 ` Roman Zippel 2007-08-01 15:07 ` Ingo Molnar 1 sibling, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-08-01 13:19 UTC (permalink / raw) To: Ingo Molnar; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > Please also send me the output of this script: > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh Send privately. > Could you also please send the source code for the "l.c" and "lt.c" apps > you used for your testing so i can have a look. Thanks! l.c is a simple busy loop (well, with the option to start many of them). This is lt.c, what it does is to run a bit less than a jiffie, so it needs a low resolution clock to trigger the problem: #include <stdio.h> #include <signal.h> #include <time.h> #include <sys/time.h> #define NSEC 1000000000 #define USEC 1000000 #define PERIOD (NSEC/1000) int i; void worker(int sig) { struct timeval tv; long long t0, t; gettimeofday(&tv, 0); //printf("%u,%lu\n", i, tv.tv_usec); t0 = (long long)tv.tv_sec * 1000000 + tv.tv_usec + PERIOD / 1000 - 50; do { gettimeofday(&tv, 0); t = (long long)tv.tv_sec * 1000000 + tv.tv_usec; } while (t < t0); } int main(int ac, char **av) { int cnt; timer_t timer; struct itimerspec its; struct sigaction sa; cnt = i = atoi(av[1]); sa.sa_handler = worker; sa.sa_flags = 0; sigemptyset(&sa.sa_mask); sigaction(SIGALRM, &sa, 0); clock_gettime(CLOCK_MONOTONIC, &its.it_value); its.it_interval.tv_sec = 0; its.it_interval.tv_nsec = PERIOD * cnt; while (--i > 0 && fork() > 0) ; its.it_value.tv_nsec += i * PERIOD; if (its.it_value.tv_nsec > NSEC) { its.it_value.tv_sec++; its.it_value.tv_nsec -= NSEC; } timer_create(CLOCK_MONOTONIC, 0, &timer); timer_settime(timer, TIMER_ABSTIME, &its, 0); printf("%u,%lu\n", i, its.it_interval.tv_nsec); while (1) pause(); return 0; } ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 13:19 ` Roman Zippel @ 2007-08-01 15:07 ` Ingo Molnar 2007-08-01 17:10 ` Andi Kleen 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 15:07 UTC (permalink / raw) To: Roman Zippel; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > > Please also send me the output of this script: > > > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > Send privately. thanks. Just to make sure, while you said that your TSC was off on that laptop, the bootup log of yours suggests a working TSC: Time: tsc clocksource has been installed. and still your fl.c testcases produces the top output that you've reported in your first mail? If so then this could be a regression. Or did you turn off the tsc manually via notsc? (or was it with a different .config or on a different machine)? Please help us figure this out exactly, we dont want a real regression go unnoticed. If you can reproduce that problem with a working TSC then please generate a second cfs-debug-info.sh snapshot _while_ your fl+l workload is running and send that to me (i'll reply back to it publicly). Thanks, Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 15:07 ` Ingo Molnar @ 2007-08-01 17:10 ` Andi Kleen 2007-08-01 16:27 ` Linus Torvalds 0 siblings, 1 reply; 535+ messages in thread From: Andi Kleen @ 2007-08-01 17:10 UTC (permalink / raw) To: Ingo Molnar Cc: Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel Ingo Molnar <mingo@elte.hu> writes: > thanks. Just to make sure, while you said that your TSC was off on that > laptop, the bootup log of yours suggests a working TSC: > > Time: tsc clocksource has been installed. Standard kernels often disable the TSC later after running a bit with it (e.g. on any cpufreq change without p state invariant TSC) -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 17:10 ` Andi Kleen @ 2007-08-01 16:27 ` Linus Torvalds 2007-08-01 17:48 ` Andi Kleen 2007-08-01 17:50 ` Ingo Molnar 0 siblings, 2 replies; 535+ messages in thread From: Linus Torvalds @ 2007-08-01 16:27 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Roman Zippel, Mike Galbraith, Andrew Morton, linux-kernel On Wed, 1 Aug 2007, Andi Kleen wrote: > Ingo Molnar <mingo@elte.hu> writes: > > > thanks. Just to make sure, while you said that your TSC was off on that > > laptop, the bootup log of yours suggests a working TSC: > > > > Time: tsc clocksource has been installed. > > Standard kernels often disable the TSC later after running a bit > with it (e.g. on any cpufreq change without p state invariant TSC) I assume that what Roman hit was that he had explicitly disabled the TSC because of TSC instability with the "notsc" kernel command line. Which disabled is *entirely*. That *used* to be the right thing to do, since the gettimeofday() logic originally didn't know about TSC instability, and it just resulted in somewhat flaky timekeeping. These days, of course, we should notice it on our own, and just switch away from the TSC as a reliable clock-source, but still allow it to be used for the cases where absolute accuracy is not a big issue. So I suspect that Roman - by virtue of being an old-timer - ends up having a workaround for an old problem that isn't needed, and that in turn ends up meaning that his scheduler clock also ends up using the really not very good timer tick.. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 16:27 ` Linus Torvalds @ 2007-08-01 17:48 ` Andi Kleen 2007-08-01 17:50 ` Ingo Molnar 1 sibling, 0 replies; 535+ messages in thread From: Andi Kleen @ 2007-08-01 17:48 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Ingo Molnar, Roman Zippel, Mike Galbraith, Andrew Morton, linux-kernel > I assume that what Roman hit was that he had explicitly disabled the TSC > because of TSC instability with the "notsc" kernel command line. Which > disabled is *entirely*. It might just have been cpufreq. That nearly hits everybody with cpufreq unless you have a pstate invariant TSC; and that's pretty much always the case on older laptops. It used to not be that drastic, but since i386 switched to the generic clock frame work it is like that :/ > These days, of course, we should notice it on our own, and just switch > away from the TSC as a reliable clock-source, but still allow it to be > used for the cases where absolute accuracy is not a big issue. The rewritten sched_clock() i still have queued does just that. I planned to submit it for .23, but then during later in deepth testing on my machine park I found a show stopper that I couldn't fix on time. Hopefully for .24 -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 16:27 ` Linus Torvalds 2007-08-01 17:48 ` Andi Kleen @ 2007-08-01 17:50 ` Ingo Molnar 2007-08-01 18:01 ` Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 17:50 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Roman Zippel, Mike Galbraith, Andrew Morton, linux-kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 1 Aug 2007, Andi Kleen wrote: > > > Ingo Molnar <mingo@elte.hu> writes: > > > > > thanks. Just to make sure, while you said that your TSC was off on that > > > laptop, the bootup log of yours suggests a working TSC: > > > > > > Time: tsc clocksource has been installed. > > > > Standard kernels often disable the TSC later after running a bit > > with it (e.g. on any cpufreq change without p state invariant TSC) > > I assume that what Roman hit was that he had explicitly disabled the > TSC because of TSC instability with the "notsc" kernel command line. > Which disabled is *entirely*. but that does not appear to be the case, the debug info i got from Roman includes the following boot options: Kernel command line: auto BOOT_IMAGE=2.6.23-rc1-git9 ro root=306 there's no "notsc" option there. Andi's theory cannot be true either, Roman's debug info also shows this /proc/<PID>/sched data: clock-delta : 95 that means that sched_clock() is in high-res mode, the TSC is alive and kicking and a sched_clock() call took 95 nanoseconds. Roman, could you please help us with this mystery? Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 17:50 ` Ingo Molnar @ 2007-08-01 18:01 ` Roman Zippel 2007-08-01 19:05 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-08-01 18:01 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > Andi's theory cannot be true either, Roman's debug info also shows this > /proc/<PID>/sched data: > > clock-delta : 95 > > that means that sched_clock() is in high-res mode, the TSC is alive and > kicking and a sched_clock() call took 95 nanoseconds. > > Roman, could you please help us with this mystery? Actually, Andi is right. What I sent you was generated directly after boot, as I had to reboot for the right kernel, so a little later appeared this: Aug 1 14:54:30 spit kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1 Aug 1 15:09:56 spit kernel: Clocksource tsc unstable (delta = 656747233 ns) Aug 1 15:09:56 spit kernel: Time: pit clocksource has been installed. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 18:01 ` Roman Zippel @ 2007-08-01 19:05 ` Ingo Molnar 2007-08-09 23:14 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 19:05 UTC (permalink / raw) To: Roman Zippel Cc: Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > Hi, > > On Wed, 1 Aug 2007, Ingo Molnar wrote: > > > Andi's theory cannot be true either, Roman's debug info also shows this > > /proc/<PID>/sched data: > > > > clock-delta : 95 > > > > that means that sched_clock() is in high-res mode, the TSC is alive and > > kicking and a sched_clock() call took 95 nanoseconds. > > > > Roman, could you please help us with this mystery? > > Actually, Andi is right. What I sent you was generated directly after > boot, as I had to reboot for the right kernel, so a little later > appeared this: > > Aug 1 14:54:30 spit kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1 > Aug 1 15:09:56 spit kernel: Clocksource tsc unstable (delta = 656747233 ns) > Aug 1 15:09:56 spit kernel: Time: pit clocksource has been installed. just to make sure, how does 'top' output of the l + "lt 3" testcase look like now on your laptop? Yesterday it was this: 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l and i'm still wondering how that output was possible. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 19:05 ` Ingo Molnar @ 2007-08-09 23:14 ` Roman Zippel 2007-08-10 5:49 ` Ingo Molnar 2007-08-10 7:23 ` Mike Galbraith 0 siblings, 2 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-09 23:14 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > just to make sure, how does 'top' output of the l + "lt 3" testcase look > like now on your laptop? Yesterday it was this: > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > and i'm still wondering how that output was possible. I disabled the jiffies logic and the result is still the same, so this problem isn't related to resolution at all. I traced it a little and what's happing is that the busy loop really only gets little time, it only runs inbetween the timer tasks. When the timer task is woken up __enqueue_sleeper() updates sleeper_bonus and a little later when the busy loop is preempted __update_curr() is called a last time and it's fully hit by the sleeper_bonus. So the timer tasks use less time than they actually get and thus produce overflows, the busy loop OTOH is punished and underflows. So it seems my initial suspicion was right and this logic is dodgy, what is it actually supposed to do? Why is some random task accounted with the sleeper_bonus? bye, Roman PS: Can I still expect answer about all the other stuff? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-09 23:14 ` Roman Zippel @ 2007-08-10 5:49 ` Ingo Molnar 2007-08-10 13:52 ` Roman Zippel 2007-08-10 7:23 ` Mike Galbraith 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-10 5:49 UTC (permalink / raw) To: Roman Zippel Cc: Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > > > and i'm still wondering how that output was possible. > > I disabled the jiffies logic and the result is still the same, so this > problem isn't related to resolution at all. how did you disable the jiffies logic? Also, could you please send me the cfs-debug-info.sh: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh captured _while_ the above workload is running. This is the third time i've asked for that :-) to establish that the basic sched_clock() behavior is sound on that box, could you please also run this tool: http://people.redhat.com/mingo/cfs-scheduler/tools/tsc-dump.c please run it both while the system is idle, and while there's a CPU hog running: while :; do :; done & and send me that output too? (it's 2x 60 lines only) Thanks! Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 5:49 ` Ingo Molnar @ 2007-08-10 13:52 ` Roman Zippel 2007-08-10 14:18 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-10 13:52 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel Hi, On Fri, 10 Aug 2007, Ingo Molnar wrote: > > I disabled the jiffies logic and the result is still the same, so this > > problem isn't related to resolution at all. > > how did you disable the jiffies logic? I commented it out. > Also, could you please send me > the cfs-debug-info.sh: > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > captured _while_ the above workload is running. This is the third time > i've asked for that :-) Is there any reason to believe my analysis is wrong? So far you haven't answered a single question about the CFS design... Anyway, I give you something better - the raw trace data for 2ms: 1186747669.274790012: update_curr 0xc7fb06f0,479587,319708,21288884188,159880,7360532 1186747669.274790375: dequeue_entity 0xc7fb06f0,21280402988,159880 1186747669.274792580: sched 2848,2846,0xc7432cb0,-7520413 1186747669.274820987: update_curr 0xc7432ce0,29302,-130577,21288913490,1,-7680293 1186747669.274821269: dequeue_entity 0xc7432ce0,21296077409,1 1186747669.274821930: enqueue_entity 0xc7432ce0,21296593783,1 1186747669.274826979: update_curr 0xc7432ce0,5707,5707,21288919197,1,-7680294 1186747669.274827724: enqueue_entity 0xc7432180,21280919197,639451 1186747669.274829948: update_curr 0xc7432ce0,1553,-318172,21288920750,319726,-8000000 1186747669.274831878: sched 2846,2847,0xc7432150,8000000 1186747669.275789883: update_curr 0xc7432180,479797,319935,21289400547,159864,7360339 1186747669.275790295: dequeue_entity 0xc7432180,21280919197,159864 1186747669.275792439: sched 2847,2846,0xc7432cb0,-7520203 1186747669.275820819: update_curr 0xc7432ce0,29238,-130625,21289429785,1,-7680067 1186747669.275821109: dequeue_entity 0xc7432ce0,21296593783,1 1186747669.275821763: enqueue_entity 0xc7432ce0,21297109852,1 1186747669.275826887: update_curr 0xc7432ce0,5772,5772,21289435557,1,-7680068 1186747669.275827652: enqueue_entity 0xc7fb0ca0,21281435557,639881 1186747669.275829826: update_curr 0xc7432ce0,1549,-318391,21289437106,319941,-8000000 1186747669.275831584: sched 2846,2849,0xc7fb0c70,8000000 About the values: update_curr: sched_entity, delta_fair, delta_mine, fair_clock, sleeper_bonus, wait_runtime (final values at the end of __update_curr) {en,de}queue_entity: sched_entity, fair_key, sleeper_bonus (at the start of __enqueue_entity/__dequeue_entity) sched: prev_pid,pid,current,wait_runtime (at the end of scheduling, note that current has a small structure offset to sched_entity) It starts with a timer task going to sleep, the busy loop runs for a few usec until the timer tick, the next timer task is woken up (sleeper_bonus is increased). Before switching the tasks, current task is updated and it's punished with the sleeper_bonus. These tests are done without the recent updates, but they don't seem to change the basic logic. AFAICT the change to __update_curr() only makes it more unpredictable, which task is punished with the sleeper_bonus. So again, what is this logic _supposed_ to do? bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 13:52 ` Roman Zippel @ 2007-08-10 14:18 ` Ingo Molnar 2007-08-10 16:47 ` Mike Galbraith 2007-08-10 16:54 ` Michael Chang 2 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-10 14:18 UTC (permalink / raw) To: Roman Zippel Cc: Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > > Also, could you please send me > > the cfs-debug-info.sh: > > > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > > > captured _while_ the above workload is running. This is the third time > > i've asked for that :-) > > Is there any reason to believe my analysis is wrong? please first give me the debug data captured with the script above (while the workload is running) - so that i can see the full picture about what's happening. Thanks, Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 13:52 ` Roman Zippel 2007-08-10 14:18 ` Ingo Molnar @ 2007-08-10 16:47 ` Mike Galbraith 2007-08-10 17:19 ` Roman Zippel 2007-08-10 16:54 ` Michael Chang 2 siblings, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-08-10 16:47 UTC (permalink / raw) To: Roman Zippel Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel I guess I'm going to have to give up on trying to reproduce this... my 3GHz P4 is just not getting there from here. Last attempt, compiled UP, HZ=1000 dynticks, full preempt and highres timers fwiw. 6392 root 20 0 1696 332 248 R 25.5 0.0 3:00.14 0 lt 6393 root 20 0 1696 332 248 R 24.9 0.0 3:00.15 0 lt 6391 root 20 0 1696 488 404 R 24.7 0.0 3:00.20 0 lt 6394 root 20 0 2888 1232 1028 R 24.5 0.1 2:58.58 0 sh -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 16:47 ` Mike Galbraith @ 2007-08-10 17:19 ` Roman Zippel 0 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-10 17:19 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel Hi, On Fri, 10 Aug 2007, Mike Galbraith wrote: > I guess I'm going to have to give up on trying to reproduce this... my > 3GHz P4 is just not getting there from here. Last attempt, compiled UP, > HZ=1000 dynticks, full preempt and highres timers fwiw. > > 6392 root 20 0 1696 332 248 R 25.5 0.0 3:00.14 0 lt > 6393 root 20 0 1696 332 248 R 24.9 0.0 3:00.15 0 lt > 6391 root 20 0 1696 488 404 R 24.7 0.0 3:00.20 0 lt > 6394 root 20 0 2888 1232 1028 R 24.5 0.1 2:58.58 0 sh Except for UP and HZ=1000, everything else is pretty much turned off. If you use a very recent kernel, the problem may not be visible like this anymore. It may be a bit easier to reproduce, if you change the end time t0 in lt.c a little. Also try to start the busy loop first. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 13:52 ` Roman Zippel 2007-08-10 14:18 ` Ingo Molnar 2007-08-10 16:47 ` Mike Galbraith @ 2007-08-10 16:54 ` Michael Chang 2007-08-10 17:25 ` Roman Zippel 2 siblings, 1 reply; 535+ messages in thread From: Michael Chang @ 2007-08-10 16:54 UTC (permalink / raw) To: Roman Zippel Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel On 8/10/07, Roman Zippel <zippel@linux-m68k.org> wrote: > Is there any reason to believe my analysis is wrong? Not yet, but if you give Ingo what he wants (as opposed to what you're giving him) it'll be easier for him to answer what's going wrong, and perhaps "fix" the problem to boot. (The scripts gives info about CPU characteristics, interrupts, modules, etc. -- you know, all those "unknown" variables.) And perhaps a patch to show what parts you commented out, too, so one can tell if anything got broken (unintentionally). -- Michael Chang Please avoid sending me Word or PowerPoint attachments. Send me ODT, RTF, or HTML instead. See http://www.gnu.org/philosophy/no-word-attachments.html Thank you. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 16:54 ` Michael Chang @ 2007-08-10 17:25 ` Roman Zippel 2007-08-10 19:44 ` Ingo Molnar 2007-08-10 19:47 ` Willy Tarreau 0 siblings, 2 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-10 17:25 UTC (permalink / raw) To: Michael Chang Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel Hi, On Fri, 10 Aug 2007, Michael Chang wrote: > On 8/10/07, Roman Zippel <zippel@linux-m68k.org> wrote: > > Is there any reason to believe my analysis is wrong? > > Not yet, but if you give Ingo what he wants (as opposed to what you're > giving him) it'll be easier for him to answer what's going wrong, and > perhaps "fix" the problem to boot. > > (The scripts gives info about CPU characteristics, interrupts, > modules, etc. -- you know, all those "unknown" variables.) He already has most of this information and the trace shows _exactly_ what's going on. All this information should be more than enough to allow an initial judgement whether my analysis is correct. Also none of this information is needed to explain the CFS logic a little more, which I'm still waiting for... bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 17:25 ` Roman Zippel @ 2007-08-10 19:44 ` Ingo Molnar 2007-08-10 19:47 ` Willy Tarreau 1 sibling, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-10 19:44 UTC (permalink / raw) To: Roman Zippel Cc: Michael Chang, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > > Not yet, but if you give Ingo what he wants (as opposed to what > > you're giving him) it'll be easier for him to answer what's going > > wrong, and perhaps "fix" the problem to boot. > > > > (The scripts gives info about CPU characteristics, interrupts, > > modules, etc. -- you know, all those "unknown" variables.) > > He already has most of this information and the trace shows _exactly_ > what's going on. [...] i'll need the other bits of information too to have a complete picture about what's going on while your test is running - to maximize the chances of me being able to fix it. I'm a bit perplexed (and a bit worried) about this - you've spent _far_ more effort to _not send_ that script output (captured while the workload is running) than it would have taken to do it :-/ If you'd like me to fix bugs then please just send it (in private mail if you want) - or give me an ssh login to that box - whichever variant you prefer. Thanks, Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 17:25 ` Roman Zippel 2007-08-10 19:44 ` Ingo Molnar @ 2007-08-10 19:47 ` Willy Tarreau 2007-08-10 21:15 ` Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Willy Tarreau @ 2007-08-10 19:47 UTC (permalink / raw) To: Roman Zippel Cc: Michael Chang, Ingo Molnar, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel On Fri, Aug 10, 2007 at 07:25:57PM +0200, Roman Zippel wrote: > Hi, > > On Fri, 10 Aug 2007, Michael Chang wrote: > > > On 8/10/07, Roman Zippel <zippel@linux-m68k.org> wrote: > > > Is there any reason to believe my analysis is wrong? > > > > Not yet, but if you give Ingo what he wants (as opposed to what you're > > giving him) it'll be easier for him to answer what's going wrong, and > > perhaps "fix" the problem to boot. > > > > (The scripts gives info about CPU characteristics, interrupts, > > modules, etc. -- you know, all those "unknown" variables.) > > He already has most of this information and the trace shows _exactly_ > what's going on. All this information should be more than enough to allow > an initial judgement whether my analysis is correct. > Also none of this information is needed to explain the CFS logic a little > more, which I'm still waiting for... Roman, fortunately all bug reporters are not like you. It's amazing how long you can resist sending a simple bug report to a developer! Maybe you consider that you need to fix the bug by yourself after you understand the code, but if you systematically refuse to return the small information Ingo asks you, we will have to wait for some more cooperative users to be hit by the same bug when 2.6.23 is released, which is stupid. I thought you could at least understand that one developer who is used to read traces from the same tool every day will be far faster at decoding a trace from the same tool than trying to figure out what your self-maid dump means. It's the exact same reason I ask for pcap files when people send me outputs of tcpdumps without the information I *need*. I you definitely do not want to cooperate, stop asking for a personal explanation, and go figure by yourself how the code works. BTW, in the trace you "kindly offered" in exchange for the cfs-debug-info dump, you show several useful variables, but nothing says where they are captured. And as you can see, they're changing. That's a fantastic trace for a developer, really... Please try to be a little bit more transparent if you really want the bugs fixed, and don't behave as if you wanted this bug to survive till -final. Thanks, Willy ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 19:47 ` Willy Tarreau @ 2007-08-10 21:15 ` Roman Zippel 2007-08-10 21:36 ` Ingo Molnar 2007-08-11 5:15 ` Willy Tarreau 0 siblings, 2 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-10 21:15 UTC (permalink / raw) To: Willy Tarreau Cc: Michael Chang, Ingo Molnar, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel Hi, On Fri, 10 Aug 2007, Willy Tarreau wrote: > fortunately all bug reporters are not like you. It's amazing how long > you can resist sending a simple bug report to a developer! I'm more amazed how long Ingo can resist providing some explanations (not just about this problem). It's not like I haven't given him anything, he already has the test programs, he already knows the system configuration. Well, I've sent him the stuff now... > Maybe you > consider that you need to fix the bug by yourself after you understand > the code, Fixing the bug requires some knowledge what the code is intended to do. > Please try to be a little bit more transparent if you really want the > bugs fixed, and don't behave as if you wanted this bug to survive > till -final. Could you please ask Ingo the same? I'm simply trying to get some transparancy into the CFS design. Without further information it's difficult to tell, whether something is supposed to work this way or it's a bug. In this case it's quite possible that due to a recent change my testcase doesn't work anymore. Should I consider the problem fixed or did it just go into hiding? Without more information it's difficult to verify this independently. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 21:15 ` Roman Zippel @ 2007-08-10 21:36 ` Ingo Molnar 2007-08-10 22:50 ` Roman Zippel 2007-08-11 0:30 ` Ingo Molnar 2007-08-11 5:15 ` Willy Tarreau 1 sibling, 2 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-10 21:36 UTC (permalink / raw) To: Roman Zippel Cc: Willy Tarreau, Michael Chang, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > Well, I've sent him the stuff now... received it - thanks alot, looking at it! > It's not like I haven't given him anything, he already has the test > programs, he already knows the system configuration. one more small thing: could you please send your exact .config (Mike asked for that too, and i too on two prior occasions). Sometimes unexpected little details in the .config make a difference, we are not asking you that because we are second-guessing you in any way, the reason is simple: i frequently boot _the very .config that others use_, and see surprising reproducability of bugs that i couldnt trigger before. It's standard procedure to just pick up the .config of others to eliminate a whole bunch of degrees of freedom for a bug to hide behind - and your "it's a pretty standard config" description doesnt really achieve that. It probably wont make a real difference, but it's really easy for you to send and it's still very useful when one tries to eliminate possibilities and when one wants to concentrate on the remaining possibilities alone. Thanks again, Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 21:36 ` Ingo Molnar @ 2007-08-10 22:50 ` Roman Zippel 2007-08-11 5:28 ` Willy Tarreau 2007-08-11 0:30 ` Ingo Molnar 1 sibling, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-08-10 22:50 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Michael Chang, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel Hi, On Fri, 10 Aug 2007, Ingo Molnar wrote: > achieve that. It probably wont make a real difference, but it's really > easy for you to send and it's still very useful when one tries to > eliminate possibilities and when one wants to concentrate on the > remaining possibilities alone. The thing I'm afraid about CFS is its possible unpredictability, which would make it hard to reproduce problems and we may end up with users with unexplainable weird problems. That's the main reason I'm trying so hard to push for a design discussion. Just to give an idea here are two more examples of irregular behaviour, which are hopefully easier to reproduce. 1. Two simple busy loops, one of them is reniced to 15, according to my calculations the reniced task should get about 3.4% (1/(1.25^15+1)), but I get this: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4433 roman 20 0 1532 300 244 R 99.2 0.2 5:05.51 l 4434 roman 35 15 1532 72 16 R 0.7 0.1 0:10.62 l OTOH upto nice level 12 I get what I expect. 2. If I start 20 busy loops, initially I see in top that every task gets 5% and time increments equally (as it should): PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4492 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4491 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4490 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4489 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4488 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4487 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4486 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4485 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4484 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4483 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4482 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4481 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4480 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4479 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4478 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4477 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4476 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4475 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4474 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l 4473 roman 20 0 1532 296 244 R 5.0 0.2 0:02.86 l But if I renice all of them to -15, the time every task gets is rather random: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4492 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.95 l 4491 roman 5 -15 1532 68 16 R 4.3 0.1 0:07.62 l 4490 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.50 l 4489 roman 5 -15 1532 68 16 R 7.6 0.1 0:07.80 l 4488 roman 5 -15 1532 68 16 R 9.6 0.1 0:08.31 l 4487 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.59 l 4486 roman 5 -15 1532 68 16 R 6.6 0.1 0:07.08 l 4485 roman 5 -15 1532 68 16 R 10.0 0.1 0:07.31 l 4484 roman 5 -15 1532 68 16 R 8.0 0.1 0:07.30 l 4483 roman 5 -15 1532 68 16 R 7.0 0.1 0:07.34 l 4482 roman 5 -15 1532 68 16 R 1.0 0.1 0:05.84 l 4481 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.16 l 4480 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.00 l 4479 roman 5 -15 1532 68 16 R 1.0 0.1 0:06.66 l 4478 roman 5 -15 1532 68 16 R 8.6 0.1 0:06.96 l 4477 roman 5 -15 1532 68 16 R 8.6 0.1 0:07.63 l 4476 roman 5 -15 1532 68 16 R 9.6 0.1 0:07.38 l 4475 roman 5 -15 1532 68 16 R 1.3 0.1 0:07.09 l 4474 roman 5 -15 1532 68 16 R 2.3 0.1 0:07.97 l 4473 roman 5 -15 1532 296 244 R 1.0 0.2 0:07.73 l bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 22:50 ` Roman Zippel @ 2007-08-11 5:28 ` Willy Tarreau 2007-08-12 5:17 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Willy Tarreau @ 2007-08-11 5:28 UTC (permalink / raw) To: Roman Zippel Cc: Ingo Molnar, Michael Chang, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel On Sat, Aug 11, 2007 at 12:50:08AM +0200, Roman Zippel wrote: > Hi, > > On Fri, 10 Aug 2007, Ingo Molnar wrote: > > > achieve that. It probably wont make a real difference, but it's really > > easy for you to send and it's still very useful when one tries to > > eliminate possibilities and when one wants to concentrate on the > > remaining possibilities alone. > > The thing I'm afraid about CFS is its possible unpredictability, which > would make it hard to reproduce problems and we may end up with users with > unexplainable weird problems. That's the main reason I'm trying so hard to > push for a design discussion. You may be interested by looking at the very early CFS versions. The design was much more naive and understandable. After that, a lot of tricks have been added to take into account a lot of uses and corner cases, which may not help in understanding it globally. > Just to give an idea here are two more examples of irregular behaviour, > which are hopefully easier to reproduce. > > 1. Two simple busy loops, one of them is reniced to 15, according to my > calculations the reniced task should get about 3.4% (1/(1.25^15+1)), but I > get this: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 4433 roman 20 0 1532 300 244 R 99.2 0.2 5:05.51 l > 4434 roman 35 15 1532 72 16 R 0.7 0.1 0:10.62 l Could this be caused by typos in some tables like you have found in wmult ? > OTOH upto nice level 12 I get what I expect. > > 2. If I start 20 busy loops, initially I see in top that every task gets > 5% and time increments equally (as it should): (...) > But if I renice all of them to -15, the time every task gets is rather > random: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 4492 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.95 l > 4491 roman 5 -15 1532 68 16 R 4.3 0.1 0:07.62 l > 4490 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.50 l > 4489 roman 5 -15 1532 68 16 R 7.6 0.1 0:07.80 l > 4488 roman 5 -15 1532 68 16 R 9.6 0.1 0:08.31 l > 4487 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.59 l > 4486 roman 5 -15 1532 68 16 R 6.6 0.1 0:07.08 l > 4485 roman 5 -15 1532 68 16 R 10.0 0.1 0:07.31 l > 4484 roman 5 -15 1532 68 16 R 8.0 0.1 0:07.30 l > 4483 roman 5 -15 1532 68 16 R 7.0 0.1 0:07.34 l > 4482 roman 5 -15 1532 68 16 R 1.0 0.1 0:05.84 l > 4481 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.16 l > 4480 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.00 l > 4479 roman 5 -15 1532 68 16 R 1.0 0.1 0:06.66 l > 4478 roman 5 -15 1532 68 16 R 8.6 0.1 0:06.96 l > 4477 roman 5 -15 1532 68 16 R 8.6 0.1 0:07.63 l > 4476 roman 5 -15 1532 68 16 R 9.6 0.1 0:07.38 l > 4475 roman 5 -15 1532 68 16 R 1.3 0.1 0:07.09 l > 4474 roman 5 -15 1532 68 16 R 2.3 0.1 0:07.97 l > 4473 roman 5 -15 1532 296 244 R 1.0 0.2 0:07.73 l Do you see this only at -15, or starting with -15 and below ? Willy ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-11 5:28 ` Willy Tarreau @ 2007-08-12 5:17 ` Ingo Molnar 0 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-12 5:17 UTC (permalink / raw) To: Willy Tarreau Cc: Roman Zippel, Michael Chang, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel * Willy Tarreau <w@1wt.eu> wrote: > > 1. Two simple busy loops, one of them is reniced to 15, according to > > my calculations the reniced task should get about 3.4% > > (1/(1.25^15+1)), but I get this: > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 4433 roman 20 0 1532 300 244 R 99.2 0.2 5:05.51 l > > 4434 roman 35 15 1532 72 16 R 0.7 0.1 0:10.62 l > > Could this be caused by typos in some tables like you have found in > wmult ? note that the typo was not in the weight table but in the inverse weight table which didnt really affect CPU utilization (that's why we didnt notice the typo sooner). Regarding the above problem with nice +15 being beefier than intended i'd suggest to re-test with a doubled /proc/sys/kernel/sched_runtime_limit value, or with: echo 30 > /proc/sys/kernel/sched_features (which turns off sleeper fairness) > > 4477 roman 5 -15 1532 68 16 R 8.6 0.1 0:07.63 l > > 4476 roman 5 -15 1532 68 16 R 9.6 0.1 0:07.38 l > > 4475 roman 5 -15 1532 68 16 R 1.3 0.1 0:07.09 l > > 4474 roman 5 -15 1532 68 16 R 2.3 0.1 0:07.97 l > > 4473 roman 5 -15 1532 296 244 R 1.0 0.2 0:07.73 l > > Do you see this only at -15, or starting with -15 and below ? i think this was scheduling jitter caused by the larger granularity of negatively reniced tasks. This got improved recently, with latest -git i get: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3108 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent 3109 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent 3110 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent 3111 root 5 -15 1576 244 196 R 5.0 0.0 0:07.26 loop_silent 3112 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent 3113 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent that's picture-perfect CPU time distribution. But, and that's fair to say, i never ran such an artificial workload of 20x nice -15 infinite loops (!) before, and boy does interactivity suck (as expected) ;) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 21:36 ` Ingo Molnar 2007-08-10 22:50 ` Roman Zippel @ 2007-08-11 0:30 ` Ingo Molnar 2007-08-20 22:19 ` Roman Zippel 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-11 0:30 UTC (permalink / raw) To: Roman Zippel Cc: Willy Tarreau, Michael Chang, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > Well, I've sent him the stuff now... > > received it - thanks alot, looking at it! everything looks good in your debug output and the TSC dump data, except for the wait_runtime values, they are quite out of balance - and that balance cannot be explained with jiffies granularity or with any sort of sched_clock() artifact. So this clearly looks like a CFS regression that should be fixed. the only relevant thing that comes to mind at the moment is that last week Peter noticed a buggy aspect of sleeper bonuses (in that we do not rate-limit their output, hence we 'waste' them instead of redistributing them), and i've got the small patch below in my queue to fix that - could you give it a try? this is just a blind stab into the dark - i couldnt see any real impact from that patch in various workloads (and it's not upstream yet), so it might not make a big difference. The trace you did (could you send the source for that?) seems to implicate sleeper bonuses though. if this patch doesnt help, could you check the general theory whether it's related to sleeper-fairness, via turning it off: echo 30 > /proc/sys/kernel/sched_features does the bug go away if you do that? If sleeper bonuses are showing too many artifacts then we could turn it off for final .23. Ingo ---------------------> Subject: sched: fix sleeper bonus From: Ingo Molnar <mingo@elte.hu> Peter Ziljstra noticed that the sleeper bonus deduction code was not properly rate-limited: a task that scheduled more frequently would get a disproportionately large deduction. So limit the deduction to delta_exec and limit production to runtime_limit. Not-Yet-Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_fair.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -75,7 +75,7 @@ enum { unsigned int sysctl_sched_features __read_mostly = SCHED_FEAT_FAIR_SLEEPERS *1 | - SCHED_FEAT_SLEEPER_AVG *1 | + SCHED_FEAT_SLEEPER_AVG *0 | SCHED_FEAT_SLEEPER_LOAD_AVG *1 | SCHED_FEAT_PRECISE_CPU_LOAD *1 | SCHED_FEAT_START_DEBIT *1 | @@ -304,11 +304,9 @@ __update_curr(struct cfs_rq *cfs_rq, str delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw); if (cfs_rq->sleeper_bonus > sysctl_sched_granularity) { - delta = calc_delta_mine(cfs_rq->sleeper_bonus, - curr->load.weight, lw); - if (unlikely(delta > cfs_rq->sleeper_bonus)) - delta = cfs_rq->sleeper_bonus; - + delta = min(cfs_rq->sleeper_bonus, (u64)delta_exec); + delta = calc_delta_mine(delta, curr->load.weight, lw); + delta = min((u64)delta, cfs_rq->sleeper_bonus); cfs_rq->sleeper_bonus -= delta; delta_mine -= delta; } @@ -521,6 +519,8 @@ static void __enqueue_sleeper(struct cfs * Track the amount of bonus we've given to sleepers: */ cfs_rq->sleeper_bonus += delta_fair; + if (unlikely(cfs_rq->sleeper_bonus > sysctl_sched_runtime_limit)) + cfs_rq->sleeper_bonus = sysctl_sched_runtime_limit; schedstat_add(cfs_rq, wait_runtime, se->wait_runtime); } ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-11 0:30 ` Ingo Molnar @ 2007-08-20 22:19 ` Roman Zippel 2007-08-21 7:33 ` Mike Galbraith 0 siblings, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-08-20 22:19 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Michael Chang, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel Hi, On Sat, 11 Aug 2007, Ingo Molnar wrote: > the only relevant thing that comes to mind at the moment is that last > week Peter noticed a buggy aspect of sleeper bonuses (in that we do not > rate-limit their output, hence we 'waste' them instead of redistributing > them), and i've got the small patch below in my queue to fix that - > could you give it a try? It doesn't make much of a difference. OTOH if I disabled the sleeper code completely in __update_curr(), I get this: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3139 roman 20 0 1796 344 256 R 21.7 0.3 0:02.68 lt 3138 roman 20 0 1796 344 256 R 21.7 0.3 0:02.68 lt 3137 roman 20 0 1796 520 432 R 21.7 0.4 0:02.68 lt 3136 roman 20 0 1532 268 216 R 34.5 0.2 0:06.82 l Disabling this code completely via sched_features makes only a minor difference: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3139 roman 20 0 1796 344 256 R 20.4 0.3 0:09.94 lt 3138 roman 20 0 1796 344 256 R 20.4 0.3 0:09.94 lt 3137 roman 20 0 1796 520 432 R 20.4 0.4 0:09.94 lt 3136 roman 20 0 1532 268 216 R 39.1 0.2 0:19.20 l > this is just a blind stab into the dark - i couldnt see any real impact > from that patch in various workloads (and it's not upstream yet), so it > might not make a big difference. Can we please skip to the point, where you try to explain the intention a little more? If I had to guess that this is supposed to keep the runtime balance, then it would be better to use wait_runtime to adjust fair_clock, from where it would be evenly distributed to all tasks (but this had to be done during enqueue and dequeue). OTOH this also had then a consequence for the wait queue, as fair_clock is used to calculate fair_key. IMHO current wait_runtime should have some influence in calculating the sleep bonus, so that wait_runtime doesn't constantly overflow for tasks which only run occasionally. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-20 22:19 ` Roman Zippel @ 2007-08-21 7:33 ` Mike Galbraith 2007-08-21 8:35 ` Ingo Molnar 2007-08-21 11:54 ` Roman Zippel 0 siblings, 2 replies; 535+ messages in thread From: Mike Galbraith @ 2007-08-21 7:33 UTC (permalink / raw) To: Roman Zippel Cc: Ingo Molnar, Willy Tarreau, Michael Chang, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel On Tue, 2007-08-21 at 00:19 +0200, Roman Zippel wrote: > Hi, > > On Sat, 11 Aug 2007, Ingo Molnar wrote: > > > the only relevant thing that comes to mind at the moment is that last > > week Peter noticed a buggy aspect of sleeper bonuses (in that we do not > > rate-limit their output, hence we 'waste' them instead of redistributing > > them), and i've got the small patch below in my queue to fix that - > > could you give it a try? > > It doesn't make much of a difference. I thought this was history. With your config, I was finally able to reproduce the anomaly (only with your proggy though), and Ingo's patch does indeed fix it here. Freshly reproduced anomaly and patch verification, running 2.6.23-rc3 with your config, both with and without Ingo's patch reverted: 6561 root 20 0 1696 492 404 S 32.0 0.0 0:30.83 0 lt 6562 root 20 0 1696 336 248 R 32.0 0.0 0:30.79 0 lt 6563 root 20 0 1696 336 248 R 32.0 0.0 0:30.80 0 lt 6564 root 20 0 2888 1236 1028 R 4.6 0.1 0:05.26 0 sh 6507 root 20 0 2888 1236 1028 R 25.8 0.1 0:30.75 0 sh 6504 root 20 0 1696 492 404 R 24.4 0.0 0:29.26 0 lt 6505 root 20 0 1696 336 248 R 24.4 0.0 0:29.26 0 lt 6506 root 20 0 1696 336 248 R 24.4 0.0 0:29.25 0 lt -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-21 7:33 ` Mike Galbraith @ 2007-08-21 8:35 ` Ingo Molnar 2007-08-21 11:54 ` Roman Zippel 1 sibling, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-21 8:35 UTC (permalink / raw) To: Mike Galbraith Cc: Roman Zippel, Willy Tarreau, Michael Chang, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel * Mike Galbraith <efault@gmx.de> wrote: > > It doesn't make much of a difference. > > I thought this was history. With your config, I was finally able to > reproduce the anomaly (only with your proggy though), and Ingo's patch > does indeed fix it here. > > Freshly reproduced anomaly and patch verification, running 2.6.23-rc3 > with your config, both with and without Ingo's patch reverted: > > 6561 root 20 0 1696 492 404 S 32.0 0.0 0:30.83 0 lt > 6562 root 20 0 1696 336 248 R 32.0 0.0 0:30.79 0 lt > 6563 root 20 0 1696 336 248 R 32.0 0.0 0:30.80 0 lt > 6564 root 20 0 2888 1236 1028 R 4.6 0.1 0:05.26 0 sh > > 6507 root 20 0 2888 1236 1028 R 25.8 0.1 0:30.75 0 sh > 6504 root 20 0 1696 492 404 R 24.4 0.0 0:29.26 0 lt > 6505 root 20 0 1696 336 248 R 24.4 0.0 0:29.26 0 lt > 6506 root 20 0 1696 336 248 R 24.4 0.0 0:29.25 0 lt oh, great! I'm glad we didnt discard this as a pure sched_clock resolution artifact. Roman, a quick & easy request: please send the usual cfs-debug-info.sh output captured while your testcase is running. (Preferably try .23-rc3 or later as Mike did, which has the most recent scheduler code, it includes the patch i sent to you already.) I'll reply to your sleeper-fairness questions separately, but in any case we need to figure out what's happening on your box - if you can still reproduce it with .23-rc3. Thanks, Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-21 7:33 ` Mike Galbraith 2007-08-21 8:35 ` Ingo Molnar @ 2007-08-21 11:54 ` Roman Zippel 1 sibling, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-21 11:54 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Willy Tarreau, Michael Chang, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel Hi, On Tue, 21 Aug 2007, Mike Galbraith wrote: > I thought this was history. With your config, I was finally able to > reproduce the anomaly (only with your proggy though), and Ingo's patch > does indeed fix it here. > > Freshly reproduced anomaly and patch verification, running 2.6.23-rc3 > with your config, both with and without Ingo's patch reverted: I did update to 2.6.23-rc3-git1 first, but I ended up reverting the patch, as I didn't notice it had been applied already. Sorry about that. With this patch the underflows are gone, but there are still the overflows, so the questions from the last mail still remain. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-10 21:15 ` Roman Zippel 2007-08-10 21:36 ` Ingo Molnar @ 2007-08-11 5:15 ` Willy Tarreau 1 sibling, 0 replies; 535+ messages in thread From: Willy Tarreau @ 2007-08-11 5:15 UTC (permalink / raw) To: Roman Zippel Cc: Michael Chang, Ingo Molnar, Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel On Fri, Aug 10, 2007 at 11:15:55PM +0200, Roman Zippel wrote: > Hi, > > On Fri, 10 Aug 2007, Willy Tarreau wrote: > > > fortunately all bug reporters are not like you. It's amazing how long > > you can resist sending a simple bug report to a developer! > > I'm more amazed how long Ingo can resist providing some explanations (not > just about this problem). It's a matter of time balance. It takes a short time to send the output of a script, and it takes a very long time to explain how things work. I often encounter the same situation with haproxy. People ask me to explain them in detail how this or that would apply to their context, and it's often easier for me to provide them with a 5-lines patch to add the feature they need, than to spend half an hour explaining why and how it would badly behave. > It's not like I haven't given him anything, he already has the test > programs, he already knows the system configuration. > Well, I've sent him the stuff now... fine, thanks. > > Maybe you > > consider that you need to fix the bug by yourself after you understand > > the code, > > Fixing the bug requires some knowledge what the code is intended to do. > > > Please try to be a little bit more transparent if you really want the > > bugs fixed, and don't behave as if you wanted this bug to survive > > till -final. > > Could you please ask Ingo the same? I'm simply trying to get some > transparancy into the CFS design. Without further information it's > difficult to tell, whether something is supposed to work this way or it's > a bug. I know that Ingo tends to reply to a question with another question. But as I said, imagine if he has to explain the same things to each person who asks him for it. I think that a more constructive approach would be to point what is missing/unclear/inexact in the doc so that he adds some paragraphs for you and everyone else. If you need this information to debug, most likely other people will need it too. > In this case it's quite possible that due to a recent change my testcase > doesn't work anymore. Should I consider the problem fixed or did it just > go into hiding? Without more information it's difficult to verify this > independently. generally, problems that appear only on one person's side and which suddenly disappear are either caused by some random buggy patch left in the tree (not your case it seems), or by an obscure bug of the feature being tested which will resurface from time to time as long as it's not identified. Willy ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-09 23:14 ` Roman Zippel 2007-08-10 5:49 ` Ingo Molnar @ 2007-08-10 7:23 ` Mike Galbraith 1 sibling, 0 replies; 535+ messages in thread From: Mike Galbraith @ 2007-08-10 7:23 UTC (permalink / raw) To: Roman Zippel Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Andrew Morton, linux-kernel On Fri, 2007-08-10 at 01:14 +0200, Roman Zippel wrote: > Hi, Greetings, > On Wed, 1 Aug 2007, Ingo Molnar wrote: > > > just to make sure, how does 'top' output of the l + "lt 3" testcase look > > like now on your laptop? Yesterday it was this: > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > > > and i'm still wondering how that output was possible. > > I disabled the jiffies logic and the result is still the same, so this > problem isn't related to resolution at all. > I traced it a little and what's happing is that the busy loop really only > gets little time, it only runs inbetween the timer tasks. When the timer > task is woken up __enqueue_sleeper() updates sleeper_bonus and a little > later when the busy loop is preempted __update_curr() is called a last > time and it's fully hit by the sleeper_bonus. So the timer tasks use less > time than they actually get and thus produce overflows, the busy loop OTOH > is punished and underflows. I still can't reproduce this here. Can you please send your .config, so I can try again with a config as close to yours as possible? -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 3:41 ` CFS review Roman Zippel 2007-08-01 7:12 ` Ingo Molnar @ 2007-08-01 11:22 ` Ingo Molnar 2007-08-01 12:21 ` Roman Zippel 2007-08-03 3:04 ` Matt Mackall 2007-08-01 11:37 ` Ingo Molnar ` (4 subsequent siblings) 6 siblings, 2 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 11:22 UTC (permalink / raw) To: Roman Zippel; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > [...] e.g. in this example there are three tasks that run only for > about 1ms every 3ms, but they get far more time than should have > gotten fairly: > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l Mike and me have managed to reproduce similarly looking 'top' output, but it takes some effort: we had to deliberately run a non-TSC sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. in that case 'top' accounting symptoms similar to the above are not due to the scheduler starvation you suspected, but due the effect of a low-resolution scheduler clock and a tightly coupled timer/scheduler tick to it. I tried the very same workload on 2.6.22 (with the same .config) and i saw similarly anomalous 'top' output. (Not only can one create really anomalous CPU usage, one can completely hide tasks from 'top' output.) if your test-box has a high-resolution sched_clock() [easily possible] then please send us the lt.c and l.c code so that we can have a look. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 11:22 ` Ingo Molnar @ 2007-08-01 12:21 ` Roman Zippel 2007-08-01 12:23 ` Ingo Molnar 2007-08-01 13:59 ` Ingo Molnar 2007-08-03 3:04 ` Matt Mackall 1 sibling, 2 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-01 12:21 UTC (permalink / raw) To: Ingo Molnar; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > > [...] e.g. in this example there are three tasks that run only for > > about 1ms every 3ms, but they get far more time than should have > > gotten fairly: > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > Mike and me have managed to reproduce similarly looking 'top' output, > but it takes some effort: we had to deliberately run a non-TSC > sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. I used my old laptop for these tests, where tsc is indeed disabled due to instability. Otherwise the kernel was configured with CONFIG_HZ=1000. > in that case 'top' accounting symptoms similar to the above are not due > to the scheduler starvation you suspected, but due the effect of a > low-resolution scheduler clock and a tightly coupled timer/scheduler > tick to it. Well, it magnifies the rounding problems in CFS. I mainly wanted to test a little the behaviour of CFS and I thought a saw patch which enabled the use of TSC in these cases, so I didn't check sched_clock(). Anyway, I want to point out that this wasn't the main focus of what I wrote. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 12:21 ` Roman Zippel @ 2007-08-01 12:23 ` Ingo Molnar 2007-08-01 13:59 ` Ingo Molnar 1 sibling, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 12:23 UTC (permalink / raw) To: Roman Zippel; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > On Wed, 1 Aug 2007, Ingo Molnar wrote: > > > > [...] e.g. in this example there are three tasks that run only for > > > about 1ms every 3ms, but they get far more time than should have > > > gotten fairly: > > > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > > > Mike and me have managed to reproduce similarly looking 'top' output, > > but it takes some effort: we had to deliberately run a non-TSC > > sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. > > I used my old laptop for these tests, where tsc is indeed disabled due > to instability. Otherwise the kernel was configured with > CONFIG_HZ=1000. please send all the debug info and source code we asked for - thanks! Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 12:21 ` Roman Zippel 2007-08-01 12:23 ` Ingo Molnar @ 2007-08-01 13:59 ` Ingo Molnar 2007-08-01 14:04 ` Arjan van de Ven 2007-08-01 15:44 ` Roman Zippel 1 sibling, 2 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 13:59 UTC (permalink / raw) To: Roman Zippel; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > > in that case 'top' accounting symptoms similar to the above are not > > due to the scheduler starvation you suspected, but due the effect of > > a low-resolution scheduler clock and a tightly coupled > > timer/scheduler tick to it. > > Well, it magnifies the rounding problems in CFS. why do you say that? 2.6.22 behaves similarly with a low-res sched_clock(). This has nothing to do with 'rounding problems'! i tried your fl.c and if sched_clock() is high-resolution it's scheduled _perfectly_ by CFS: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5906 mingo 20 0 1576 244 196 R 71.2 0.0 0:30.11 l 5909 mingo 20 0 1844 344 260 S 9.6 0.0 0:04.02 lt 5907 mingo 20 0 1844 508 424 S 9.5 0.0 0:04.01 lt 5908 mingo 20 0 1844 344 260 S 9.5 0.0 0:04.02 lt if sched_clock() is low-resolution then indeed the 'lt' tasks will "hide": PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2366 mingo 20 0 1576 248 196 R 99.9 0.0 0:07.95 loop_silent 1 root 20 0 2132 636 548 S 0.0 0.0 0:04.64 init but that's nothing new. CFS cannot conjure up time measurement methods that do not exist. If you have a low-res clock and if you create an app that syncs precisely to the tick of that clock via timers that run off that exact tick then there's nothing the scheduler can do about it. It is false to charachterise this as 'sleeper starvation' or 'rounding error' like you did. No amount of rounding logic can create a high-resolution clock out of thin air. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 13:59 ` Ingo Molnar @ 2007-08-01 14:04 ` Arjan van de Ven 2007-08-01 15:44 ` Roman Zippel 1 sibling, 0 replies; 535+ messages in thread From: Arjan van de Ven @ 2007-08-01 14:04 UTC (permalink / raw) To: Ingo Molnar Cc: Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel > but that's nothing new. CFS cannot conjure up time measurement methods > that do not exist. If you have a low-res clock and if you create an app > that syncs precisely to the tick of that clock via timers that run off > that exact tick then there's nothing the scheduler can do about it. It > is false to charachterise this as 'sleeper starvation' or 'rounding > error' like you did. No amount of rounding logic can create a > high-resolution clock out of thin air. CFS is only as fair as your clock is good. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 13:59 ` Ingo Molnar 2007-08-01 14:04 ` Arjan van de Ven @ 2007-08-01 15:44 ` Roman Zippel 2007-08-01 17:41 ` Ingo Molnar 1 sibling, 1 reply; 535+ messages in thread From: Roman Zippel @ 2007-08-01 15:44 UTC (permalink / raw) To: Ingo Molnar; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > > > in that case 'top' accounting symptoms similar to the above are not > > > due to the scheduler starvation you suspected, but due the effect of > > > a low-resolution scheduler clock and a tightly coupled > > > timer/scheduler tick to it. > > > > Well, it magnifies the rounding problems in CFS. > > why do you say that? 2.6.22 behaves similarly with a low-res > sched_clock(). This has nothing to do with 'rounding problems'! > > i tried your fl.c and if sched_clock() is high-resolution it's scheduled > _perfectly_ by CFS: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5906 mingo 20 0 1576 244 196 R 71.2 0.0 0:30.11 l > 5909 mingo 20 0 1844 344 260 S 9.6 0.0 0:04.02 lt > 5907 mingo 20 0 1844 508 424 S 9.5 0.0 0:04.01 lt > 5908 mingo 20 0 1844 344 260 S 9.5 0.0 0:04.02 lt > > if sched_clock() is low-resolution then indeed the 'lt' tasks will > "hide": > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2366 mingo 20 0 1576 248 196 R 99.9 0.0 0:07.95 loop_silent > 1 root 20 0 2132 636 548 S 0.0 0.0 0:04.64 init > > but that's nothing new. CFS cannot conjure up time measurement methods > that do not exist. If you have a low-res clock and if you create an app > that syncs precisely to the tick of that clock via timers that run off > that exact tick then there's nothing the scheduler can do about it. It > is false to charachterise this as 'sleeper starvation' or 'rounding > error' like you did. No amount of rounding logic can create a > high-resolution clock out of thin air. Please calm down. You apparantly already get worked up about one of the secondary problems. I didn't say 'sleeper starvation' or 'rounding error', these are your words and it's your perception of what I said. sched_clock() can have a low resolution, which can be a problem for the scheduler. This is all this program demonstrates. If and how this problem should be solved is a completely different issue, about which I haven't said anything yet and since it's not that important right now I'll leave it at that for now. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 15:44 ` Roman Zippel @ 2007-08-01 17:41 ` Ingo Molnar 2007-08-01 18:14 ` Roman Zippel 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 17:41 UTC (permalink / raw) To: Roman Zippel; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > > i tried your fl.c and if sched_clock() is high-resolution it's scheduled > > _perfectly_ by CFS: > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 5906 mingo 20 0 1576 244 196 R 71.2 0.0 0:30.11 l > > 5909 mingo 20 0 1844 344 260 S 9.6 0.0 0:04.02 lt > > 5907 mingo 20 0 1844 508 424 S 9.5 0.0 0:04.01 lt > > 5908 mingo 20 0 1844 344 260 S 9.5 0.0 0:04.02 lt > > > > if sched_clock() is low-resolution then indeed the 'lt' tasks will > > "hide": > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 2366 mingo 20 0 1576 248 196 R 99.9 0.0 0:07.95 loop_silent > > 1 root 20 0 2132 636 548 S 0.0 0.0 0:04.64 init > > > > but that's nothing new. CFS cannot conjure up time measurement > > methods that do not exist. If you have a low-res clock and if you > > create an app that syncs precisely to the tick of that clock via > > timers that run off that exact tick then there's nothing the > > scheduler can do about it. It is false to charachterise this as > > 'sleeper starvation' or 'rounding error' like you did. No amount of > > rounding logic can create a high-resolution clock out of thin air. > > [...] I didn't say 'sleeper starvation' or 'rounding error', these are > your words and it's your perception of what I said. Oh dear :-) It was indeed my preception that yesterday you said: | A problem here is that this can be exploited, if a job is spread over | ^^^^^^^^^^^^^^^^^^^^^ | a few threads, they can get more time relativ to other tasks, e.g. in | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | this example there are three tasks that run only for about 1ms every | 3ms, but they get far more time than should have gotten fairly: | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | | 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt | 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt | 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt | 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l [ http://lkml.org/lkml/2007/7/31/668 ] ( the underlined portion, in other words, is called 'starvation'.) And again today, i clearly perceived you to say: | > in that case 'top' accounting symptoms similar to the above are not | > due to the scheduler starvation you suspected, but due the effect of | > a low-resolution scheduler clock and a tightly coupled | > timer/scheduler tick to it. | | Well, it magnifies the rounding problems in CFS. [ http://lkml.org/lkml/2007/8/1/153 ] But you are right, that must be my perception alone, you couldnt possibly have said any of that =B-) Or are you perhaps one of those who claims that saying something analogous to sleeper starvation does not equal to talking about 'sleeper starvation' and saying something about 'rounding problems in CFS' does in no way mean you were talking about rounding errors? :-) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 17:41 ` Ingo Molnar @ 2007-08-01 18:14 ` Roman Zippel 0 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-01 18:14 UTC (permalink / raw) To: Ingo Molnar; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > > [...] I didn't say 'sleeper starvation' or 'rounding error', these are > > your words and it's your perception of what I said. > > Oh dear :-) It was indeed my preception that yesterday you said: *sigh* and here you go off again nitpicking on a minor issue just to prove your point... When I wrote the earlier stuff I hadn't realized it was resolution related, so things have to be put into proper context and you make it yourself a little easy by equating them. Yippi, you found another small error I made, can we drop this now? Please? bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 11:22 ` Ingo Molnar 2007-08-01 12:21 ` Roman Zippel @ 2007-08-03 3:04 ` Matt Mackall 2007-08-03 3:57 ` Arjan van de Ven 1 sibling, 1 reply; 535+ messages in thread From: Matt Mackall @ 2007-08-03 3:04 UTC (permalink / raw) To: Ingo Molnar Cc: Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel On Wed, Aug 01, 2007 at 01:22:29PM +0200, Ingo Molnar wrote: > > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > [...] e.g. in this example there are three tasks that run only for > > about 1ms every 3ms, but they get far more time than should have > > gotten fairly: > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > Mike and me have managed to reproduce similarly looking 'top' output, > but it takes some effort: we had to deliberately run a non-TSC > sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. ..which is pretty much the state of play for lots of non-x86 hardware. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-03 3:04 ` Matt Mackall @ 2007-08-03 3:57 ` Arjan van de Ven 2007-08-03 4:18 ` Willy Tarreau 2007-08-03 4:38 ` Matt Mackall 0 siblings, 2 replies; 535+ messages in thread From: Arjan van de Ven @ 2007-08-03 3:57 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel On Thu, 2007-08-02 at 22:04 -0500, Matt Mackall wrote: > On Wed, Aug 01, 2007 at 01:22:29PM +0200, Ingo Molnar wrote: > > > > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > > > [...] e.g. in this example there are three tasks that run only for > > > about 1ms every 3ms, but they get far more time than should have > > > gotten fairly: > > > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > > > Mike and me have managed to reproduce similarly looking 'top' output, > > but it takes some effort: we had to deliberately run a non-TSC > > sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. > > ..which is pretty much the state of play for lots of non-x86 hardware. question is if it's significantly worse than before. With a 100 or 1000Hz timer, you can't expect perfect fairness just due to the extremely rough measurement of time spent... ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-03 3:57 ` Arjan van de Ven @ 2007-08-03 4:18 ` Willy Tarreau 2007-08-03 4:31 ` Arjan van de Ven 2007-08-03 4:38 ` Matt Mackall 1 sibling, 1 reply; 535+ messages in thread From: Willy Tarreau @ 2007-08-03 4:18 UTC (permalink / raw) To: Arjan van de Ven Cc: Matt Mackall, Ingo Molnar, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel On Thu, Aug 02, 2007 at 08:57:47PM -0700, Arjan van de Ven wrote: > On Thu, 2007-08-02 at 22:04 -0500, Matt Mackall wrote: > > On Wed, Aug 01, 2007 at 01:22:29PM +0200, Ingo Molnar wrote: > > > > > > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > > > > > [...] e.g. in this example there are three tasks that run only for > > > > about 1ms every 3ms, but they get far more time than should have > > > > gotten fairly: > > > > > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > > > > > Mike and me have managed to reproduce similarly looking 'top' output, > > > but it takes some effort: we had to deliberately run a non-TSC > > > sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. > > > > ..which is pretty much the state of play for lots of non-x86 hardware. > > question is if it's significantly worse than before. With a 100 or > 1000Hz timer, you can't expect perfect fairness just due to the > extremely rough measurement of time spent... Well, at least we're able to *measure* that task 'l' used 3.3% and that tasks 'lt' used 32%. If we're able to measure it, then that's already fine enough to be able to adjust future timeslices credits. Granted it may be rough for small periods (a few jiffies), but it should be fair for larger periods. Or at least it should *report* some fair distribution. Willy ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-03 4:18 ` Willy Tarreau @ 2007-08-03 4:31 ` Arjan van de Ven 2007-08-03 4:53 ` Willy Tarreau 0 siblings, 1 reply; 535+ messages in thread From: Arjan van de Ven @ 2007-08-03 4:31 UTC (permalink / raw) To: Willy Tarreau Cc: Matt Mackall, Ingo Molnar, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel On Fri, 2007-08-03 at 06:18 +0200, Willy Tarreau wrote: > On Thu, Aug 02, 2007 at 08:57:47PM -0700, Arjan van de Ven wrote: > > On Thu, 2007-08-02 at 22:04 -0500, Matt Mackall wrote: > > > On Wed, Aug 01, 2007 at 01:22:29PM +0200, Ingo Molnar wrote: > > > > > > > > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > > > > > > > [...] e.g. in this example there are three tasks that run only for > > > > > about 1ms every 3ms, but they get far more time than should have > > > > > gotten fairly: > > > > > > > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > > > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > > > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > > > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > > > > > > > Mike and me have managed to reproduce similarly looking 'top' output, > > > > but it takes some effort: we had to deliberately run a non-TSC > > > > sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. > > > > > > ..which is pretty much the state of play for lots of non-x86 hardware. > > > > question is if it's significantly worse than before. With a 100 or > > 1000Hz timer, you can't expect perfect fairness just due to the > > extremely rough measurement of time spent... > > Well, at least we're able to *measure* that task 'l' used 3.3% and > that tasks 'lt' used 32%. It's not measured if you use jiffies level stuff. It's at best sampled! > If we're able to measure it, then that's > already fine enough to be able to adjust future timeslices credits. > Granted it may be rough for small periods (a few jiffies), but it > should be fair for larger periods. Or at least it should *report* but the testcase here uses a LOT shorter time than jiffies... not "a few jiffies". -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-03 4:31 ` Arjan van de Ven @ 2007-08-03 4:53 ` Willy Tarreau 0 siblings, 0 replies; 535+ messages in thread From: Willy Tarreau @ 2007-08-03 4:53 UTC (permalink / raw) To: Arjan van de Ven Cc: Matt Mackall, Ingo Molnar, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel On Thu, Aug 02, 2007 at 09:31:19PM -0700, Arjan van de Ven wrote: > On Fri, 2007-08-03 at 06:18 +0200, Willy Tarreau wrote: > > On Thu, Aug 02, 2007 at 08:57:47PM -0700, Arjan van de Ven wrote: > > > On Thu, 2007-08-02 at 22:04 -0500, Matt Mackall wrote: > > > > On Wed, Aug 01, 2007 at 01:22:29PM +0200, Ingo Molnar wrote: > > > > > > > > > > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > > > > > > > > > [...] e.g. in this example there are three tasks that run only for > > > > > > about 1ms every 3ms, but they get far more time than should have > > > > > > gotten fairly: > > > > > > > > > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > > > > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > > > > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > > > > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > > > > > > > > > Mike and me have managed to reproduce similarly looking 'top' output, > > > > > but it takes some effort: we had to deliberately run a non-TSC > > > > > sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. > > > > > > > > ..which is pretty much the state of play for lots of non-x86 hardware. > > > > > > question is if it's significantly worse than before. With a 100 or > > > 1000Hz timer, you can't expect perfect fairness just due to the > > > extremely rough measurement of time spent... > > > > Well, at least we're able to *measure* that task 'l' used 3.3% and > > that tasks 'lt' used 32%. > > It's not measured if you use jiffies level stuff. It's at best sampled! But if we rely on the same sampling method, at least we will report something consistent with what happens. And sampling is often the correct method to get finer resolution on a macroscopic scale. I mean, we're telling users that we include the "completely fair scheduler" in 2.6.23, a scheduler which will ensure that all tasks get a fair share of CPU time. A user starts top and sees 33%+32%+32+3% for 4 tasks while he would have expected to see 25%+25%+25%+25%. You can try to explain users that it's the fairest distribution, but they will have a hard time believing it, especially when they measure the time spent on CPU with the "time" command. OK this is all sampling, but we should try to avoid relying on different sources of data for computation and reporting. Time and Top should report something close to 4*25% for comparable tasks. And if not, because of some sampling problem, maybe the scheduler cannot be that fair in some situations, but either it should make use of the sampling time and top use, or top and time should rely on the view of the scheduler. I'll try to quickly hack up a program which makes use of rdtsc from userspace to precisely measure user-space time, and disable TSC use from the kernel to see how the values diverge. Regards, Willy ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-03 3:57 ` Arjan van de Ven 2007-08-03 4:18 ` Willy Tarreau @ 2007-08-03 4:38 ` Matt Mackall 2007-08-03 8:44 ` Ingo Molnar 2007-08-03 9:29 ` Andi Kleen 1 sibling, 2 replies; 535+ messages in thread From: Matt Mackall @ 2007-08-03 4:38 UTC (permalink / raw) To: Arjan van de Ven Cc: Ingo Molnar, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel On Thu, Aug 02, 2007 at 08:57:47PM -0700, Arjan van de Ven wrote: > On Thu, 2007-08-02 at 22:04 -0500, Matt Mackall wrote: > > On Wed, Aug 01, 2007 at 01:22:29PM +0200, Ingo Molnar wrote: > > > > > > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > > > > > [...] e.g. in this example there are three tasks that run only for > > > > about 1ms every 3ms, but they get far more time than should have > > > > gotten fairly: > > > > > > > > 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt > > > > 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt > > > > 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt > > > > 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l > > > > > > Mike and me have managed to reproduce similarly looking 'top' output, > > > but it takes some effort: we had to deliberately run a non-TSC > > > sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS. > > > > ..which is pretty much the state of play for lots of non-x86 hardware. > > question is if it's significantly worse than before. With a 100 or > 1000Hz timer, you can't expect perfect fairness just due to the > extremely rough measurement of time spent... Indeed. I'm just pointing out that not having TSC, fast HZ, no-HZ mode, or high-res timers should not be treated as an unusual circumstance. That's a PC-centric view. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-03 4:38 ` Matt Mackall @ 2007-08-03 8:44 ` Ingo Molnar 2007-08-03 9:29 ` Andi Kleen 1 sibling, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-03 8:44 UTC (permalink / raw) To: Matt Mackall Cc: Arjan van de Ven, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Matt Mackall <mpm@selenic.com> wrote: > > question is if it's significantly worse than before. With a 100 or > > 1000Hz timer, you can't expect perfect fairness just due to the > > extremely rough measurement of time spent... > > Indeed. I'm just pointing out that not having TSC, fast HZ, no-HZ > mode, or high-res timers should not be treated as an unusual > circumstance. That's a PC-centric view. actually, you dont need high-res or fast HZ or TSC to reduce those timer artifacts: all you need is _two_ (low-res, slow) hw clocks. Most platforms do have that (even the really really cheap ones), but arches do not set up the scheduler tick one of them and the timer tick to the other, and to skew the periodic-timer programming setup a bit (by nature of physics they are usually already skewed a bit) so that the scheduler tick and timer tick are not coupled. This whole thing is not a big deal on embedded anyway. (you dont get students log in to the toaster or to the fridge to run timer exploits, do you? :-) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-03 4:38 ` Matt Mackall 2007-08-03 8:44 ` Ingo Molnar @ 2007-08-03 9:29 ` Andi Kleen 1 sibling, 0 replies; 535+ messages in thread From: Andi Kleen @ 2007-08-03 9:29 UTC (permalink / raw) To: Matt Mackall Cc: Arjan van de Ven, Ingo Molnar, Roman Zippel, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel Matt Mackall <mpm@selenic.com> writes: > > Indeed. I'm just pointing out that not having TSC, fast HZ, no-HZ > mode, or high-res timers should not be treated as an unusual > circumstance. That's a PC-centric view. The question is if it would be that hard to add TSC equivalent sched_clock() support to more systems. At least a lot of CPUs I have ever looked at had some kind of fast clock available. Perhaps it's more laziness of the developers or cut'n'paste that these are not as widely used as they should be? -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 3:41 ` CFS review Roman Zippel 2007-08-01 7:12 ` Ingo Molnar 2007-08-01 11:22 ` Ingo Molnar @ 2007-08-01 11:37 ` Ingo Molnar 2007-08-01 12:27 ` Roman Zippel 2007-08-01 13:20 ` Andi Kleen ` (3 subsequent siblings) 6 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 11:37 UTC (permalink / raw) To: Roman Zippel; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > [...] the increase in code size: > > 2.6.22: > text data bss dec hex filename > 10150 24 3344 13518 34ce kernel/sched.o > > recent git: > text data bss dec hex filename > 14724 228 2020 16972 424c kernel/sched.o > > That's i386 without stats/debug. [...] that's without CONFIG_SMP, right? :-) On SMP they are about net break even: text data bss dec hex filename 26535 4173 24 30732 780c kernel/sched.o-2.6.22 28378 2574 16 30968 78f8 kernel/sched.o-2.6.23-git (plus a further ~1.5K per CPU data reduction which is not visible here) btw., here's the general change in size of a generic vmlinux from .22 to .23-git, using the same .config: text data bss dec hex filename 5256628 520760 1331200 7108588 6c77ec vmlinux.22 5306918 535844 1327104 7169866 6d674a vmlinux.23-git +50K. (this was on UP) In any case, there's still some debugging code in the scheduler (beyond SCHED_DEBUG), i'll work some more on reducing it. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 11:37 ` Ingo Molnar @ 2007-08-01 12:27 ` Roman Zippel 0 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-01 12:27 UTC (permalink / raw) To: Ingo Molnar; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Ingo Molnar wrote: > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > [...] the increase in code size: > > > > 2.6.22: > > text data bss dec hex filename > > 10150 24 3344 13518 34ce kernel/sched.o > > > > recent git: > > text data bss dec hex filename > > 14724 228 2020 16972 424c kernel/sched.o > > > > That's i386 without stats/debug. [...] > > that's without CONFIG_SMP, right? :-) On SMP they are about net break > even: > > text data bss dec hex filename > 26535 4173 24 30732 780c kernel/sched.o-2.6.22 > 28378 2574 16 30968 78f8 kernel/sched.o-2.6.23-git That's still quite an increase in some rather important code paths and it's not just the code size, but also code complexity which is important - a major point I tried to address in my review. > (plus a further ~1.5K per CPU data reduction which is not visible here) That's why I mentioned the increased runtime memory usage... bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 3:41 ` CFS review Roman Zippel ` (2 preceding siblings ...) 2007-08-01 11:37 ` Ingo Molnar @ 2007-08-01 13:20 ` Andi Kleen 2007-08-01 13:33 ` Roman Zippel 2007-08-01 14:40 ` Ingo Molnar ` (2 subsequent siblings) 6 siblings, 1 reply; 535+ messages in thread From: Andi Kleen @ 2007-08-01 13:20 UTC (permalink / raw) To: Roman Zippel Cc: Mike Galbraith, Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel Roman Zippel <zippel@linux-m68k.org> writes: > especially if one already knows that > scheduler clock has only limited resolution (because it's based on > jiffies), it becomes possible to use mostly 32bit values. jiffies based sched_clock should be soon very rare. It's probably not worth optimizing for it. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 13:20 ` Andi Kleen @ 2007-08-01 13:33 ` Roman Zippel 2007-08-01 14:36 ` Ingo Molnar 2007-08-02 2:17 ` Linus Torvalds 0 siblings, 2 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-01 13:33 UTC (permalink / raw) To: Andi Kleen Cc: Mike Galbraith, Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Andi Kleen wrote: > > especially if one already knows that > > scheduler clock has only limited resolution (because it's based on > > jiffies), it becomes possible to use mostly 32bit values. > > jiffies based sched_clock should be soon very rare. It's probably > not worth optimizing for it. I'm not so sure about that. sched_clock() has to be fast, so many archs may want to continue to use jiffies. As soon as one does that one can also save a lot of computational overhead by using 32bit instead of 64bit. The question is then how easy that is possible. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 13:33 ` Roman Zippel @ 2007-08-01 14:36 ` Ingo Molnar 2007-08-01 16:11 ` Andi Kleen 2007-08-02 2:17 ` Linus Torvalds 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 14:36 UTC (permalink / raw) To: Roman Zippel Cc: Andi Kleen, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > > jiffies based sched_clock should be soon very rare. It's probably > > not worth optimizing for it. > > I'm not so sure about that. sched_clock() has to be fast, so many > archs may want to continue to use jiffies. [...] i think Andi was talking about the vast majority of the systems out there. For example, check out the arch demography of current Fedora installs (according to the Smolt opt-in UUID based user metrics): http://smolt.fedoraproject.org/ i686: 74743 x86_64: 18599 i386: 1208 ppc: 527 ppc64: 396 sparc64: 14 --------------- Total: 95488 even pure i386 (kernels, not systems) is a only 1.2% of all installs. By the time the CFS kernel gets into a distro (a few months at minimum, typically a year) this percentage will go down further. And embedded doesnt really care about task-statistics corner cases [ (it likely doesnt have 'top' installed - likely doesnt even have /proc mounted or even built in ;-) ]. of course CFS should not do _worse_ stats than what we had before, and should not break or massively misbehave. Also, anything sane we can do for low-resolution arches we should do (and we already do quite a bit - the while wmult stuff is to avoid expensive divisions) - and i regularly booted CFS with a low-resolution clock to make sure it works. So i'm not trying to duck anything, we've just got to keep our design priorities right :-) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 14:36 ` Ingo Molnar @ 2007-08-01 16:11 ` Andi Kleen 0 siblings, 0 replies; 535+ messages in thread From: Andi Kleen @ 2007-08-01 16:11 UTC (permalink / raw) To: Ingo Molnar Cc: Roman Zippel, Andi Kleen, Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel On Wed, Aug 01, 2007 at 04:36:24PM +0200, Ingo Molnar wrote: > > * Roman Zippel <zippel@linux-m68k.org> wrote: > > > > jiffies based sched_clock should be soon very rare. It's probably > > > not worth optimizing for it. > > > > I'm not so sure about that. sched_clock() has to be fast, so many > > archs may want to continue to use jiffies. [...] > > i think Andi was talking about the vast majority of the systems out > there. For example, check out the arch demography of current Fedora > installs (according to the Smolt opt-in UUID based user metrics): I meant that in many cases where the TSC is considered unreliable today it'll be possible to use it anyways at least for sched_clock() (and possibly even gtod()) The exception would be system which really have none, but there should be very few of those. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 13:33 ` Roman Zippel 2007-08-01 14:36 ` Ingo Molnar @ 2007-08-02 2:17 ` Linus Torvalds 2007-08-02 4:57 ` Willy Tarreau ` (3 more replies) 1 sibling, 4 replies; 535+ messages in thread From: Linus Torvalds @ 2007-08-02 2:17 UTC (permalink / raw) To: Roman Zippel Cc: Andi Kleen, Mike Galbraith, Ingo Molnar, Andrew Morton, linux-kernel On Wed, 1 Aug 2007, Roman Zippel wrote: > > I'm not so sure about that. sched_clock() has to be fast, so many archs > may want to continue to use jiffies. As soon as one does that one can also > save a lot of computational overhead by using 32bit instead of 64bit. > The question is then how easy that is possible. I have to say, it would be interesting to try to use 32-bit arithmetic. I also think it's likely a mistake to do a nanosecond resolution. That's part of what forces us to 64 bits, and it's just not even an *interesting* resolution. It would be better, I suspect, to make the scheduler clock totally distinct from the other clock sources (many architectures have per-cpu cycle counters), and *not* try to even necessarily force it to be a "time-based" one. So I think it would be entirely appropriate to - do something that *approximates* microseconds. Using microseconds instead of nanoseconds would likely allow us to do 32-bit arithmetic in more areas, without any real overflow. And quite frankly, even on fast CPU's, the scheduler is almost certainly not going to be able to take any advantage of the nanosecond resolution. Just about anything takes a microsecond - including IO. I don't think nanoseconds are worth the ten extra bits they need, if we could do microseconds in 32 bits. And the "approximates" thing would be about the fact that we don't actually care about "absolute" microseconds as much as something that is in the "roughly a microsecond" area. So if we say "it doesn't have to be microseconds, but it should be within a factor of two of a ms", we could avoid all the expensive divisions (even if they turn into multiplications with reciprocals), and just let people *shift* the CPU counter instead. In fact, we could just say that we don't even care about CPU counters that shift frequency - so what? It gets a bit further off the "ideal microsecond", but the scheduler just cares about _relative_ times between tasks (and that the total latency is within some reasonable value), it doesn't really care about absolute time. Hmm? It would still be true that something that is purely based on timer ticks will always be liable to have rounding errors that will inevitably mean that you don't get good fairness - tuning threads to run less than a timer tick at a time would effectively "hide" them from the scheduler accounting. However, in the end, I think that's pretty much unavoidable. We should make sure that things mostly *work* for that situation, but I think it's ok to say that the *quality* of the fairness will obviously suffer (and possibly a lot in the extreme cases). Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-02 2:17 ` Linus Torvalds @ 2007-08-02 4:57 ` Willy Tarreau 2007-08-02 10:43 ` Andi Kleen 2007-08-02 16:09 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 1 reply; 535+ messages in thread From: Willy Tarreau @ 2007-08-02 4:57 UTC (permalink / raw) To: Linus Torvalds Cc: Roman Zippel, Andi Kleen, Mike Galbraith, Ingo Molnar, Andrew Morton, linux-kernel On Wed, Aug 01, 2007 at 07:17:51PM -0700, Linus Torvalds wrote: > > > On Wed, 1 Aug 2007, Roman Zippel wrote: > > > > I'm not so sure about that. sched_clock() has to be fast, so many archs > > may want to continue to use jiffies. As soon as one does that one can also > > save a lot of computational overhead by using 32bit instead of 64bit. > > The question is then how easy that is possible. > > I have to say, it would be interesting to try to use 32-bit arithmetic. > > I also think it's likely a mistake to do a nanosecond resolution. That's > part of what forces us to 64 bits, and it's just not even an *interesting* > resolution. I would add that I have been bothered by the 64-bit arithmetics when trying to see what could be improved in the code. In fact, it's very hard to optimize anything when you have arithmetics on integers larger than the CPU's, and gcc is known not to emit very good code in this situation (I remember it could not play with registers renaming, etc...). However, I undertand why Ingo chose to use 64 bits. It has the advantage that the numbers never wrap within 584 years. I'm well aware that it's very difficult to keep tasks ordered according to a key which can wrap. But if we consider that we don't need to be more precise than the return value from gettimeofday() that all applications use, we see that a bunch of microseconds is enough. 32 bits at the microsecond level wraps around every hour. We may accept to recompute all keys every hour. It's not that dramatic. The problem is how to detect that we will need to. I remember a trick used by Tim Schmielau in his jiffies64 patch for 2.4. He kept a copy of the highest bit of the lower word in the lowest bit of the higher word, and considered that the lower one could not wrap before we could check it. I liked this approach, which could be translated here in something like the following : Have all keys use 32-bit resolution, and monitor the 32nd bit. All tasks must have the same value in this bit, otherwise we consider that their keys have wrapped. The "current" value of this bit is copied somewhere. When we walk the tree and find a task with a key which does not have its 32nd bit equal to the current value, it means that this key has wrapped, so we have to use this information in our arithmetics. When all keys have their 32nd bit different from the "current" value, then we switch this value to reflect the new 32nd bit, and everything is in sync again. The only requirement is that no key wraps around before the "current" value is switched. This implies that no couple of tasks could have their keys distant by more than 31 bits (35 minutes), which seems reasonable. If we can recompute all tasks' keys when all of them have wrapped, then we do not have to store the "current" bit value anymore, and consider that it is always zero instead (I don't know if the code permits this). It is possible that using the 32nd bit to detect the wrapping may impose us to perform some computations on 33 bits. If this is the case, then it would be fine if we reduced the range to 31 bits, with all tasks distant from at most 30 bits (17 minutes). Also, I remember that the key is signed. I've never experimented with the tricks above on signed values, but we might be able to define something like this for the higher bits : 00 = positive, no wrap 01 = positive, wrapped 10 = negative, wrapped 11 = negative, no wrap I have no code to show, I just wanted to expose this idea. I know that if Ingo likes it, he will beat everyone at implementing it ;-) > Linus Willy ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-02 4:57 ` Willy Tarreau @ 2007-08-02 10:43 ` Andi Kleen 2007-08-02 10:07 ` Willy Tarreau 0 siblings, 1 reply; 535+ messages in thread From: Andi Kleen @ 2007-08-02 10:43 UTC (permalink / raw) To: Willy Tarreau Cc: Linus Torvalds, Roman Zippel, Andi Kleen, Mike Galbraith, Ingo Molnar, Andrew Morton, linux-kernel Willy Tarreau <w@1wt.eu> writes: >(I remember it could not play with registers renaming, etc...). This has changed in recent gccs. It doesn't force register pairs anymore. Given the code is still not that good, but some of the worst sins are gone > However, I undertand why Ingo chose to use 64 bits. It has the advantage > that the numbers never wrap within 584 years. I'm well aware that it's > very difficult to keep tasks ordered according to a key which can wrap. If you define an appropiate window and use some macros for the comparisons it shouldn't be a big problem. > But if we consider that we don't need to be more precise than the return > value from gettimeofday() that all applications use, gettimeofday() has too strict requirements, that make it unnecessarily slow for this task. > every hour. We may accept to recompute all keys every hour. You don't need to recompute keys; just use careful comparisons using subtractions and a window. TCPs have done that for decades. > Have all keys use 32-bit resolution, and monitor the 32nd bit. All tasks If you're worried about wrapping in one hour why is wrapping in two hours not a problem? I have one request though. If anybody adds anything complicated for this please make it optional so that 64bit platforms are not burdened by it. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-02 10:43 ` Andi Kleen @ 2007-08-02 10:07 ` Willy Tarreau 0 siblings, 0 replies; 535+ messages in thread From: Willy Tarreau @ 2007-08-02 10:07 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Roman Zippel, Mike Galbraith, Ingo Molnar, Andrew Morton, linux-kernel On Thu, Aug 02, 2007 at 12:43:29PM +0200, Andi Kleen wrote: > > However, I undertand why Ingo chose to use 64 bits. It has the advantage > > that the numbers never wrap within 584 years. I'm well aware that it's > > very difficult to keep tasks ordered according to a key which can wrap. > > If you define an appropiate window and use some macros for the comparisons > it shouldn't be a big problem. It should not, but the hardest thing is often to keep values sorted within a window. When you store your values in a tree according to a key, it's not always easy to do find the first one relative to a sliding offset. > > But if we consider that we don't need to be more precise than the return > > value from gettimeofday() that all applications use, > > gettimeofday() has too strict requirements, that make it unnecessarily slow > for this task. maybe we would use (TSC >> (x bits)) when TSC is available, or jiffies in other cases. I believe that people not interested in high time accuracies are not much interested in perfect fairness either. However, we should do the best to avoid any form of starvation, otherwise we jump back to the problem with the current scheduler. > > every hour. We may accept to recompute all keys every hour. > > You don't need to recompute keys; just use careful comparisons using > subtractions and a window. TCPs have done that for decades. > > > Have all keys use 32-bit resolution, and monitor the 32nd bit. All tasks > > If you're worried about wrapping in one hour why is wrapping in two > hours not a problem? I've not said I was worried about that (or maybe I explained myself poorly). Even if it was once a minute, it would not bother me. I wanted to explain that if the solution requires to do it that often, it should be acceptable. > I have one request though. If anybody adds anything complicated > for this please make it optional so that 64bit platforms are not > burdened by it. Fair enough. But the solution is not there yet. Maybe once it's there, it will be better than current design for all platforms and the only difference will be the accuracy due to optional use of the TSC. > -Andi Willy ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-02 2:17 ` Linus Torvalds 2007-08-02 4:57 ` Willy Tarreau @ 2007-08-02 16:09 ` Ingo Molnar 2007-08-02 22:38 ` Roman Zippel 2007-08-02 19:16 ` Daniel Phillips 2007-08-02 23:23 ` Roman Zippel 3 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-02 16:09 UTC (permalink / raw) To: Linus Torvalds Cc: Roman Zippel, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > It would be better, I suspect, to make the scheduler clock totally > distinct from the other clock sources (many architectures have per-cpu > cycle counters), and *not* try to even necessarily force it to be a > "time-based" one. yeah. Note that i largely detached sched_clock() from the GTOD clocksources already in CFS, so part of this is already implemented and the intention is clear. For example, when the following happens: Marking TSC unstable due to: possible TSC halt in C2. Clocksource tsc unstable (delta = -71630388 ns) sched_clock() does _not_ stop using the TSC. It is very careful with the TSC value, it checks against wraps, jumping, etc. (the whole rq_clock() wrapper around sched_clock()), but still tries to use the highest resolution time source possible, even if that time source is not good enough for GTOD's purposes anymore. So the scheduler clock is already largely detached from other clocksources in the system. > It would still be true that something that is purely based on timer > ticks will always be liable to have rounding errors that will > inevitably mean that you don't get good fairness - tuning threads to > run less than a timer tick at a time would effectively "hide" them > from the scheduler accounting. However, in the end, I think that's > pretty much unavoidable. Note that there is a relatively easy way of reducing the effects of such intentional coupling: turn on CONFIG_HIGH_RES_TIMERS. That decouples the scheduler tick from the jiffy tick and works against such 'exploits' - _even_ if the scheduler clock is otherwise low resolution. Also enable CONFIG_NO_HZ and the whole thing (of when the scheduler tick kicks in) becomes very hard to predict. [ So while in a low-res clock situation scheduling will always be less precise, with hres-timers and dynticks we have a natural 'random sampler' mechanism so that no task can couple to the scheduler tick - accidentally or even intentionally. The only 'unavoidable coupling' scenario is when the hardware has only a single, low-resolution time sampling method. (that is pretty rare though, even in the ultra-embedded space. If a box has two independent hw clocks, even if they are low resolution, the timer tick can be decoupled from the scheduler tick.) ] > I have to say, it would be interesting to try to use 32-bit > arithmetic. yeah. We tried to do as much of that as possible, please read on below for (many) more details. There's no short summary i'm afraid :-/ Most importantly, CFS _already_ includes a number of measures that act against too frequent math. So even though you can see 64-bit math code in it, it's only rarely called if your clock has a low resolution - and that happens all automatically! (see below the details of this buffered delta math) I have not seen Roman notice and mention any of these important details (perhaps because he was concentrating on finding faults in CFS - which a reviewer should do), but those measures are still very important for a complete, balanced picture, especially if one focuses on overhead on small boxes where the clock is low-resolution. As Peter has said it in his detailed review of Roman's suggested algorithm, our main focus is on keeping total complexity down - and we are (of course) fundamentally open to changing the math behind CFS, we ourselves tweaked it numerous times, it's not cast into stone in any way, shape or form. > I also think it's likely a mistake to do a nanosecond resolution. > That's part of what forces us to 64 bits, and it's just not even an > *interesting* resolution. yes, very much not interesting, and we did not pick nanoseconds because we find anything "interesting" in that timescale. Firstly, before i go into the thinking behind nanoseconds, microseconds indeed have advantages too, so the choice was not easy, see the arguments in favor of microseconds below at: [*]. There are two fundamental reasons for nanoseconds: 1) they _automatically_ act as a 'carry-over' for fractional math and thus reduce rounding errors. As Roman has noticed we dont care much about math rounding errors in the algorithm: _exactly_ because we *dont have to* care about rounding errors: we've got extra 10 "uninteresting" bits in the time variables to offset the effects of rounding errors and to carry over fractionals. ( Sidenote: in fact we had some simple extra anti-rounding-error code in CFS and removed it because it made no measurable difference. So even in the current structure there's additional design reserves that we could tap before having to go to another math model. All we need is a testcase that demonstrates rounding errors. Roman's testcase was _not_ a demonstration of math rounding errors, it was about clock granularity! ) 2) i tried microseconds resolution once (it's easy) but on fast hw it already showed visible accounting/rounding artifacts in high-frequency scheduling workloads, which, if hw gets just a little bit faster, will become pain. ( Sidenote: if a workload is rescheduling once every few microseconds, then it very much matters to the balance of things whether there's a fundamental +- 1 microsecond skew on who gets accounted what. In fact the placement of sched_clock() call within schedule() is already visible in practice in some testcases, whether the runqueue-spinlock acquire spinning time is accounted to the 'previous' or the 'next' task - despite that time being sub-microsecond on average. Going to microseconds makes this too coarse. ) I.e. microseconds are on the limit today on fast hardware, and nanoseconds give us an automatic buffer against rounding errors. On _slow_ hardware, with a low-resolution clock, i very much agree that we should not do too frequent math, and there are already four independent measures that we did in CFS to keep the math overhead down: Firstly, CFS fundamentally "buffers the math" via deltas, _everywhere_ in the fastpath: if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) { __update_curr(cfs_rq, curr, now); curr->delta_exec = 0; } I.e. we only call the math routines if there was any delta. The beauty is that this "math buffering" works _automatically_ if there is a low-resolution sched_clock() present, because with a low-resolution clock a delta is only observed if a tick happens. I.e. in a high-frequency scheduling workload (the only place where scheduler overhead would be noticeable) all the CFS math is in a rare _slowpath_, that gets triggered only every 10 msecs or so! (if HZ=1000) I.e. we didnt have to go down the (very painful) path of ktime_t split-model math and we didnt have to introduce a variable precision model for "scheduler time" either, because the delta math itself automatically buffers on slow boxes. Secondly, note that all the key fastpath variables are already 32-bit (on 32-bit platforms): long wait_runtime; unsigned long delta_fair_run; unsigned long delta_fair_sleep; unsigned long delta_exec; The _timestamps_ are still 64-bit, but most of the actual math goes on in 32-bit delta variables. That's one of the advantages of doing deltas instead of absolute values. Thirdly, if even this amount of buffering is not enough for an architecture, CFS also includes the sched_stat_granularity_ns tunable that allows the further reduction of the sampling frequency (and the frequency of us having to do the math) - so if the math overhead is a problem an architecture can set it. Fourthly, in CFS there is a _single_ 64-bit division, and even for that division, the actual values passed to it are typically in a 32-bit range. Hence we've introduced the following optimization: static u64 div64_likely32(u64 divident, unsigned long divisor) { #if BITS_PER_LONG == 32 if (likely(divident <= 0xffffffffULL)) return (u32)divident / divisor; do_div(divident, divisor); return divident; #else return divident / divisor; #endif } Which, if the divident is in 32-bit range, does a 32-bit division. About math related rounding errors mentioned by Roman (not to be confused with clock granularity rounding), in our analysis and in our experience rounding errors of the math were never an issue in CFS, due to the extra buffering that nanosecs gives - i tweaked it a bit around CFSv10 but it was unprovable to have any effect. That's one of the advantages of working with nanoseconds: the fundamental time unit includes about 10 "low bits" that can carry over much of the error and reduce rounding artifacts. And even that math rounding errors we believe centers around 0. In Roman's variant of CFS's algorithm the variables are 32-bit, but the error is rolled forward in separate fract_* (fractional) 32-bit variables, so we still have 32+32==64 bit of stuff to handle. So we think that in the end such a 32+32 scheme would be more complex (and anyone who took a look at fs2.c would i think agree - it took Peter a day to decypher the math!) - but we'll be happy to be surprised with patches of course :-) Ingo [*] the main advantage of microseconds would be that we could use "u32" throughout to carry around the "now" variable (timestamp). That property of microseconds _is_ tempting and would reduce the task_struct footprint as well. But if we did that it would have ripple effects: we'd have to resort to (pretty complex and non-obvious) math to act against rounding errors. We'd either have to carry the rounding error with us in separate 32-bit variables (in essence creating 32+32 bit 64-bit variable), or we'd have to shift up the microseconds by say ... 10 binary positions ... in essence bringing us back into the same nanoseconds range. And then the wraparound problem of microseconds - 72 hours is not _that_ unrealistic to trigger intentionally, so we have to do _something_ about it to inhibit infinite starvation. We experimented around with all this and the overwhelming conclusion (so far) was that trying to reduce timestamps back to 32 bits is just not worth it. _Deltas_ should, can and _are_ 32-bit values already, even in the nanosecond model. So all the buffering and delta logic gives us most of the 32-bit advantages already, without the disadvantages of microseconds. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-02 16:09 ` Ingo Molnar @ 2007-08-02 22:38 ` Roman Zippel 0 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-02 22:38 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andi Kleen, Mike Galbraith, Andrew Morton, linux-kernel Hi, On Thu, 2 Aug 2007, Ingo Molnar wrote: > Most importantly, CFS _already_ includes a number of measures that act > against too frequent math. So even though you can see 64-bit math code > in it, it's only rarely called if your clock has a low resolution - and > that happens all automatically! (see below the details of this buffered > delta math) > > I have not seen Roman notice and mention any of these important details > (perhaps because he was concentrating on finding faults in CFS - which a > reviewer should do), but those measures are still very important for a > complete, balanced picture, especially if one focuses on overhead on > small boxes where the clock is low-resolution. > > As Peter has said it in his detailed review of Roman's suggested > algorithm, our main focus is on keeping total complexity down - and we > are (of course) fundamentally open to changing the math behind CFS, we > ourselves tweaked it numerous times, it's not cast into stone in any > way, shape or form. You're comparing apples with oranges, I explicitely said: "At this point I'm not that much interested in a few localized optimizations, what I'm interested in is how can this optimized at the design level" IMO it's very important to keep computational and algorithmic complexity separately, I want to concentrate on the latter, so unless you can _prove_ that a similiar set of optimizations is impossible within my example, I'm going to ignore them for now. CFS has already gone through several versions of optimization and tuning, expecting the same from my design prototype is a little confusing... I want to analyze the foundation CFS is based on, in the review I mentioned a number of other issues and design related questions. If you need more time, that's fine, but I'd appreciate more background information related to that and not that you only jump on the more trivial issues. > In Roman's variant of CFS's algorithm the variables are 32-bit, but the > error is rolled forward in separate fract_* (fractional) 32-bit > variables, so we still have 32+32==64 bit of stuff to handle. So we > think that in the end such a 32+32 scheme would be more complex (and > anyone who took a look at fs2.c would i think agree - it took Peter a > day to decypher the math!) Come on, Ingo, you can do better than that, I did mention in my review some of the requirements for the data types. I'm amazed how you can get to that judgement so quickly, could you please substantiate that a little more? I admit that the lack of source comments is an open invitation for further questions and Peter did exactly this and his comments were great - I'm hoping for more like that. You OTOH jump to conclusions based on a partial understanding what I'm actually trying to do. Ingo, how about you provide some of the mathematical prove CFS is based on? Can you prove that the rounding errors are irrelevant? Can you prove that all the limit checks can have no adverse effect? I tried that and I'm not entirely convinced of that, but maybe it's just me, so I'd love to see someone else's attempt at this. A major goal of my design is it to be able to define the limits within the scheduler is working correctly, so I know which information is relevant and what can be approximated. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-02 2:17 ` Linus Torvalds 2007-08-02 4:57 ` Willy Tarreau 2007-08-02 16:09 ` Ingo Molnar @ 2007-08-02 19:16 ` Daniel Phillips 2007-08-02 23:23 ` Roman Zippel 3 siblings, 0 replies; 535+ messages in thread From: Daniel Phillips @ 2007-08-02 19:16 UTC (permalink / raw) To: Linus Torvalds Cc: Roman Zippel, Andi Kleen, Mike Galbraith, Ingo Molnar, Andrew Morton, linux-kernel Hi Linus, On Wednesday 01 August 2007 19:17, Linus Torvalds wrote: > And the "approximates" thing would be about the fact that we don't > actually care about "absolute" microseconds as much as something > that is in the "roughly a microsecond" area. So if we say "it doesn't > have to be microseconds, but it should be within a factor of two of a > ms", we could avoid all the expensive divisions (even if they turn > into multiplications with reciprocals), and just let people *shift* > the CPU counter instead. On that theme, expressing the subsecond part of high precision time in decimal instead of left-aligned binary always was an insane idea. Applications end up with silly numbers of multiplies and divides (likely as not incorrect) whereas they would often just need a simple shift as you say, if the tv struct had been defined sanely from the start. As a bonus, whenever precision gets bumped up, the new bits appear on the right in formerly zero locations on the right, meaning little if any code needs to change. What we have in the incumbent libc timeofday scheme is the moral equivalent of BCD. Of course libc is unlikely ever to repent, but we can at least put off converting into the awkward decimal format until the last possible instant. In other words, I do not see why xtime is expressed as a tv instead of simple 32.32 fixed point. Perhaps somebody can elucidate me? Regards, Daniel ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-02 2:17 ` Linus Torvalds ` (2 preceding siblings ...) 2007-08-02 19:16 ` Daniel Phillips @ 2007-08-02 23:23 ` Roman Zippel 3 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-02 23:23 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Mike Galbraith, Ingo Molnar, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Linus Torvalds wrote: > So I think it would be entirely appropriate to > > - do something that *approximates* microseconds. > > Using microseconds instead of nanoseconds would likely allow us to do > 32-bit arithmetic in more areas, without any real overflow. The basic problem is that one needs a number of bits (at least 16) for normalization, which limits the time range one can work with. This means that 32 bit leaves only room for 1 millisecond resolution, the remainder could maybe saved and reused later. So AFAICT using micro- or nanosecond resolution doesn't make much computational difference. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 3:41 ` CFS review Roman Zippel ` (3 preceding siblings ...) 2007-08-01 13:20 ` Andi Kleen @ 2007-08-01 14:40 ` Ingo Molnar 2007-08-01 14:49 ` Peter Zijlstra 2007-08-02 15:46 ` Ingo Molnar 6 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-01 14:40 UTC (permalink / raw) To: Roman Zippel; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > while (1) > sched_yield(); sched_yield() is being reworked at the moment. But in general we want apps to move away to sane locking constructs ASAP. There's some movement in the 3D space at least. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 3:41 ` CFS review Roman Zippel ` (4 preceding siblings ...) 2007-08-01 14:40 ` Ingo Molnar @ 2007-08-01 14:49 ` Peter Zijlstra 2007-08-02 17:36 ` Roman Zippel 2007-08-02 15:46 ` Ingo Molnar 6 siblings, 1 reply; 535+ messages in thread From: Peter Zijlstra @ 2007-08-01 14:49 UTC (permalink / raw) To: Roman Zippel Cc: Mike Galbraith, Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel Hi Roman, Took me most of today trying to figure out WTH you did in fs2.c, more math and fundamental explanations would have been good. So please bear with me as I try to recap this thing. (No, your code was very much _not_ obvious, a few comments and broken out functions would have made a world of a difference) So, for each task we keep normalised time normalised time := time/weight using Bresenham's algorithm we can do this prefectly (up until a renice - where you'd get errors) avg_frac += weight_inv weight_inv = X / weight avg = avg_frac / weight0_inv weight0_inv = X / weight0 avg = avg_frac / (X / weight0) = (X / weight) / (X / weight0) = X / weight * weight0 / X = weight0 / weight So avg ends up being in units of [weight0/weight]. Then, in order to allow sleeping, we need to have a global clock to sync with. Its this global clock that gave me headaches to reconstruct. We're looking for a time like this: rq_time := sum(time)/sum(weight) And you commented that the /sum(weight) part is where CFS obtained its accumulating rounding error? (I'm inclined to believe the error will statistically be 0, but I'll readily accept otherwise if you can show a practical 'exploit') Its not obvious how to do this using modulo logic like Bresenham because that would involve using a gcm of all possible weights. What you ended up with is quite interesting if correct. sum_avg_frac += weight_inv_{i} however by virtue of the scheduler minimising: avg_{i} - avg_{j} | i != j this gets a factor of: weight_{i}/sum_{j}^{N}(weight_{j}) ( seems correct, needs more analysis though, this is very much a statistical step based on the previous constraint. this might very well introduce some errors ) resulting in: sum_avg_frac += sum_{i}^{N}(weight_inv_{i} * weight_{i}/sum_{j}^{N}(weight_{j})) weight_inv = X / weight sum_avg = sum_avg_frac / sum(weight0_inv) = sum_avg_frac / N*weight0_inv weight0_inv = X / weight0 sum_avg = sum_avg_frac / N*weight0_inv = sum_{i}^{N}(weight_inv_{i} * weight_{i}/sum_{j}^{N}(weight_{j})) / N*weight0_inv = sum_{i}^{N}(X/weight_{i} * weight_{i}/sum_{j}^{N}(weight_{j})) / N*(X/weight0) = N*X / sum_{j}(weight_{j}) * weight0/N*X = weight0 / sum_{j}(weight_{j}) Exactly the unit we were looking for [weight0/sum(weight)] ( the extra weight0 matching the one we had in the per task normalised time ) I'm not sure all this is less complex than CFS, I'd be inclined to say it is more so. Also, I think you have an accumulating error on wakeup where you sync with the global clock but fully discard the fraction. Anyway, as said a more detailed explanation and certainly a proof of your math would be nice. Is this something along the lines of what you intended to convey? If so, in the future please use more understandable language, we were taught math for a reason :-) Regards, Peter ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 14:49 ` Peter Zijlstra @ 2007-08-02 17:36 ` Roman Zippel 0 siblings, 0 replies; 535+ messages in thread From: Roman Zippel @ 2007-08-02 17:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel Hi, On Wed, 1 Aug 2007, Peter Zijlstra wrote: > Took me most of today trying to figure out WTH you did in fs2.c, more > math and fundamental explanations would have been good. So please bear > with me as I try to recap this thing. (No, your code was very much _not_ > obvious, a few comments and broken out functions would have made a world > of a difference) Thanks for the effort though. :) I know I'm not the best explaining these things, so I really appreciate the questions, so I know what to concentrate on. > So, for each task we keep normalised time > > normalised time := time/weight > > using Bresenham's algorithm we can do this prefectly (up until a renice > - where you'd get errors) > > avg_frac += weight_inv > > weight_inv = X / weight > > avg = avg_frac / weight0_inv > > weight0_inv = X / weight0 > > avg = avg_frac / (X / weight0) > = (X / weight) / (X / weight0) > = X / weight * weight0 / X > = weight0 / weight > > > So avg ends up being in units of [weight0/weight]. > > Then, in order to allow sleeping, we need to have a global clock to sync > with. Its this global clock that gave me headaches to reconstruct. > > We're looking for a time like this: > > rq_time := sum(time)/sum(weight) > > And you commented that the /sum(weight) part is where CFS obtained its > accumulating rounding error? (I'm inclined to believe the error will > statistically be 0, but I'll readily accept otherwise if you can show a > practical 'exploit') > > Its not obvious how to do this using modulo logic like Bresenham because > that would involve using a gcm of all possible weights. I think I've sent you off into the wrong direction somehow. Sorry. :) Let's ignore the average for a second, normalized time is maintained as: normalized time := time * (2^16 / weight) The important point is that I keep the value in full resolution of 2^-16 vsec units (vsec for virtual second or sec/weight, where every tasks gets weight seconds for every virtual second, to keep things simpler I also omit the nano prefix from the units for a moment). Compared to that CFS maintains a global normalized value in 1 vsec units. Since I don't round the value down I avoid the accumulating error, this means that time_norm += time_delta1 * (2^16 / weight) time_norm += time_delta2 * (2^16 / weight) is the same as time_norm += (time_delta1 + time_delta2) * (2^16 / weight) CFS for example does this delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw); in above terms this means time = time_delta * weight * (2^16 / weight_sum) / 2^16 The last shift now rounds the value down and if one does that 1000 times per second, the resolution of the value that is finally accounted to wait_runtime is also reduced appropriately. The other rounding problem is based on that this term x * prio_to_weight[i] * prio_to_wmult[i] / 2^32 doesn't produce x for most values in that tables (the same applies to the weight sum), so if we have chains, where the values are converted from one scale to the other, a rounding error is produced. In CFS this happens now because wait_runtime is maintained in nanoseconds and fair_clock is a normalized value. The problem here isn't that these errors might have a statistical relevance, as they are usually completely overshadowed by measurement errors anyway. The problem is that these errors exist at all, this means they have to be compensated somehow, so that they don't accumulate over time and then become significant. This also has to be seen in the context of the overflow checks. All this adds a number of variables to the system which considerably increases complexity and makes a thorough analysis quite challenging. So to get back to the average, if you look for this rq_time := sum(time)/sum(weight) you won't find it like this, this basically produces a weighted average and I agree this can't really be maintained via the modulo logic (at least AFAICT), so I'm using a simple average instead, so if we have: time_norm = time/weight we can write your rq_time like this: weighted_avg = sum_{i}^{N}(time_norm_{i}*weight_{i})/sum_{i}^{N}(weight_{i}) this is the formula for a weighted average, so we can aproximate the value using a simple average instead: avg = sum_{i}^{N}(time_norm_{i})/N This sum is now what I maintain at runtime incrementally: time_{i} = sum_{j}^{S}(time_{j}) time_norm_{i} = time_{i}/weight_{i} = sum_{j}^{S}(time_{j})/weight_{i} = sum_{j}^{S}(time_{j}/weight_{i}) If I add this up and add weigth0 I get: avg*N*weigth0 = sum_{i}^{N}(time_norm_{i})*weight0 and now I have also the needed modulo factors. The average probably could be further simplified by using a different approximation. The question is how perfect this average has to be. The average is only used to control when a task gets its share, currently higher priority tasks are already given a bit more preference or a sleep bonus is added. In CFS the average already isn't perfect due to above rounding problems, otherwise the sum of all updated wait_runtime should be 0 and if a task with a wait_runtime value different from 0 is added, fair_clock would have to change too to keep the balance. So unless someone has a brilliant idea, I guess we have to settle for the approximation of a perfect scheduler. :) My approach is insofar different that I at least maintain an accurate fair share and approximate based on this the scheduling decision. This has IMO the advantage that the scheduler function can be easily exchanged, one can do it the quick and dirty way or one can put in the extra effort to get it closer to perfection. Either way every task will get its fair share. The scheduling function I used is rather simple: if (j >= 0 && ((int)(task[l].time_norm - (task[j].time_norm + SLICE(j))) >= 0 || ((int)(task[l].time_norm - task[j].time_norm) >= 0 && (int)(task[l].time_norm - (s + SLICE(l))) >= 0))) { So a new task is selected if there is a higher priority task or if the task has used up its share (unless the other task has lower priority and already got its share). It would be interesting to use a dynamic (per task) time slice here, which should make it possible to control the burstiness that has been mentioned. > I'm not sure all this is less complex than CFS, I'd be inclined to say > it is more so. The complexity is different, IMO the basic complexity to maintain the fair share is less and I can add arbitrary complexity to improve scheduling on top of it. Most of it comes now from maintaining the accurate average, it depends now on how the scheduling is finally done what other optimizations are possible. > Also, I think you have an accumulating error on wakeup where you sync > with the global clock but fully discard the fraction. I hope not, otherwise the checks at the end should have triggered. :) The per task requirement is time_norm = time_avg * weight0 + avg_fract I don't simply discard it, it's accounted to time_norm and then synced to the global average. bye, Roman ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-01 3:41 ` CFS review Roman Zippel ` (5 preceding siblings ...) 2007-08-01 14:49 ` Peter Zijlstra @ 2007-08-02 15:46 ` Ingo Molnar 6 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-02 15:46 UTC (permalink / raw) To: Roman Zippel; +Cc: Mike Galbraith, Linus Torvalds, Andrew Morton, linux-kernel * Roman Zippel <zippel@linux-m68k.org> wrote: > [...] With the increased text comes increased runtime memory usage, > e.g. task_struct increased so that only 5 of them instead 6 fit now > into 8KB. yeah, thanks for the reminder, this is on my todo list. As i suspect you noticed it too, much of the task_struct size increase is not fundamental and not related to 64-bit math at all - it's simply debug and instrumentation overhead. Look at the following table (i386, nodebug): size ---- pre-CFS 1328 CFS 1472 CFS+patch 1376 the very small patch below gets rid of 96 bytes. And that's only the beginning. Ingo --------------------------------------------------> --- include/linux/sched.h | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -905,23 +905,28 @@ struct sched_entity { struct rb_node run_node; unsigned int on_rq; + u64 exec_start; + u64 sum_exec_runtime; u64 wait_start_fair; + u64 sleep_start_fair; + +#ifdef CONFIG_SCHEDSTATS u64 wait_start; - u64 exec_start; + u64 wait_max; + s64 sum_wait_runtime; + u64 sleep_start; - u64 sleep_start_fair; - u64 block_start; u64 sleep_max; + s64 sum_sleep_runtime; + + u64 block_start; u64 block_max; u64 exec_max; - u64 wait_max; - u64 last_ran; - u64 sum_exec_runtime; - s64 sum_wait_runtime; - s64 sum_sleep_runtime; unsigned long wait_runtime_overruns; unsigned long wait_runtime_underruns; +#endif + #ifdef CONFIG_FAIR_GROUP_SCHED struct sched_entity *parent; /* rq on which this entity is (to be) queued: */ ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 17:42 ` Ingo Molnar 2007-07-11 21:02 ` Randy Dunlap 2007-07-11 21:16 ` Andi Kleen @ 2007-07-11 21:42 ` Linus Torvalds 2007-07-11 22:04 ` Thomas Gleixner 2007-07-11 23:19 ` Ingo Molnar 2 siblings, 2 replies; 535+ messages in thread From: Linus Torvalds @ 2007-07-11 21:42 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright On Wed, 11 Jul 2007, Ingo Molnar wrote: > > What you just did here is a slap in the face to a lot of contributors > who worked hard on this code :( Ingo, I'm sorry to say so, but your answer just convinced me that you're wrong, and we MUST NOT take that code. That was *exactly* the same thing you talked about when I refused to take the original timer changes into 2.6.20. You were talking about how lots of people had worked really hard, and how it was really tested. And it damn well was NOT really tested, and 2.6.21 ended up being a horribly painful experience (one of the more painful kernel releases in recent times), and we ended up havign to fix a *lot* of stuff. And you admitted you were wrong at the time. Now you do the *exact* same thing. Here's a big clue: it doesn't matter one _whit_ how much face-slapping you get, or how much effort some programmers have put into the code. It's untested. And no, we are *not* going to do another "rip everything out, and replace it with new code" again. Over my dead body. We're going to do this thing gradually, or not at all. And if somebody feels slighted by the face-slap, and thinks he has already done enough, and isn't interested in doing it gradually, then good riddance. The "not at all" seems like a good idea, and maybe we can re-visit this in a year or two. I'm not going to have another 2.6.21 on my hands. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:42 ` x86 status was Re: -mm merge plans for 2.6.23 Linus Torvalds @ 2007-07-11 22:04 ` Thomas Gleixner 2007-07-11 22:20 ` Linus Torvalds 2007-07-11 23:19 ` Ingo Molnar 1 sibling, 1 reply; 535+ messages in thread From: Thomas Gleixner @ 2007-07-11 22:04 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Linus, On Wed, 2007-07-11 at 14:42 -0700, Linus Torvalds wrote: > Here's a big clue: it doesn't matter one _whit_ how much face-slapping you > get, or how much effort some programmers have put into the code. It's > untested. And no, we are *not* going to do another "rip everything out, > and replace it with new code" again. > > Over my dead body. > > We're going to do this thing gradually, or not at all. Can you please shed some light on me, how exactly you switch an architecture gradually to clock events. You simply can not convert PIT today and the HPET next week followed by the local APIC in three month. At least not to my knowledge. > And if somebody feels slighted by the face-slap, and thinks he has already > done enough, and isn't interested in doing it gradually, then good > riddance. The "not at all" seems like a good idea, and maybe we can > re-visit this in a year or two. I have no problem to brew this for some more time. I got not repulsed by the 2.6.20 decision, but I have no clue how to communicate with a black hole. tglx ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:04 ` Thomas Gleixner @ 2007-07-11 22:20 ` Linus Torvalds 2007-07-11 22:50 ` Thomas Gleixner 2007-07-11 22:51 ` Chris Wright 0 siblings, 2 replies; 535+ messages in thread From: Linus Torvalds @ 2007-07-11 22:20 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Thu, 12 Jul 2007, Thomas Gleixner wrote: > > Can you please shed some light on me, how exactly you switch an > architecture gradually to clock events. For example, we can make sure that the code in question that actually touches the hardware stays exactly the same, and then just move the interfaces around - and basically guarantee that _zero_ hardware-specific issues pop up when you switch over, for example. That way there is a gradual change-over. The other approach (which would be nice _too_) is to actually try to convert one clock source at a time. Why is that not an option? Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:20 ` Linus Torvalds @ 2007-07-11 22:50 ` Thomas Gleixner 2007-07-11 23:03 ` Chris Wright ` (2 more replies) 2007-07-11 22:51 ` Chris Wright 1 sibling, 3 replies; 535+ messages in thread From: Thomas Gleixner @ 2007-07-11 22:50 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Linus, On Wed, 2007-07-11 at 15:20 -0700, Linus Torvalds wrote: > For example, we can make sure that the code in question that actually > touches the hardware stays exactly the same, and then just move the > interfaces around - and basically guarantee that _zero_ hardware-specific > issues pop up when you switch over, for example. > > That way there is a gradual change-over. Ok, I can try to split this down further. > The other approach (which would be nice _too_) is to actually try to > convert one clock source at a time. Why is that not an option? We need to give control to the clock events core code once we convert one clock event device. Having two competing subsystems controlling different devices (e.g. PIT and APIC) is not really desirable. The HPET change, which is the larger part of the conversion set simply because we now share the code with i386, might be split out by disabling HPET in the first step, doing the PIT / APIC conversion and then the HPET one in a separate step. Thanks, tglx ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:50 ` Thomas Gleixner @ 2007-07-11 23:03 ` Chris Wright 2007-07-11 23:07 ` Linus Torvalds 2007-07-12 20:38 ` Matt Mackall 2 siblings, 0 replies; 535+ messages in thread From: Chris Wright @ 2007-07-11 23:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Thomas Gleixner (tglx@linutronix.de) wrote: > The HPET change, which is the larger part of the conversion set simply > because we now share the code with i386, might be split out by disabling > HPET in the first step, doing the PIT / APIC conversion and then the > HPET one in a separate step. The timer specific changes (i.e. the merges between arches) can be done more slowly, but the setup above is basically where I started, and it was already broken on one of my test boxes. Anyway, I'll help you however I can, because it's important to me to get this merged. thanks, -chris ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:50 ` Thomas Gleixner 2007-07-11 23:03 ` Chris Wright @ 2007-07-11 23:07 ` Linus Torvalds 2007-07-11 23:29 ` Thomas Gleixner 2007-07-11 23:36 ` Andi Kleen 2007-07-12 20:38 ` Matt Mackall 2 siblings, 2 replies; 535+ messages in thread From: Linus Torvalds @ 2007-07-11 23:07 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Thu, 12 Jul 2007, Thomas Gleixner wrote: > > The HPET change, which is the larger part of the conversion set simply > because we now share the code with i386, might be split out by disabling > HPET in the first step, doing the PIT / APIC conversion and then the > HPET one in a separate step. But that misses the point. It means that the commit that actually *changes* the code never actually gets tested on its own Why not just fix up the HPET code so that it can be shared *first*. Without the other conversion? Really - What's so wrong with the hpet.c changes in the *absense* of conversion to clockevents? Those changes seem to be totally independent - just abstracting ou tthe "hpet_get_virt_address()" stuff etc. None of that has anything to do with clockevents, as far as I can see. In other words, you now change a i386-only file, and maybe it breaks subtly on i386 as a result. Wouldn't it be nicer to see that breakage as a separate event? Then, the x86-64 clockevents code will switch over entirely, but now it switches over to something we can say has gotten testing, and we know the switch-over won't break any 32-bit code, because the switch-over literally didn't change anything at all for that case. See? THAT is what I mean by "gradual". Bugs happen, but if we can make _independent_ bugs show up in _independent_ commits, that will make it much easier to figure out what happened. The same is true of a lot of the APIC timer code. Sure, that patch has the actual conversion in it, and you don't have the cross-architecture issues, but more than 50% of the patch seems to be just cleanup that is independent of the actual switch-over, no? Again, if it was done as a "one patch for cleanup, and another patch that actually switches the higher-level interfaces around", then the two mostly independent issues (of "hardware access/initialization" vs "higher-level changes in how it got called") get done as two independent commits. And no, I really probably wouldn't ask for this, but 2.6.21 showed *exactly* this problem. Trivial debugging helps like "git bisect" didn't help at all, because all the problems started when the new code was "activated", not when it was actually brought in. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 23:07 ` Linus Torvalds @ 2007-07-11 23:29 ` Thomas Gleixner 2007-07-11 23:36 ` Andi Kleen 1 sibling, 0 replies; 535+ messages in thread From: Thomas Gleixner @ 2007-07-11 23:29 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Linus, On Wed, 2007-07-11 at 16:07 -0700, Linus Torvalds wrote: > Why not just fix up the HPET code so that it can be shared *first*. > Without the other conversion? Really - What's so wrong with the hpet.c > changes in the *absense* of conversion to clockevents? Those changes seem > to be totally independent - just abstracting ou tthe > "hpet_get_virt_address()" stuff etc. > > None of that has anything to do with clockevents, as far as I can see. > > In other words, you now change a i386-only file, and maybe it breaks > subtly on i386 as a result. Wouldn't it be nicer to see that breakage as a > separate event? Sure, I meant to do the HPET changes to i386 separate as a preparatory patch. Sharing HPET before the conversion is nasty at best (it involves a ton of ifdeffery at least). > Then, the x86-64 clockevents code will switch over entirely, but now it > switches over to something we can say has gotten testing, and we know the > switch-over won't break any 32-bit code, because the switch-over literally > didn't change anything at all for that case. Well, we know that it works on i386, but once we turn on the x64 switch we have not tested the shared code for x64 yet. I try to find some practicable compromise between the big bang patch and the theoretical gradual optimum. > The same is true of a lot of the APIC timer code. Sure, that patch has the > actual conversion in it, and you don't have the cross-architecture issues, > but more than 50% of the patch seems to be just cleanup that is > independent of the actual switch-over, no? I said before, that I'm going to split them further. tglx ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 23:07 ` Linus Torvalds 2007-07-11 23:29 ` Thomas Gleixner @ 2007-07-11 23:36 ` Andi Kleen 2007-07-11 23:48 ` Thomas Gleixner 2007-07-11 23:58 ` Ingo Molnar 1 sibling, 2 replies; 535+ messages in thread From: Andi Kleen @ 2007-07-11 23:36 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright > The same is true of a lot of the APIC timer code. Sure, that patch has the > actual conversion in it, and you don't have the cross-architecture issues, > but more than 50% of the patch seems to be just cleanup that is > independent of the actual switch-over, no? I don't think it's that much cleanup. One of my goals for x86-64 was always to have it support modern x86 only; this means in particularly most of the old bug workaround removed. With the APIC timer merging a lot of that crap gets back in. I would prefer to keep APIC code separate. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 23:36 ` Andi Kleen @ 2007-07-11 23:48 ` Thomas Gleixner 2007-07-11 23:58 ` Ingo Molnar 1 sibling, 0 replies; 535+ messages in thread From: Thomas Gleixner @ 2007-07-11 23:48 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright Andi, On Thu, 2007-07-12 at 01:36 +0200, Andi Kleen wrote: > > The same is true of a lot of the APIC timer code. Sure, that patch has the > > actual conversion in it, and you don't have the cross-architecture issues, > > but more than 50% of the patch seems to be just cleanup that is > > independent of the actual switch-over, no? > > I don't think it's that much cleanup. One of my goals for x86-64 was always > to have it support modern x86 only; this means in particularly most of the > old bug workaround removed. With the APIC timer merging a lot of that crap > gets back in. > > I would prefer to keep APIC code separate. Care to look at the patch ? It _IS_ seperate. Only HPET and PIT got shared. tglx ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 23:36 ` Andi Kleen 2007-07-11 23:48 ` Thomas Gleixner @ 2007-07-11 23:58 ` Ingo Molnar 2007-07-12 0:07 ` Andi Kleen 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-11 23:58 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Thomas Gleixner, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Andi Kleen <andi@firstfloor.org> wrote: > > The same is true of a lot of the APIC timer code. Sure, that patch > > has the actual conversion in it, and you don't have the > > cross-architecture issues, but more than 50% of the patch seems to > > be just cleanup that is independent of the actual switch-over, no? > > I don't think it's that much cleanup. One of my goals for x86-64 was > always to have it support modern x86 only; this means in particularly > most of the old bug workaround removed. With the APIC timer merging a > lot of that crap gets back in. i dont think "clean, modern x86 code" will ever happen - x86_64 has and is going to have the exact same type of crap. And i'll say a weird thing now: that is a _blessing_. Why? Because this crap in question originates from the _diversity_ of the platform, and that is a much larger asset than the cost of the quirks can ever be! What you suggest does not end up in "clean 64-bit code", it ends up in "a bit less crappy 64-bit code", plus a lot of unnecessary duplication of effort and duplication of code - which easily introduces more crap total than it gets rid of ... The x86 architecture isnt fully analogous to a random piece of device hardware that evolves. It is more of a collector of random pieces of hardware that evolve independently, and as such it will always be exposed to human messups in a factorized way. "The pristine, clean architecture" is an utopia and it will never come until humans design hardware. Under your scheme we'll end up with is two sets of code which share some of the workarounds and dont share some others. No, in fact we _already_ ended up with two sets of code that is crappy in different ways. We had countless cases of bugs fixed in i386 but not fixed in x86_64. (and vice versa) Sharing code for similar hardware is almost always good. I think the PowerPC experience (although it is not a fully equivalent case) about them merging their 32-bit and 64-bit architectures was an overwhelmingly positive move, and x86 could learn a thing or two from that. The only way to fight crappy hardware is to map it, to understand it and to design as cleanly in the presence of it as possible. Having two sets of code for the same thing hardly serves that purpose. In fact, having _more_ crappy hardware _forces_ us to do a cleaner design (up to a pain threshold). Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 23:58 ` Ingo Molnar @ 2007-07-12 0:07 ` Andi Kleen 2007-07-12 0:15 ` Chris Wright 2007-07-12 0:18 ` Ingo Molnar 0 siblings, 2 replies; 535+ messages in thread From: Andi Kleen @ 2007-07-12 0:07 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Linus Torvalds, Thomas Gleixner, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright > i dont think "clean, modern x86 code" will ever happen - x86_64 has and > is going to have the exact same type of crap. And i'll say a weird thing Yes, but it will be new crap, but no old crap anymore. If you always pile the new crap on the old crap at some point the whole thing might fall over. 64bit was intended as a fresh start. Admittedly we're getting more and more workarounds too and sometimes when I want to remove cruft i find out it is still needed on some 64bit boxes (e.g. see my repeated attempts to clean up the irq 0 routing), but it's still much better than i386. > I think the PowerPC experience (although it is not a fully equivalent > case) about them merging their 32-bit and 64-bit architectures was an > overwhelmingly positive move, and x86 could learn a thing or two from > that. The equivalent to the powerpc way would be essentially to report i386 into the x86-64 code base and leave the really old hardware only in arch/i386. I've considered doing it, but it would be an awful lot of work and to tempt distributions to actually use the new port would require going back quite a long time. And at least immediately it would end up with three cases to do things instead of two like currently. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-12 0:07 ` Andi Kleen @ 2007-07-12 0:15 ` Chris Wright 2007-07-12 0:18 ` Ingo Molnar 1 sibling, 0 replies; 535+ messages in thread From: Chris Wright @ 2007-07-12 0:15 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Linus Torvalds, Thomas Gleixner, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Andi Kleen (andi@firstfloor.org) wrote: > The equivalent to the powerpc way would be essentially to report i386 > into the x86-64 code base and leave the really old hardware only > in arch/i386. I've considered doing it, but it would be an awful > lot of work and to tempt distributions to actually use the new > port would require going back quite a long time. And at least > immediately it would end up with three cases to do things instead > of two like currently. Well that's just silly. The right way will never create 3 ways, but always keep the limit to the existing 2 where the differences aren't worth reconciling, and 1 for anything that is common. It will be a fair amount of work, so any constructive input you have upfront would be helpful. thanks, -chris ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-12 0:07 ` Andi Kleen 2007-07-12 0:15 ` Chris Wright @ 2007-07-12 0:18 ` Ingo Molnar 2007-07-12 0:37 ` Andi Kleen 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-12 0:18 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Thomas Gleixner, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Andi Kleen <andi@firstfloor.org> wrote: > > i dont think "clean, modern x86 code" will ever happen - x86_64 has > > and is going to have the exact same type of crap. And i'll say a > > weird thing > > Yes, but it will be new crap, but no old crap anymore. > > If you always pile the new crap on the old crap at some point the > whole thing might fall over. 64bit was intended as a fresh start. I think there's no such thing as a fresh start for a diverse architecture - the ia64 failure has proven that. x86_64 CPUs still do A20 emulation today (!). We still have people running industrial boards on real i386 DX CPUs, with the latest upstream kernel. 15 years ago an i386 DX was already quite obsolete. 32-bit is not going to go away in our lifetime, and we'll want to support it in a first-grade way. We better realize that prospect and have it right before our eyes in a single tree wherever it makes sense to share code - i'm certainly not talking about sharing mtrr/centaur.c or k8.c. (and i'm not necessarily suggesting to share io_apic.c either - although it's certainly borderline.) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-12 0:18 ` Ingo Molnar @ 2007-07-12 0:37 ` Andi Kleen 0 siblings, 0 replies; 535+ messages in thread From: Andi Kleen @ 2007-07-12 0:37 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Linus Torvalds, Thomas Gleixner, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Thu, Jul 12, 2007 at 02:18:07AM +0200, Ingo Molnar wrote: > > > i dont think "clean, modern x86 code" will ever happen - x86_64 has > > > and is going to have the exact same type of crap. And i'll say a > > > weird thing > > > > Yes, but it will be new crap, but no old crap anymore. > > > > If you always pile the new crap on the old crap at some point the > > whole thing might fall over. 64bit was intended as a fresh start. > > I think there's no such thing as a fresh start for a diverse > architecture - the ia64 failure has proven that. x86_64 CPUs still do > A20 emulation today (!). x86-64 doesn't care about a lot of x86 baggage and a lot of things have been even obsoleted in the platform. In practice the backwards compatibility on x86 isn't that great either. For example a significant number of new systems don't even work correctly in PIC mode anymore. > We still have people running industrial boards > on real i386 DX CPUs, with the latest upstream kernel. 15 years ago an Yes, but those for example would be perfectly happy with an arch/i386 with all APIC and SMP code stripped out. Only the few people who still run dual P5s might not, but those could continue using old kernels. But eventually I think that would be the right clean way: arch/i386 stripped down port for truly old systems like the embedded 386 upto 586 or early 686. No SMP or APIC. arch/x86 supporting 32bit and 64bit for reasonably modern systems. NUMAQ/Voyager/P5-SMP/visual workstation gone [frankly the user base of those is too small to justify the code impact] It's just quite ugly to get there and when you think through it the actual advantages of such a setup it is likely not enough to justify the significant work to make it work. Also I wouldn't have any idea how to regression test significant changes to arch/i386 aimed at old systems. e.g. I don't think the powerpc people actually tried to still support really old systems where it is hard to do regression tests anymore, only really supported platforms. So while such a setup would be quite nice the practical problems of getting there are nasty. Also I must admit I prefer hacking on new code instead. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:50 ` Thomas Gleixner 2007-07-11 23:03 ` Chris Wright 2007-07-11 23:07 ` Linus Torvalds @ 2007-07-12 20:38 ` Matt Mackall 2 siblings, 0 replies; 535+ messages in thread From: Matt Mackall @ 2007-07-12 20:38 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright On Thu, Jul 12, 2007 at 12:50:19AM +0200, Thomas Gleixner wrote: > Linus, > > On Wed, 2007-07-11 at 15:20 -0700, Linus Torvalds wrote: > > For example, we can make sure that the code in question that actually > > touches the hardware stays exactly the same, and then just move the > > interfaces around - and basically guarantee that _zero_ hardware-specific > > issues pop up when you switch over, for example. > > > > That way there is a gradual change-over. > > Ok, I can try to split this down further. > > > The other approach (which would be nice _too_) is to actually try to > > convert one clock source at a time. Why is that not an option? > > We need to give control to the clock events core code once we convert > one clock event device. Having two competing subsystems controlling > different devices (e.g. PIT and APIC) is not really desirable. Can't you take the entire legacy clock system and wrap it as a single legacy clock source? Then you take bits out of the old system and put them as independent sources in the new system? When the legacy clock system is empty, you remove the legacy clock source. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:20 ` Linus Torvalds 2007-07-11 22:50 ` Thomas Gleixner @ 2007-07-11 22:51 ` Chris Wright 2007-07-11 22:58 ` Linus Torvalds 1 sibling, 1 reply; 535+ messages in thread From: Chris Wright @ 2007-07-11 22:51 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven, Chris Wright * Linus Torvalds (torvalds@linux-foundation.org) wrote: > For example, we can make sure that the code in question that actually > touches the hardware stays exactly the same, and then just move the > interfaces around - and basically guarantee that _zero_ hardware-specific > issues pop up when you switch over, for example. That's not quite right. Leaving the code unchanged caused breakage already. The PIT is damn stupid and can be sensitive to how quickly it's programmed. So code that enable/disable didn't change, but frequency with which it is called did and broke some random boxes. > The other approach (which would be nice _too_) is to actually try to > convert one clock source at a time. Why is that not an option? It was that way for x86_64, that's the first thing I fixed (since it was done by fully disabling all other timers but the one coverted ;-) ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:51 ` Chris Wright @ 2007-07-11 22:58 ` Linus Torvalds 2007-07-12 2:53 ` Arjan van de Ven 0 siblings, 1 reply; 535+ messages in thread From: Linus Torvalds @ 2007-07-11 22:58 UTC (permalink / raw) To: Chris Wright Cc: Thomas Gleixner, Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel, Arjan van de Ven On Wed, 11 Jul 2007, Chris Wright wrote: > > That's not quite right. Leaving the code unchanged caused breakage > already. The PIT is damn stupid and can be sensitive to how quickly it's > programmed. So code that enable/disable didn't change, but frequency > with which it is called did and broke some random boxes. Sure. We cannot avoid *all* problems. Bugs happen. But at least we could try to make sure that there aren't totally unnecessary changes in that switch-over patch. Which there definitely were, as far as I can tell. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 22:58 ` Linus Torvalds @ 2007-07-12 2:53 ` Arjan van de Ven 0 siblings, 0 replies; 535+ messages in thread From: Arjan van de Ven @ 2007-07-12 2:53 UTC (permalink / raw) To: Linus Torvalds Cc: Chris Wright, Thomas Gleixner, Ingo Molnar, Andi Kleen, Andrew Morton, linux-kernel On Wed, 2007-07-11 at 15:58 -0700, Linus Torvalds wrote: > > On Wed, 11 Jul 2007, Chris Wright wrote: > > > > That's not quite right. Leaving the code unchanged caused breakage > > already. The PIT is damn stupid and can be sensitive to how quickly it's > > programmed. So code that enable/disable didn't change, but frequency > > with which it is called did and broke some random boxes. > > Sure. We cannot avoid *all* problems. Bugs happen. > > But at least we could try to make sure that there aren't totally > unnecessary changes in that switch-over patch. Which there definitely > were, as far as I can tell. one note is that the "talk differently to hardware" thing is in part already tested with the 32 bit tickless code; a lot of people (80% ?) are still using the 32 bit OS on their 64 bit machines, and the 32 bit code already talks in the "new way" to this hardware.... (and since Fedora 7 already ships tickless for 32 bit there are quite a lot of people using that in practice, in addition to the kernel.org kernel users) I would expect just about all the hardware interaction issues to have popped up already because of this "run 32 bit on 64 bit hardware" thing. -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 21:42 ` x86 status was Re: -mm merge plans for 2.6.23 Linus Torvalds 2007-07-11 22:04 ` Thomas Gleixner @ 2007-07-11 23:19 ` Ingo Molnar 2007-07-11 23:45 ` Linus Torvalds 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-07-11 23:19 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright * Linus Torvalds <torvalds@linux-foundation.org> wrote: > That was *exactly* the same thing you talked about when I refused to > take the original timer changes into 2.6.20. You were talking about > how lots of people had worked really hard, and how it was really > tested. yes - i was (way too!) upset about it, and your reasoning for the rejection was hard (on us) but fair: you wanted a quiet 2.6.20, and you felt fundamentally uneasy about the patches. > And it damn well was NOT really tested, and 2.6.21 ended up being a > horribly painful experience (one of the more painful kernel releases > in recent times), and we ended up havign to fix a *lot* of stuff. yes. We had 12 -hrt/dynticks merge related regressions between 2.6.21-rc1 and -final, and 4 after final. Here's a quick post-mortem: 12 fixes after -rc1: [PATCH] i386: Fix bogus return value in hpet_next_event() [PATCH] clockevents: remove bad designed sysfs support for now [PATCH] clocksource: Fix thinko in watchdog selection [PATCH] dynticks: fix hrtimer rounding error in next_timer_interrupt [PATCH] i386: add command line option "local_apic_timer_c2_ok" [PATCH] i386: disable local apic timer via command line or dmi quirk [PATCH] i386: clockevents fix breakage on Geode/Cyrix PIT [PATCH] i386: trust the PM-Timer calibration of the local APIC timer [PATCH] clockevents: Fix suspend/resume to disk hangs [PATCH] highres: do not run the TIMER_SOFTIRQ after switching to highres mode [PATCH] hrtimer: prevent overrun DoS in hrtimer_forward() [PATCH] Save/restore periodic tick information over suspend/resume implementations 4 fixes after -final: 2.6.21.1: - 2.6.21.2: [PATCH] clocksource: fix resume logic 2.6.21.3: - 2.6.21.4: - 2.6.21.5: [PATCH] NOHZ: Rate limit the local softirq pending warning output [PATCH] Ignore bogus ACPI info for offline CPUs [PATCH] i386: HPET, check if the counter works 2.6.21.6: - it's all pretty quiet today on the dynticks regressions front. (there are no open regressions in either the upstream i386 code or in the devel patches we are aware of. Forced-HPET in -mm, which is not part of this queue in question [but which is done for dynticks], has one open regression.) The majority of the above bugs were in the infrastructure code. (the worst was the generic resume/suspend one fixed in 2.6.21.2) And sadly, a fair number of the infrastructure bugs we introduced during the frentic clockevents/dynticks rewrites/redesigns we did between .20 and .21. That was a royally stupid mistake for us to do - instead of patiently waiting for the bugs to be shaken out we destabilized the infrastructure. (it was a "lets make this thing so nice that it's impossible to reject" instintic gut reaction.) In the 'weird arch bugs' category, out of the 6 i386 breakages listed above, 'i386 legacy systems' was/is by far the worst offender: 4-5 were on such old (not 64-bit-capable) systems. (this is not really a surprise) While x86_64 certainly has weird crap hardware too, it probably is an order of magnitude fewer than i386 - just due to the sheer volume, time and diversity difference. (On the other hand if there's crap then it will be debugged/tested slower than on 32-bit, which offsets that advantage.) The most prominent bugs were the ones that were in the infrastructure - they affected many machines. (But i'd expect the infrastructure to be pretty robust by now.) The x86_64 hrt/dynticks code makes the x86_64 PIT driver (and hpet too) shared between the two architectures - which is perhaps another difference to the original i386 clockevents merge. We also integrated _all_ feedback we got, and we had the capacity and capability to fix whatever other feedback comes back - it just never came ... until today. But i fully agree with you that the cleanups should be done separately - it's just so hard to actually hack on the old hpet code (and to understand it to begin with) without first cleaning it up a bit so that it does not cause permanent brain damage ;) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 23:19 ` Ingo Molnar @ 2007-07-11 23:45 ` Linus Torvalds 0 siblings, 0 replies; 535+ messages in thread From: Linus Torvalds @ 2007-07-11 23:45 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Andrew Morton, linux-kernel, Thomas Gleixner, Arjan van de Ven, Chris Wright On Thu, 12 Jul 2007, Ingo Molnar wrote: > > We also integrated _all_ feedback we got, and we had the capacity and > capability to fix whatever other feedback comes back - it just never > came ... until today. One thing I'll happily talk about is that while 2.6.21 was painful, you and Thomas in particular were both very responsible about the thing, so no, I'm not at all complaining or worried about it in that sense! I just really _really_ wish we could have two fairly stable releases in a row. I think 2.6.22 has the potential to be a pretty good setup, and I'd really like to avoid having another 2.6.21 immediately afterwards. So I'm not worried about integration and getting fixes when things break per se, but I *am* worried that this is an area where we've traditionally had lots of unexpected problems. And hey, maybe this time there will be none. I just still smart from the last time, so I'd prefer it to go more smoothly this time around. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 12:43 ` x86 status was " Andi Kleen 2007-07-11 17:33 ` Jesse Barnes 2007-07-11 17:42 ` Ingo Molnar @ 2007-07-11 18:14 ` Jeremy Fitzhardinge 2007-07-12 19:33 ` Christoph Lameter 3 siblings, 0 replies; 535+ messages in thread From: Jeremy Fitzhardinge @ 2007-07-11 18:14 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, linux-kernel, tglx, Tim Hockin, jesse.barnes, Adrian Bunk, dave young Andi Kleen wrote: > More review: > > >> xen-fix-x86-config-dependencies.patch >> xen-suppress-abs-symbol-warnings-for-unused-reloc-pointers.patch >> xen-cant-support-numa-yet.patch >> The first and third of these are just simple Kconfig updates, and the middle one just updates the list of symbols which shouldn't be warned about in CONFIG_RELOCATABLE's absolute symbol check. They're completely harmless but may prevent someone from generating an unbuildable .config. >> x86-fix-iounmaps-use-of-vm_structs-size-field.patch >> This appears to fix a real bug; the only question is whether x86-64 needs the same treatment. I'm not sure if the original bug reporter (dave young) has confirmed it fixed his problem. J ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-11 12:43 ` x86 status was " Andi Kleen ` (2 preceding siblings ...) 2007-07-11 18:14 ` Jeremy Fitzhardinge @ 2007-07-12 19:33 ` Christoph Lameter 2007-07-12 20:38 ` Andi Kleen 3 siblings, 1 reply; 535+ messages in thread From: Christoph Lameter @ 2007-07-12 19:33 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, linux-kernel, tglx, jeremy, Tim Hockin, jesse.barnes On Wed, 11 Jul 2007, Andi Kleen wrote: > These all need re-review: > > > i386-add-support-for-picopower-irq-router.patch > > make-arch-i386-kernel-setupcremapped_pgdat_init-static.patch > > arch-i386-kernel-i8253c-should-include-asm-timerh.patch > > make-arch-i386-kernel-io_apicctimer_irq_works-static-again.patch > > quicklist-support-for-x86_64.patch ^^^ That patch was supposed to be merged for 2.6.22 (you told me you forgot to merge it) and has been for a long time in mm. Does it now need to be rereviewed for 2.6.23? The other pieces of the quicklist patch for core and other arches were merged for 2.6.22. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: x86 status was Re: -mm merge plans for 2.6.23 2007-07-12 19:33 ` Christoph Lameter @ 2007-07-12 20:38 ` Andi Kleen 0 siblings, 0 replies; 535+ messages in thread From: Andi Kleen @ 2007-07-12 20:38 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Andrew Morton, linux-kernel, tglx, jeremy, Tim Hockin, jesse.barnes On Thu, Jul 12, 2007 at 12:33:43PM -0700, Christoph Lameter wrote: > On Wed, 11 Jul 2007, Andi Kleen wrote: > > > These all need re-review: > > > > > i386-add-support-for-picopower-irq-router.patch > > > make-arch-i386-kernel-setupcremapped_pgdat_init-static.patch > > > arch-i386-kernel-i8253c-should-include-asm-timerh.patch > > > make-arch-i386-kernel-io_apicctimer_irq_works-static-again.patch > > > > > quicklist-support-for-x86_64.patch > > ^^^ That patch was supposed to be merged for 2.6.22 (you told me you > forgot to merge it) and has been for a long time in mm. Does it now > need to be rereviewed for 2.6.23? The other pieces of the quicklist patch > for core and other arches were merged for 2.6.22. It's just on the normal re-review list. But it'll likely go in. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* generic clockevents/ (hr)time(r) patches was Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (22 preceding siblings ...) 2007-07-11 12:43 ` x86 status was " Andi Kleen @ 2007-07-11 23:03 ` Thomas Gleixner 2007-07-11 23:57 ` Andrew Morton 2007-07-11 23:59 ` Andi Kleen 2007-07-12 0:54 ` fault vs invalidate race (Re: -mm merge plans for 2.6.23) Nick Piggin ` (2 subsequent siblings) 26 siblings, 2 replies; 535+ messages in thread From: Thomas Gleixner @ 2007-07-11 23:03 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andi Kleen Andrew, Linus, On Tue, 2007-07-10 at 01:31 -0700, Andrew Morton wrote: > When replying, please rewrite the subject suitably and try to Cc: the > appropriate developer(s). > i386-hpet-check-if-the-counter-works.patch > nohz-fix-nohz-x86-dyntick-idle-handling.patch > acpi-move-timer-broadcast-and-pmtimer-access-before-c3-arbiter-shutdown.patch > clockevents-fix-typo-in-acpi_pmc.patch > timekeeping-fixup-shadow-variable-argument.patch > timerc-cleanup-recently-introduced-whitespace-damage.patch > clockevents-remove-prototypes-of-removed-functions.patch > clockevents-fix-resume-logic.patch > clockevents-fix-device-replacement.patch > tick-management-spread-timer-interrupt.patch > highres-improve-debug-output.patch > hrtimer-speedup-hrtimer_enqueue.patch > pcspkr-use-the-global-pit-lock.patch > ntp-move-the-cmos-update-code-into-ntpc.patch > i386-pit-stop-only-when-in-periodic-or-oneshot-mode.patch > i386-remove-volatile-in-apicc.patch > i386-hpet-assumes-boot-cpu-is-0.patch > i386-move-pit-function-declarations-and-constants-to-correct-header-file.patch These got sent to Andi as well, but the patches are independent of the x86_64 conversion. These are bugfixes (nohz-fix-nohz-x86-dyntick-idle-handling.patch) and general improvements of the core code and the existing i386 code. Can we please merge the above now ? I can resend them or setup a git repo if you want. Andi, any objections against the above i386 fixlets ? Thanks, tglx ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: generic clockevents/ (hr)time(r) patches was Re: -mm merge plans for 2.6.23 2007-07-11 23:03 ` generic clockevents/ (hr)time(r) patches " Thomas Gleixner @ 2007-07-11 23:57 ` Andrew Morton 2007-07-12 0:04 ` Thomas Gleixner 2007-07-11 23:59 ` Andi Kleen 1 sibling, 1 reply; 535+ messages in thread From: Andrew Morton @ 2007-07-11 23:57 UTC (permalink / raw) To: Thomas Gleixner; +Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andi Kleen On Thu, 12 Jul 2007 01:03:28 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > Andrew, Linus, > > On Tue, 2007-07-10 at 01:31 -0700, Andrew Morton wrote: > > When replying, please rewrite the subject suitably and try to Cc: the > > appropriate developer(s). > > > i386-hpet-check-if-the-counter-works.patch > > nohz-fix-nohz-x86-dyntick-idle-handling.patch > > acpi-move-timer-broadcast-and-pmtimer-access-before-c3-arbiter-shutdown.patch > > clockevents-fix-typo-in-acpi_pmc.patch > > timekeeping-fixup-shadow-variable-argument.patch > > timerc-cleanup-recently-introduced-whitespace-damage.patch > > clockevents-remove-prototypes-of-removed-functions.patch > > clockevents-fix-resume-logic.patch > > clockevents-fix-device-replacement.patch > > tick-management-spread-timer-interrupt.patch > > highres-improve-debug-output.patch > > hrtimer-speedup-hrtimer_enqueue.patch > > pcspkr-use-the-global-pit-lock.patch > > ntp-move-the-cmos-update-code-into-ntpc.patch > > i386-pit-stop-only-when-in-periodic-or-oneshot-mode.patch > > i386-remove-volatile-in-apicc.patch > > i386-hpet-assumes-boot-cpu-is-0.patch > > i386-move-pit-function-declarations-and-constants-to-correct-header-file.patch > > These got sent to Andi as well, but the patches are independent of the > x86_64 conversion. > > These are bugfixes (nohz-fix-nohz-x86-dyntick-idle-handling.patch) and > general improvements of the core code and the existing i386 code. > > Can we please merge the above now ? > > I can resend them or setup a git repo if you want. > > Andi, any objections against the above i386 fixlets ? > They all look pretty innocuous to me. Could you please take a second look, decide if any of them should also be in 2.6.22.x and let me know? ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: generic clockevents/ (hr)time(r) patches was Re: -mm merge plans for 2.6.23 2007-07-11 23:57 ` Andrew Morton @ 2007-07-12 0:04 ` Thomas Gleixner 2007-07-12 0:17 ` [stable] " Chris Wright 2007-07-12 0:43 ` Andi Kleen 0 siblings, 2 replies; 535+ messages in thread From: Thomas Gleixner @ 2007-07-12 0:04 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andi Kleen, Stable Team Andrew, On Wed, 2007-07-11 at 16:57 -0700, Andrew Morton wrote: > They all look pretty innocuous to me. > > Could you please take a second look, decide if any of them should also be > in 2.6.22.x and let me know? i386-hpet-check-if-the-counter-works.patch pcspkr-use-the-global-pit-lock.patch are the only candidates. tglx ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [stable] generic clockevents/ (hr)time(r) patches was Re: -mm merge plans for 2.6.23 2007-07-12 0:04 ` Thomas Gleixner @ 2007-07-12 0:17 ` Chris Wright 2007-07-12 0:43 ` Andi Kleen 1 sibling, 0 replies; 535+ messages in thread From: Chris Wright @ 2007-07-12 0:17 UTC (permalink / raw) To: Thomas Gleixner Cc: Andrew Morton, Stable Team, Ingo Molnar, Linus Torvalds, linux-kernel, Andi Kleen * Thomas Gleixner (tglx@linutronix.de) wrote: > Andrew, > > On Wed, 2007-07-11 at 16:57 -0700, Andrew Morton wrote: > > They all look pretty innocuous to me. > > > > Could you please take a second look, decide if any of them should also be > > in 2.6.22.x and let me know? > > i386-hpet-check-if-the-counter-works.patch > pcspkr-use-the-global-pit-lock.patch > > are the only candidates. yup, come through -stable a few times, be great to get them upstream, and into .22.y thanks, -chris ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: generic clockevents/ (hr)time(r) patches was Re: -mm merge plans for 2.6.23 2007-07-12 0:04 ` Thomas Gleixner 2007-07-12 0:17 ` [stable] " Chris Wright @ 2007-07-12 0:43 ` Andi Kleen 2007-07-12 0:46 ` [stable] " Chris Wright 1 sibling, 1 reply; 535+ messages in thread From: Andi Kleen @ 2007-07-12 0:43 UTC (permalink / raw) To: Thomas Gleixner Cc: Andrew Morton, linux-kernel, Linus Torvalds, Ingo Molnar, Stable Team On Thursday 12 July 2007 02:04, Thomas Gleixner wrote: > Andrew, > > On Wed, 2007-07-11 at 16:57 -0700, Andrew Morton wrote: > > They all look pretty innocuous to me. > > > > Could you please take a second look, decide if any of them should also be > > in 2.6.22.x and let me know? > > i386-hpet-check-if-the-counter-works.patch > pcspkr-use-the-global-pit-lock.patch Ok by me, although I suspect a lot of the cases where the hpet one was needed got resolved with the PCI HPET resource fix But it's still safer to check. However I don't think patches should go into stable before they hit Linus' tree. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [stable] generic clockevents/ (hr)time(r) patches was Re: -mm merge plans for 2.6.23 2007-07-12 0:43 ` Andi Kleen @ 2007-07-12 0:46 ` Chris Wright 0 siblings, 0 replies; 535+ messages in thread From: Chris Wright @ 2007-07-12 0:46 UTC (permalink / raw) To: Andi Kleen Cc: Thomas Gleixner, Andrew Morton, Ingo Molnar, Linus Torvalds, linux-kernel, Stable Team * Andi Kleen (ak@suse.de) wrote: > Ok by me, although I suspect a lot of the cases where the hpet one > was needed got resolved with the PCI HPET resource fix But it's still > safer to check. > > However I don't think patches should go into stable before they > hit Linus' tree. Agreed, we're just waiting ;-) thanks, -chris ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: generic clockevents/ (hr)time(r) patches was Re: -mm merge plans for 2.6.23 2007-07-11 23:03 ` generic clockevents/ (hr)time(r) patches " Thomas Gleixner 2007-07-11 23:57 ` Andrew Morton @ 2007-07-11 23:59 ` Andi Kleen 2007-07-12 0:33 ` Andrew Morton 1 sibling, 1 reply; 535+ messages in thread From: Andi Kleen @ 2007-07-11 23:59 UTC (permalink / raw) To: Thomas Gleixner; +Cc: Andrew Morton, linux-kernel, Linus Torvalds, Ingo Molnar > Andi, any objections against the above i386 fixlets ? No, they are fine for me. -Andi ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: generic clockevents/ (hr)time(r) patches was Re: -mm merge plans for 2.6.23 2007-07-11 23:59 ` Andi Kleen @ 2007-07-12 0:33 ` Andrew Morton 0 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-12 0:33 UTC (permalink / raw) To: Andi Kleen; +Cc: Thomas Gleixner, linux-kernel, Linus Torvalds, Ingo Molnar On Thu, 12 Jul 2007 01:59:23 +0200 Andi Kleen <ak@suse.de> wrote: > > > Andi, any objections against the above i386 fixlets ? > > No, they are fine for me. > OK, I queued them up for an akpm->linus transfer. Which will of course be abandoned if an akpm->andi or andi->linus merge happens in the next week or so. ^ permalink raw reply [flat|nested] 535+ messages in thread
* fault vs invalidate race (Re: -mm merge plans for 2.6.23) 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (23 preceding siblings ...) 2007-07-11 23:03 ` generic clockevents/ (hr)time(r) patches " Thomas Gleixner @ 2007-07-12 0:54 ` Nick Piggin 2007-07-12 2:31 ` block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23)) David Chinner 2007-07-13 9:46 ` -mm merge plans for 2.6.23 Jan Engelhardt 2007-07-17 8:55 ` unprivileged mounts (was: Re: -mm merge plans for 2.6.23) Andrew Morton 26 siblings, 1 reply; 535+ messages in thread From: Nick Piggin @ 2007-07-12 0:54 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Linux Memory Management Andrew Morton wrote: > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > mm-merge-nopfn-into-fault.patch > convert-hugetlbfs-to-use-vm_ops-fault.patch > mm-remove-legacy-cruft.patch > mm-debug-check-for-the-fault-vs-invalidate-race.patch > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch > invalidate_mapping_pages-add-cond_resched.patch > ocfs2-release-page-lock-before-calling-page_mkwrite.patch > document-page_mkwrite-locking.patch > > The fault-vs-invalidate race fix. I have belatedly learned that these need > more work, so their state is uncertain. The more work may turn out being too much for you (although it is nothing exactly tricky that would introduce subtle bugs, it is a fair amont of churn). However, in that case we can still merge these two: mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch mm-fix-clear_page_dirty_for_io-vs-fault-race.patch Which fix real bugs that need fixing (and will at least help to get some of my patches off your hands). -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23)) 2007-07-12 0:54 ` fault vs invalidate race (Re: -mm merge plans for 2.6.23) Nick Piggin @ 2007-07-12 2:31 ` David Chinner 2007-07-12 2:42 ` Nick Piggin 0 siblings, 1 reply; 535+ messages in thread From: David Chinner @ 2007-07-12 2:31 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, linux-kernel, Linux Memory Management, linux-fsdevel, xfs-oss On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote: > Andrew Morton wrote: > > The fault-vs-invalidate race fix. I have belatedly learned that these > > need > > more work, so their state is uncertain. > > The more work may turn out being too much for you (although it is nothing > exactly tricky that would introduce subtle bugs, it is a fair amont of > churn). OK, so does that mean we can finally get the block_page_mkwrite patches merged? i.e.: http://marc.info/?l=linux-kernel&m=117426058311032&w=2 http://marc.info/?l=linux-kernel&m=117426070111136&w=2 I've got up-to-date versions of them ready to go and they've been consistently tested thanks to the XFSQA test I wrote for the bug that it fixes. I've been holding them out-of-tree for months now because ->fault was supposed to supercede this interface..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23)) 2007-07-12 2:31 ` block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23)) David Chinner @ 2007-07-12 2:42 ` Nick Piggin 0 siblings, 0 replies; 535+ messages in thread From: Nick Piggin @ 2007-07-12 2:42 UTC (permalink / raw) To: David Chinner Cc: Andrew Morton, linux-kernel, Linux Memory Management, linux-fsdevel, xfs-oss David Chinner wrote: > On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote: > >>Andrew Morton wrote: >> >>>The fault-vs-invalidate race fix. I have belatedly learned that these >>>need >>>more work, so their state is uncertain. >> >>The more work may turn out being too much for you (although it is nothing >>exactly tricky that would introduce subtle bugs, it is a fair amont of >>churn). > > > OK, so does that mean we can finally get the block_page_mkwrite > patches merged? > > i.e.: > > http://marc.info/?l=linux-kernel&m=117426058311032&w=2 > http://marc.info/?l=linux-kernel&m=117426070111136&w=2 > > I've got up-to-date versions of them ready to go and they've been > consistently tested thanks to the XFSQA test I wrote for the bug > that it fixes. I've been holding them out-of-tree for months now > because ->fault was supposed to supercede this interface..... Yeah, as I've said, don't hold them back because of me. They are relatively simple enough that I don't see why they couldn't be merged in this window. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (24 preceding siblings ...) 2007-07-12 0:54 ` fault vs invalidate race (Re: -mm merge plans for 2.6.23) Nick Piggin @ 2007-07-13 9:46 ` Jan Engelhardt 2007-07-13 23:09 ` Tilman Schmidt 2007-07-17 8:55 ` unprivileged mounts (was: Re: -mm merge plans for 2.6.23) Andrew Morton 26 siblings, 1 reply; 535+ messages in thread From: Jan Engelhardt @ 2007-07-13 9:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Linux Kernel Mailing List, tilman On Jul 10 2007 01:31, Andrew Morton wrote: >use-menuconfig-objects-isdn-config_isdn_i4l.patch > > tilman didn't like it - might drop Or replace by his suggestion patch ( http://lkml.org/lkml/2007/5/31/222 ) Jan -- ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-13 9:46 ` -mm merge plans for 2.6.23 Jan Engelhardt @ 2007-07-13 23:09 ` Tilman Schmidt 2007-07-14 10:02 ` Jan Engelhardt 0 siblings, 1 reply; 535+ messages in thread From: Tilman Schmidt @ 2007-07-13 23:09 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Andrew Morton, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 786 bytes --] Am 13.07.2007 11:46 schrieb Jan Engelhardt: > On Jul 10 2007 01:31, Andrew Morton wrote: > >> use-menuconfig-objects-isdn-config_isdn_i4l.patch >> >> tilman didn't like it - might drop > > Or replace by his suggestion patch ( http://lkml.org/lkml/2007/5/31/222 ) That posting was just a change proposal for the drivers/isdn/Kconfig part, not a complete replacement for the entire patch. If you'd care to reissue that patch with the modification I proposed, I'll gladly ack it. Alternatively I can also send a full replacement patch if you prefer. Regards, Tilman -- Tilman Schmidt E-Mail: tilman@imap.cc Bonn, Germany Diese Nachricht besteht zu 100% aus wiederverwerteten Bits. Ungeöffnet mindestens haltbar bis: (siehe Rückseite) [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 253 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: -mm merge plans for 2.6.23 2007-07-13 23:09 ` Tilman Schmidt @ 2007-07-14 10:02 ` Jan Engelhardt [not found] ` <20070715131144.3467DFC040@xenon.ts.pxnet.com> 0 siblings, 1 reply; 535+ messages in thread From: Jan Engelhardt @ 2007-07-14 10:02 UTC (permalink / raw) To: Tilman Schmidt; +Cc: Andrew Morton, Linux Kernel Mailing List On Jul 14 2007 01:09, Tilman Schmidt wrote: >Am 13.07.2007 11:46 schrieb Jan Engelhardt: >> On Jul 10 2007 01:31, Andrew Morton wrote: >> >>> use-menuconfig-objects-isdn-config_isdn_i4l.patch >>> >>> tilman didn't like it - might drop >> >> Or replace by his suggestion patch ( http://lkml.org/lkml/2007/5/31/222 ) > >That posting was just a change proposal for the drivers/isdn/Kconfig >part, not a complete replacement for the entire patch. If you'd care to >reissue that patch with the modification I proposed, I'll gladly ack it. >Alternatively I can also send a full replacement patch if you prefer. Since I did not really see much of a difference between our two approaches, I'd be grateful if you could send a full replacement in the hopes that I see the global picture. Thanks, Jan -- ^ permalink raw reply [flat|nested] 535+ messages in thread
[parent not found: <20070715131144.3467DFC040@xenon.ts.pxnet.com>]
* Re: [PATCH] Use menuconfig objects - CONFIG_ISDN_I4L [v2] [not found] ` <20070715131144.3467DFC040@xenon.ts.pxnet.com> @ 2007-07-18 18:18 ` Jan Engelhardt 2007-07-18 18:22 ` [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L Jan Engelhardt 1 sibling, 0 replies; 535+ messages in thread From: Jan Engelhardt @ 2007-07-18 18:18 UTC (permalink / raw) To: Tilman Schmidt; +Cc: Andrew Morton, Karsten Keil, Linux Kernel Mailing List >Remove a menu statement and several dependencies from the Kconfig files in >the drivers/isdn tree as they have become unnecessary by the transformation >of CONFIG_ISDN from "menu, config" into "menuconfig". >(Modified version of a patch originally proposed by Jan Engelhardt.) > >Signed-off-by: Tilman Schmidt <tilman@imap.cc> >--- > >This is my alternative proposal for >use-menuconfig-objects-isdn-config_isdn_i4l.patch (patch 2 of 6 in Jan's >"Use menuconfig objects 4 - ISDN" series). It must go between patch 1 and >3 of that series because they touch some of the same files. This looks good to me. (Applies on top of today's git.) Thanks, Jan -- ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L [not found] ` <20070715131144.3467DFC040@xenon.ts.pxnet.com> 2007-07-18 18:18 ` [PATCH] Use menuconfig objects - CONFIG_ISDN_I4L [v2] Jan Engelhardt @ 2007-07-18 18:22 ` Jan Engelhardt 2007-07-18 18:23 ` [patch 1/2] Use menuconfig objects - ISDN Jan Engelhardt ` (2 more replies) 1 sibling, 3 replies; 535+ messages in thread From: Jan Engelhardt @ 2007-07-18 18:22 UTC (permalink / raw) To: Tilman Schmidt; +Cc: Andrew Morton, Karsten Keil, Linux Kernel Mailing List Hi, here are two more changes I propose for the isdn submenu(s). They go on top of Tilman's patch; each of the two following patches is independent of another. Opinions please :) Jan -- ^ permalink raw reply [flat|nested] 535+ messages in thread
* [patch 1/2] Use menuconfig objects - ISDN 2007-07-18 18:22 ` [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L Jan Engelhardt @ 2007-07-18 18:23 ` Jan Engelhardt 2007-07-18 18:23 ` [patch 2/2] Use menuconfig objects - ISDN/Gigaset Jan Engelhardt 2007-07-22 0:32 ` [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L Tilman Schmidt 2 siblings, 0 replies; 535+ messages in thread From: Jan Engelhardt @ 2007-07-18 18:23 UTC (permalink / raw) To: Tilman Schmidt; +Cc: Andrew Morton, Karsten Keil, Linux Kernel Mailing List Unclutter the ISDN menu a tiny bit by moving ISDN4Linux and the CAPI2.0 layers into their own menu. Signed-off-by: Jan Engelhardt <jengelh@gmx.de> --- drivers/isdn/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.23/drivers/isdn/Kconfig =================================================================== --- linux-2.6.23.orig/drivers/isdn/Kconfig +++ linux-2.6.23/drivers/isdn/Kconfig @@ -21,7 +21,7 @@ menuconfig ISDN if ISDN -config ISDN_I4L +menuconfig ISDN_I4L tristate "Old ISDN4Linux (deprecated)" ---help--- This driver allows you to use an ISDN adapter for networking @@ -43,7 +43,7 @@ if ISDN_I4L source "drivers/isdn/i4l/Kconfig" endif -config ISDN_CAPI +menuconfig ISDN_CAPI tristate "CAPI 2.0 subsystem" help This provides the CAPI (Common ISDN Application Programming ^ permalink raw reply [flat|nested] 535+ messages in thread
* [patch 2/2] Use menuconfig objects - ISDN/Gigaset 2007-07-18 18:22 ` [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L Jan Engelhardt 2007-07-18 18:23 ` [patch 1/2] Use menuconfig objects - ISDN Jan Engelhardt @ 2007-07-18 18:23 ` Jan Engelhardt 2007-07-22 0:32 ` [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L Tilman Schmidt 2 siblings, 0 replies; 535+ messages in thread From: Jan Engelhardt @ 2007-07-18 18:23 UTC (permalink / raw) To: Tilman Schmidt; +Cc: Andrew Morton, Karsten Keil, Linux Kernel Mailing List Change Kconfig objects from "menu, config" into "menuconfig" so that the user can disable the whole feature without having to enter the menu first. Signed-off-by: Jan Engelhardt <jengelh@gmx.de> --- drivers/isdn/gigaset/Kconfig | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) Index: linux-2.6.23/drivers/isdn/gigaset/Kconfig =================================================================== --- linux-2.6.23.orig/drivers/isdn/gigaset/Kconfig +++ linux-2.6.23/drivers/isdn/gigaset/Kconfig @@ -1,6 +1,4 @@ -menu "Siemens Gigaset" - -config ISDN_DRV_GIGASET +menuconfig ISDN_DRV_GIGASET tristate "Siemens Gigaset support (isdn)" select CRC_CCITT select BITREVERSE @@ -53,6 +51,4 @@ config GIGASET_UNDOCREQ features like configuration mode of M105, say yes. If you care about your device, say no. -endif - -endmenu +endif # ISDN_DRV_GIGASET != n ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L 2007-07-18 18:22 ` [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L Jan Engelhardt 2007-07-18 18:23 ` [patch 1/2] Use menuconfig objects - ISDN Jan Engelhardt 2007-07-18 18:23 ` [patch 2/2] Use menuconfig objects - ISDN/Gigaset Jan Engelhardt @ 2007-07-22 0:32 ` Tilman Schmidt 2 siblings, 0 replies; 535+ messages in thread From: Tilman Schmidt @ 2007-07-22 0:32 UTC (permalink / raw) To: Jan Engelhardt Cc: Tilman Schmidt, Andrew Morton, Karsten Keil, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 511 bytes --] Hi, sorry for the late reply. On 18.07.2007 20:22 Jan Engelhardt wrote: > here are two more changes I propose for the isdn submenu(s). > They go on top of Tilman's patch; each of the two following patches is > independent of another. > Opinions please :) These are fine by me. Thanks Tilman -- Tilman Schmidt E-Mail: tilman@imap.cc Bonn, Germany Diese Nachricht besteht zu 100% aus wiederverwerteten Bits. Ungeöffnet mindestens haltbar bis: (siehe Rückseite) [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 253 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* unprivileged mounts (was: Re: -mm merge plans for 2.6.23) 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton ` (25 preceding siblings ...) 2007-07-13 9:46 ` -mm merge plans for 2.6.23 Jan Engelhardt @ 2007-07-17 8:55 ` Andrew Morton 26 siblings, 0 replies; 535+ messages in thread From: Andrew Morton @ 2007-07-17 8:55 UTC (permalink / raw) To: linux-kernel; +Cc: Christoph Hellwig, Al Viro, Miklos Szeredi On Tue, 10 Jul 2007 01:31:52 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > unprivileged-mounts-add-user-mounts-to-the-kernel.patch > unprivileged-mounts-allow-unprivileged-umount.patch > unprivileged-mounts-account-user-mounts.patch > unprivileged-mounts-propagate-error-values-from-clone_mnt.patch > unprivileged-mounts-allow-unprivileged-bind-mounts.patch > unprivileged-mounts-put-declaration-of-put_filesystem-in-fsh.patch > unprivileged-mounts-allow-unprivileged-mounts.patch > unprivileged-mounts-allow-unprivileged-fuse-mounts.patch > unprivileged-mounts-propagation-inherit-owner-from-parent.patch > unprivileged-mounts-add-no-submounts-flag.patch > > Don't know. Need to ping suitable developers over this work. ping. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review
@ 2007-08-11 10:44 Al Boldi
2007-08-12 4:17 ` Ingo Molnar
0 siblings, 1 reply; 535+ messages in thread
From: Al Boldi @ 2007-08-11 10:44 UTC (permalink / raw)
To: linux-kernel
Roman Zippel wrote:
> On Fri, 10 Aug 2007, Ingo Molnar wrote:
> > achieve that. It probably wont make a real difference, but it's really
> > easy for you to send and it's still very useful when one tries to
> > eliminate possibilities and when one wants to concentrate on the
> > remaining possibilities alone.
>
> The thing I'm afraid about CFS is its possible unpredictability, which
> would make it hard to reproduce problems and we may end up with users with
> unexplainable weird problems. That's the main reason I'm trying so hard to
> push for a design discussion.
>
> Just to give an idea here are two more examples of irregular behaviour,
> which are hopefully easier to reproduce.
>
> 1. Two simple busy loops, one of them is reniced to 15, according to my
> calculations the reniced task should get about 3.4% (1/(1.25^15+1)), but I
> get this:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4433 roman 20 0 1532 300 244 R 99.2 0.2 5:05.51 l
> 4434 roman 35 15 1532 72 16 R 0.7 0.1 0:10.62 l
>
> OTOH upto nice level 12 I get what I expect.
>
> 2. If I start 20 busy loops, initially I see in top that every task gets
> 5% and time increments equally (as it should):
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4492 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4491 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4490 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4489 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4488 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4487 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4486 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4485 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4484 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4483 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4482 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4481 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4480 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4479 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4478 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4477 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4476 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4475 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4474 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
> 4473 roman 20 0 1532 296 244 R 5.0 0.2 0:02.86 l
>
> But if I renice all of them to -15, the time every task gets is rather
> random:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4492 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.95 l
> 4491 roman 5 -15 1532 68 16 R 4.3 0.1 0:07.62 l
> 4490 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.50 l
> 4489 roman 5 -15 1532 68 16 R 7.6 0.1 0:07.80 l
> 4488 roman 5 -15 1532 68 16 R 9.6 0.1 0:08.31 l
> 4487 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.59 l
> 4486 roman 5 -15 1532 68 16 R 6.6 0.1 0:07.08 l
> 4485 roman 5 -15 1532 68 16 R 10.0 0.1 0:07.31 l
> 4484 roman 5 -15 1532 68 16 R 8.0 0.1 0:07.30 l
> 4483 roman 5 -15 1532 68 16 R 7.0 0.1 0:07.34 l
> 4482 roman 5 -15 1532 68 16 R 1.0 0.1 0:05.84 l
> 4481 roman 5 -15 1532 68 16 R 1.0 0.1 0:07.16 l
> 4480 roman 5 -15 1532 68 16 R 3.3 0.1 0:07.00 l
> 4479 roman 5 -15 1532 68 16 R 1.0 0.1 0:06.66 l
> 4478 roman 5 -15 1532 68 16 R 8.6 0.1 0:06.96 l
> 4477 roman 5 -15 1532 68 16 R 8.6 0.1 0:07.63 l
> 4476 roman 5 -15 1532 68 16 R 9.6 0.1 0:07.38 l
> 4475 roman 5 -15 1532 68 16 R 1.3 0.1 0:07.09 l
> 4474 roman 5 -15 1532 68 16 R 2.3 0.1 0:07.97 l
> 4473 roman 5 -15 1532 296 244 R 1.0 0.2 0:07.73 l
That's because granularity increases when decreasing nice, and results in
larger timeslices, which affects smoothness negatively. chew.c easily shows
this problem with 2 background cpu-hogs at the same nice-level.
pid 908, prio 0, out for 8 ms, ran for 4 ms, load 37%
pid 908, prio 0, out for 8 ms, ran for 4 ms, load 37%
pid 908, prio 0, out for 8 ms, ran for 2 ms, load 26%
pid 908, prio 0, out for 8 ms, ran for 4 ms, load 38%
pid 908, prio 0, out for 2 ms, ran for 1 ms, load 47%
pid 908, prio -5, out for 23 ms, ran for 3 ms, load 14%
pid 908, prio -5, out for 17 ms, ran for 9 ms, load 35%
pid 908, prio -5, out for 18 ms, ran for 6 ms, load 27%
pid 908, prio -5, out for 20 ms, ran for 10 ms, load 34%
pid 908, prio -5, out for 9 ms, ran for 3 ms, load 30%
pid 908, prio -10, out for 69 ms, ran for 8 ms, load 11%
pid 908, prio -10, out for 35 ms, ran for 19 ms, load 36%
pid 908, prio -10, out for 58 ms, ran for 20 ms, load 26%
pid 908, prio -10, out for 34 ms, ran for 17 ms, load 34%
pid 908, prio -10, out for 58 ms, ran for 23 ms, load 28%
pid 908, prio -15, out for 164 ms, ran for 20 ms, load 11%
pid 908, prio -15, out for 21 ms, ran for 11 ms, load 36%
pid 908, prio -15, out for 21 ms, ran for 12 ms, load 37%
pid 908, prio -15, out for 115 ms, ran for 14 ms, load 11%
pid 908, prio -15, out for 27 ms, ran for 22 ms, load 45%
pid 908, prio -15, out for 125 ms, ran for 33 ms, load 21%
pid 908, prio -15, out for 54 ms, ran for 16 ms, load 22%
pid 908, prio -15, out for 34 ms, ran for 33 ms, load 49%
pid 908, prio -15, out for 94 ms, ran for 15 ms, load 14%
pid 908, prio -15, out for 29 ms, ran for 21 ms, load 42%
pid 908, prio -15, out for 108 ms, ran for 20 ms, load 15%
pid 908, prio -15, out for 44 ms, ran for 20 ms, load 31%
pid 908, prio -15, out for 34 ms, ran for 110 ms, load 76%
pid 908, prio -15, out for 132 ms, ran for 21 ms, load 14%
pid 908, prio -15, out for 42 ms, ran for 39 ms, load 48%
pid 908, prio -15, out for 57 ms, ran for 124 ms, load 68%
pid 908, prio -15, out for 44 ms, ran for 17 ms, load 28%
It looks like the larger the granularity, the more unpredictable it gets,
which probably means that this unpredictability exists even at smaller
granularity but is only exposed with larger ones.
Thanks!
--
Al
^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-11 10:44 CFS review Al Boldi @ 2007-08-12 4:17 ` Ingo Molnar 2007-08-12 15:27 ` Al Boldi 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-12 4:17 UTC (permalink / raw) To: Al Boldi Cc: Peter Zijlstra, Mike Galbraith, Roman Zippel, Linus Torvalds, Andrew Morton, linux-kernel * Al Boldi <a1426z@gawab.com> wrote: > That's because granularity increases when decreasing nice, and results > in larger timeslices, which affects smoothness negatively. chew.c > easily shows this problem with 2 background cpu-hogs at the same > nice-level. > > pid 908, prio 0, out for 8 ms, ran for 4 ms, load 37% > pid 908, prio 0, out for 8 ms, ran for 4 ms, load 37% > pid 908, prio 0, out for 8 ms, ran for 2 ms, load 26% > pid 908, prio 0, out for 8 ms, ran for 4 ms, load 38% > pid 908, prio 0, out for 2 ms, ran for 1 ms, load 47% > > pid 908, prio -5, out for 23 ms, ran for 3 ms, load 14% > pid 908, prio -5, out for 17 ms, ran for 9 ms, load 35% yeah. Incidentally, i refined this last week and those nice-level granularity changes went into the upstream scheduler code a few days ago: commit 7cff8cf61cac15fa29a1ca802826d2bcbca66152 Author: Ingo Molnar <mingo@elte.hu> Date: Thu Aug 9 11:16:52 2007 +0200 sched: refine negative nice level granularity refine the granularity of negative nice level tasks: let them reschedule more often to offset the effect of them consuming their wait_runtime proportionately slower. (This makes nice-0 task scheduling smoother in the presence of negatively reniced tasks.) Signed-off-by: Ingo Molnar <mingo@elte.hu> so could you please re-check chew jitter behavior with the latest kernel? (i've attached the standalone patch below, it will apply cleanly to rc2 too.) when testing this, you might also want to try chew-max: http://redhat.com/~mingo/cfs-scheduler/tools/chew-max.c i added a few trivial enhancements to chew.c: it tracks the maximum latency, latency fluctuations (noise of scheduling) and allows it to be run for a fixed amount of time. NOTE: if you run chew from any indirect terminal (xterm, ssh, etc.) it's best to capture/report chew numbers like this: ./chew-max 60 > chew.log otherwise the indirect scheduling activities of the chew printout will disturb the numbers. > It looks like the larger the granularity, the more unpredictable it > gets, which probably means that this unpredictability exists even at > smaller granularity but is only exposed with larger ones. this should only affect non-default nice levels. Note that 99.9%+ of all userspace Linux CPU time is spent on default nice level 0, and that is what controls the design. So the approach was always to first get nice-0 right, and then to adjust the non-default nice level behavior too, carefully, without hurting nice-0 - to refine all the workloads where people (have to) use positive or negative nice levels. In any case, please keep re-testing this so that we can adjust it. Ingo ---------------------> commit 7cff8cf61cac15fa29a1ca802826d2bcbca66152 Author: Ingo Molnar <mingo@elte.hu> Date: Thu Aug 9 11:16:52 2007 +0200 sched: refine negative nice level granularity refine the granularity of negative nice level tasks: let them reschedule more often to offset the effect of them consuming their wait_runtime proportionately slower. (This makes nice-0 task scheduling smoother in the presence of negatively reniced tasks.) Signed-off-by: Ingo Molnar <mingo@elte.hu> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 7a632c5..e91db32 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -222,21 +222,25 @@ niced_granularity(struct sched_entity *curr, unsigned long granularity) { u64 tmp; + if (likely(curr->load.weight == NICE_0_LOAD)) + return granularity; /* - * Negative nice levels get the same granularity as nice-0: + * Positive nice levels get the same granularity as nice-0: */ - if (likely(curr->load.weight >= NICE_0_LOAD)) - return granularity; + if (likely(curr->load.weight < NICE_0_LOAD)) { + tmp = curr->load.weight * (u64)granularity; + return (long) (tmp >> NICE_0_SHIFT); + } /* - * Positive nice level tasks get linearly finer + * Negative nice level tasks get linearly finer * granularity: */ - tmp = curr->load.weight * (u64)granularity; + tmp = curr->load.inv_weight * (u64)granularity; /* * It will always fit into 'long': */ - return (long) (tmp >> NICE_0_SHIFT); + return (long) (tmp >> WMULT_SHIFT); } static inline void ^ permalink raw reply related [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-12 4:17 ` Ingo Molnar @ 2007-08-12 15:27 ` Al Boldi 2007-08-12 15:52 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Al Boldi @ 2007-08-12 15:27 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Roman Zippel, Linus Torvalds, Andrew Morton, linux-kernel Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > That's because granularity increases when decreasing nice, and results > > in larger timeslices, which affects smoothness negatively. chew.c > > easily shows this problem with 2 background cpu-hogs at the same > > nice-level. > > > > pid 908, prio 0, out for 8 ms, ran for 4 ms, load 37% > > pid 908, prio 0, out for 8 ms, ran for 4 ms, load 37% > > pid 908, prio 0, out for 8 ms, ran for 2 ms, load 26% > > pid 908, prio 0, out for 8 ms, ran for 4 ms, load 38% > > pid 908, prio 0, out for 2 ms, ran for 1 ms, load 47% > > > > pid 908, prio -5, out for 23 ms, ran for 3 ms, load 14% > > pid 908, prio -5, out for 17 ms, ran for 9 ms, load 35% > > yeah. Incidentally, i refined this last week and those nice-level > granularity changes went into the upstream scheduler code a few days > ago: > > commit 7cff8cf61cac15fa29a1ca802826d2bcbca66152 > Author: Ingo Molnar <mingo@elte.hu> > Date: Thu Aug 9 11:16:52 2007 +0200 > > sched: refine negative nice level granularity > > refine the granularity of negative nice level tasks: let them > reschedule more often to offset the effect of them consuming > their wait_runtime proportionately slower. (This makes nice-0 > task scheduling smoother in the presence of negatively > reniced tasks.) > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > > so could you please re-check chew jitter behavior with the latest > kernel? (i've attached the standalone patch below, it will apply cleanly > to rc2 too.) That fixes it, but by reducing granularity ctx is up 4-fold. Mind you, it does have an enormous effect on responsiveness, as negative nice with small granularity can't hijack the system any more. > when testing this, you might also want to try chew-max: > > http://redhat.com/~mingo/cfs-scheduler/tools/chew-max.c > > i added a few trivial enhancements to chew.c: it tracks the maximum > latency, latency fluctuations (noise of scheduling) and allows it to be > run for a fixed amount of time. Looks great. Thanks! > NOTE: if you run chew from any indirect terminal (xterm, ssh, etc.) it's > best to capture/report chew numbers like this: > > ./chew-max 60 > chew.log > > otherwise the indirect scheduling activities of the chew printout will > disturb the numbers. Correct; that's why I always boot into /bin/sh to get a clean run. But redirecting output to a file is also a good idea, provided that this file lives on something like tmpfs, otherwise you'll get flush out jitter. > > It looks like the larger the granularity, the more unpredictable it > > gets, which probably means that this unpredictability exists even at > > smaller granularity but is only exposed with larger ones. > > this should only affect non-default nice levels. Note that 99.9%+ of all > userspace Linux CPU time is spent on default nice level 0, and that is > what controls the design. So the approach was always to first get nice-0 > right, and then to adjust the non-default nice level behavior too, > carefully, without hurting nice-0 - to refine all the workloads where > people (have to) use positive or negative nice levels. In any case, > please keep re-testing this so that we can adjust it. The thing is, this unpredictability seems to exist even at nice level 0, but the smaller granularity covers it all up. It occasionally exhibits itself as hick-ups during transient heavy workload flux. But it's not easily reproducible. Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-12 15:27 ` Al Boldi @ 2007-08-12 15:52 ` Ingo Molnar 2007-08-12 19:43 ` Al Boldi 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-12 15:52 UTC (permalink / raw) To: Al Boldi Cc: Peter Zijlstra, Mike Galbraith, Roman Zippel, Linus Torvalds, Andrew Morton, linux-kernel * Al Boldi <a1426z@gawab.com> wrote: > > so could you please re-check chew jitter behavior with the latest > > kernel? (i've attached the standalone patch below, it will apply > > cleanly to rc2 too.) > > That fixes it, but by reducing granularity ctx is up 4-fold. ok, great! (the context-switch rate is obviously up.) > Mind you, it does have an enormous effect on responsiveness, as > negative nice with small granularity can't hijack the system any more. ok. i'm glad you like the result :-) This makes reniced X (or any reniced app) more usable. > The thing is, this unpredictability seems to exist even at nice level > 0, but the smaller granularity covers it all up. It occasionally > exhibits itself as hick-ups during transient heavy workload flux. But > it's not easily reproducible. In general, "hickups" can be due to many, many reasons. If a task got indeed delayed by scheduling jitter that is provable, even if the behavior is hard to reproduce, by enabling CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y in your kernel. First clear all the stats: for N in /proc/*/task/*/sched; do echo 0 > $N; done then wait for the 'hickup' to happen, and once it happens capture the system state (after the hickup) via this script: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh and tell me which specific task exhibited that 'hickup' and send me the debug output. Also, could you try the patch below as well? Thanks, Ingo --------------------------------> Subject: sched: fix sleeper bonus From: Ingo Molnar <mingo@elte.hu> Peter Ziljstra noticed that the sleeper bonus deduction code was not properly rate-limited: a task that scheduled more frequently would get a disproportionately large deduction. So limit the deduction to delta_exec. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_fair.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -75,7 +75,7 @@ enum { unsigned int sysctl_sched_features __read_mostly = SCHED_FEAT_FAIR_SLEEPERS *1 | - SCHED_FEAT_SLEEPER_AVG *1 | + SCHED_FEAT_SLEEPER_AVG *0 | SCHED_FEAT_SLEEPER_LOAD_AVG *1 | SCHED_FEAT_PRECISE_CPU_LOAD *1 | SCHED_FEAT_START_DEBIT *1 | @@ -304,11 +304,9 @@ __update_curr(struct cfs_rq *cfs_rq, str delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw); if (cfs_rq->sleeper_bonus > sysctl_sched_granularity) { - delta = calc_delta_mine(cfs_rq->sleeper_bonus, - curr->load.weight, lw); - if (unlikely(delta > cfs_rq->sleeper_bonus)) - delta = cfs_rq->sleeper_bonus; - + delta = min(cfs_rq->sleeper_bonus, (u64)delta_exec); + delta = calc_delta_mine(delta, curr->load.weight, lw); + delta = min((u64)delta, cfs_rq->sleeper_bonus); cfs_rq->sleeper_bonus -= delta; delta_mine -= delta; } @@ -521,6 +519,8 @@ static void __enqueue_sleeper(struct cfs * Track the amount of bonus we've given to sleepers: */ cfs_rq->sleeper_bonus += delta_fair; + if (unlikely(cfs_rq->sleeper_bonus > sysctl_sched_runtime_limit)) + cfs_rq->sleeper_bonus = sysctl_sched_runtime_limit; schedstat_add(cfs_rq, wait_runtime, se->wait_runtime); } ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-12 15:52 ` Ingo Molnar @ 2007-08-12 19:43 ` Al Boldi 2007-08-21 10:58 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Al Boldi @ 2007-08-12 19:43 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Roman Zippel, Linus Torvalds, Andrew Morton, linux-kernel Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > The thing is, this unpredictability seems to exist even at nice level > > 0, but the smaller granularity covers it all up. It occasionally > > exhibits itself as hick-ups during transient heavy workload flux. But > > it's not easily reproducible. > > In general, "hickups" can be due to many, many reasons. If a task got > indeed delayed by scheduling jitter that is provable, even if the > behavior is hard to reproduce, by enabling CONFIG_SCHED_DEBUG=y and > CONFIG_SCHEDSTATS=y in your kernel. First clear all the stats: > > for N in /proc/*/task/*/sched; do echo 0 > $N; done > > then wait for the 'hickup' to happen, and once it happens capture the > system state (after the hickup) via this script: > > http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh > > and tell me which specific task exhibited that 'hickup' and send me the > debug output. Ok. > Also, could you try the patch below as well? Thanks, Looks ok, but I'm not sure which workload this is supposed to improve. There is one workload that still isn't performing well; it's a web-server workload that spawns 1K+ client procs. It can be emulated by using this: for i in `seq 1 to 3333`; do ping 10.1 -A > /dev/null & done The problem is that consecutive runs don't give consistent results and sometimes stalls. You may want to try that. Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-12 19:43 ` Al Boldi @ 2007-08-21 10:58 ` Ingo Molnar 2007-08-21 22:27 ` Al Boldi 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-21 10:58 UTC (permalink / raw) To: Al Boldi Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Al Boldi <a1426z@gawab.com> wrote: > There is one workload that still isn't performing well; it's a > web-server workload that spawns 1K+ client procs. It can be emulated > by using this: > > for i in `seq 1 to 3333`; do ping 10.1 -A > /dev/null & done on bash i did this as: for ((i=0; i<3333; i++)); do ping 10.1 -A > /dev/null & done and this quickly creates a monster-runqueue with tons of ping tasks pending. (i replaced 10.1 with the IP of another box on the same LAN as the testbox) Is this what should happen? > The problem is that consecutive runs don't give consistent results and > sometimes stalls. You may want to try that. well, there's a natural saturation point after a few hundred tasks (depending on your CPU's speed), at which point there's no idle time left. From that point on things get slower progressively (and the ability of the shell to start new ping tasks is impacted as well), but that's expected on an overloaded system, isnt it? Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-21 10:58 ` Ingo Molnar @ 2007-08-21 22:27 ` Al Boldi 2007-08-24 13:45 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Al Boldi @ 2007-08-21 22:27 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > There is one workload that still isn't performing well; it's a > > web-server workload that spawns 1K+ client procs. It can be emulated > > by using this: > > > > for i in `seq 1 to 3333`; do ping 10.1 -A > /dev/null & done > > on bash i did this as: > > for ((i=0; i<3333; i++)); do ping 10.1 -A > /dev/null & done > > and this quickly creates a monster-runqueue with tons of ping tasks > pending. (i replaced 10.1 with the IP of another box on the same LAN as > the testbox) Is this what should happen? Yes, sometimes they start pending and sometimes they run immediately. > > The problem is that consecutive runs don't give consistent results and > > sometimes stalls. You may want to try that. > > well, there's a natural saturation point after a few hundred tasks > (depending on your CPU's speed), at which point there's no idle time > left. From that point on things get slower progressively (and the > ability of the shell to start new ping tasks is impacted as well), but > that's expected on an overloaded system, isnt it? Of course, things should get slower with higher load, but it should be consistent without stalls. To see this problem, make sure you boot into /bin/sh with the normal VGA console (ie. not fb-console). Then try each loop a few times to show different behaviour; loops like: # for ((i=0; i<3333; i++)); do ping 10.1 -A > /dev/null & done # for ((i=0; i<3333; i++)); do nice -99 ping 10.1 -A > /dev/null & done # { for ((i=0; i<3333; i++)); do ping 10.1 -A > /dev/null & done } > /dev/null 2>&1 Especially the last one sometimes causes a complete console lock-up, while the other two sometimes stall then surge periodically. BTW, I am also wondering how one might test threading behaviour wrt to startup and sync-on-exit with parent thread. This may not show any problems with small number of threads, but how does it scale with 1K+? Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-21 22:27 ` Al Boldi @ 2007-08-24 13:45 ` Ingo Molnar 2007-08-25 22:27 ` Al Boldi 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-24 13:45 UTC (permalink / raw) To: Al Boldi Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Al Boldi <a1426z@gawab.com> wrote: > > > The problem is that consecutive runs don't give consistent results > > > and sometimes stalls. You may want to try that. > > > > well, there's a natural saturation point after a few hundred tasks > > (depending on your CPU's speed), at which point there's no idle time > > left. From that point on things get slower progressively (and the > > ability of the shell to start new ping tasks is impacted as well), > > but that's expected on an overloaded system, isnt it? > > Of course, things should get slower with higher load, but it should be > consistent without stalls. > > To see this problem, make sure you boot into /bin/sh with the normal > VGA console (ie. not fb-console). Then try each loop a few times to > show different behaviour; loops like: > > # for ((i=0; i<3333; i++)); do ping 10.1 -A > /dev/null & done > > # for ((i=0; i<3333; i++)); do nice -99 ping 10.1 -A > /dev/null & done > > # { for ((i=0; i<3333; i++)); do > ping 10.1 -A > /dev/null & > done } > /dev/null 2>&1 > > Especially the last one sometimes causes a complete console lock-up, > while the other two sometimes stall then surge periodically. ok. I think i might finally have found the bug causing this. Could you try the fix below, does your webserver thread-startup test work any better? Ingo ---------------------------> Subject: sched: fix startup penalty calculation From: Ingo Molnar <mingo@elte.hu> fix task startup penalty miscalculation: sysctl_sched_granularity is unsigned int and wait_runtime is long so we first have to convert it to long before turning it negative ... Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -1048,7 +1048,7 @@ static void task_new_fair(struct rq *rq, * -granularity/2, so initialize the task with that: */ if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) - p->se.wait_runtime = -(sysctl_sched_granularity / 2); + p->se.wait_runtime = -((long)sysctl_sched_granularity / 2); __enqueue_entity(cfs_rq, se); } ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-24 13:45 ` Ingo Molnar @ 2007-08-25 22:27 ` Al Boldi 2007-08-25 23:15 ` Ingo Molnar 2007-08-29 3:37 ` Bill Davidsen 0 siblings, 2 replies; 535+ messages in thread From: Al Boldi @ 2007-08-25 22:27 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > > > The problem is that consecutive runs don't give consistent results > > > > and sometimes stalls. You may want to try that. > > > > > > well, there's a natural saturation point after a few hundred tasks > > > (depending on your CPU's speed), at which point there's no idle time > > > left. From that point on things get slower progressively (and the > > > ability of the shell to start new ping tasks is impacted as well), > > > but that's expected on an overloaded system, isnt it? > > > > Of course, things should get slower with higher load, but it should be > > consistent without stalls. > > > > To see this problem, make sure you boot into /bin/sh with the normal > > VGA console (ie. not fb-console). Then try each loop a few times to > > show different behaviour; loops like: > > > > # for ((i=0; i<3333; i++)); do ping 10.1 -A > /dev/null & done > > > > # for ((i=0; i<3333; i++)); do nice -99 ping 10.1 -A > /dev/null & done > > > > # { for ((i=0; i<3333; i++)); do > > ping 10.1 -A > /dev/null & > > done } > /dev/null 2>&1 > > > > Especially the last one sometimes causes a complete console lock-up, > > while the other two sometimes stall then surge periodically. > > ok. I think i might finally have found the bug causing this. Could you > try the fix below, does your webserver thread-startup test work any > better? It seems to help somewhat, but the problem is still visible. Even v20.3 on 2.6.22.5 didn't help. It does look related to ia-boosting, so I turned off __update_curr like Roman mentioned, which had an enormous smoothing effect, but then nice levels completely break down and lockup the system. There is another way to show the problem visually under X (vesa-driver), by starting 3 gears simultaneously, which after laying them out side-by-side need some settling time before smoothing out. Without __update_curr it's absolutely smooth from the start. Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-25 22:27 ` Al Boldi @ 2007-08-25 23:15 ` Ingo Molnar 2007-08-26 16:27 ` Al Boldi 2007-08-29 3:42 ` Bill Davidsen 2007-08-29 3:37 ` Bill Davidsen 1 sibling, 2 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-25 23:15 UTC (permalink / raw) To: Al Boldi Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Al Boldi <a1426z@gawab.com> wrote: > > ok. I think i might finally have found the bug causing this. Could > > you try the fix below, does your webserver thread-startup test work > > any better? > > It seems to help somewhat, but the problem is still visible. Even > v20.3 on 2.6.22.5 didn't help. > > It does look related to ia-boosting, so I turned off __update_curr > like Roman mentioned, which had an enormous smoothing effect, but then > nice levels completely break down and lockup the system. you can turn sleeper-fairness off via: echo 28 > /proc/sys/kernel/sched_features another thing to try would be: echo 12 > /proc/sys/kernel/sched_features (that's the new-task penalty turned off.) Another thing to try would be to edit this: if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) p->se.wait_runtime = -(sched_granularity(cfs_rq) / 2); to: if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) p->se.wait_runtime = -(sched_granularity(cfs_rq); and could you also check 20.4 on 2.6.22.5 perhaps, or very latest -git? (Peter has experienced smaller spikes with that.) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-25 23:15 ` Ingo Molnar @ 2007-08-26 16:27 ` Al Boldi 2007-08-26 16:39 ` Ingo Molnar 2007-08-29 3:42 ` Bill Davidsen 1 sibling, 1 reply; 535+ messages in thread From: Al Boldi @ 2007-08-26 16:27 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > > ok. I think i might finally have found the bug causing this. Could > > > you try the fix below, does your webserver thread-startup test work > > > any better? > > > > It seems to help somewhat, but the problem is still visible. Even > > v20.3 on 2.6.22.5 didn't help. > > > > It does look related to ia-boosting, so I turned off __update_curr > > like Roman mentioned, which had an enormous smoothing effect, but then > > nice levels completely break down and lockup the system. > > you can turn sleeper-fairness off via: > > echo 28 > /proc/sys/kernel/sched_features > > another thing to try would be: > > echo 12 > /proc/sys/kernel/sched_features > > (that's the new-task penalty turned off.) > > Another thing to try would be to edit this: > > if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) > p->se.wait_runtime = -(sched_granularity(cfs_rq) / 2); > > to: > > if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) > p->se.wait_runtime = -(sched_granularity(cfs_rq); > > and could you also check 20.4 on 2.6.22.5 perhaps, or very latest -git? > (Peter has experienced smaller spikes with that.) Ok, I tried all your suggestions, but nothing works as smooth as removing __update_curr. Does the problem show on your machine with the 3x gears under X-vesa test? Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-26 16:27 ` Al Boldi @ 2007-08-26 16:39 ` Ingo Molnar 2007-08-27 4:06 ` Al Boldi 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-26 16:39 UTC (permalink / raw) To: Al Boldi Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Al Boldi <a1426z@gawab.com> wrote: > > and could you also check 20.4 on 2.6.22.5 perhaps, or very latest > > -git? (Peter has experienced smaller spikes with that.) > > Ok, I tried all your suggestions, but nothing works as smooth as > removing __update_curr. could you send the exact patch that shows what you did? And could you also please describe it exactly which aspect of the workload you call 'smooth'. Could it be made quantitative somehow? Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-26 16:39 ` Ingo Molnar @ 2007-08-27 4:06 ` Al Boldi 2007-08-27 10:53 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Al Boldi @ 2007-08-27 4:06 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > > and could you also check 20.4 on 2.6.22.5 perhaps, or very latest > > > -git? (Peter has experienced smaller spikes with that.) > > > > Ok, I tried all your suggestions, but nothing works as smooth as > > removing __update_curr. > > could you send the exact patch that shows what you did? On 2.6.22.5-v20.3 (not v20.4): 340- curr->delta_exec += delta_exec; 341- 342- if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) { 343:// __update_curr(cfs_rq, curr); 344- curr->delta_exec = 0; 345- } 346- curr->exec_start = rq_of(cfs_rq)->clock; > And could you > also please describe it exactly which aspect of the workload you call > 'smooth'. Could it be made quantitative somehow? The 3x gears test shows the startup problem in a really noticeable way. With v20.4 they startup surging and stalling periodically for about 10sec, then they are smooth. With v20.3 + above patch they startup completely smooth. Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-27 4:06 ` Al Boldi @ 2007-08-27 10:53 ` Ingo Molnar 2007-08-27 14:46 ` Al Boldi 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-27 10:53 UTC (permalink / raw) To: Al Boldi Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Al Boldi <a1426z@gawab.com> wrote: > > could you send the exact patch that shows what you did? > > On 2.6.22.5-v20.3 (not v20.4): > > 340- curr->delta_exec += delta_exec; > 341- > 342- if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) { > 343:// __update_curr(cfs_rq, curr); > 344- curr->delta_exec = 0; > 345- } > 346- curr->exec_start = rq_of(cfs_rq)->clock; ouch - this produces a really broken scheduler - with this we dont do any run-time accounting (!). Could you try the patch below instead, does this make 3x glxgears smooth again? (if yes, could you send me your Signed-off-by line as well.) Ingo ------------------------> Subject: sched: make the scheduler converge to the ideal latency From: Ingo Molnar <mingo@elte.hu> de-HZ-ification of the granularity defaults unearthed a pre-existing property of CFS: while it correctly converges to the granularity goal, it does not prevent run-time fluctuations in the range of [-gran ... +gran]. With the increase of the granularity due to the removal of HZ dependencies, this becomes visible in chew-max output (with 5 tasks running): out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40 out: 27 . 27. 32 | flu: 0 . 0 | ran: 17 . 13 | per: 44 . 40 out: 27 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 36 . 40 out: 29 . 27. 32 | flu: 2 . 0 | ran: 17 . 13 | per: 46 . 40 out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40 out: 29 . 27. 32 | flu: 0 . 0 | ran: 18 . 13 | per: 47 . 40 out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40 average slice is the ideal 13 msecs and the period is picture-perfect 40 msecs. But the 'ran' field fluctuates around 13.33 msecs and there's no mechanism in CFS to keep that from happening: it's a perfectly valid solution that CFS finds. the solution is to add a granularity/preemption rule that knows about the "target latency", which makes tasks that run longer than the ideal latency run a bit less. The simplest approach is to simply decrease the preemption granularity when a task overruns its ideal latency. For this we have to track how much the task executed since its last preemption. ( this adds a new field to task_struct, but we can eliminate that overhead in 2.6.24 by putting all the scheduler timestamps into an anonymous union. ) with this change in place, chew-max output is fluctuation-less all around: out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40 this patch has no impact on any fastpath or on any globally observable scheduling property. (unless you have sharp enough eyes to see millisecond-level ruckles in glxgears smoothness :-) Also, with this mechanism in place the formula for adaptive granularity can be simplified down to the obvious "granularity = latency/nr_running" calculation. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/sched.h | 1 + kernel/sched_fair.c | 43 ++++++++++++++----------------------------- 2 files changed, 15 insertions(+), 29 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -904,6 +904,7 @@ struct sched_entity { u64 exec_start; u64 sum_exec_runtime; + u64 prev_sum_exec_runtime; u64 wait_start_fair; u64 sleep_start_fair; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -225,30 +225,6 @@ static struct sched_entity *__pick_next_ * Calculate the preemption granularity needed to schedule every * runnable task once per sysctl_sched_latency amount of time. * (down to a sensible low limit on granularity) - * - * For example, if there are 2 tasks running and latency is 10 msecs, - * we switch tasks every 5 msecs. If we have 3 tasks running, we have - * to switch tasks every 3.33 msecs to get a 10 msecs observed latency - * for each task. We do finer and finer scheduling up to until we - * reach the minimum granularity value. - * - * To achieve this we use the following dynamic-granularity rule: - * - * gran = lat/nr - lat/nr/nr - * - * This comes out of the following equations: - * - * kA1 + gran = kB1 - * kB2 + gran = kA2 - * kA2 = kA1 - * kB2 = kB1 - d + d/nr - * lat = d * nr - * - * Where 'k' is key, 'A' is task A (waiting), 'B' is task B (running), - * '1' is start of time, '2' is end of time, 'd' is delay between - * 1 and 2 (during which task B was running), 'nr' is number of tasks - * running, 'lat' is the the period of each task. ('lat' is the - * sched_latency that we aim for.) */ static long sched_granularity(struct cfs_rq *cfs_rq) @@ -257,7 +233,7 @@ sched_granularity(struct cfs_rq *cfs_rq) unsigned int nr = cfs_rq->nr_running; if (nr > 1) { - gran = gran/nr - gran/nr/nr; + gran = gran/nr; gran = max(gran, sysctl_sched_min_granularity); } @@ -668,7 +644,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st /* * Preempt the current task with a newly woken task if needed: */ -static void +static int __check_preempt_curr_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, struct sched_entity *curr, unsigned long granularity) { @@ -679,8 +655,11 @@ __check_preempt_curr_fair(struct cfs_rq * preempt the current task unless the best task has * a larger than sched_granularity fairness advantage: */ - if (__delta > niced_granularity(curr, granularity)) + if (__delta > niced_granularity(curr, granularity)) { resched_task(rq_of(cfs_rq)->curr); + return 1; + } + return 0; } static inline void @@ -725,6 +704,7 @@ static void put_prev_entity(struct cfs_r static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { + unsigned long gran, delta_exec; struct sched_entity *next; /* @@ -741,8 +721,13 @@ static void entity_tick(struct cfs_rq *c if (next == curr) return; - __check_preempt_curr_fair(cfs_rq, next, curr, - sched_granularity(cfs_rq)); + gran = sched_granularity(cfs_rq); + delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; + if (delta_exec > gran) + gran = 0; + + if (__check_preempt_curr_fair(cfs_rq, next, curr, gran)) + curr->prev_sum_exec_runtime = curr->sum_exec_runtime; } /************************************************** ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-27 10:53 ` Ingo Molnar @ 2007-08-27 14:46 ` Al Boldi 2007-08-27 20:41 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Al Boldi @ 2007-08-27 14:46 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > > could you send the exact patch that shows what you did? > > > > On 2.6.22.5-v20.3 (not v20.4): > > > > 340- curr->delta_exec += delta_exec; > > 341- > > 342- if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) > > { 343:// __update_curr(cfs_rq, curr); > > 344- curr->delta_exec = 0; > > 345- } > > 346- curr->exec_start = rq_of(cfs_rq)->clock; > > ouch - this produces a really broken scheduler - with this we dont do > any run-time accounting (!). Of course it's broken, and it's not meant as a fix, but this change allows you to see the amount of overhead as well as any miscalculations __update_curr incurs. In terms of overhead, __update_curr incurs ~3x slowdown, and in terms of run-time accounting it exhibits a ~10sec task-startup miscalculation. > Could you try the patch below instead, does this make 3x glxgears smooth > again? (if yes, could you send me your Signed-off-by line as well.) The task-startup stalling is still there for ~10sec. Can you see the problem on your machine? Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-27 14:46 ` Al Boldi @ 2007-08-27 20:41 ` Ingo Molnar 2007-08-28 4:37 ` Al Boldi 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-27 20:41 UTC (permalink / raw) To: Al Boldi Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Al Boldi <a1426z@gawab.com> wrote: > > Could you try the patch below instead, does this make 3x glxgears > > smooth again? (if yes, could you send me your Signed-off-by line as > > well.) > > The task-startup stalling is still there for ~10sec. > > Can you see the problem on your machine? nope (i have no framebuffer setup) - but i can see some chew-max latencies that occur when new tasks are started up. I _think_ it's probably the same problem as yours. could you try the patch below (which is the combo patch of my current queue), ontop of head 50c46637aa? This makes chew-max behave better during task mass-startup here. Ingo -----------------> Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -904,6 +904,7 @@ struct sched_entity { u64 exec_start; u64 sum_exec_runtime; + u64 prev_sum_exec_runtime; u64 wait_start_fair; u64 sleep_start_fair; Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -1587,6 +1587,7 @@ static void __sched_fork(struct task_str p->se.wait_start_fair = 0; p->se.exec_start = 0; p->se.sum_exec_runtime = 0; + p->se.prev_sum_exec_runtime = 0; p->se.delta_exec = 0; p->se.delta_fair_run = 0; p->se.delta_fair_sleep = 0; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -82,12 +82,12 @@ enum { }; unsigned int sysctl_sched_features __read_mostly = - SCHED_FEAT_FAIR_SLEEPERS *1 | + SCHED_FEAT_FAIR_SLEEPERS *0 | SCHED_FEAT_SLEEPER_AVG *0 | SCHED_FEAT_SLEEPER_LOAD_AVG *1 | SCHED_FEAT_PRECISE_CPU_LOAD *1 | - SCHED_FEAT_START_DEBIT *1 | - SCHED_FEAT_SKIP_INITIAL *0; + SCHED_FEAT_START_DEBIT *0 | + SCHED_FEAT_SKIP_INITIAL *1; extern struct sched_class fair_sched_class; @@ -225,39 +225,15 @@ static struct sched_entity *__pick_next_ * Calculate the preemption granularity needed to schedule every * runnable task once per sysctl_sched_latency amount of time. * (down to a sensible low limit on granularity) - * - * For example, if there are 2 tasks running and latency is 10 msecs, - * we switch tasks every 5 msecs. If we have 3 tasks running, we have - * to switch tasks every 3.33 msecs to get a 10 msecs observed latency - * for each task. We do finer and finer scheduling up to until we - * reach the minimum granularity value. - * - * To achieve this we use the following dynamic-granularity rule: - * - * gran = lat/nr - lat/nr/nr - * - * This comes out of the following equations: - * - * kA1 + gran = kB1 - * kB2 + gran = kA2 - * kA2 = kA1 - * kB2 = kB1 - d + d/nr - * lat = d * nr - * - * Where 'k' is key, 'A' is task A (waiting), 'B' is task B (running), - * '1' is start of time, '2' is end of time, 'd' is delay between - * 1 and 2 (during which task B was running), 'nr' is number of tasks - * running, 'lat' is the the period of each task. ('lat' is the - * sched_latency that we aim for.) */ -static long +static unsigned long sched_granularity(struct cfs_rq *cfs_rq) { unsigned int gran = sysctl_sched_latency; unsigned int nr = cfs_rq->nr_running; if (nr > 1) { - gran = gran/nr - gran/nr/nr; + gran = gran/nr; gran = max(gran, sysctl_sched_min_granularity); } @@ -489,6 +465,9 @@ update_stats_wait_end(struct cfs_rq *cfs { unsigned long delta_fair; + if (unlikely(!se->wait_start_fair)) + return; + delta_fair = (unsigned long)min((u64)(2*sysctl_sched_runtime_limit), (u64)(cfs_rq->fair_clock - se->wait_start_fair)); @@ -668,7 +647,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st /* * Preempt the current task with a newly woken task if needed: */ -static void +static int __check_preempt_curr_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, struct sched_entity *curr, unsigned long granularity) { @@ -679,8 +658,11 @@ __check_preempt_curr_fair(struct cfs_rq * preempt the current task unless the best task has * a larger than sched_granularity fairness advantage: */ - if (__delta > niced_granularity(curr, granularity)) + if (__delta > niced_granularity(curr, granularity)) { resched_task(rq_of(cfs_rq)->curr); + return 1; + } + return 0; } static inline void @@ -725,6 +707,7 @@ static void put_prev_entity(struct cfs_r static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { + unsigned long gran, delta_exec; struct sched_entity *next; /* @@ -741,8 +724,13 @@ static void entity_tick(struct cfs_rq *c if (next == curr) return; - __check_preempt_curr_fair(cfs_rq, next, curr, - sched_granularity(cfs_rq)); + gran = sched_granularity(cfs_rq); + delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; + if (delta_exec > gran) + gran = 0; + + if (__check_preempt_curr_fair(cfs_rq, next, curr, gran)) + curr->prev_sum_exec_runtime = curr->sum_exec_runtime; } /************************************************** @@ -1080,29 +1068,27 @@ static void task_new_fair(struct rq *rq, sched_info_queued(p); + update_curr(cfs_rq); update_stats_enqueue(cfs_rq, se); - /* - * Child runs first: we let it run before the parent - * until it reschedules once. We set up the key so that - * it will preempt the parent: - */ - p->se.fair_key = current->se.fair_key - - niced_granularity(&rq->curr->se, sched_granularity(cfs_rq)) - 1; + /* * The first wait is dominated by the child-runs-first logic, * so do not credit it with that waiting time yet: */ if (sysctl_sched_features & SCHED_FEAT_SKIP_INITIAL) - p->se.wait_start_fair = 0; + se->wait_start_fair = 0; /* * The statistical average of wait_runtime is about * -granularity/2, so initialize the task with that: */ - if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) - p->se.wait_runtime = -(sched_granularity(cfs_rq) / 2); + if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) { + se->wait_runtime = -(sched_granularity(cfs_rq)/2); + schedstat_add(cfs_rq, wait_runtime, se->wait_runtime); + } __enqueue_entity(cfs_rq, se); + resched_task(current); } #ifdef CONFIG_FAIR_GROUP_SCHED ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-27 20:41 ` Ingo Molnar @ 2007-08-28 4:37 ` Al Boldi 2007-08-28 5:05 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 535+ messages in thread From: Al Boldi @ 2007-08-28 4:37 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > > Could you try the patch below instead, does this make 3x glxgears > > > smooth again? (if yes, could you send me your Signed-off-by line as > > > well.) > > > > The task-startup stalling is still there for ~10sec. > > > > Can you see the problem on your machine? > > nope (i have no framebuffer setup) No need for framebuffer. All you need is X using the X.org vesa-driver. Then start gears like this: # gears & gears & gears & Then lay them out side by side to see the periodic stallings for ~10sec. > - but i can see some chew-max > latencies that occur when new tasks are started up. I _think_ it's > probably the same problem as yours. chew-max is great, but it's too accurate in that it exposes any scheduling glitches and as such hides the startup glitch within the many glitches it exposes. For example, it fluctuates all over the place using this: # for ((i=0;i<9;i++)); do chew-max 60 > /dev/shm/chew$i.log & done Also, chew-max locks-up when disabling __update_curr, which means that the workload of chew-max is different from either the ping-startup loop or the gears. You really should try the gears test by any means, as the problem is really pronounced there. > could you try the patch below (which is the combo patch of my current > queue), ontop of head 50c46637aa? This makes chew-max behave better > during task mass-startup here. Still no improvement. Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 4:37 ` Al Boldi @ 2007-08-28 5:05 ` Linus Torvalds 2007-08-28 5:23 ` Al Boldi 2007-08-28 20:46 ` Valdis.Kletnieks 2007-08-28 7:43 ` Xavier Bestel 2007-08-29 4:18 ` Ingo Molnar 2 siblings, 2 replies; 535+ messages in thread From: Linus Torvalds @ 2007-08-28 5:05 UTC (permalink / raw) To: Al Boldi Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel On Tue, 28 Aug 2007, Al Boldi wrote: > > No need for framebuffer. All you need is X using the X.org vesa-driver. > Then start gears like this: > > # gears & gears & gears & > > Then lay them out side by side to see the periodic stallings for ~10sec. I don't think this is a good test. Why? If you're not using direct rendering, what you have is the X server doing all the rendering, which in turn means that what you are testing is quite possibly not so much about the *kernel* scheduling, but about *X-server* scheduling! I'm sure the kernel scheduler has an impact, but what's more likely to be going on is that you're seeing effects that are indirect, and not necessarily at all even "good". For example, if the X server is the scheduling point, it's entirely possible that it ends up showing effects that are more due to the queueing of the X command stream than due to the scheduler - and that those stalls are simply due to *that*. One thing to try is to run the X connection in synchronous mode, which minimizes queueing issues. I don't know if gears has a flag to turn on synchronous X messaging, though. Many X programs take the "[+-]sync" flag to turn on synchronous mode, iirc. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 5:05 ` Linus Torvalds @ 2007-08-28 5:23 ` Al Boldi 2007-08-28 7:28 ` Mike Galbraith 2007-08-28 16:34 ` Linus Torvalds 2007-08-28 20:46 ` Valdis.Kletnieks 1 sibling, 2 replies; 535+ messages in thread From: Al Boldi @ 2007-08-28 5:23 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel Linus Torvalds wrote: > On Tue, 28 Aug 2007, Al Boldi wrote: > > No need for framebuffer. All you need is X using the X.org vesa-driver. > > Then start gears like this: > > > > # gears & gears & gears & > > > > Then lay them out side by side to see the periodic stallings for ~10sec. > > I don't think this is a good test. > > Why? > > If you're not using direct rendering, what you have is the X server doing > all the rendering, which in turn means that what you are testing is quite > possibly not so much about the *kernel* scheduling, but about *X-server* > scheduling! > > I'm sure the kernel scheduler has an impact, but what's more likely to be > going on is that you're seeing effects that are indirect, and not > necessarily at all even "good". > > For example, if the X server is the scheduling point, it's entirely > possible that it ends up showing effects that are more due to the queueing > of the X command stream than due to the scheduler - and that those > stalls are simply due to *that*. > > One thing to try is to run the X connection in synchronous mode, which > minimizes queueing issues. I don't know if gears has a flag to turn on > synchronous X messaging, though. Many X programs take the "[+-]sync" flag > to turn on synchronous mode, iirc. I like your analysis, but how do you explain that these stalls vanish when __update_curr is disabled? Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 5:23 ` Al Boldi @ 2007-08-28 7:28 ` Mike Galbraith 2007-08-28 7:36 ` Ingo Molnar 2007-08-28 16:34 ` Linus Torvalds 1 sibling, 1 reply; 535+ messages in thread From: Mike Galbraith @ 2007-08-28 7:28 UTC (permalink / raw) To: Al Boldi Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Andrew Morton, linux-kernel On Tue, 2007-08-28 at 08:23 +0300, Al Boldi wrote: > Linus Torvalds wrote: > > On Tue, 28 Aug 2007, Al Boldi wrote: > > > No need for framebuffer. All you need is X using the X.org vesa-driver. > > > Then start gears like this: > > > > > > # gears & gears & gears & > > > > > > Then lay them out side by side to see the periodic stallings for ~10sec. > > > > I don't think this is a good test. > > > > Why? > > > > If you're not using direct rendering, what you have is the X server doing > > all the rendering, which in turn means that what you are testing is quite > > possibly not so much about the *kernel* scheduling, but about *X-server* > > scheduling! > > > > I'm sure the kernel scheduler has an impact, but what's more likely to be > > going on is that you're seeing effects that are indirect, and not > > necessarily at all even "good". > > > > For example, if the X server is the scheduling point, it's entirely > > possible that it ends up showing effects that are more due to the queueing > > of the X command stream than due to the scheduler - and that those > > stalls are simply due to *that*. > > > > One thing to try is to run the X connection in synchronous mode, which > > minimizes queueing issues. I don't know if gears has a flag to turn on > > synchronous X messaging, though. Many X programs take the "[+-]sync" flag > > to turn on synchronous mode, iirc. > > I like your analysis, but how do you explain that these stalls vanish when > __update_curr is disabled? When you disable __update_curr(), you're utterly destroying the scheduler. There may well be a scheduler connection, but disabling __update_curr() doesn't tell you anything meaningful. Basically, you're letting all tasks run uninterrupted for just as long as they please (which is why busy loops lock your box solid as a rock). I'd suggest gathering some sched_debug stats or something... shoot, _anything_ but what you did :) -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 7:28 ` Mike Galbraith @ 2007-08-28 7:36 ` Ingo Molnar 0 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-28 7:36 UTC (permalink / raw) To: Mike Galbraith Cc: Al Boldi, Linus Torvalds, Peter Zijlstra, Andrew Morton, linux-kernel * Mike Galbraith <efault@gmx.de> wrote: > > I like your analysis, but how do you explain that these stalls > > vanish when __update_curr is disabled? > > When you disable __update_curr(), you're utterly destroying the > scheduler. There may well be a scheduler connection, but disabling > __update_curr() doesn't tell you anything meaningful. Basically, you're > letting all tasks run uninterrupted for just as long as they please > (which is why busy loops lock your box solid as a rock). I'd suggest > gathering some sched_debug stats or something... [...] the output of the following would be nice: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh captured while the gears are running. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 5:23 ` Al Boldi 2007-08-28 7:28 ` Mike Galbraith @ 2007-08-28 16:34 ` Linus Torvalds 2007-08-28 16:44 ` Arjan van de Ven 2007-08-28 16:45 ` Ingo Molnar 1 sibling, 2 replies; 535+ messages in thread From: Linus Torvalds @ 2007-08-28 16:34 UTC (permalink / raw) To: Al Boldi Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel On Tue, 28 Aug 2007, Al Boldi wrote: > > I like your analysis, but how do you explain that these stalls vanish when > __update_curr is disabled? It's entirely possible that what happens is that the X scheduling is just a slightly unstable system - which effectively would turn a small scheduling difference into a *huge* visible difference. And the "small scheduling difference" might be as simple as "if the process slept for a while, we give it a bit more CPU time". And then you get into some unbalanced setup where the X scheduler makes it sleep even more, because it fills its buffers. Or something. I can easily see two schedulers that are trying to *individually* be "fair", fighting it out in a way where the end result is not very good. I do suspect it's probably a very interesting load, so I hope Ingo looks more at it, but I also suspect it's more than just the kernel scheduler. Linus ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 16:34 ` Linus Torvalds @ 2007-08-28 16:44 ` Arjan van de Ven 2007-08-28 16:45 ` Ingo Molnar 1 sibling, 0 replies; 535+ messages in thread From: Arjan van de Ven @ 2007-08-28 16:44 UTC (permalink / raw) To: Linus Torvalds Cc: Al Boldi, Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel On Tue, 28 Aug 2007 09:34:03 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Tue, 28 Aug 2007, Al Boldi wrote: > > > > I like your analysis, but how do you explain that these stalls > > vanish when __update_curr is disabled? > > It's entirely possible that what happens is that the X scheduling is > just a slightly unstable system - which effectively would turn a > small scheduling difference into a *huge* visible difference. one thing that happens if you remove __update_curr is the following pattern (since no apps will get preempted involuntarily) app 1 submits a full frame worth of 3D stuff to X app 1 then sleeps/waits for that to complete X gets to run, has 1 full frame to render, does this X now waits for more input app 2 now gets to run and submits a full frame app 2 then sleeps again X gets to run again to process and complete X goes to sleep app 3 gets to run and submits a full frame app 3 then sleeps X runs X sleeps app 1 gets to submit a frame etc etc so without preemption happening, you can get "perfect" behavior, just because everything is perfectly doing 1 thing at a time cooperatively. once you start doing timeslices and enforcing limits on them, this "perfect pattern" will break down (remember this is all software rendering in the problem being described), and whatever you get won't be as perfect as this. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 16:34 ` Linus Torvalds 2007-08-28 16:44 ` Arjan van de Ven @ 2007-08-28 16:45 ` Ingo Molnar 2007-08-29 4:19 ` Al Boldi 1 sibling, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-28 16:45 UTC (permalink / raw) To: Linus Torvalds Cc: Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 28 Aug 2007, Al Boldi wrote: > > > > I like your analysis, but how do you explain that these stalls > > vanish when __update_curr is disabled? > > It's entirely possible that what happens is that the X scheduling is > just a slightly unstable system - which effectively would turn a small > scheduling difference into a *huge* visible difference. i think it's because disabling __update_curr() in essence removes the ability of scheduler to preempt tasks - that hack in essence results in a non-scheduler. Hence the gears + X pair of tasks becomes a synchronous pair of tasks in essence - and thus gears cannot "overload" X. Normally gears + X is an asynchronous pair of tasks, with gears (or xperf, or devel versions of firefox, etc.) not being throttled at all and thus being able to overload/spam the X server with requests. (And we generally want to _reward_ asynchronity and want to allow tasks to overlap each other and we want each task to go as fast and as parallel as it can.) Eventually X's built-in "bad, abusive client" throttling code kicks in, which, AFAIK is pretty crude and might yield to such artifacts. But ... it would be nice for an X person to confirm - and in any case i'll try Al's workload - i thought i had a reproducer but i barked up the wrong tree :-) My laptop doesnt run with the vesa driver, so i have no easy reproducer for now. ( also, it would be nice if Al could try rc4 plus my latest scheduler tree as well - just on the odd chance that something got fixed meanwhile. In particular Mike's sleeper-bonus-limit fix could be related. ) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 16:45 ` Ingo Molnar @ 2007-08-29 4:19 ` Al Boldi 2007-08-29 4:53 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Al Boldi @ 2007-08-29 4:19 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel Ingo Molnar wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 28 Aug 2007, Al Boldi wrote: > > > I like your analysis, but how do you explain that these stalls > > > vanish when __update_curr is disabled? > > > > It's entirely possible that what happens is that the X scheduling is > > just a slightly unstable system - which effectively would turn a small > > scheduling difference into a *huge* visible difference. > > i think it's because disabling __update_curr() in essence removes the > ability of scheduler to preempt tasks - that hack in essence results in > a non-scheduler. Hence the gears + X pair of tasks becomes a synchronous > pair of tasks in essence - and thus gears cannot "overload" X. I have narrowed it down a bit to add_wait_runtime. Patch 2.6.22.5-v20.4 like this: 346- * the two values are equal) 347- * [Note: delta_mine - delta_exec is negative]: 348- */ 349:// add_wait_runtime(cfs_rq, curr, delta_mine - delta_exec); 350-} 351- 352-static void update_curr(struct cfs_rq *cfs_rq) When disabling add_wait_runtime the stalls are gone. With this change the scheduler is still usable, but it does not constitute a fix. Now, even with this hack, uneven nice-levels between X and gears causes a return of the stalls, so make sure both X and gears run on the same nice-level when testing. Again, the whole point of this workload is to expose scheduler glitches regardless of whether X is broken or not, and my hunch is that this problem looks suspiciously like an ia-boosting bug. What's important to note is that by adjusting the scheduler we can effect a correction in behaviour, and as such should yield this problem as fixable. It's probably a good idea to look further into add_wait_runtime. Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 4:19 ` Al Boldi @ 2007-08-29 4:53 ` Ingo Molnar 2007-08-29 5:58 ` Al Boldi 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-29 4:53 UTC (permalink / raw) To: Al Boldi Cc: Linus Torvalds, Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel, Keith Packard * Al Boldi <a1426z@gawab.com> wrote: > I have narrowed it down a bit to add_wait_runtime. the scheduler is a red herring here. Could you "strace -ttt -TTT" one of the glxgears instances (and send us the cfs-debug-info.sh output, with CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y as requested before) so that we can have a closer look? i reproduced something similar and there the stall is caused by 1+ second select() delays on the X client<->server socket. The scheduler stats agree with that: se.sleep_max : 2194711437 se.block_max : 0 se.exec_max : 977446 se.wait_max : 1912321 the scheduler itself had a worst-case scheduling delay of 1.9 milliseconds for that glxgears instance (which is perfectly good - in fact - excellent interactivity) - but the task had a maximum sleep time of 2.19 seconds. So the 'glitch' was not caused by the scheduler. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 4:53 ` Ingo Molnar @ 2007-08-29 5:58 ` Al Boldi 2007-08-29 6:43 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Al Boldi @ 2007-08-29 5:58 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel, Keith Packard Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > I have narrowed it down a bit to add_wait_runtime. > > the scheduler is a red herring here. Could you "strace -ttt -TTT" one of > the glxgears instances (and send us the cfs-debug-info.sh output, with > CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y as requested before) so > that we can have a closer look? > > i reproduced something similar and there the stall is caused by 1+ > second select() delays on the X client<->server socket. The scheduler > stats agree with that: > > se.sleep_max : 2194711437 > se.block_max : 0 > se.exec_max : 977446 > se.wait_max : 1912321 > > the scheduler itself had a worst-case scheduling delay of 1.9 > milliseconds for that glxgears instance (which is perfectly good - in > fact - excellent interactivity) - but the task had a maximum sleep time > of 2.19 seconds. So the 'glitch' was not caused by the scheduler. 2.19sec is probably the time you need to lay them out side by side. You see, gears sleeps when it is covered by another window, so once you lay them out it starts running, and that's when they start to stutter for about 10sec. After that they should run smoothly, because they used up all the sleep bonus. If you like, I can send you my straces, but they are kind of big though, and you need to strace each gear, as stracing itself changes the workload balance. Let's first make sure what we are looking for: 1. start # gears & gears & gears & 2. lay them out side by side, don't worry about sleep times yet. 3. now they start stuttering for about 10sec 4. now they run out of sleep bonuses and smooth out If this is the sequence you get on your machine, then try disabling add_wait_runtime to see the difference. Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 5:58 ` Al Boldi @ 2007-08-29 6:43 ` Ingo Molnar 0 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-29 6:43 UTC (permalink / raw) To: Al Boldi Cc: Linus Torvalds, Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel, Keith Packard * Al Boldi <a1426z@gawab.com> wrote: > > se.sleep_max : 2194711437 > > se.block_max : 0 > > se.exec_max : 977446 > > se.wait_max : 1912321 > > > > the scheduler itself had a worst-case scheduling delay of 1.9 > > milliseconds for that glxgears instance (which is perfectly good - in > > fact - excellent interactivity) - but the task had a maximum sleep time > > of 2.19 seconds. So the 'glitch' was not caused by the scheduler. > > 2.19sec is probably the time you need to lay them out side by side. > [...] nope, i cleared the stats after i laid the glxgears out, via: for N in /proc/*/sched; do echo 0 > $N; done and i did the strace (which showed a 1+ seconds latency) while the glxgears was not manipulated in any way. > [...] You see, gears sleeps when it is covered by another window, > [...] none of the gear windows in my test were overlaid... > [...] so once you lay them out it starts running, and that's when they > start to stutter for about 10sec. After that they should run > smoothly, because they used up all the sleep bonus. that's plain wrong - at least in the test i've reproduced. In any case, if that were the case then that would be visible in the stats. So please send me your cfs-debug-info.sh output captured while the test is running (with a CONFIG_SCHEDSTATS=y and CONFIG_SCHED_DEBUG=y kernel) - you can download it from: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh for best data, execute this before running it: for N in /proc/*/sched; do echo 0 > $N; done > If you like, I can send you my straces, but they are kind of big > though, and you need to strace each gear, as stracing itself changes > the workload balance. sure, send them along or upload them somewhere - but more importantly, please send the cfs-debug-info.sh output. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 5:05 ` Linus Torvalds 2007-08-28 5:23 ` Al Boldi @ 2007-08-28 20:46 ` Valdis.Kletnieks 1 sibling, 0 replies; 535+ messages in thread From: Valdis.Kletnieks @ 2007-08-28 20:46 UTC (permalink / raw) To: Linus Torvalds Cc: Al Boldi, Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 964 bytes --] On Mon, 27 Aug 2007 22:05:37 PDT, Linus Torvalds said: > > > On Tue, 28 Aug 2007, Al Boldi wrote: > > > > No need for framebuffer. All you need is X using the X.org vesa-driver. > > Then start gears like this: > > > > # gears & gears & gears & > > > > Then lay them out side by side to see the periodic stallings for ~10sec. > > I don't think this is a good test. > > Why? > > If you're not using direct rendering, what you have is the X server doing > all the rendering, which in turn means that what you are testing is quite > possibly not so much about the *kernel* scheduling, but about *X-server* > scheduling! I wonder - can people who are doing this as a test please specify whether they're using an older X that has the libX11 or the newer libxcb code? That may have a similar impact as well. (libxcb is pretty new - it landed in Fedora Rawhide just about a month ago, after Fedora 7 shipped. Not sure what other distros have it now...) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 4:37 ` Al Boldi 2007-08-28 5:05 ` Linus Torvalds @ 2007-08-28 7:43 ` Xavier Bestel 2007-08-28 8:02 ` Ingo Molnar 2007-08-29 4:18 ` Ingo Molnar 2 siblings, 1 reply; 535+ messages in thread From: Xavier Bestel @ 2007-08-28 7:43 UTC (permalink / raw) To: Al Boldi Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel On Tue, 2007-08-28 at 07:37 +0300, Al Boldi wrote: > start gears like this: > > # gears & gears & gears & > > Then lay them out side by side to see the periodic stallings for > ~10sec. Are you sure they are stalled ? What you may have is simple gears running at a multiple of your screen refresh rate, so they only appear stalled. Plus, as said Linus, you're not really testing the kernel scheduler. gears is really bad benchmark, it should die. Xav ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 7:43 ` Xavier Bestel @ 2007-08-28 8:02 ` Ingo Molnar 2007-08-28 19:19 ` Willy Tarreau 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-28 8:02 UTC (permalink / raw) To: Xavier Bestel Cc: Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Xavier Bestel <xavier.bestel@free.fr> wrote: > Are you sure they are stalled ? What you may have is simple gears > running at a multiple of your screen refresh rate, so they only appear > stalled. > > Plus, as said Linus, you're not really testing the kernel scheduler. > gears is really bad benchmark, it should die. i like glxgears as long as it runs on _real_ 3D hardware, because there it has minimal interaction with X and so it's an excellent visual test about consistency of scheduling. You can immediately see (literally) scheduling hickups down to a millisecond range (!). In this sense, if done and interpreted carefully, glxgears gives more feedback than many audio tests. (audio latency problems are audible, but on most sound hw it takes quite a bit of latency to produce an xrun.) So basically glxgears is the "early warning system" that tells us about the potential for xruns earlier than an xrun would happen for real. [ of course you can also run all the other tools to get numeric results, but glxgears is nice in that it gives immediate visual feedback. ] but i agree that on a non-accelerated X setup glxgears is not really meaningful. It can have similar "spam the X server" effects as xperf. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 8:02 ` Ingo Molnar @ 2007-08-28 19:19 ` Willy Tarreau 2007-08-28 19:55 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Willy Tarreau @ 2007-08-28 19:19 UTC (permalink / raw) To: Ingo Molnar Cc: Xavier Bestel, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel On Tue, Aug 28, 2007 at 10:02:18AM +0200, Ingo Molnar wrote: > > * Xavier Bestel <xavier.bestel@free.fr> wrote: > > > Are you sure they are stalled ? What you may have is simple gears > > running at a multiple of your screen refresh rate, so they only appear > > stalled. > > > > Plus, as said Linus, you're not really testing the kernel scheduler. > > gears is really bad benchmark, it should die. > > i like glxgears as long as it runs on _real_ 3D hardware, because there > it has minimal interaction with X and so it's an excellent visual test > about consistency of scheduling. You can immediately see (literally) > scheduling hickups down to a millisecond range (!). In this sense, if > done and interpreted carefully, glxgears gives more feedback than many > audio tests. (audio latency problems are audible, but on most sound hw > it takes quite a bit of latency to produce an xrun.) So basically > glxgears is the "early warning system" that tells us about the potential > for xruns earlier than an xrun would happen for real. > > [ of course you can also run all the other tools to get numeric results, > but glxgears is nice in that it gives immediate visual feedback. ] Al could also test ocbench, which brings visual feedback without harnessing the X server : http://linux.1wt.eu/sched/ I packaged it exactly for this problem and it has already helped. It uses X after each loop, so if you run it with large run time, X is nearly not sollicitated. Willy ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 19:19 ` Willy Tarreau @ 2007-08-28 19:55 ` Ingo Molnar 0 siblings, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-28 19:55 UTC (permalink / raw) To: Willy Tarreau Cc: Xavier Bestel, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Willy Tarreau <w@1wt.eu> wrote: > On Tue, Aug 28, 2007 at 10:02:18AM +0200, Ingo Molnar wrote: > > > > * Xavier Bestel <xavier.bestel@free.fr> wrote: > > > > > Are you sure they are stalled ? What you may have is simple gears > > > running at a multiple of your screen refresh rate, so they only appear > > > stalled. > > > > > > Plus, as said Linus, you're not really testing the kernel scheduler. > > > gears is really bad benchmark, it should die. > > > > i like glxgears as long as it runs on _real_ 3D hardware, because there > > it has minimal interaction with X and so it's an excellent visual test > > about consistency of scheduling. You can immediately see (literally) > > scheduling hickups down to a millisecond range (!). In this sense, if > > done and interpreted carefully, glxgears gives more feedback than many > > audio tests. (audio latency problems are audible, but on most sound hw > > it takes quite a bit of latency to produce an xrun.) So basically > > glxgears is the "early warning system" that tells us about the potential > > for xruns earlier than an xrun would happen for real. > > > > [ of course you can also run all the other tools to get numeric results, > > but glxgears is nice in that it gives immediate visual feedback. ] > > Al could also test ocbench, which brings visual feedback without > harnessing the X server : http://linux.1wt.eu/sched/ > > I packaged it exactly for this problem and it has already helped. It > uses X after each loop, so if you run it with large run time, X is > nearly not sollicitated. yeah, and ocbench is one of my favorite cross-task-fairness tests - i dont release a CFS patch without checking it with ocbench first :-) Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-28 4:37 ` Al Boldi 2007-08-28 5:05 ` Linus Torvalds 2007-08-28 7:43 ` Xavier Bestel @ 2007-08-29 4:18 ` Ingo Molnar 2007-08-29 4:29 ` Keith Packard 2007-08-29 4:40 ` Mike Galbraith 2 siblings, 2 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-29 4:18 UTC (permalink / raw) To: Al Boldi Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel, Keith Packard * Al Boldi <a1426z@gawab.com> wrote: > No need for framebuffer. All you need is X using the X.org > vesa-driver. Then start gears like this: > > # gears & gears & gears & > > Then lay them out side by side to see the periodic stallings for > ~10sec. i just tried something similar (by adding Option "NoDRI" to xorg.conf) and i'm wondering how it can be smooth on vesa-driver at all. I tested it on a Core2Duo box and software rendering manages to do about 3 frames per second. (although glxgears itself thinks it does ~600 fps) If i start 3 glxgears then they do ~1 frame per second each. This is on Fedora 7 with xorg-x11-server-Xorg-1.3.0.0-9.fc7 and xorg-x11-drv-i810-2.0.0-4.fc7. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 4:18 ` Ingo Molnar @ 2007-08-29 4:29 ` Keith Packard 2007-08-29 4:46 ` Ingo Molnar 2007-08-29 4:40 ` Mike Galbraith 1 sibling, 1 reply; 535+ messages in thread From: Keith Packard @ 2007-08-29 4:29 UTC (permalink / raw) To: Ingo Molnar Cc: keith.packard, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1117 bytes --] On Wed, 2007-08-29 at 06:18 +0200, Ingo Molnar wrote: > > Then lay them out side by side to see the periodic stallings for > > ~10sec. The X scheduling code isn't really designed to handle software GL well; the requests can be very expensive to execute, and yet are specified as atomic operations (sigh). > i just tried something similar (by adding Option "NoDRI" to xorg.conf) > and i'm wondering how it can be smooth on vesa-driver at all. I tested > it on a Core2Duo box and software rendering manages to do about 3 frames > per second. (although glxgears itself thinks it does ~600 fps) If i > start 3 glxgears then they do ~1 frame per second each. This is on > Fedora 7 with xorg-x11-server-Xorg-1.3.0.0-9.fc7 and > xorg-x11-drv-i810-2.0.0-4.fc7. Are you attempting to measure the visible updates by eye? Or are you using some other metric? In any case, attempting to measure anything using glxgears is a bad idea; it's not representative of *any* real applications. And then using software GL on top of that... What was the question again? -- keith.packard@intel.com [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 4:29 ` Keith Packard @ 2007-08-29 4:46 ` Ingo Molnar 2007-08-29 7:57 ` Keith Packard 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-29 4:46 UTC (permalink / raw) To: Keith Packard Cc: Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Keith Packard <keith.packard@intel.com> wrote: > > > Then lay them out side by side to see the periodic stallings for > > > ~10sec. > > The X scheduling code isn't really designed to handle software GL > well; the requests can be very expensive to execute, and yet are > specified as atomic operations (sigh). [...] > Are you attempting to measure the visible updates by eye? Or are you > using some other metric? > > In any case, attempting to measure anything using glxgears is a bad > idea; it's not representative of *any* real applications. And then > using software GL on top of that... > > What was the question again? ok, i finally managed to reproduce the "artifact" myself on an older box. It goes like this: start up X with the vesa driver (or with NoDRI) to force software rendering. Then start up a couple of glxgears instances. Those glxgears instances update in a very "chunky", "stuttering" way - each glxgears instance runs/stops/runs/stops at a rate of a about once per second, and this was reported to me as a potential CPU scheduler regression. at a quick glance this is not a CPU scheduler thing: X uses up 99% of CPU time, all the glxgears tasks (i needed 8 parallel instances to see the stallings) are using up the remaining 1% of CPU time. The ordering of the requests from the glxgears tasks is X's choice - and for a pathological overload situation like this we cannot blame X at all for not producing a completely smooth output. (although Xorg could perhaps try to schedule such requests more smoothly, in a more finegrained way?) i've attached below a timestamped strace of one of the glxgears instances that shows such a 'stall': 1188356998.440155 select(4, [3], [3], NULL, NULL) = 1 (out [3]) <1.173680> the select (waiting for X) took 1.17 seconds. Ingo --------------------> Process 3644 attached - interrupt to quit 1188356997.810351 select(4, [3], [3], NULL, NULL) = 1 (out [3]) <0.594074> 1188356998.404580 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000115> 1188356998.404769 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.404880 gettimeofday({1188356998, 404893}, {4294967176, 0}) = 0 <0.000006> 1188356998.404923 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356998.405054 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.405116 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000073> 1188356998.405221 gettimeofday({1188356998, 405231}, {4294967176, 0}) = 0 <0.000006> 1188356998.405258 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.405394 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.405461 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.405582 gettimeofday({1188356998, 405593}, {4294967176, 0}) = 0 <0.000006> 1188356998.405620 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.405656 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000108> 1188356998.405818 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.405856 gettimeofday({1188356998, 405866}, {4294967176, 0}) = 0 <0.000006> 1188356998.405993 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356998.406032 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.406092 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000107> 1188356998.406232 gettimeofday({1188356998, 406242}, {4294967176, 0}) = 0 <0.000006> 1188356998.406269 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356998.406305 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000065> 1188356998.406423 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000118> 1188356998.406573 gettimeofday({1188356998, 406583}, {4294967176, 0}) = 0 <0.000006> 1188356998.406610 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356998.406646 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000125> 1188356998.406824 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.406863 gettimeofday({1188356998, 406873}, {4294967176, 0}) = 0 <0.000007> 1188356998.406900 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.407069 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.407131 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.407169 gettimeofday({1188356998, 407179}, {4294967176, 0}) = 0 <0.000007> 1188356998.407206 ioctl(3, FIONREAD, [0]) = 0 <0.000139> 1188356998.407376 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.407440 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.407478 gettimeofday({1188356998, 407487}, {4294967176, 0}) = 0 <0.000006> 1188356998.407649 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.407687 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.407747 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.407785 gettimeofday({1188356998, 407795}, {4294967176, 0}) = 0 <0.000140> 1188356998.407957 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356998.407993 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.408052 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.408224 gettimeofday({1188356998, 408236}, {4294967176, 0}) = 0 <0.000007> 1188356998.408263 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356998.408322 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000113> 1188356998.408490 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.408528 gettimeofday({1188356998, 408537}, {4294967176, 0}) = 0 <0.000006> 1188356998.408565 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356998.408735 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.408796 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.408834 gettimeofday({1188356998, 408843}, {4294967176, 0}) = 0 <0.000005> 1188356998.408870 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.408905 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000174> 1188356998.409132 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.409170 gettimeofday({1188356998, 409178}, {4294967176, 0}) = 0 <0.000005> 1188356998.409205 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.409240 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.409469 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.409508 gettimeofday({1188356998, 409517}, {4294967176, 0}) = 0 <0.000005> 1188356998.409544 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.409579 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.409638 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.409675 gettimeofday({1188356998, 409684}, {4294967176, 0}) = 0 <0.000005> 1188356998.409711 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.409747 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.409805 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000308> 1188356998.410145 gettimeofday({1188356998, 410154}, {4294967176, 0}) = 0 <0.000006> 1188356998.410181 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.410215 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.410274 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.410311 gettimeofday({1188356998, 410320}, {4294967176, 0}) = 0 <0.000006> 1188356998.410347 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.410381 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000260> 1188356998.410697 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.410735 gettimeofday({1188356998, 410743}, {4294967176, 0}) = 0 <0.000006> 1188356998.410771 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.410805 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.410864 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.410902 gettimeofday({1188356998, 410910}, {4294967176, 0}) = 0 <0.000005> 1188356998.410937 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.410972 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.411030 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.411068 gettimeofday({1188356998, 411077}, {4294967176, 0}) = 0 <0.000005> 1188356998.411104 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.411139 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.411198 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.411235 gettimeofday({1188356998, 411244}, {4294967176, 0}) = 0 <0.000005> 1188356998.411271 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.411306 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.411377 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000010> 1188356998.412017 gettimeofday({1188356998, 412027}, {4294967176, 0}) = 0 <0.000006> 1188356998.412055 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.412089 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.412148 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.412185 gettimeofday({1188356998, 412194}, {4294967176, 0}) = 0 <0.000006> 1188356998.412221 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.412255 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.412313 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.412350 gettimeofday({1188356998, 412359}, {4294967176, 0}) = 0 <0.000006> 1188356998.412385 ioctl(3, FIONREAD, [0]) = 0 <0.000009> 1188356998.412776 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.412837 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.412874 gettimeofday({1188356998, 412883}, {4294967176, 0}) = 0 <0.000006> 1188356998.412910 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.412944 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.413003 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.413040 gettimeofday({1188356998, 413049}, {4294967176, 0}) = 0 <0.000006> 1188356998.413076 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.413110 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.413169 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.413206 gettimeofday({1188356998, 413214}, {4294967176, 0}) = 0 <0.000006> 1188356998.413241 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.413276 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.413334 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.413371 gettimeofday({1188356998, 413380}, {4294967176, 0}) = 0 <0.000005> 1188356998.413924 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.413961 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.414020 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.414057 gettimeofday({1188356998, 414066}, {4294967176, 0}) = 0 <0.000006> 1188356998.414093 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.414127 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.414186 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.414223 gettimeofday({1188356998, 414232}, {4294967176, 0}) = 0 <0.000006> 1188356998.414259 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.414293 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.414351 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000006> 1188356998.414388 gettimeofday({1188356998, 414397}, {4294967176, 0}) = 0 <0.000082> 1188356998.414503 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.414538 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.414601 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.414638 gettimeofday({1188356998, 414647}, {4294967176, 0}) = 0 <0.000005> 1188356998.414674 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.414709 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.414768 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.414818 gettimeofday({1188356998, 414827}, {4294967176, 0}) = 0 <0.000006> 1188356998.414854 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.414889 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.414948 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.414986 gettimeofday({1188356998, 414995}, {4294967176, 0}) = 0 <0.000005> 1188356998.415022 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.415057 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000010> 1188356998.415118 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.415156 gettimeofday({1188356998, 415165}, {4294967176, 0}) = 0 <0.000005> 1188356998.415192 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.415227 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.415287 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.415325 gettimeofday({1188356998, 415334}, {4294967176, 0}) = 0 <0.000006> 1188356998.415362 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.415397 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.001116> 1188356998.416602 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356998.416707 gettimeofday({1188356998, 416717}, {4294967176, 0}) = 0 <0.000038> 1188356998.416778 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.416813 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.416872 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.416911 gettimeofday({1188356998, 416919}, {4294967176, 0}) = 0 <0.000006> 1188356998.416946 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.416981 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.417039 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.417076 gettimeofday({1188356998, 417085}, {4294967176, 0}) = 0 <0.000006> 1188356998.417111 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.417146 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.417204 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.417241 gettimeofday({1188356998, 417250}, {4294967176, 0}) = 0 <0.000006> 1188356998.417277 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.417311 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.417370 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.417953 gettimeofday({1188356998, 417963}, {4294967176, 0}) = 0 <0.000037> 1188356998.418056 ioctl(3, FIONREAD, [0]) = 0 <0.000037> 1188356998.418158 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.418218 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.418255 gettimeofday({1188356998, 418263}, {4294967176, 0}) = 0 <0.000006> 1188356998.418290 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.418325 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.418384 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000206> 1188356998.418656 gettimeofday({1188356998, 418666}, {4294967176, 0}) = 0 <0.000037> 1188356998.418759 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356998.418829 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.418887 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.418924 gettimeofday({1188356998, 418945}, {4294967176, 0}) = 0 <0.000006> 1188356998.418972 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.419007 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.419065 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.419103 gettimeofday({1188356998, 419111}, {4294967176, 0}) = 0 <0.000006> 1188356998.419138 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.419172 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.419231 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.419268 gettimeofday({1188356998, 419276}, {4294967176, 0}) = 0 <0.000005> 1188356998.419303 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.419338 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.419396 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000481> 1188356998.419944 gettimeofday({1188356998, 419954}, {4294967176, 0}) = 0 <0.000038> 1188356998.420048 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356998.420118 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.420176 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.420213 gettimeofday({1188356998, 420222}, {4294967176, 0}) = 0 <0.000005> 1188356998.420249 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.420283 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.420342 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.420379 gettimeofday({1188356998, 420387}, {4294967176, 0}) = 0 <0.000006> 1188356998.420682 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356998.420786 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000041> 1188356998.420880 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.420917 gettimeofday({1188356998, 420926}, {4294967176, 0}) = 0 <0.000006> 1188356998.420952 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.420987 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.421046 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.421083 gettimeofday({1188356998, 421091}, {4294967176, 0}) = 0 <0.000006> 1188356998.421118 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.421152 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.421212 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.421249 gettimeofday({1188356998, 421258}, {4294967176, 0}) = 0 <0.000005> 1188356998.421285 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.421319 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.421378 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000009> 1188356998.421881 gettimeofday({1188356998, 421891}, {4294967176, 0}) = 0 <0.000038> 1188356998.421984 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356998.422054 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.422112 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.422149 gettimeofday({1188356998, 422157}, {4294967176, 0}) = 0 <0.000005> 1188356998.422184 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.422218 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.422277 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.422314 gettimeofday({1188356998, 422323}, {4294967176, 0}) = 0 <0.000006> 1188356998.422350 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.422395 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000320> 1188356998.422804 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000039> 1188356998.422909 gettimeofday({1188356998, 422919}, {4294967176, 0}) = 0 <0.000005> 1188356998.422946 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.422981 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.423039 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.423077 gettimeofday({1188356998, 423085}, {4294967176, 0}) = 0 <0.000006> 1188356998.423112 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.423146 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.423204 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.423241 gettimeofday({1188356998, 423250}, {4294967176, 0}) = 0 <0.000006> 1188356998.423277 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.423312 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.423370 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.423831 gettimeofday({1188356998, 423841}, {4294967176, 0}) = 0 <0.000038> 1188356998.423936 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356998.424040 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.424100 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.424137 gettimeofday({1188356998, 424146}, {4294967176, 0}) = 0 <0.000005> 1188356998.424173 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.424207 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.424265 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.424302 gettimeofday({1188356998, 424310}, {4294967176, 0}) = 0 <0.000006> 1188356998.424339 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.424374 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000311> 1188356998.424775 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356998.424880 gettimeofday({1188356998, 424891}, {4294967176, 0}) = 0 <0.000006> 1188356998.424918 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.424952 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.425011 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.425048 gettimeofday({1188356998, 425056}, {4294967176, 0}) = 0 <0.000005> 1188356998.425083 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.425117 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.425175 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.425212 gettimeofday({1188356998, 425221}, {4294967176, 0}) = 0 <0.000005> 1188356998.425248 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.425282 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.425341 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.425379 gettimeofday({1188356998, 425388}, {4294967176, 0}) = 0 <0.000005> 1188356998.425930 ioctl(3, FIONREAD, [0]) = 0 <0.000039> 1188356998.426036 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.426096 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.426134 gettimeofday({1188356998, 426142}, {4294967176, 0}) = 0 <0.000006> 1188356998.426169 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.426203 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.426274 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.426312 gettimeofday({1188356998, 426320}, {4294967176, 0}) = 0 <0.000006> 1188356998.426347 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.426382 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000304> 1188356998.426775 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000041> 1188356998.426882 gettimeofday({1188356998, 426892}, {4294967176, 0}) = 0 <0.000006> 1188356998.426919 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.426954 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.427013 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.427050 gettimeofday({1188356998, 427059}, {4294967176, 0}) = 0 <0.000006> 1188356998.427086 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.427120 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.427179 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.427216 gettimeofday({1188356998, 427225}, {4294967176, 0}) = 0 <0.000006> 1188356998.427252 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.427286 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.427344 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.427381 gettimeofday({1188356998, 427390}, {4294967176, 0}) = 0 <0.000006> 1188356998.427867 ioctl(3, FIONREAD, [0]) = 0 <0.000039> 1188356998.427972 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000041> 1188356998.428067 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.428104 gettimeofday({1188356998, 428113}, {4294967176, 0}) = 0 <0.000005> 1188356998.428140 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.428175 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.428233 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.428270 gettimeofday({1188356998, 428279}, {4294967176, 0}) = 0 <0.000006> 1188356998.428305 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.428340 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.428401 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000322> 1188356998.428788 gettimeofday({1188356998, 428798}, {4294967176, 0}) = 0 <0.000039> 1188356998.428894 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.428930 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.428989 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.429027 gettimeofday({1188356998, 429035}, {4294967176, 0}) = 0 <0.000006> 1188356998.429062 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.429096 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.429155 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.429192 gettimeofday({1188356998, 429201}, {4294967176, 0}) = 0 <0.000005> 1188356998.429228 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.429262 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.429321 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000006> 1188356998.429357 gettimeofday({1188356998, 429366}, {4294967176, 0}) = 0 <0.000005> 1188356998.429393 ioctl(3, FIONREAD, [0]) = 0 <0.000436> 1188356998.429896 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000040> 1188356998.430023 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.430074 gettimeofday({1188356998, 430083}, {4294967176, 0}) = 0 <0.000005> 1188356998.430110 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.430145 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.430204 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.430242 gettimeofday({1188356998, 430250}, {4294967176, 0}) = 0 <0.000006> 1188356998.430277 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.430311 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.430370 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.430723 gettimeofday({1188356998, 430733}, {4294967176, 0}) = 0 <0.000039> 1188356998.430828 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356998.430899 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.430958 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.430995 gettimeofday({1188356998, 431004}, {4294967176, 0}) = 0 <0.000005> 1188356998.431031 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.431065 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.431124 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.431161 gettimeofday({1188356998, 431169}, {4294967176, 0}) = 0 <0.000006> 1188356998.431196 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.431230 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.431289 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.431326 gettimeofday({1188356998, 431334}, {4294967176, 0}) = 0 <0.000006> 1188356998.431361 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.431395 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000456> 1188356998.431940 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000039> 1188356998.432046 gettimeofday({1188356998, 432056}, {4294967176, 0}) = 0 <0.000005> 1188356998.432083 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.432117 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.432176 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.432213 gettimeofday({1188356998, 432222}, {4294967176, 0}) = 0 <0.000005> 1188356998.432249 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.432283 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.432342 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.432379 gettimeofday({1188356998, 432387}, {4294967176, 0}) = 0 <0.000006> 1188356998.432751 ioctl(3, FIONREAD, [0]) = 0 <0.000039> 1188356998.432855 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.432915 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.432952 gettimeofday({1188356998, 432961}, {4294967176, 0}) = 0 <0.000005> 1188356998.432988 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.433022 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.433081 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.433118 gettimeofday({1188356998, 433127}, {4294967176, 0}) = 0 <0.000006> 1188356998.433154 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.433188 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.433247 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.433283 gettimeofday({1188356998, 433292}, {4294967176, 0}) = 0 <0.000005> 1188356998.433330 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.433365 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000441> 1188356998.433896 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356998.434002 gettimeofday({1188356998, 434012}, {4294967176, 0}) = 0 <0.000038> 1188356998.434073 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.434108 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.434167 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.434204 gettimeofday({1188356998, 434213}, {4294967176, 0}) = 0 <0.000005> 1188356998.434240 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.434274 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.434332 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356998.434370 gettimeofday({1188356998, 434379}, {4294967176, 0}) = 0 <0.000006> 1188356998.434408 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.434443 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.434503 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.434540 gettimeofday({1188356998, 434549}, {4294967176, 0}) = 0 <0.000006> 1188356998.434575 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.434610 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.434668 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.434705 gettimeofday({1188356998, 434714}, {4294967176, 0}) = 0 <0.000005> 1188356998.434741 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.434775 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.434835 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.434871 gettimeofday({1188356998, 434880}, {4294967176, 0}) = 0 <0.000005> 1188356998.434907 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.434941 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.434999 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.435036 gettimeofday({1188356998, 435045}, {4294967176, 0}) = 0 <0.000006> 1188356998.435071 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.435106 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.435164 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.435201 gettimeofday({1188356998, 435209}, {4294967176, 0}) = 0 <0.000006> 1188356998.435237 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.435271 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.435329 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.435365 gettimeofday({1188356998, 435374}, {4294967176, 0}) = 0 <0.000006> 1188356998.435403 ioctl(3, FIONREAD, [0]) = 0 <0.001120> 1188356998.436587 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000041> 1188356998.436715 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000039> 1188356998.436821 gettimeofday({1188356998, 436831}, {4294967176, 0}) = 0 <0.000005> 1188356998.436857 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.436892 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.436951 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.436988 gettimeofday({1188356998, 436996}, {4294967176, 0}) = 0 <0.000006> 1188356998.437023 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.437069 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.437129 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.437166 gettimeofday({1188356998, 437175}, {4294967176, 0}) = 0 <0.000006> 1188356998.437202 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.437237 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.437295 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.437332 gettimeofday({1188356998, 437341}, {4294967176, 0}) = 0 <0.000005> 1188356998.437368 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.437404 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000498> 1188356998.437988 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356998.438094 gettimeofday({1188356998, 438104}, {4294967176, 0}) = 0 <0.000038> 1188356998.438198 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.438234 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.438293 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.438330 gettimeofday({1188356998, 438339}, {4294967176, 0}) = 0 <0.000006> 1188356998.438365 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.438403 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000206> 1188356998.438694 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356998.438800 gettimeofday({1188356998, 438810}, {4294967176, 0}) = 0 <0.000038> 1188356998.438871 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.438906 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356998.438966 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.439003 gettimeofday({1188356998, 439012}, {4294967176, 0}) = 0 <0.000005> 1188356998.439039 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.439073 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356998.439132 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.439169 gettimeofday({1188356998, 439178}, {4294967176, 0}) = 0 <0.000005> 1188356998.439205 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.439239 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.439297 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.439334 gettimeofday({1188356998, 439343}, {4294967176, 0}) = 0 <0.000006> 1188356998.439370 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.439406 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.439465 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.439503 gettimeofday({1188356998, 439511}, {4294967176, 0}) = 0 <0.000006> 1188356998.439538 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.439573 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.439631 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.439669 gettimeofday({1188356998, 439677}, {4294967176, 0}) = 0 <0.000005> 1188356998.439704 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.439738 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.439797 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.439834 gettimeofday({1188356998, 439842}, {4294967176, 0}) = 0 <0.000006> 1188356998.439869 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356998.439904 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356998.439974 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356998.440011 gettimeofday({1188356998, 440020}, {4294967176, 0}) = 0 <0.000006> 1188356998.440047 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356998.440081 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = -1 EAGAIN (Resource temporarily unavailable) <0.000007> 1188356998.440155 select(4, [3], [3], NULL, NULL) = 1 (out [3]) <1.173680> 1188356999.614013 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000047> 1188356999.614125 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000059> 1188356999.614218 gettimeofday({1188356999, 614228}, {4294967176, 0}) = 0 <0.000006> 1188356999.614326 ioctl(3, FIONREAD, [0]) = 0 <0.000008> 1188356999.614371 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000075> 1188356999.614505 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356999.614611 gettimeofday({1188356999, 614622}, {4294967176, 0}) = 0 <0.000006> 1188356999.614650 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.614770 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.614831 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000074> 1188356999.614938 gettimeofday({1188356999, 614948}, {4294967176, 0}) = 0 <0.000007> 1188356999.614975 ioctl(3, FIONREAD, [0]) = 0 <0.000088> 1188356999.615095 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.615155 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000090> 1188356999.615278 gettimeofday({1188356999, 615287}, {4294967176, 0}) = 0 <0.000006> 1188356999.615315 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356999.615407 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.615565 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356999.615606 gettimeofday({1188356999, 615615}, {4294967176, 0}) = 0 <0.000006> 1188356999.615643 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356999.615798 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.615859 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.615897 gettimeofday({1188356999, 615907}, {4294967176, 0}) = 0 <0.000006> 1188356999.616053 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356999.616090 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.616150 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000109> 1188356999.616292 gettimeofday({1188356999, 616301}, {4294967176, 0}) = 0 <0.000006> 1188356999.616329 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356999.616365 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000126> 1188356999.616548 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356999.616587 gettimeofday({1188356999, 616596}, {4294967176, 0}) = 0 <0.000006> 1188356999.616624 ioctl(3, FIONREAD, [0]) = 0 <0.000124> 1188356999.616779 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356999.616840 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.616878 gettimeofday({1188356999, 616888}, {4294967176, 0}) = 0 <0.000122> 1188356999.617034 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356999.617070 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.617130 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000125> 1188356999.617288 gettimeofday({1188356999, 617297}, {4294967176, 0}) = 0 <0.000006> 1188356999.617348 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356999.617385 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000129> 1188356999.617571 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.617610 gettimeofday({1188356999, 617619}, {4294967176, 0}) = 0 <0.000005> 1188356999.617748 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.617785 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.617844 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.617881 gettimeofday({1188356999, 617890}, {4294967176, 0}) = 0 <0.000005> 1188356999.617917 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.617952 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.618011 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.618305 gettimeofday({1188356999, 618315}, {4294967176, 0}) = 0 <0.000006> 1188356999.618343 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.618378 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000110> 1188356999.618543 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.618581 gettimeofday({1188356999, 618590}, {4294967176, 0}) = 0 <0.000005> 1188356999.618617 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.618652 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.618710 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.618748 gettimeofday({1188356999, 618756}, {4294967176, 0}) = 0 <0.000006> 1188356999.618783 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.618818 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000010> 1188356999.618878 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.618916 gettimeofday({1188356999, 618925}, {4294967176, 0}) = 0 <0.000005> 1188356999.618952 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.618987 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.619046 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.619083 gettimeofday({1188356999, 619092}, {4294967176, 0}) = 0 <0.000005> 1188356999.619119 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.619154 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.619792 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.619831 gettimeofday({1188356999, 619840}, {4294967176, 0}) = 0 <0.000006> 1188356999.619867 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.619902 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.619961 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.619998 gettimeofday({1188356999, 620007}, {4294967176, 0}) = 0 <0.000005> 1188356999.620034 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.620068 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.620127 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.620164 gettimeofday({1188356999, 620173}, {4294967176, 0}) = 0 <0.000005> 1188356999.620200 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.620235 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.620293 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.620330 gettimeofday({1188356999, 620339}, {4294967176, 0}) = 0 <0.000006> 1188356999.620366 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.620959 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.621020 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.621057 gettimeofday({1188356999, 621066}, {4294967176, 0}) = 0 <0.000005> 1188356999.621093 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.621128 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.621187 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.621224 gettimeofday({1188356999, 621233}, {4294967176, 0}) = 0 <0.000006> 1188356999.621259 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.621294 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.621352 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.621390 gettimeofday({1188356999, 621401}, {4294967176, 0}) = 0 <0.000008> 1188356999.621429 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.621464 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.621522 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.621559 gettimeofday({1188356999, 621568}, {4294967176, 0}) = 0 <0.000005> 1188356999.621595 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.621630 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.621688 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.621726 gettimeofday({1188356999, 621735}, {4294967176, 0}) = 0 <0.000005> 1188356999.621762 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.621797 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.621856 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.621894 gettimeofday({1188356999, 621903}, {4294967176, 0}) = 0 <0.000006> 1188356999.621930 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.621964 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.622023 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.622061 gettimeofday({1188356999, 622070}, {4294967176, 0}) = 0 <0.000006> 1188356999.622097 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.622132 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.622191 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356999.622230 gettimeofday({1188356999, 622239}, {4294967176, 0}) = 0 <0.000006> 1188356999.622267 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.622301 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.622360 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.622400 gettimeofday({1188356999, 622410}, {4294967176, 0}) = 0 <0.001199> 1188356999.623665 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356999.623768 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.623828 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.623866 gettimeofday({1188356999, 623875}, {4294967176, 0}) = 0 <0.000006> 1188356999.623902 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.623936 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.623995 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.624032 gettimeofday({1188356999, 624041}, {4294967176, 0}) = 0 <0.000005> 1188356999.624068 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.624103 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.624174 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.624212 gettimeofday({1188356999, 624221}, {4294967176, 0}) = 0 <0.000006> 1188356999.624248 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.624282 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.624344 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.624381 gettimeofday({1188356999, 624390}, {4294967176, 0}) = 0 <0.000005> 1188356999.624987 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356999.625091 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000041> 1188356999.625185 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.625222 gettimeofday({1188356999, 625231}, {4294967176, 0}) = 0 <0.000006> 1188356999.625258 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.625292 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.625351 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.625388 gettimeofday({1188356999, 625397}, {4294967176, 0}) = 0 <0.000261> 1188356999.625714 ioctl(3, FIONREAD, [0]) = 0 <0.000039> 1188356999.625818 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.625879 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.625916 gettimeofday({1188356999, 625925}, {4294967176, 0}) = 0 <0.000005> 1188356999.625952 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.625987 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.626045 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.626082 gettimeofday({1188356999, 626091}, {4294967176, 0}) = 0 <0.000005> 1188356999.626118 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.626153 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.626211 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.626249 gettimeofday({1188356999, 626257}, {4294967176, 0}) = 0 <0.000005> 1188356999.626284 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.626318 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.626377 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000010> 1188356999.626913 gettimeofday({1188356999, 626923}, {4294967176, 0}) = 0 <0.000038> 1188356999.627018 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356999.627088 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.627147 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.627185 gettimeofday({1188356999, 627193}, {4294967176, 0}) = 0 <0.000005> 1188356999.627220 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.627254 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.627313 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.627350 gettimeofday({1188356999, 627359}, {4294967176, 0}) = 0 <0.000005> 1188356999.627386 ioctl(3, FIONREAD, [0]) = 0 <0.000009> 1188356999.627733 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000043> 1188356999.627861 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000006> 1188356999.627901 gettimeofday({1188356999, 627910}, {4294967176, 0}) = 0 <0.000006> 1188356999.627936 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.627971 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000010> 1188356999.628031 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.628081 gettimeofday({1188356999, 628090}, {4294967176, 0}) = 0 <0.000006> 1188356999.628117 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.628152 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.628211 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.628248 gettimeofday({1188356999, 628256}, {4294967176, 0}) = 0 <0.000006> 1188356999.628283 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.628318 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.628376 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000009> 1188356999.628864 gettimeofday({1188356999, 628875}, {4294967176, 0}) = 0 <0.000039> 1188356999.628970 ioctl(3, FIONREAD, [0]) = 0 <0.000039> 1188356999.629040 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.629099 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.629136 gettimeofday({1188356999, 629145}, {4294967176, 0}) = 0 <0.000005> 1188356999.629172 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.629207 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.629265 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.629302 gettimeofday({1188356999, 629311}, {4294967176, 0}) = 0 <0.000005> 1188356999.629338 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.629373 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000323> 1188356999.629784 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356999.629891 gettimeofday({1188356999, 629901}, {4294967176, 0}) = 0 <0.000005> 1188356999.629928 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.629963 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.630022 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.630059 gettimeofday({1188356999, 630068}, {4294967176, 0}) = 0 <0.000005> 1188356999.630095 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.630130 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.630188 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.630225 gettimeofday({1188356999, 630234}, {4294967176, 0}) = 0 <0.000005> 1188356999.630261 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.630295 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.630354 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.630391 gettimeofday({1188356999, 630403}, {4294967176, 0}) = 0 <0.000441> 1188356999.630897 ioctl(3, FIONREAD, [0]) = 0 <0.000039> 1188356999.631000 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.631060 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.631097 gettimeofday({1188356999, 631106}, {4294967176, 0}) = 0 <0.000006> 1188356999.631133 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.631167 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.631226 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.631263 gettimeofday({1188356999, 631272}, {4294967176, 0}) = 0 <0.000006> 1188356999.631299 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.631333 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.631392 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000339> 1188356999.631799 gettimeofday({1188356999, 631809}, {4294967176, 0}) = 0 <0.000037> 1188356999.631916 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.631953 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.632012 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356999.632050 gettimeofday({1188356999, 632058}, {4294967176, 0}) = 0 <0.000006> 1188356999.632085 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.632120 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.632178 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.632215 gettimeofday({1188356999, 632224}, {4294967176, 0}) = 0 <0.000005> 1188356999.632251 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.632285 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.632344 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.632381 gettimeofday({1188356999, 632390}, {4294967176, 0}) = 0 <0.000006> 1188356999.632836 ioctl(3, FIONREAD, [0]) = 0 <0.000039> 1188356999.632939 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000041> 1188356999.633033 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356999.633071 gettimeofday({1188356999, 633079}, {4294967176, 0}) = 0 <0.000006> 1188356999.633106 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.633141 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.633199 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.633237 gettimeofday({1188356999, 633245}, {4294967176, 0}) = 0 <0.000006> 1188356999.633273 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.633307 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.633365 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.633405 gettimeofday({1188356999, 633414}, {4294967176, 0}) = 0 <0.000356> 1188356999.633826 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356999.633930 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.633989 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.634026 gettimeofday({1188356999, 634035}, {4294967176, 0}) = 0 <0.000006> 1188356999.634062 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.634096 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.634155 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.634192 gettimeofday({1188356999, 634200}, {4294967176, 0}) = 0 <0.000005> 1188356999.634227 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.634261 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.634319 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.634356 gettimeofday({1188356999, 634365}, {4294967176, 0}) = 0 <0.000005> 1188356999.634392 ioctl(3, FIONREAD, [0]) = 0 <0.000405> 1188356999.634863 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000041> 1188356999.634990 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.635029 gettimeofday({1188356999, 635038}, {4294967176, 0}) = 0 <0.000005> 1188356999.635065 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.635099 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.635158 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.635195 gettimeofday({1188356999, 635204}, {4294967176, 0}) = 0 <0.000006> 1188356999.635230 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.635265 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.635337 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.635374 gettimeofday({1188356999, 635383}, {4294967176, 0}) = 0 <0.000006> 1188356999.635781 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356999.635885 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000041> 1188356999.635980 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.636017 gettimeofday({1188356999, 636026}, {4294967176, 0}) = 0 <0.000006> 1188356999.636053 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.636087 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.636146 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.636183 gettimeofday({1188356999, 636192}, {4294967176, 0}) = 0 <0.000005> 1188356999.636219 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.636253 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.636312 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.636349 gettimeofday({1188356999, 636358}, {4294967176, 0}) = 0 <0.000005> 1188356999.636385 ioctl(3, FIONREAD, [0]) = 0 <0.000009> 1188356999.636828 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000041> 1188356999.636955 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000006> 1188356999.636994 gettimeofday({1188356999, 637003}, {4294967176, 0}) = 0 <0.000006> 1188356999.637030 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.637065 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.637123 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.637161 gettimeofday({1188356999, 637169}, {4294967176, 0}) = 0 <0.000006> 1188356999.637196 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.637232 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.637291 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.637328 gettimeofday({1188356999, 637337}, {4294967176, 0}) = 0 <0.000006> 1188356999.637364 ioctl(3, FIONREAD, [0]) = 0 <0.000007> 1188356999.637402 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000414> 1188356999.637902 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356999.637976 gettimeofday({1188356999, 637985}, {4294967176, 0}) = 0 <0.000006> 1188356999.638012 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.638047 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.638106 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.638143 gettimeofday({1188356999, 638152}, {4294967176, 0}) = 0 <0.000006> 1188356999.638179 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.638213 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.638272 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.638308 gettimeofday({1188356999, 638317}, {4294967176, 0}) = 0 <0.000005> 1188356999.638344 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.638378 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000396> 1188356999.638863 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356999.638969 gettimeofday({1188356999, 638979}, {4294967176, 0}) = 0 <0.000005> 1188356999.639006 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.639041 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356999.639112 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.639150 gettimeofday({1188356999, 639159}, {4294967176, 0}) = 0 <0.000006> 1188356999.639186 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.639220 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.639279 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.639316 gettimeofday({1188356999, 639325}, {4294967176, 0}) = 0 <0.000006> 1188356999.639352 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.639386 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000373> 1188356999.639848 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356999.639955 gettimeofday({1188356999, 639965}, {4294967176, 0}) = 0 <0.000005> 1188356999.639992 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.640027 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.640085 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.640123 gettimeofday({1188356999, 640131}, {4294967176, 0}) = 0 <0.000006> 1188356999.640159 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.640193 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.640251 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.640288 gettimeofday({1188356999, 640297}, {4294967176, 0}) = 0 <0.000006> 1188356999.640324 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.640358 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000011> 1188356999.640802 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000041> 1188356999.640910 gettimeofday({1188356999, 640920}, {4294967176, 0}) = 0 <0.000038> 1188356999.640982 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.641017 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.641076 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.641113 gettimeofday({1188356999, 641122}, {4294967176, 0}) = 0 <0.000005> 1188356999.641149 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.641183 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.641242 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.641279 gettimeofday({1188356999, 641287}, {4294967176, 0}) = 0 <0.000006> 1188356999.641315 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.641349 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.641775 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356999.641882 gettimeofday({1188356999, 641892}, {4294967176, 0}) = 0 <0.000038> 1188356999.641954 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.641988 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.642047 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.642085 gettimeofday({1188356999, 642093}, {4294967176, 0}) = 0 <0.000006> 1188356999.642121 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.642155 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.642213 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.642250 gettimeofday({1188356999, 642259}, {4294967176, 0}) = 0 <0.000005> 1188356999.642286 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.642320 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.642379 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000388> 1188356999.642849 gettimeofday({1188356999, 642859}, {4294967176, 0}) = 0 <0.000038> 1188356999.642953 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.642989 select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout) <0.000008> 1188356999.643030 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.643090 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.643127 gettimeofday({1188356999, 643136}, {4294967176, 0}) = 0 <0.000006> 1188356999.643163 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.643198 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.643256 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.643294 gettimeofday({1188356999, 643303}, {4294967176, 0}) = 0 <0.000006> 1188356999.643330 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.643364 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000374> 1188356999.643827 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356999.643933 gettimeofday({1188356999, 643943}, {4294967176, 0}) = 0 <0.000038> 1188356999.644005 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.644040 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.644099 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.644137 gettimeofday({1188356999, 644145}, {4294967176, 0}) = 0 <0.000006> 1188356999.644172 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.644207 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.644266 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.644303 gettimeofday({1188356999, 644311}, {4294967176, 0}) = 0 <0.000006> 1188356999.644339 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.644373 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000357> 1188356999.644819 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356999.644926 gettimeofday({1188356999, 644936}, {4294967176, 0}) = 0 <0.000005> 1188356999.644963 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.644998 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.645057 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.645095 gettimeofday({1188356999, 645103}, {4294967176, 0}) = 0 <0.000005> 1188356999.645130 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.645165 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.645225 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.645262 gettimeofday({1188356999, 645271}, {4294967176, 0}) = 0 <0.000006> 1188356999.645298 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.645332 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.645391 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000407> 1188356999.645865 gettimeofday({1188356999, 645875}, {4294967176, 0}) = 0 <0.000038> 1188356999.645969 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.646005 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.646065 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.646102 gettimeofday({1188356999, 646111}, {4294967176, 0}) = 0 <0.000006> 1188356999.646138 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.646172 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.646231 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.646281 gettimeofday({1188356999, 646290}, {4294967176, 0}) = 0 <0.000005> 1188356999.646317 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.646352 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.646763 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356999.646870 gettimeofday({1188356999, 646880}, {4294967176, 0}) = 0 <0.000039> 1188356999.646941 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.646976 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.647036 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.647073 gettimeofday({1188356999, 647082}, {4294967176, 0}) = 0 <0.000005> 1188356999.647109 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.647143 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.647202 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.647240 gettimeofday({1188356999, 647248}, {4294967176, 0}) = 0 <0.000006> 1188356999.647276 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.647310 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000009> 1188356999.647372 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.647812 gettimeofday({1188356999, 647823}, {4294967176, 0}) = 0 <0.000039> 1188356999.647917 ioctl(3, FIONREAD, [0]) = 0 <0.000039> 1188356999.647988 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.648047 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.648084 gettimeofday({1188356999, 648093}, {4294967176, 0}) = 0 <0.000005> 1188356999.648120 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.648155 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.648213 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.648251 gettimeofday({1188356999, 648260}, {4294967176, 0}) = 0 <0.000006> 1188356999.648287 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.648321 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.648380 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000010> 1188356999.648420 gettimeofday({1188356999, 648429}, {4294967176, 0}) = 0 <0.000005> 1188356999.648456 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.648491 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.648549 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.648586 gettimeofday({1188356999, 648595}, {4294967176, 0}) = 0 <0.000005> 1188356999.648622 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.648656 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.648715 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.648752 gettimeofday({1188356999, 648761}, {4294967176, 0}) = 0 <0.000005> 1188356999.648788 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.648823 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000007> 1188356999.648881 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000008> 1188356999.648919 gettimeofday({1188356999, 648928}, {4294967176, 0}) = 0 <0.000006> 1188356999.648955 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.648990 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.649048 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.649086 gettimeofday({1188356999, 649094}, {4294967176, 0}) = 0 <0.000005> 1188356999.649134 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 1188356999.649170 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.649229 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000007> 1188356999.649266 gettimeofday({1188356999, 649275}, {4294967176, 0}) = 0 <0.000006> 1188356999.649302 ioctl(3, FIONREAD, [0]) = 0 <0.000005> 1188356999.649336 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = 240 <0.000008> 1188356999.649394 write(3, "\217\v\3\0\1\0\0\0\2\0 \2", 12) = 12 <0.000040> 1188356999.650639 gettimeofday({1188356999, 650649}, {4294967176, 0}) = 0 <0.000038> 1188356999.650743 ioctl(3, FIONREAD, [0]) = 0 <0.000038> 1188356999.650846 writev(3, [{"\217\1<\0\1\0\0\0", 8}, {"\10\0\177\0\0A\0\0\4\0\270\0\24\0\272\0\0\0\240A\0\0\200"..., 232}], 2) = -1 EAGAIN (Resource temporarily unavailable) <0.000006> 1188356999.650912 select(4, [3], [3], NULL, NULL <unfinished ...> Process 3644 detached ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 4:46 ` Ingo Molnar @ 2007-08-29 7:57 ` Keith Packard 2007-08-29 8:04 ` Ingo Molnar 0 siblings, 1 reply; 535+ messages in thread From: Keith Packard @ 2007-08-29 7:57 UTC (permalink / raw) To: Ingo Molnar Cc: keith.packard, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2155 bytes --] On Wed, 2007-08-29 at 06:46 +0200, Ingo Molnar wrote: > ok, i finally managed to reproduce the "artifact" myself on an older > box. It goes like this: start up X with the vesa driver (or with NoDRI) > to force software rendering. Then start up a couple of glxgears > instances. Those glxgears instances update in a very "chunky", > "stuttering" way - each glxgears instance runs/stops/runs/stops at a > rate of a about once per second, and this was reported to me as a > potential CPU scheduler regression. Hmm. I can't even run two copies of glxgears on software GL code today; it's broken in every X server I have available. Someone broke it a while ago, but no-one noticed. However, this shouldn't be GLX related as the software rasterizer is no different from any other rendering code. Testing with my smart-scheduler case (many copies of 'plaid') shows that at least with git master, things are working as designed. When GLX is working again, I'll try that as well. > at a quick glance this is not a CPU scheduler thing: X uses up 99% of > CPU time, all the glxgears tasks (i needed 8 parallel instances to see > the stallings) are using up the remaining 1% of CPU time. The ordering > of the requests from the glxgears tasks is X's choice - and for a > pathological overload situation like this we cannot blame X at all for > not producing a completely smooth output. (although Xorg could perhaps > try to schedule such requests more smoothly, in a more finegrained way?) It does. It should switch between clients ever 20ms; that's why X spends so much time asking the kernel for the current time. Make sure the X server isn't running with the smart scheduler disabled; that will cause precisely the symptoms you're seeing here. In the normal usptream sources, you'd have to use '-dumbSched' as an X server command line option. The old 'scheduler' would run an entire X client's input buffer dry before looking for requests from another client. Because glxgears requests are small but time consuming, this can cause very long delays between client switching. -- keith.packard@intel.com [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 7:57 ` Keith Packard @ 2007-08-29 8:04 ` Ingo Molnar 2007-08-29 8:53 ` Al Boldi 2007-08-29 15:57 ` Keith Packard 0 siblings, 2 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-29 8:04 UTC (permalink / raw) To: Keith Packard Cc: Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Keith Packard <keith.packard@intel.com> wrote: > Make sure the X server isn't running with the smart scheduler > disabled; that will cause precisely the symptoms you're seeing here. > In the normal usptream sources, you'd have to use '-dumbSched' as an X > server command line option. > > The old 'scheduler' would run an entire X client's input buffer dry > before looking for requests from another client. Because glxgears > requests are small but time consuming, this can cause very long delays > between client switching. on the old box where i've reproduced this i've got an ancient X version: neptune:~> X -version X Window System Version 6.8.2 Release Date: 9 February 2005 X Protocol Version 11, Revision 0, Release 6.8.2 Build Operating System: Linux 2.6.9-22.ELsmp i686 [ELF] is that old enough to not have the smart X scheduler? on newer systems i dont see correctly updated glxgears output (probably the GLX bug you mentioned) so i cannot reproduce the bug. Al, could you send us your 'X -version' output? Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 8:04 ` Ingo Molnar @ 2007-08-29 8:53 ` Al Boldi 2007-08-29 15:57 ` Keith Packard 1 sibling, 0 replies; 535+ messages in thread From: Al Boldi @ 2007-08-29 8:53 UTC (permalink / raw) To: Ingo Molnar, Keith Packard Cc: Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Ingo Molnar wrote: > * Keith Packard <keith.packard@intel.com> wrote: > > Make sure the X server isn't running with the smart scheduler > > disabled; that will cause precisely the symptoms you're seeing here. > > In the normal usptream sources, you'd have to use '-dumbSched' as an X > > server command line option. > > > > The old 'scheduler' would run an entire X client's input buffer dry > > before looking for requests from another client. Because glxgears > > requests are small but time consuming, this can cause very long delays > > between client switching. > > on the old box where i've reproduced this i've got an ancient X version: > > neptune:~> X -version > > X Window System Version 6.8.2 > Release Date: 9 February 2005 > X Protocol Version 11, Revision 0, Release 6.8.2 > Build Operating System: Linux 2.6.9-22.ELsmp i686 [ELF] > > is that old enough to not have the smart X scheduler? > > on newer systems i dont see correctly updated glxgears output (probably > the GLX bug you mentioned) so i cannot reproduce the bug. > > Al, could you send us your 'X -version' output? This is the one I have been talking about: XFree86 Version 4.3.0 Release Date: 27 February 2003 X Protocol Version 11, Revision 0, Release 6.6 Build Operating System: Linux 2.4.21-0.13mdksmp i686 [ELF] I also tried the gears test just now on this: X Window System Version 6.8.1 Release Date: 17 September 2004 X Protocol Version 11, Revision 0, Release 6.8.1 Build Operating System: Linux 2.6.9-1.860_ELsmp i686 [ELF] but it completely locks up. Disabling add_wait_runtime seems to fix it. Thanks! -- Al ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 8:04 ` Ingo Molnar 2007-08-29 8:53 ` Al Boldi @ 2007-08-29 15:57 ` Keith Packard 2007-08-29 19:56 ` Rene Herman 1 sibling, 1 reply; 535+ messages in thread From: Keith Packard @ 2007-08-29 15:57 UTC (permalink / raw) To: Ingo Molnar Cc: keith.packard, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel [-- Attachment #1: Type: text/plain, Size: 530 bytes --] On Wed, 2007-08-29 at 10:04 +0200, Ingo Molnar wrote: > is that old enough to not have the smart X scheduler? The smart scheduler went into the server in like 2000. I don't think you've got any systems that old. XFree86 4.1 or 4.2, I can't remember which. > (probably > the GLX bug you mentioned) so i cannot reproduce the bug. With X server 1.3, I'm getting consistent crashes with two glxgear instances running. So, if you're getting any output, it's better than my situation. -- keith.packard@intel.com [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 15:57 ` Keith Packard @ 2007-08-29 19:56 ` Rene Herman 2007-08-30 7:05 ` Rene Herman 2007-08-30 16:06 ` Chuck Ebbert 0 siblings, 2 replies; 535+ messages in thread From: Rene Herman @ 2007-08-29 19:56 UTC (permalink / raw) To: keith.packard Cc: Ingo Molnar, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel On 08/29/2007 05:57 PM, Keith Packard wrote: > With X server 1.3, I'm getting consistent crashes with two glxgear > instances running. So, if you're getting any output, it's better than my > situation. Before people focuss on software rendering too much -- also with 1.3.0 (and a Matrox Millenium G550 AGP, 32M) glxgears also works decidedly crummy using hardware rendering. While I can move the glxgears window itself, the actual spinning wheels stay in the upper-left corner of the screen and the movement leaves a non-repainting trace on the screen. Running a second instance of glxgears in addition seems to make both instances unkillable -- and when I just now forcefully killed X in this situation (the spinning wheels were covering the upper left corner of all my desktops) I got the below. Kernel is 2.6.22.5-cfs-v20.5, schedule() is in the traces (but that may be expected anyway). BUG: unable to handle kernel NULL pointer dereference at virtual address 00000010 printing eip: c10ff416 *pde = 00000000 Oops: 0000 [#1] PREEMPT Modules linked in: nfsd exportfs lockd nfs_acl sunrpc nls_iso8859_1 nls_cp437 vfat fat nls_base CPU: 0 EIP: 0060:[<c10ff416>] Not tainted VLI EFLAGS: 00210246 (2.6.22.5-cfs-v20.5-local #5) EIP is at mga_dma_buffers+0x189/0x2e3 eax: 00000000 ebx: efd07200 ecx: 00000001 edx: efc32c00 esi: 00000000 edi: c12756cc ebp: dfea44c0 esp: dddaaec0 ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068 Process glxgears (pid: 1775, ti=dddaa000 task=e9daca60 task.ti=dddaa000) Stack: efc32c00 00000000 00000004 e4c3bd20 c10fa54b e4c3bd20 efc32c00 00000000 00000004 00000000 00000000 00000000 00000000 00000001 00010000 bfbdb8bc bfbdb8b8 00000000 c10ff28d 00000029 c12756cc dfea44c0 c10f87fc bfbdb844 Call Trace: [<c10fa54b>] drm_lock+0x255/0x2de [<c10ff28d>] mga_dma_buffers+0x0/0x2e3 [<c10f87fc>] drm_ioctl+0x142/0x18a [<c1005973>] do_IRQ+0x97/0xb0 [<c10f86ba>] drm_ioctl+0x0/0x18a [<c10f86ba>] drm_ioctl+0x0/0x18a [<c105b0d7>] do_ioctl+0x87/0x9f [<c105b32c>] vfs_ioctl+0x23d/0x250 [<c11b533e>] schedule+0x2d0/0x2e6 [<c105b372>] sys_ioctl+0x33/0x4d [<c1003d1e>] syscall_call+0x7/0xb ======================= Code: 9a 08 03 00 00 8b 73 30 74 14 c7 44 24 04 28 76 1c c1 c7 04 24 49 51 23 c1 e8 b0 74 f1 ff 8b 83 d8 00 00 00 83 3d 1c 47 30 c1 00 <8b> 40 10 8b a8 58 1e 00 00 8b 43 28 8b b8 64 01 00 00 74 32 8b EIP: [<c10ff416>] mga_dma_buffers+0x189/0x2e3 SS:ESP 0068:dddaaec0 BUG: unable to handle kernel NULL pointer dereference at virtual address 00000010 printing eip: c10ff416 *pde = 00000000 Oops: 0000 [#2] PREEMPT Modules linked in: nfsd exportfs lockd nfs_acl sunrpc nls_iso8859_1 nls_cp437 vfat fat nls_base CPU: 0 EIP: 0060:[<c10ff416>] Not tainted VLI EFLAGS: 00210246 (2.6.22.5-cfs-v20.5-local #5) EIP is at mga_dma_buffers+0x189/0x2e3 eax: 00000000 ebx: efd07200 ecx: 00000001 edx: efc32c00 esi: 00000000 edi: c12756cc ebp: dfea4780 esp: e0552ec0 ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068 Process glxgears (pid: 1776, ti=e0552000 task=c19ec000 task.ti=e0552000) Stack: efc32c00 00000000 00000003 efc64b40 c10fa54b efc64b40 efc32c00 00000000 00000003 00000000 00000000 00000000 00000000 00000001 00010000 bf8dbdcc bf8dbdc8 00000000 c10ff28d 00000029 c12756cc dfea4780 c10f87fc bf8dbd54 Call Trace: [<c10fa54b>] drm_lock+0x255/0x2de [<c10ff28d>] mga_dma_buffers+0x0/0x2e3 [<c10f87fc>] drm_ioctl+0x142/0x18a [<c11b53f6>] preempt_schedule+0x4e/0x5a [<c10f86ba>] drm_ioctl+0x0/0x18a [<c10f86ba>] drm_ioctl+0x0/0x18a [<c105b0d7>] do_ioctl+0x87/0x9f [<c105b32c>] vfs_ioctl+0x23d/0x250 [<c11b52a9>] schedule+0x23b/0x2e6 [<c11b533e>] schedule+0x2d0/0x2e6 [<c105b372>] sys_ioctl+0x33/0x4d [<c1003d1e>] syscall_call+0x7/0xb ======================= Code: 9a 08 03 00 00 8b 73 30 74 14 c7 44 24 04 28 76 1c c1 c7 04 24 49 51 23 c1 e8 b0 74 f1 ff 8b 83 d8 00 00 00 83 3d 1c 47 30 c1 00 <8b> 40 10 8b a8 58 1e 00 00 8b 43 28 8b b8 64 01 00 00 74 32 8b EIP: [<c10ff416>] mga_dma_buffers+0x189/0x2e3 SS:ESP 0068:e0552ec0 [drm:drm_release] *ERROR* Device busy: 2 0 Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 19:56 ` Rene Herman @ 2007-08-30 7:05 ` Rene Herman 2007-08-30 7:20 ` Ingo Molnar 2007-08-31 6:46 ` Tilman Sauerbeck 2007-08-30 16:06 ` Chuck Ebbert 1 sibling, 2 replies; 535+ messages in thread From: Rene Herman @ 2007-08-30 7:05 UTC (permalink / raw) To: keith.packard Cc: Ingo Molnar, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel, airlied, dri-devel On 08/29/2007 09:56 PM, Rene Herman wrote: Realised the BUGs may mean the kernel DRM people could want to be in CC... > On 08/29/2007 05:57 PM, Keith Packard wrote: > >> With X server 1.3, I'm getting consistent crashes with two glxgear >> instances running. So, if you're getting any output, it's better than my >> situation. > > Before people focuss on software rendering too much -- also with 1.3.0 > (and a Matrox Millenium G550 AGP, 32M) glxgears also works decidedly > crummy using hardware rendering. While I can move the glxgears window > itself, the actual spinning wheels stay in the upper-left corner of the > screen and the movement leaves a non-repainting trace on the screen. > Running a second instance of glxgears in addition seems to make both > instances unkillable -- and when I just now forcefully killed X in this > situation (the spinning wheels were covering the upper left corner of all > my desktops) I got the below. > > Kernel is 2.6.22.5-cfs-v20.5, schedule() is in the traces (but that may be > expected anyway). > > BUG: unable to handle kernel NULL pointer dereference at virtual address > 00000010 > printing eip: > c10ff416 > *pde = 00000000 > Oops: 0000 [#1] > PREEMPT > Modules linked in: nfsd exportfs lockd nfs_acl sunrpc nls_iso8859_1 > nls_cp437 vfat fat nls_base > CPU: 0 > EIP: 0060:[<c10ff416>] Not tainted VLI > EFLAGS: 00210246 (2.6.22.5-cfs-v20.5-local #5) > EIP is at mga_dma_buffers+0x189/0x2e3 > eax: 00000000 ebx: efd07200 ecx: 00000001 edx: efc32c00 > esi: 00000000 edi: c12756cc ebp: dfea44c0 esp: dddaaec0 > ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068 > Process glxgears (pid: 1775, ti=dddaa000 task=e9daca60 task.ti=dddaa000) > Stack: efc32c00 00000000 00000004 e4c3bd20 c10fa54b e4c3bd20 efc32c00 > 00000000 > 00000004 00000000 00000000 00000000 00000000 00000001 00010000 > bfbdb8bc > bfbdb8b8 00000000 c10ff28d 00000029 c12756cc dfea44c0 c10f87fc > bfbdb844 > Call Trace: > [<c10fa54b>] drm_lock+0x255/0x2de > [<c10ff28d>] mga_dma_buffers+0x0/0x2e3 > [<c10f87fc>] drm_ioctl+0x142/0x18a > [<c1005973>] do_IRQ+0x97/0xb0 > [<c10f86ba>] drm_ioctl+0x0/0x18a > [<c10f86ba>] drm_ioctl+0x0/0x18a > [<c105b0d7>] do_ioctl+0x87/0x9f > [<c105b32c>] vfs_ioctl+0x23d/0x250 > [<c11b533e>] schedule+0x2d0/0x2e6 > [<c105b372>] sys_ioctl+0x33/0x4d > [<c1003d1e>] syscall_call+0x7/0xb > ======================= > Code: 9a 08 03 00 00 8b 73 30 74 14 c7 44 24 04 28 76 1c c1 c7 04 24 49 > 51 23 c1 e8 b0 74 f1 ff 8b 83 d8 00 00 00 83 3d 1c 47 30 c1 00 <8b> 40 > 10 8b a8 58 1e 00 00 8b 43 28 8b b8 64 01 00 00 74 32 8b > EIP: [<c10ff416>] mga_dma_buffers+0x189/0x2e3 SS:ESP 0068:dddaaec0 > BUG: unable to handle kernel NULL pointer dereference at virtual address > 00000010 > printing eip: > c10ff416 > *pde = 00000000 > Oops: 0000 [#2] > PREEMPT > Modules linked in: nfsd exportfs lockd nfs_acl sunrpc nls_iso8859_1 > nls_cp437 vfat fat nls_base > CPU: 0 > EIP: 0060:[<c10ff416>] Not tainted VLI > EFLAGS: 00210246 (2.6.22.5-cfs-v20.5-local #5) > EIP is at mga_dma_buffers+0x189/0x2e3 > eax: 00000000 ebx: efd07200 ecx: 00000001 edx: efc32c00 > esi: 00000000 edi: c12756cc ebp: dfea4780 esp: e0552ec0 > ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068 > Process glxgears (pid: 1776, ti=e0552000 task=c19ec000 task.ti=e0552000) > Stack: efc32c00 00000000 00000003 efc64b40 c10fa54b efc64b40 efc32c00 > 00000000 > 00000003 00000000 00000000 00000000 00000000 00000001 00010000 > bf8dbdcc > bf8dbdc8 00000000 c10ff28d 00000029 c12756cc dfea4780 c10f87fc > bf8dbd54 > Call Trace: > [<c10fa54b>] drm_lock+0x255/0x2de > [<c10ff28d>] mga_dma_buffers+0x0/0x2e3 > [<c10f87fc>] drm_ioctl+0x142/0x18a > [<c11b53f6>] preempt_schedule+0x4e/0x5a > [<c10f86ba>] drm_ioctl+0x0/0x18a > [<c10f86ba>] drm_ioctl+0x0/0x18a > [<c105b0d7>] do_ioctl+0x87/0x9f > [<c105b32c>] vfs_ioctl+0x23d/0x250 > [<c11b52a9>] schedule+0x23b/0x2e6 > [<c11b533e>] schedule+0x2d0/0x2e6 > [<c105b372>] sys_ioctl+0x33/0x4d > [<c1003d1e>] syscall_call+0x7/0xb > ======================= > Code: 9a 08 03 00 00 8b 73 30 74 14 c7 44 24 04 28 76 1c c1 c7 04 24 49 > 51 23 c1 e8 b0 74 f1 ff 8b 83 d8 00 00 00 83 3d 1c 47 30 c1 00 <8b> 40 > 10 8b a8 58 1e 00 00 8b 43 28 8b b8 64 01 00 00 74 32 8b > EIP: [<c10ff416>] mga_dma_buffers+0x189/0x2e3 SS:ESP 0068:e0552ec0 > [drm:drm_release] *ERROR* Device busy: 2 0 Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-30 7:05 ` Rene Herman @ 2007-08-30 7:20 ` Ingo Molnar 2007-08-31 6:46 ` Tilman Sauerbeck 1 sibling, 0 replies; 535+ messages in thread From: Ingo Molnar @ 2007-08-30 7:20 UTC (permalink / raw) To: Rene Herman Cc: keith.packard, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel, airlied, dri-devel * Rene Herman <rene.herman@gmail.com> wrote: > Realised the BUGs may mean the kernel DRM people could want to be in CC... and note that the schedule() call in there is not part of the crash backtrace: > >Call Trace: > > [<c10fa54b>] drm_lock+0x255/0x2de > > [<c10ff28d>] mga_dma_buffers+0x0/0x2e3 > > [<c10f87fc>] drm_ioctl+0x142/0x18a > > [<c1005973>] do_IRQ+0x97/0xb0 > > [<c10f86ba>] drm_ioctl+0x0/0x18a > > [<c10f86ba>] drm_ioctl+0x0/0x18a > > [<c105b0d7>] do_ioctl+0x87/0x9f > > [<c105b32c>] vfs_ioctl+0x23d/0x250 > > [<c11b533e>] schedule+0x2d0/0x2e6 > > [<c105b372>] sys_ioctl+0x33/0x4d > > [<c1003d1e>] syscall_call+0x7/0xb it just happened to be on the kernel stack. Nor is the do_IRQ() entry real. Both are frequent functions (and were executed recently) that's why they were still in the stackframe. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-30 7:05 ` Rene Herman 2007-08-30 7:20 ` Ingo Molnar @ 2007-08-31 6:46 ` Tilman Sauerbeck 1 sibling, 0 replies; 535+ messages in thread From: Tilman Sauerbeck @ 2007-08-31 6:46 UTC (permalink / raw) To: Rene Herman Cc: keith.packard, Al Boldi, Peter Zijlstra, Mike Galbraith, linux-kernel, airlied, Ingo Molnar, dri-devel, Linus Torvalds, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 1638 bytes --] Rene Herman [2007-08-30 09:05]: > On 08/29/2007 09:56 PM, Rene Herman wrote: > > Realised the BUGs may mean the kernel DRM people could want to be in CC... > > > On 08/29/2007 05:57 PM, Keith Packard wrote: > > > >> With X server 1.3, I'm getting consistent crashes with two glxgear > >> instances running. So, if you're getting any output, it's better than my > >> situation. > > > > Before people focuss on software rendering too much -- also with 1.3.0 > > (and a Matrox Millenium G550 AGP, 32M) glxgears also works decidedly > > crummy using hardware rendering. While I can move the glxgears window > > itself, the actual spinning wheels stay in the upper-left corner of the > > screen and the movement leaves a non-repainting trace on the screen. This sounds like you're running an older version of Mesa. The bugfix went into Mesa 6.3 and 7.0. > > Running a second instance of glxgears in addition seems to make both > > instances unkillable -- and when I just now forcefully killed X in this > > situation (the spinning wheels were covering the upper left corner of all > > my desktops) I got the below. Running two instances of glxgears and killing them works for me, too. I'm using xorg-server 1.3.0.0, Mesa 7.0.1 with the latest DRM bits from http://gitweb.freedesktop.org/?p=mesa/drm.git;a=summary I'm not running CFS though, but I guess the oops wasn't related to that. Regards, Tilman -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail? [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 19:56 ` Rene Herman 2007-08-30 7:05 ` Rene Herman @ 2007-08-30 16:06 ` Chuck Ebbert 2007-08-30 16:48 ` Rene Herman 1 sibling, 1 reply; 535+ messages in thread From: Chuck Ebbert @ 2007-08-30 16:06 UTC (permalink / raw) To: Rene Herman Cc: keith.packard, Ingo Molnar, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel, Dave Airlie On 08/29/2007 03:56 PM, Rene Herman wrote: > > Before people focuss on software rendering too much -- also with 1.3.0 (and > a Matrox Millenium G550 AGP, 32M) glxgears also works decidedly crummy > using > hardware rendering. While I can move the glxgears window itself, the actual > spinning wheels stay in the upper-left corner of the screen and the > movement > leaves a non-repainting trace on the screen. Running a second instance of > glxgears in addition seems to make both instances unkillable -- and when > I just now forcefully killed X in this situation (the spinning wheels were > covering the upper left corner of all my desktops) I got the below. > > Kernel is 2.6.22.5-cfs-v20.5, schedule() is in the traces (but that may be > expected anyway). > And this doesn't happen at all with the stock scheduler? (Just confirming, in case you didn't compare.) > BUG: unable to handle kernel NULL pointer dereference at virtual address > 00000010 > printing eip: > c10ff416 > *pde = 00000000 > Oops: 0000 [#1] > PREEMPT Try it without preempt? > Modules linked in: nfsd exportfs lockd nfs_acl sunrpc nls_iso8859_1 > nls_cp437 vfat fat nls_base > CPU: 0 > EIP: 0060:[<c10ff416>] Not tainted VLI > EFLAGS: 00210246 (2.6.22.5-cfs-v20.5-local #5) > EIP is at mga_dma_buffers+0x189/0x2e3 > eax: 00000000 ebx: efd07200 ecx: 00000001 edx: efc32c00 > esi: 00000000 edi: c12756cc ebp: dfea44c0 esp: dddaaec0 > ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068 > Process glxgears (pid: 1775, ti=dddaa000 task=e9daca60 task.ti=dddaa000) > Stack: efc32c00 00000000 00000004 e4c3bd20 c10fa54b e4c3bd20 efc32c00 > 00000000 > 00000004 00000000 00000000 00000000 00000000 00000001 00010000 > bfbdb8bc > bfbdb8b8 00000000 c10ff28d 00000029 c12756cc dfea44c0 c10f87fc > bfbdb844 > Call Trace: > [<c10fa54b>] drm_lock+0x255/0x2de > [<c10ff28d>] mga_dma_buffers+0x0/0x2e3 > [<c10f87fc>] drm_ioctl+0x142/0x18a > [<c1005973>] do_IRQ+0x97/0xb0 > [<c10f86ba>] drm_ioctl+0x0/0x18a > [<c10f86ba>] drm_ioctl+0x0/0x18a > [<c105b0d7>] do_ioctl+0x87/0x9f > [<c105b32c>] vfs_ioctl+0x23d/0x250 > [<c11b533e>] schedule+0x2d0/0x2e6 > [<c105b372>] sys_ioctl+0x33/0x4d > [<c1003d1e>] syscall_call+0x7/0xb > ======================= > Code: 9a 08 03 00 00 8b 73 30 74 14 c7 44 24 04 28 76 1c c1 c7 04 24 49 > 51 23 c1 e8 b0 74 f1 ff 8b 83 d8 00 00 00 83 3d 1c 47 30 c1 00 <8b> 40 > 10 8b a8 58 1e 00 00 8b 43 28 8b b8 64 01 00 00 74 32 8b > EIP: [<c10ff416>] mga_dma_buffers+0x189/0x2e3 SS:ESP 0068:dddaaec0 dev->dev_private->mmio is NULL when trying to access mmio.handle ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-30 16:06 ` Chuck Ebbert @ 2007-08-30 16:48 ` Rene Herman 0 siblings, 0 replies; 535+ messages in thread From: Rene Herman @ 2007-08-30 16:48 UTC (permalink / raw) To: Chuck Ebbert Cc: keith.packard, Ingo Molnar, Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel, Dave Airlie On 08/30/2007 06:06 PM, Chuck Ebbert wrote: > On 08/29/2007 03:56 PM, Rene Herman wrote: >> Before people focuss on software rendering too much -- also with 1.3.0 >> (and a Matrox Millenium G550 AGP, 32M) glxgears also works decidedly >> crummy using hardware rendering. While I can move the glxgears window >> itself, the actual spinning wheels stay in the upper-left corner of the >> screen and the movement leaves a non-repainting trace on the screen. >> Running a second instance of glxgears in addition seems to make both >> instances unkillable -- and when I just now forcefully killed X in this >> situation (the spinning wheels were covering the upper left corner of >> all my desktops) I got the below. >> >> Kernel is 2.6.22.5-cfs-v20.5, schedule() is in the traces (but that may >> be expected anyway). > And this doesn't happen at all with the stock scheduler? (Just confirming, > in case you didn't compare.) I didn't compare -- it no doubt will. I know the title of this thread is "CFS review" but it turned into Keith Packard noticing glxgears being broken on recent-ish X.org. The start of the thread was about things being broken using _software_ rendering though, so I thought it might be useful to remark/report glxgears also being quite broken using hardware rendering on my setup at least. >> BUG: unable to handle kernel NULL pointer dereference at virtual address >> 00000010 >> printing eip: >> c10ff416 >> *pde = 00000000 >> Oops: 0000 [#1] >> PREEMPT > > Try it without preempt? If you're asking in a "I'll go debug the DRM" way I'll go dig a bit later (please say) but if you are only interested in the thread due to CFS, note that I'm aware it's not likely to have anything to do with CFS. It's not reproducable for you? (full description of bug above). Rene. ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 4:18 ` Ingo Molnar 2007-08-29 4:29 ` Keith Packard @ 2007-08-29 4:40 ` Mike Galbraith 1 sibling, 0 replies; 535+ messages in thread From: Mike Galbraith @ 2007-08-29 4:40 UTC (permalink / raw) To: Ingo Molnar Cc: Al Boldi, Peter Zijlstra, Andrew Morton, Linus Torvalds, linux-kernel, Keith Packard On Wed, 2007-08-29 at 06:18 +0200, Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > > > No need for framebuffer. All you need is X using the X.org > > vesa-driver. Then start gears like this: > > > > # gears & gears & gears & > > > > Then lay them out side by side to see the periodic stallings for > > ~10sec. > > i just tried something similar (by adding Option "NoDRI" to xorg.conf) > and i'm wondering how it can be smooth on vesa-driver at all. I tested > it on a Core2Duo box and software rendering manages to do about 3 frames > per second. (although glxgears itself thinks it does ~600 fps) If i > start 3 glxgears then they do ~1 frame per second each. This is on > Fedora 7 with xorg-x11-server-Xorg-1.3.0.0-9.fc7 and > xorg-x11-drv-i810-2.0.0-4.fc7. At least you can run the darn test... the third instance of glxgears here means say bye bye to GUI instantly. -Mike ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-25 23:15 ` Ingo Molnar 2007-08-26 16:27 ` Al Boldi @ 2007-08-29 3:42 ` Bill Davidsen 1 sibling, 0 replies; 535+ messages in thread From: Bill Davidsen @ 2007-08-29 3:42 UTC (permalink / raw) To: Ingo Molnar Cc: Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Ingo Molnar wrote: > * Al Boldi <a1426z@gawab.com> wrote: > >>> ok. I think i might finally have found the bug causing this. Could >>> you try the fix below, does your webserver thread-startup test work >>> any better? >> It seems to help somewhat, but the problem is still visible. Even >> v20.3 on 2.6.22.5 didn't help. >> >> It does look related to ia-boosting, so I turned off __update_curr >> like Roman mentioned, which had an enormous smoothing effect, but then >> nice levels completely break down and lockup the system. > > you can turn sleeper-fairness off via: > > echo 28 > /proc/sys/kernel/sched_features > > another thing to try would be: > > echo 12 > /proc/sys/kernel/sched_features 14, and drop the granularity to 500000. > > (that's the new-task penalty turned off.) > > Another thing to try would be to edit this: > > if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) > p->se.wait_runtime = -(sched_granularity(cfs_rq) / 2); > > to: > > if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) > p->se.wait_runtime = -(sched_granularity(cfs_rq); > > and could you also check 20.4 on 2.6.22.5 perhaps, or very latest -git? > (Peter has experienced smaller spikes with that.) > > Ingo -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-25 22:27 ` Al Boldi 2007-08-25 23:15 ` Ingo Molnar @ 2007-08-29 3:37 ` Bill Davidsen 2007-08-29 3:45 ` Ingo Molnar 1 sibling, 1 reply; 535+ messages in thread From: Bill Davidsen @ 2007-08-29 3:37 UTC (permalink / raw) To: Al Boldi Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Al Boldi wrote: > Ingo Molnar wrote: >> * Al Boldi <a1426z@gawab.com> wrote: >>>>> The problem is that consecutive runs don't give consistent results >>>>> and sometimes stalls. You may want to try that. >>>> well, there's a natural saturation point after a few hundred tasks >>>> (depending on your CPU's speed), at which point there's no idle time >>>> left. From that point on things get slower progressively (and the >>>> ability of the shell to start new ping tasks is impacted as well), >>>> but that's expected on an overloaded system, isnt it? >>> Of course, things should get slower with higher load, but it should be >>> consistent without stalls. >>> >>> To see this problem, make sure you boot into /bin/sh with the normal >>> VGA console (ie. not fb-console). Then try each loop a few times to >>> show different behaviour; loops like: >>> >>> # for ((i=0; i<3333; i++)); do ping 10.1 -A > /dev/null & done >>> >>> # for ((i=0; i<3333; i++)); do nice -99 ping 10.1 -A > /dev/null & done >>> >>> # { for ((i=0; i<3333; i++)); do >>> ping 10.1 -A > /dev/null & >>> done } > /dev/null 2>&1 >>> >>> Especially the last one sometimes causes a complete console lock-up, >>> while the other two sometimes stall then surge periodically. >> ok. I think i might finally have found the bug causing this. Could you >> try the fix below, does your webserver thread-startup test work any >> better? > > It seems to help somewhat, but the problem is still visible. Even v20.3 on > 2.6.22.5 didn't help. > > It does look related to ia-boosting, so I turned off __update_curr like Roman > mentioned, which had an enormous smoothing effect, but then nice levels > completely break down and lockup the system. > > There is another way to show the problem visually under X (vesa-driver), by > starting 3 gears simultaneously, which after laying them out side-by-side > need some settling time before smoothing out. Without __update_curr it's > absolutely smooth from the start. I posted a LOT of stuff using the glitch1 script, and finally found a set of tuning values which make the test script run smooth. See back posts, I don't have them here. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 3:37 ` Bill Davidsen @ 2007-08-29 3:45 ` Ingo Molnar 2007-08-29 13:11 ` Bill Davidsen 0 siblings, 1 reply; 535+ messages in thread From: Ingo Molnar @ 2007-08-29 3:45 UTC (permalink / raw) To: Bill Davidsen Cc: Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel * Bill Davidsen <davidsen@tmr.com> wrote: > > There is another way to show the problem visually under X > > (vesa-driver), by starting 3 gears simultaneously, which after > > laying them out side-by-side need some settling time before > > smoothing out. Without __update_curr it's absolutely smooth from > > the start. > > I posted a LOT of stuff using the glitch1 script, and finally found a > set of tuning values which make the test script run smooth. See back > posts, I don't have them here. but you have real 3D hw and DRI enabled, correct? In that case X uses up almost no CPU time and glxgears makes most of the processing. That is quite different from the above software-rendering case, where X spends most of the CPU time. Ingo ^ permalink raw reply [flat|nested] 535+ messages in thread
* Re: CFS review 2007-08-29 3:45 ` Ingo Molnar @ 2007-08-29 13:11 ` Bill Davidsen 0 siblings, 0 replies; 535+ messages in thread From: Bill Davidsen @ 2007-08-29 13:11 UTC (permalink / raw) To: Ingo Molnar Cc: Al Boldi, Peter Zijlstra, Mike Galbraith, Andrew Morton, Linus Torvalds, linux-kernel Ingo Molnar wrote: > * Bill Davidsen <davidsen@tmr.com> wrote: > > >>> There is another way to show the problem visually under X >>> (vesa-driver), by starting 3 gears simultaneously, which after >>> laying them out side-by-side need some settling time before >>> smoothing out. Without __update_curr it's absolutely smooth from >>> the start. >>> >> I posted a LOT of stuff using the glitch1 script, and finally found a >> set of tuning values which make the test script run smooth. See back >> posts, I don't have them here. >> > > but you have real 3D hw and DRI enabled, correct? In that case X uses up > almost no CPU time and glxgears makes most of the processing. That is > quite different from the above software-rendering case, where X spends > most of the CPU time. > No, my test machine for that is a compile server, and uses the built-in motherboard graphics which are very limited. This is not in any sense a graphics powerhouse, it is used to build custom kernels and applications, and for testing of kvm and xen, and I grabbed it because it had the only Core2 CPU I could reboot to try new kernel versions and "from cold boot" testing, discovered the graphics smoothness issue by having several windows open on compiles, and developed the glitch1 script as a way to reproduce it. The settings I used, features=14, granularity=500000, work to improve smoothness on other machines for other uses, but they do seem to impact performance for compiles, video processing, etc, so they are not optimal for general use. I regard the existence of these tuning knobs as one of the real strengths of CFS, when you change the tuning it has a visible effect. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 535+ messages in thread
end of thread, other threads:[~2007-08-31 6:56 UTC | newest] Thread overview: 535+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-07-10 8:31 -mm merge plans for 2.6.23 Andrew Morton 2007-07-10 9:04 ` intel iommu (Re: -mm merge plans for 2.6.23) Jan Engelhardt 2007-07-10 9:07 ` -mm merge plans for 2.6.23 -- sys_fallocate Heiko Carstens 2007-07-10 9:22 ` Andrew Morton 2007-07-10 15:45 ` Theodore Tso 2007-07-10 17:27 ` Andrew Morton 2007-07-10 18:05 ` Heiko Carstens 2007-07-10 18:39 ` Amit K. Arora 2007-07-10 18:41 ` Andrew Morton 2007-07-11 9:36 ` testcases, was " Christoph Hellwig 2007-07-11 9:40 ` Nick Piggin 2007-07-11 10:36 ` Michael Kerrisk 2007-07-11 9:40 ` Andi Kleen 2007-07-10 18:20 ` Mark Fasheh 2007-07-10 20:28 ` Amit K. Arora 2007-07-10 9:17 ` cpuset-remove-sched-domain-hooks-from-cpusets Paul Jackson 2007-07-10 10:15 ` -mm merge plans for 2.6.23 Con Kolivas [not found] ` <b21f8390707101802o2d546477n2a18c1c3547c3d7a@mail.gmail.com> 2007-07-11 1:14 ` [ck] " Andrew Morton [not found] ` <b8bf37780707101852g25d835b4ubbf8da5383755d4b@mail.gmail.com> 2007-07-11 1:53 ` Fwd: " André Goddard Rosa 2007-07-11 2:21 ` Ira Snyder 2007-07-11 3:37 ` timotheus 2007-07-11 2:54 ` Matthew Hawkins 2007-07-11 5:18 ` Nick Piggin 2007-07-11 5:47 ` Ray Lee 2007-07-11 5:54 ` Nick Piggin 2007-07-11 6:04 ` Ray Lee 2007-07-11 6:24 ` Nick Piggin 2007-07-11 7:50 ` swap prefetch (Re: -mm merge plans for 2.6.23) Ingo Molnar 2007-07-11 6:00 ` [ck] Re: -mm merge plans for 2.6.23 Nick Piggin 2007-07-11 3:59 ` Grzegorz Kulewski 2007-07-11 12:26 ` Kevin Winchester 2007-07-11 12:36 ` Jesper Juhl 2007-07-12 12:06 ` Kacper Wysocki 2007-07-12 12:35 ` Avuton Olrich 2007-07-23 23:08 ` Jesper Juhl 2007-07-24 3:22 ` Nick Piggin 2007-07-24 4:53 ` Ray Lee 2007-07-24 5:10 ` Jeremy Fitzhardinge 2007-07-24 5:18 ` Ray Lee 2007-07-24 5:16 ` Nick Piggin 2007-07-24 16:15 ` Ray Lee 2007-07-24 17:46 ` [ck] " Rashkae 2007-07-25 4:06 ` Nick Piggin 2007-07-25 4:55 ` Rene Herman 2007-07-25 5:00 ` Nick Piggin 2007-07-25 5:12 ` david 2007-07-25 5:30 ` Rene Herman 2007-07-25 5:51 ` david 2007-07-25 7:14 ` Valdis.Kletnieks 2007-07-25 8:18 ` Rene Herman 2007-07-25 8:28 ` Ingo Molnar 2007-07-25 8:43 ` Rene Herman 2007-07-25 11:34 ` Ingo Molnar 2007-07-25 11:40 ` Rene Herman 2007-07-25 11:50 ` Ingo Molnar 2007-07-25 16:08 ` Valdis.Kletnieks 2007-07-25 22:05 ` Paul Jackson 2007-07-25 22:22 ` Zan Lynx 2007-07-25 22:27 ` Jesper Juhl 2007-07-25 22:28 ` [ck] " Michael Chang 2007-07-25 23:45 ` André Goddard Rosa [not found] ` <5c77e14b0707250353r48458316x5e6adde6dbce1fbd@mail.gmail.com> 2007-07-25 11:06 ` Nick Piggin 2007-07-25 13:30 ` Rene Herman 2007-07-25 13:50 ` Ingo Molnar 2007-07-25 17:33 ` Satyam Sharma 2007-07-25 20:35 ` Ingo Molnar 2007-07-26 2:32 ` Bartlomiej Zolnierkiewicz 2007-07-26 4:13 ` Jeff Garzik 2007-07-26 10:22 ` Bartlomiej Zolnierkiewicz 2007-07-25 16:02 ` Ray Lee 2007-07-25 20:55 ` Zan Lynx 2007-07-25 21:28 ` Ray Lee 2007-07-26 1:15 ` [ck] " Matthew Hawkins 2007-07-26 1:32 ` Ray Lee 2007-07-26 3:16 ` Matthew Hawkins 2007-07-26 22:30 ` Michael Chang 2007-07-25 5:30 ` Eric St-Laurent 2007-07-25 5:37 ` Nick Piggin 2007-07-25 5:53 ` david 2007-07-25 6:04 ` Nick Piggin 2007-07-25 6:23 ` david 2007-07-25 7:25 ` Nick Piggin 2007-07-25 7:49 ` Ingo Molnar 2007-07-25 7:58 ` Nick Piggin 2007-07-25 8:15 ` Ingo Molnar 2007-07-25 10:41 ` Jesper Juhl 2007-07-25 6:19 ` [ck] " Matthew Hawkins 2007-07-25 6:30 ` Nick Piggin 2007-07-25 6:47 ` Mike Galbraith 2007-07-25 7:19 ` Eric St-Laurent 2007-07-25 6:44 ` Eric St-Laurent 2007-07-25 16:09 ` Ray Lee 2007-07-26 4:57 ` Andrew Morton 2007-07-26 5:53 ` Nick Piggin 2007-07-26 6:06 ` Andrew Morton 2007-07-26 6:17 ` Nick Piggin 2007-07-26 6:33 ` Ray Lee 2007-07-26 6:50 ` Andrew Morton 2007-07-26 7:43 ` Ray Lee 2007-07-26 7:59 ` Nick Piggin 2007-07-28 0:24 ` Matt Mackall 2007-07-26 14:19 ` [ck] " Michael Chang 2007-07-26 18:13 ` Andrew Morton 2007-07-26 22:04 ` Dirk Schoebel 2007-07-26 22:33 ` Dirk Schoebel 2007-07-26 23:27 ` Jeff Garzik 2007-07-26 23:29 ` david 2007-07-26 23:39 ` Jeff Garzik 2007-07-27 0:12 ` david 2007-07-28 0:12 ` Matt Mackall 2007-07-28 3:42 ` Daniel Cheng 2007-07-28 9:35 ` Stefan Richter 2007-07-25 17:55 ` Frank A. Kingswood 2007-07-25 6:09 ` [ck] " Matthew Hawkins 2007-07-25 6:18 ` Nick Piggin 2007-07-25 16:19 ` Ray Lee 2007-07-25 20:46 ` Andi Kleen 2007-07-26 8:38 ` Frank Kingswood 2007-07-26 9:20 ` Ingo Molnar 2007-07-26 9:34 ` Andrew Morton 2007-07-26 9:40 ` RFT: updatedb "morning after" problem [was: Re: -mm merge plans for 2.6.23] Ingo Molnar 2007-07-26 10:09 ` Andrew Morton 2007-07-26 10:24 ` Ingo Molnar 2007-07-27 0:33 ` [ck] " Matthew Hawkins 2007-07-30 9:33 ` Helge Hafting 2007-07-26 10:27 ` Ingo Molnar 2007-07-26 10:38 ` Andrew Morton 2007-07-26 12:46 ` Mike Galbraith 2007-07-26 18:05 ` Andrew Morton 2007-07-27 5:12 ` Mike Galbraith 2007-07-27 7:23 ` Mike Galbraith 2007-07-27 8:47 ` Andrew Morton 2007-07-27 8:54 ` Al Viro 2007-07-27 9:02 ` Andrew Morton 2007-07-27 9:40 ` Mike Galbraith 2007-07-27 10:00 ` Andrew Morton 2007-07-27 10:25 ` Mike Galbraith 2007-07-27 17:45 ` Daniel Hazelton 2007-07-27 18:16 ` Rene Herman 2007-07-27 19:43 ` david 2007-07-28 7:19 ` Rene Herman 2007-07-28 8:55 ` david 2007-07-28 10:11 ` Rene Herman 2007-07-28 11:21 ` Alan Cox 2007-07-28 16:29 ` Ray Lee 2007-07-28 21:03 ` david 2007-07-29 8:11 ` Rene Herman 2007-07-29 13:12 ` Alan Cox 2007-07-29 14:07 ` Rene Herman 2007-07-29 14:58 ` Ray Lee 2007-07-29 14:59 ` Rene Herman 2007-07-29 15:20 ` Ray Lee 2007-07-29 15:36 ` Rene Herman 2007-07-29 16:04 ` Ray Lee 2007-07-29 16:59 ` Rene Herman 2007-07-29 17:19 ` Ray Lee 2007-07-29 17:33 ` Rene Herman 2007-07-29 17:52 ` Ray Lee 2007-07-29 19:05 ` Rene Herman 2007-07-29 17:53 ` Alan Cox 2007-07-29 19:33 ` Paul Jackson 2007-07-29 20:00 ` Ray Lee 2007-07-29 20:18 ` Paul Jackson 2007-07-29 20:23 ` Ray Lee 2007-07-29 21:06 ` Daniel Hazelton 2007-07-28 21:00 ` david 2007-07-29 10:09 ` Rene Herman 2007-07-29 11:41 ` david 2007-07-29 14:01 ` Rene Herman 2007-07-29 21:19 ` david 2007-08-06 2:14 ` Nick Piggin 2007-08-06 2:22 ` david 2007-08-06 9:21 ` Nick Piggin 2007-08-06 9:55 ` Paolo Ciarrocchi 2007-07-28 15:56 ` Daniel Hazelton 2007-07-28 21:06 ` david 2007-07-28 21:48 ` Daniel Hazelton 2007-07-27 20:28 ` Daniel Hazelton 2007-07-28 5:19 ` Rene Herman 2007-07-27 23:15 ` Björn Steinbrink 2007-07-27 23:29 ` Andi Kleen 2007-07-28 0:08 ` Björn Steinbrink 2007-07-28 1:10 ` Daniel Hazelton 2007-07-29 12:53 ` Paul Jackson 2007-07-28 7:35 ` Rene Herman 2007-07-28 8:51 ` Rene Herman 2007-07-27 22:08 ` Mike Galbraith 2007-07-27 22:51 ` Daniel Hazelton 2007-07-28 7:48 ` Mike Galbraith 2007-07-28 15:36 ` Daniel Hazelton 2007-07-29 1:33 ` Rik van Riel 2007-07-29 3:39 ` Andrew Morton 2007-07-26 10:20 ` Al Viro 2007-07-26 12:23 ` Andi Kleen 2007-07-26 14:59 ` Al Viro 2007-07-11 20:41 ` Pavel Machek 2007-07-27 19:19 ` Paul Jackson 2007-07-31 16:37 ` [ck] Re: -mm merge plans for 2.6.23 Matthew Hawkins 2007-08-06 2:11 ` Nick Piggin 2007-07-25 4:46 ` david 2007-07-25 8:00 ` Rene Herman 2007-07-25 8:07 ` david 2007-07-25 8:29 ` Rene Herman 2007-07-25 8:31 ` david 2007-07-25 8:33 ` david 2007-07-25 10:58 ` Rene Herman 2007-07-25 15:55 ` Ray Lee 2007-07-25 20:16 ` Al Boldi 2007-07-27 0:28 ` Magnus Naeslund 2007-07-24 5:18 ` Andrew Morton 2007-07-24 6:01 ` Ray Lee 2007-07-24 6:10 ` Andrew Morton 2007-07-24 9:38 ` Tilman Schmidt 2007-07-25 1:26 ` [ck] " Matthew Hawkins 2007-07-25 1:35 ` David Miller 2007-07-24 0:08 ` Con Kolivas 2007-07-10 10:52 ` containers (was Re: -mm merge plans for 2.6.23) Srivatsa Vaddagiri 2007-07-10 11:19 ` Ingo Molnar 2007-07-10 18:34 ` Paul Menage 2007-07-10 18:53 ` Andrew Morton 2007-07-10 19:05 ` Paul Menage 2007-07-11 4:55 ` Srivatsa Vaddagiri 2007-07-11 5:29 ` Andrew Morton 2007-07-11 6:03 ` Srivatsa Vaddagiri 2007-07-11 9:04 ` Ingo Molnar 2007-07-11 9:23 ` Paul Jackson 2007-07-11 10:03 ` Srivatsa Vaddagiri 2007-07-11 10:19 ` Ingo Molnar 2007-07-11 11:39 ` Srivatsa Vaddagiri 2007-07-11 11:42 ` Paul Jackson 2007-07-11 12:06 ` Peter Zijlstra 2007-07-11 17:03 ` Paul Jackson 2007-07-11 18:47 ` Peter Zijlstra 2007-07-11 12:30 ` Srivatsa Vaddagiri 2007-07-11 11:10 ` Paul Jackson 2007-07-11 11:24 ` Peter Zijlstra 2007-07-11 11:30 ` Peter Zijlstra 2007-07-11 13:14 ` Srivatsa Vaddagiri 2007-07-11 19:44 ` Paul Menage 2007-07-12 5:39 ` Srivatsa Vaddagiri 2007-07-10 11:52 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: " Theodore Tso 2007-07-10 17:15 ` Andrew Morton 2007-07-10 17:44 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Jeff Garzik 2007-07-10 23:27 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Paul Mackerras 2007-07-11 0:16 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Andrew Morton 2007-07-11 0:50 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Paul Mackerras 2007-07-11 15:39 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Theodore Tso 2007-07-11 18:47 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Heiko Carstens 2007-07-11 20:32 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch Martin Schwidefsky 2007-07-10 19:07 ` fallocate-implementation-on-i86-x86_64-and-powerpc.patch (was: re: -mm merge plans for 2.6.23) Theodore Tso 2007-07-10 19:31 ` Andrew Morton 2007-07-10 12:37 ` clam Andy Whitcroft 2007-07-11 9:34 ` Re: -mm merge plans -- lumpy reclaim Andy Whitcroft 2007-07-11 16:46 ` Andrew Morton 2007-07-11 18:38 ` Andy Whitcroft 2007-07-16 10:37 ` Mel Gorman 2007-07-10 15:08 ` -mm merge plans for 2.6.23 Serge E. Hallyn 2007-07-10 15:11 ` Rafael J. Wysocki 2007-07-10 16:29 ` -mm merge plans for 2.6.23 (pcmcia) Randy Dunlap 2007-07-10 17:30 ` Andrew Morton 2007-07-10 16:31 ` -mm merge plans for 2.6.23 - ioat/dma engine Kok, Auke 2007-07-10 18:05 ` Nelson, Shannon 2007-07-10 18:47 ` Andrew Morton 2007-07-10 21:18 ` Nelson, Shannon 2007-07-10 17:42 ` ata and netdev (was Re: -mm merge plans for 2.6.23) Jeff Garzik 2007-07-10 18:24 ` Andrew Morton 2007-07-10 18:55 ` James Bottomley 2007-07-10 18:57 ` Jeff Garzik 2007-07-10 20:31 ` Sergei Shtylyov 2007-07-10 20:35 ` Andrew Morton 2007-07-11 16:47 ` Dan Faerch 2007-07-10 19:56 ` Sergei Shtylyov 2007-07-10 17:49 ` ext2 reservations (Re: " Alexey Dobriyan 2007-07-10 18:34 ` PCI probing changes Jesse Barnes 2007-07-10 18:55 ` Andrew Morton 2007-07-10 18:44 ` agp / cpufreq Dave Jones 2007-07-10 20:09 ` -mm merge plans for 2.6.23 Christoph Lameter 2007-07-11 9:42 ` Mel Gorman 2007-07-11 17:49 ` Christoph Lameter 2007-07-11 11:35 ` Christoph Hellwig 2007-07-11 11:39 ` David Woodhouse 2007-07-11 17:21 ` Andrew Morton 2007-07-11 17:28 ` Randy Dunlap 2007-07-11 11:37 ` scsi, was " Christoph Hellwig 2007-07-11 17:22 ` Andrew Morton 2007-07-11 11:39 ` buffered write patches, " Christoph Hellwig 2007-07-11 17:23 ` Andrew Morton 2007-07-11 11:55 ` Christoph Hellwig 2007-07-11 12:00 ` fallocate, " Christoph Hellwig 2007-07-11 12:23 ` lguest, " Christoph Hellwig 2007-07-11 15:45 ` Randy Dunlap 2007-07-11 18:04 ` Andrew Morton 2007-07-12 1:21 ` Rusty Russell 2007-07-12 2:28 ` David Miller 2007-07-12 2:48 ` Rusty Russell 2007-07-12 2:51 ` David Miller 2007-07-12 3:15 ` Rusty Russell 2007-07-12 3:35 ` David Miller 2007-07-12 4:24 ` Andrew Morton 2007-07-12 4:52 ` Rusty Russell 2007-07-12 11:10 ` Avi Kivity 2007-07-12 23:20 ` Rusty Russell 2007-07-19 17:27 ` Christoph Hellwig 2007-07-20 3:27 ` Rusty Russell 2007-07-20 7:15 ` Christoph Hellwig 2007-07-11 12:43 ` x86 status was " Andi Kleen 2007-07-11 17:33 ` Jesse Barnes 2007-07-11 17:42 ` Ingo Molnar 2007-07-11 21:02 ` Randy Dunlap 2007-07-11 21:39 ` Thomas Gleixner 2007-07-11 23:21 ` Randy Dunlap 2007-07-11 21:16 ` Andi Kleen 2007-07-11 21:46 ` Valdis.Kletnieks 2007-07-11 21:54 ` Chris Wright 2007-07-11 22:11 ` Valdis.Kletnieks 2007-07-11 22:20 ` Chris Wright 2007-07-11 22:33 ` Linus Torvalds 2007-07-11 22:12 ` Linus Torvalds 2007-07-11 21:46 ` Thomas Gleixner 2007-07-11 21:52 ` Chris Wright 2007-07-11 22:18 ` Andi Kleen 2007-07-11 21:46 ` Andrea Arcangeli 2007-07-11 22:09 ` Linus Torvalds 2007-07-12 15:36 ` Oleg Verych 2007-07-13 2:23 ` Roman Zippel 2007-07-13 4:40 ` Andrew Morton 2007-07-13 4:47 ` Mike Galbraith 2007-07-13 17:23 ` Roman Zippel 2007-07-13 19:43 ` [PATCH] CFS: Fix missing digit off in wmult table Thomas Gleixner 2007-07-16 6:18 ` James Bruce 2007-07-16 7:06 ` Ingo Molnar 2007-07-16 7:41 ` Ingo Molnar 2007-07-16 15:02 ` James Bruce 2007-07-16 10:18 ` Roman Zippel 2007-07-16 11:20 ` Ingo Molnar 2007-07-16 11:58 ` Roman Zippel 2007-07-16 12:12 ` Ingo Molnar 2007-07-16 12:42 ` Roman Zippel 2007-07-16 13:40 ` Ingo Molnar 2007-07-16 14:01 ` Roman Zippel 2007-07-16 20:31 ` Matt Mackall 2007-07-16 21:18 ` Ingo Molnar 2007-07-16 22:13 ` Roman Zippel 2007-07-16 22:29 ` Ingo Molnar 2007-07-17 0:02 ` Roman Zippel 2007-07-17 3:20 ` Roman Zippel 2007-07-17 8:02 ` Ingo Molnar 2007-07-17 14:06 ` Roman Zippel 2007-07-18 10:40 ` Ingo Molnar 2007-07-18 12:40 ` Roman Zippel 2007-07-18 16:17 ` Ingo Molnar 2007-07-20 13:38 ` Roman Zippel 2007-07-16 21:25 ` Roman Zippel 2007-07-17 7:53 ` Ingo Molnar 2007-07-17 15:12 ` Roman Zippel 2007-07-16 17:47 ` Linus Torvalds 2007-07-16 18:12 ` Roman Zippel 2007-07-18 10:27 ` Peter Zijlstra 2007-07-18 12:45 ` Roman Zippel 2007-07-18 12:52 ` Peter Zijlstra 2007-07-18 12:59 ` Ingo Molnar 2007-07-18 13:07 ` Roman Zippel 2007-07-18 13:27 ` Peter Zijlstra 2007-07-18 13:58 ` Roman Zippel 2007-07-18 13:48 ` Ingo Molnar 2007-07-18 14:14 ` Roman Zippel 2007-07-18 16:02 ` Ingo Molnar 2007-07-20 15:03 ` Roman Zippel 2007-07-18 13:26 ` Roman Zippel 2007-07-18 13:31 ` Peter Zijlstra 2007-07-14 5:04 ` x86 status was Re: -mm merge plans for 2.6.23 Mike Galbraith 2007-08-01 3:41 ` CFS review Roman Zippel 2007-08-01 7:12 ` Ingo Molnar 2007-08-01 7:26 ` Mike Galbraith 2007-08-01 7:30 ` Ingo Molnar 2007-08-01 7:36 ` Mike Galbraith 2007-08-01 8:49 ` Mike Galbraith 2007-08-01 13:19 ` Roman Zippel 2007-08-01 15:07 ` Ingo Molnar 2007-08-01 17:10 ` Andi Kleen 2007-08-01 16:27 ` Linus Torvalds 2007-08-01 17:48 ` Andi Kleen 2007-08-01 17:50 ` Ingo Molnar 2007-08-01 18:01 ` Roman Zippel 2007-08-01 19:05 ` Ingo Molnar 2007-08-09 23:14 ` Roman Zippel 2007-08-10 5:49 ` Ingo Molnar 2007-08-10 13:52 ` Roman Zippel 2007-08-10 14:18 ` Ingo Molnar 2007-08-10 16:47 ` Mike Galbraith 2007-08-10 17:19 ` Roman Zippel 2007-08-10 16:54 ` Michael Chang 2007-08-10 17:25 ` Roman Zippel 2007-08-10 19:44 ` Ingo Molnar 2007-08-10 19:47 ` Willy Tarreau 2007-08-10 21:15 ` Roman Zippel 2007-08-10 21:36 ` Ingo Molnar 2007-08-10 22:50 ` Roman Zippel 2007-08-11 5:28 ` Willy Tarreau 2007-08-12 5:17 ` Ingo Molnar 2007-08-11 0:30 ` Ingo Molnar 2007-08-20 22:19 ` Roman Zippel 2007-08-21 7:33 ` Mike Galbraith 2007-08-21 8:35 ` Ingo Molnar 2007-08-21 11:54 ` Roman Zippel 2007-08-11 5:15 ` Willy Tarreau 2007-08-10 7:23 ` Mike Galbraith 2007-08-01 11:22 ` Ingo Molnar 2007-08-01 12:21 ` Roman Zippel 2007-08-01 12:23 ` Ingo Molnar 2007-08-01 13:59 ` Ingo Molnar 2007-08-01 14:04 ` Arjan van de Ven 2007-08-01 15:44 ` Roman Zippel 2007-08-01 17:41 ` Ingo Molnar 2007-08-01 18:14 ` Roman Zippel 2007-08-03 3:04 ` Matt Mackall 2007-08-03 3:57 ` Arjan van de Ven 2007-08-03 4:18 ` Willy Tarreau 2007-08-03 4:31 ` Arjan van de Ven 2007-08-03 4:53 ` Willy Tarreau 2007-08-03 4:38 ` Matt Mackall 2007-08-03 8:44 ` Ingo Molnar 2007-08-03 9:29 ` Andi Kleen 2007-08-01 11:37 ` Ingo Molnar 2007-08-01 12:27 ` Roman Zippel 2007-08-01 13:20 ` Andi Kleen 2007-08-01 13:33 ` Roman Zippel 2007-08-01 14:36 ` Ingo Molnar 2007-08-01 16:11 ` Andi Kleen 2007-08-02 2:17 ` Linus Torvalds 2007-08-02 4:57 ` Willy Tarreau 2007-08-02 10:43 ` Andi Kleen 2007-08-02 10:07 ` Willy Tarreau 2007-08-02 16:09 ` Ingo Molnar 2007-08-02 22:38 ` Roman Zippel 2007-08-02 19:16 ` Daniel Phillips 2007-08-02 23:23 ` Roman Zippel 2007-08-01 14:40 ` Ingo Molnar 2007-08-01 14:49 ` Peter Zijlstra 2007-08-02 17:36 ` Roman Zippel 2007-08-02 15:46 ` Ingo Molnar 2007-07-11 21:42 ` x86 status was Re: -mm merge plans for 2.6.23 Linus Torvalds 2007-07-11 22:04 ` Thomas Gleixner 2007-07-11 22:20 ` Linus Torvalds 2007-07-11 22:50 ` Thomas Gleixner 2007-07-11 23:03 ` Chris Wright 2007-07-11 23:07 ` Linus Torvalds 2007-07-11 23:29 ` Thomas Gleixner 2007-07-11 23:36 ` Andi Kleen 2007-07-11 23:48 ` Thomas Gleixner 2007-07-11 23:58 ` Ingo Molnar 2007-07-12 0:07 ` Andi Kleen 2007-07-12 0:15 ` Chris Wright 2007-07-12 0:18 ` Ingo Molnar 2007-07-12 0:37 ` Andi Kleen 2007-07-12 20:38 ` Matt Mackall 2007-07-11 22:51 ` Chris Wright 2007-07-11 22:58 ` Linus Torvalds 2007-07-12 2:53 ` Arjan van de Ven 2007-07-11 23:19 ` Ingo Molnar 2007-07-11 23:45 ` Linus Torvalds 2007-07-11 18:14 ` Jeremy Fitzhardinge 2007-07-12 19:33 ` Christoph Lameter 2007-07-12 20:38 ` Andi Kleen 2007-07-11 23:03 ` generic clockevents/ (hr)time(r) patches " Thomas Gleixner 2007-07-11 23:57 ` Andrew Morton 2007-07-12 0:04 ` Thomas Gleixner 2007-07-12 0:17 ` [stable] " Chris Wright 2007-07-12 0:43 ` Andi Kleen 2007-07-12 0:46 ` [stable] " Chris Wright 2007-07-11 23:59 ` Andi Kleen 2007-07-12 0:33 ` Andrew Morton 2007-07-12 0:54 ` fault vs invalidate race (Re: -mm merge plans for 2.6.23) Nick Piggin 2007-07-12 2:31 ` block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23)) David Chinner 2007-07-12 2:42 ` Nick Piggin 2007-07-13 9:46 ` -mm merge plans for 2.6.23 Jan Engelhardt 2007-07-13 23:09 ` Tilman Schmidt 2007-07-14 10:02 ` Jan Engelhardt [not found] ` <20070715131144.3467DFC040@xenon.ts.pxnet.com> 2007-07-18 18:18 ` [PATCH] Use menuconfig objects - CONFIG_ISDN_I4L [v2] Jan Engelhardt 2007-07-18 18:22 ` [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L Jan Engelhardt 2007-07-18 18:23 ` [patch 1/2] Use menuconfig objects - ISDN Jan Engelhardt 2007-07-18 18:23 ` [patch 2/2] Use menuconfig objects - ISDN/Gigaset Jan Engelhardt 2007-07-22 0:32 ` [more PATCHes] Use menuconfig objects - CONFIG_ISDN_I4L Tilman Schmidt 2007-07-17 8:55 ` unprivileged mounts (was: Re: -mm merge plans for 2.6.23) Andrew Morton 2007-08-11 10:44 CFS review Al Boldi 2007-08-12 4:17 ` Ingo Molnar 2007-08-12 15:27 ` Al Boldi 2007-08-12 15:52 ` Ingo Molnar 2007-08-12 19:43 ` Al Boldi 2007-08-21 10:58 ` Ingo Molnar 2007-08-21 22:27 ` Al Boldi 2007-08-24 13:45 ` Ingo Molnar 2007-08-25 22:27 ` Al Boldi 2007-08-25 23:15 ` Ingo Molnar 2007-08-26 16:27 ` Al Boldi 2007-08-26 16:39 ` Ingo Molnar 2007-08-27 4:06 ` Al Boldi 2007-08-27 10:53 ` Ingo Molnar 2007-08-27 14:46 ` Al Boldi 2007-08-27 20:41 ` Ingo Molnar 2007-08-28 4:37 ` Al Boldi 2007-08-28 5:05 ` Linus Torvalds 2007-08-28 5:23 ` Al Boldi 2007-08-28 7:28 ` Mike Galbraith 2007-08-28 7:36 ` Ingo Molnar 2007-08-28 16:34 ` Linus Torvalds 2007-08-28 16:44 ` Arjan van de Ven 2007-08-28 16:45 ` Ingo Molnar 2007-08-29 4:19 ` Al Boldi 2007-08-29 4:53 ` Ingo Molnar 2007-08-29 5:58 ` Al Boldi 2007-08-29 6:43 ` Ingo Molnar 2007-08-28 20:46 ` Valdis.Kletnieks 2007-08-28 7:43 ` Xavier Bestel 2007-08-28 8:02 ` Ingo Molnar 2007-08-28 19:19 ` Willy Tarreau 2007-08-28 19:55 ` Ingo Molnar 2007-08-29 4:18 ` Ingo Molnar 2007-08-29 4:29 ` Keith Packard 2007-08-29 4:46 ` Ingo Molnar 2007-08-29 7:57 ` Keith Packard 2007-08-29 8:04 ` Ingo Molnar 2007-08-29 8:53 ` Al Boldi 2007-08-29 15:57 ` Keith Packard 2007-08-29 19:56 ` Rene Herman 2007-08-30 7:05 ` Rene Herman 2007-08-30 7:20 ` Ingo Molnar 2007-08-31 6:46 ` Tilman Sauerbeck 2007-08-30 16:06 ` Chuck Ebbert 2007-08-30 16:48 ` Rene Herman 2007-08-29 4:40 ` Mike Galbraith 2007-08-29 3:42 ` Bill Davidsen 2007-08-29 3:37 ` Bill Davidsen 2007-08-29 3:45 ` Ingo Molnar 2007-08-29 13:11 ` Bill Davidsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).