* linux-next: Tree for Jun 1 @ 2016-06-01 3:11 Stephen Rothwell 2016-06-02 1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky 0 siblings, 1 reply; 28+ messages in thread From: Stephen Rothwell @ 2016-06-01 3:11 UTC (permalink / raw) To: linux-next; +Cc: linux-kernel Hi all, Changes since 20160531: My fixes tree contains: of: silence warnings due to max() usage The arm tree gained a conflict against Linus' tree. Non-merge commits (relative to Linus' tree): 1100 936 files changed, 38159 insertions(+), 17475 deletions(-) ---------------------------------------------------------------------------- I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig (this fails its final link) and pseries_le_defconfig and i386, sparc and sparc64 defconfig. Below is a summary of the state of the merge. I am currently merging 236 trees (counting Linus' and 35 trees of patches pending for Linus' tree). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (367d3fd50566 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux) Merging fixes/master (b31033aacbd0 of: silence warnings due to max() usage) Merging kbuild-current/rc-fixes (3d1450d54a4f Makefile: Force gzip and xz on module install) Merging arc-current/for-curr (49acadff2a0c arc: Get rid of root core-frequency property) Merging arm-current/fixes (85c42e89f312 ARM: fix PTRACE_SETVFPREGS on SMP systems) Merging m68k-current/for-linus (9a6462763b17 m68k/mvme16x: Include generic <linux/rtc.h>) Merging metag-fixes/fixes (0164a711c97b metag: Fix ioremap_wc/ioremap_cached build errors) Merging powerpc-fixes/fixes (8dd75ccb571f powerpc: Use privileged SPR number for MMCR2) Merging powerpc-merge-mpe/fixes (bc0195aad0da Linux 4.2-rc2) Merging sparc/master (7cafc0b8bf13 sparc64: Fix return from trap window fill crashes.) Merging net/master (d69d16949346 usbnet: smsc95xx: fix link detection for disabled autonegotiation) Merging ipsec/master (d6af1a31cc72 vti: Add pmtu handling to vti_xmit.) Merging ipvs/master (f28f20da704d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net) Merging wireless-drivers/master (de26859dcf36 rtlwifi: Fix scheduling while atomic error from commit 49f86ec21c01) Merging mac80211/master (6fe04128f158 mac80211: fix fast_tx header alignment) Merging sound-current/for-linus (0358ccc8ffd8 ALSA: uapi: Add three missing header files to Kbuild file) Merging pci-current/for-linus (1a695a905c18 Linux 4.7-rc1) Merging driver-core.current/driver-core-linus (1a695a905c18 Linux 4.7-rc1) Merging tty.current/tty-linus (1a695a905c18 Linux 4.7-rc1) Merging usb.current/usb-linus (1a695a905c18 Linux 4.7-rc1) Merging usb-gadget-fixes/fixes (27a0faafdca5 usb: dwc3: st: Fix USB_DR_MODE_PERIPHERAL configuration.) Merging usb-serial-fixes/usb-linus (74d2a91aec97 USB: serial: option: add even more ZTE device ids) Merging usb-chipidea-fixes/ci-for-usb-stable (d144dfea8af7 usb: chipidea: otg: change workqueue ci_otg as freezable) Merging staging.current/staging-linus (1a695a905c18 Linux 4.7-rc1) Merging char-misc.current/char-misc-linus (1a695a905c18 Linux 4.7-rc1) Merging input-current/for-linus (f49cf3b8b4c8 Input: pwm-beeper - fix - scheduling while atomic) Merging crypto-current/master (ab6a11a7c8ef crypto: ccp - Fix AES XTS error for request sizes above 4096) Merging ide/master (1993b176a822 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide) Merging devicetree-current/devicetree/merge (f76502aa9140 of/dynamic: Fix test for PPC_PSERIES) Merging rr-fixes/fixes (8244062ef1e5 modules: fix longstanding /proc/kallsyms vs module insertion race.) Merging vfio-fixes/for-linus (089f1c6b2dae vfio/type1: Fix build warning) Merging kselftest-fixes/fixes (505ce68c6da3 selftest/seccomp: Fix the seccomp(2) signature) Merging backlight-fixes/for-backlight-fixes (68feaca0b13e backlight: pwm: Handle EPROBE_DEFER while requesting the PWM) Merging ftrace-fixes/for-next-urgent (6224beb12e19 tracing: Have branch tracer use recursive field of task struct) Merging mfd-fixes/for-mfd-fixes (1b52e50f2a40 mfd: max77843: Fix max77843_chg_init() return on error) Merging drm-intel-fixes/for-linux-next-fixes (1a695a905c18 Linux 4.7-rc1) Merging asm-generic/master (b0da6d44157a asm-generic: Drop renameat syscall from default list) Merging arc/for-next (776d7f1694a7 arc: axs103_smp: Fix CPU frequency to 100MHz for dual-core) Merging arm/for-next (6a606d21bc0d Merge branches 'component' and 'fixes' into for-next) CONFLICT (content): Merge conflict in drivers/gpu/drm/rockchip/rockchip_drm_drv.c Merging arm-perf/for-next/perf (4ba2578fa7b5 arm64: perf: don't expose CHAIN event in sysfs) Merging arm-soc/for-next (d6be64b09dd1 Merge branch 'fixes' into for-next) Merging at91/at91-next (5a0d7c6a48ae Merge branch 'at91-4.7-defconfig' into at91-next) Merging bcm2835-dt/bcm2835-dt-next (6a93792774fc ARM: bcm2835: dt: Add the ethernet to the device trees) Merging bcm2835-soc/bcm2835-soc-next (92e963f50fc7 Linux 4.5-rc1) Merging bcm2835-drivers/bcm2835-drivers-next (92e963f50fc7 Linux 4.5-rc1) Merging bcm2835-defconfig/bcm2835-defconfig-next (1a695a905c18 Linux 4.7-rc1) Merging berlin/berlin/for-next (9a7e06833249 Merge branch 'berlin/fixes' into berlin/for-next) Merging cortex-m/for-next (f719a0d6a854 ARM: efm32: switch to vendor,device compatible strings) Merging imx-mxs/for-next (63b44471754b Merge branch 'imx/defconfig64' into for-next) Merging keystone/next (02e15d234006 Merge branch 'for_4.7/kesytone' into next) Merging mvebu/for-next (01316cded75b Merge branch 'mvebu/defconfig' into mvebu/for-next) Merging omap/for-next (5c66191b5c76 Merge branch 'omap-for-v4.7/dt' into for-next) Merging omap-pending/for-next (c20c8f750d9f ARM: OMAP2+: hwmod: fix _idle() hwmod state sanity check sequence) Merging qcom/for-next (eb8e0105700b firmware: qcom_scm: Make core clock optional) Merging renesas/next (1df83bd17bee Merge branches 'heads/arm64-dt-for-v4.8', 'heads/dt-for-v4.8', 'heads/soc-for-v4.8' and 'heads/sh-drivers-for-v4.8' into next) Merging rockchip/for-next (bc64bf4164ed Merge branch 'v4.7-clk/fixes' into for-next) Merging rpi/for-rpi-next (bc0195aad0da Linux 4.2-rc2) Merging samsung/for-next (92e963f50fc7 Linux 4.5-rc1) Merging samsung-krzk/for-next (b68cbd51dbe4 Merge branch 'for-v4.8/dts-exynos5410-odroid-xu' into for-next) Merging tegra/for-next (5c282bc9d0a3 Merge branch for-4.7/defconfig into for-next) Merging arm64/for-next/core (e6d9a5254333 arm64: do not enforce strict 16 byte alignment to stack pointer) Merging blackfin/for-linus (391e74a51ea2 eth: bf609 eth clock: add pclk clock for stmmac driver probe) CONFLICT (content): Merge conflict in arch/blackfin/mach-common/pm.c Merging c6x/for-linux-next (ca3060d39ae7 c6x: Use generic clkdev.h header) Merging cris/for-next (f9f3f864b5e8 cris: Fix section mismatches in architecture startup code) Merging h8300/h8300-next (8cad489261c5 h8300: switch EARLYCON) Merging hexagon/linux-next (02cc2ccfe771 Revert "Hexagon: fix signal.c compile error") Merging ia64/next (787ca32dc704 ia64/unaligned: Silence another GCC warning about an uninitialised variable) Merging m68k/for-next (9a6462763b17 m68k/mvme16x: Include generic <linux/rtc.h>) Merging m68knommu/for-next (1a695a905c18 Linux 4.7-rc1) Merging metag/for-next (592ddeeff8cb metag: Fix typos) Merging microblaze/next (52e9e6e05617 microblaze: pci: export isa_io_base to fix link errors) Merging mips/mips-for-linux-next (b02b1fbdd338 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi) Merging nios2/for-next (9fa78f63a892 nios2: Add order-only DTC dependency to %.dtb target) Merging parisc-hd/for-next (57f3ea7a3d6e parisc: Fix backtrace on PA-RISC) Merging powerpc/next (138a076496e6 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next) Merging powerpc-mpe/next (bc0195aad0da Linux 4.2-rc2) Merging fsl/next (1eef33bec12d powerpc/86xx: Fix PCI interrupt map definition) Merging mpc5xxx/next (39e69f55f857 powerpc: Introduce the use of the managed version of kzalloc) Merging s390/features (5e19a42ac6d9 s390/cpuinfo: show dynamic and static cpu mhz) Merging sparc-next/master (9f935675d41a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input) Merging tile/master (bdf03e59f8c1 Fix typo) Merging uml/linux-next (a78ff1112263 um: add extended processor state save/restore support) Merging unicore32/unicore32 (c83d8b2fc986 unicore32: mm: Add missing parameter to arch_vma_access_permitted) Merging xtensa/for_next (9da8320bb977 xtensa: add test_kc705_hifi variant) Merging btrfs/next (c315ef8d9db7 Merge branch 'for-chris-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.7) Merging btrfs-kdave/for-next (33e66059b7ad Merge branch 'for-next-next-4.7-20160527' into for-next-20160527) Merging ceph/master (4a3262b17c96 libceph: use %s instead of %pE in dout()s) Merging cifs/for-next (3bdc426e2497 cifs: dynamic allocation of ntlmssp blob) Merging configfs/for-next (96c22a329351 configfs: fix CONFIGFS_BIN_ATTR_[RW]O definitions) Merging ecryptfs/next (933c32fe0e42 ecryptfs: drop null test before destroy functions) Merging ext3/for_next (b9d8905e4a75 reiserfs: check kstrdup failure) Merging ext4/dev (12735f881952 ext4: pre-zero allocated blocks for DAX IO) Merging f2fs/dev (b02b1fbdd338 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi) Merging fscache/fscache (b00c2ae2ed3c FS-Cache: Don't override netfs's primary_index if registering failed) Merging fuse/for-next (4441f63ab7e5 fuse: update mailing list in MAINTAINERS) Merging gfs2/for-next (29567292c0b5 Merge tag 'for-linus-4.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip) Merging jfs/jfs-next (6ed71e9819ac jfs: Coalesce some formats) Merging nfs/linux-next (1a695a905c18 Linux 4.7-rc1) Merging nfsd/nfsd-next (9e62f931dd07 rpc: share one xps between all backchannels) Merging orangefs/for-next (2dcd0af568b0 Linux 4.6) Merging overlayfs/overlayfs-next (7d43ba76af20 ovl: store ovl_entry in inode->i_private for all inodes) Merging v9fs/for-next (a333e4bf2556 fs/9p: use fscache mutex rather than spinlock) Merging ubifs/linux-next (1112018cefc5 ubifs: ubifs_dump_inode: Fix dumping field bulk_read) Merging xfs/for-next (1a695a905c18 Linux 4.7-rc1) Merging file-locks/linux-next (5af9c2e19da6 Merge branch 'akpm' (patches from Andrew)) Merging vfs/for-next (1eb82bc8e712 Merge branch 'for-linus' into for-next) Merging pci/next (1a695a905c18 Linux 4.7-rc1) Merging hid/for-next (185a9cac5b1e Merge branch 'for-4.6/upstream-fixes' into for-next) Merging i2c/i2c/for-next (1a695a905c18 Linux 4.7-rc1) Merging jdelvare-hwmon/master (18c358ac5e32 Documentation/hwmon: Update links in max34440) Merging dmi/master (c3db05ecf8ac firmware: dmi_scan: Save SMBIOS Type 9 System Slots) Merging hwmon-staging/hwmon-next (03bd75a88d6c hwmon: (max1668) Fix typo in documentation) Merging v4l-dvb/master (73dfb701d254 Merge branch 'v4l_for_linus' into to_next) Merging pm/linux-next (1a695a905c18 Linux 4.7-rc1) Merging idle/next (f55532a0c0b8 Linux 4.6-rc1) Merging thermal/next (88ac99063e6e Merge branches 'thermal-core', 'thermal-intel' and 'thermal-soc' into next) Merging thermal-soc/next (ddc8fdc6e2f0 Merge branch 'work-fixes' into work-next) CONFLICT (add/add): Merge conflict in drivers/thermal/tango_thermal.c CONFLICT (content): Merge conflict in drivers/thermal/rockchip_thermal.c Merging ieee1394/for-next (384fbb96f926 firewire: nosy: Replace timeval with timespec64) Merging dlm/next (82c7d823cc31 dlm: config: Fix ENOMEM failures in make_cluster()) Merging swiotlb/linux-next (386744425e35 swiotlb: Make linux/swiotlb.h standalone includible) Merging slave-dma/next (4f0382030b6d Merge branch 'topic/sh' into next) Merging net-next/master (07b75260ebc2 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus) Merging ipsec-next/master (cb866e3298cd xfrm: Increment statistic counter on inner mode error) Merging ipvs-next/master (698e2a8dca98 ipvs: make drop_entry protection effective for SIP-pe) Merging wireless-drivers-next/master (52776a700b53 Merge ath-next from ath.git) Merging bluetooth/master (07b75260ebc2 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus) Merging mac80211-next/master (019ae3a91881 cfg80211: Advertise extended capabilities per interface type to userspace) Merging rdma/for-next (7a226f9c32b0 staging/rdma: Remove the entire rdma subdirectory of staging) Merging rdma-leon/rdma-next (1a695a905c18 Linux 4.7-rc1) Merging rdma-leon-test/testing/rdma-next (1a695a905c18 Linux 4.7-rc1) Merging mtd/master (becc7ae544c6 MAINTAINERS: Add file patterns for mtd device tree bindings) Merging l2-mtd/master (becc7ae544c6 MAINTAINERS: Add file patterns for mtd device tree bindings) Merging nand/nand/next (cabfeaa67843 ARM: OMAP2+: Update GPMC and NAND DT binding documentation) Merging crypto/master (5318c53d5b4b crypto: s5p-sss - Use consistent indentation for variables and members) Merging drm/drm-next (92181d47ee74 headers_check: don't warn about c++ guards) Merging drm-panel/drm/panel/for-next (227e4f4079e1 drm/panel: simple: Add support for TPK U.S.A. LLC Fusion 7" and 10.1" panels) Merging drm-intel/for-linux-next (1800ad255c4f drm/i915: Update GEN6_PMINTRMSK setup with GuC enabled) CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_ringbuffer.h CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_ringbuffer.c CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_psr.c CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_pm.c CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_lrc.c CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_display.c CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/i915_gem.c Merging drm-tegra/drm/tegra/for-next (057eab2013ec MAINTAINERS: Remove Terje Bergström as Tegra DRM maintainer) Merging drm-misc/topic/drm-misc (b82caafcf230 drm/vc4: Use lockless gem BO free callback) Merging drm-exynos/exynos-drm/for-next (25364a9e54fb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid) Merging drm-msm/msm-next (2b669875332f drm/msm: Drop load/unload drm_driver ops) Merging hdlcd/for-upstream/hdlcd (5c5b8aedaec0 drm: hdlcd: Cleanup the atomic operations) Merging drm-vc4/drm-vc4-next (efea172891fc drm/vc4: Return -EBUSY if there's already a pending flip event.) Merging sunxi/sunxi/for-next (30ce0df9ee51 Merge branches 'sunxi/defconfig-for-4.8', 'sunxi/drm-fixes-for-4.7' and 'sunxi/dt-for-4.8' into sunxi/for-next) Merging kbuild/for-next (0c644e04ad1b Merge branch 'kbuild/misc' into kbuild/for-next) Merging kconfig/for-next (5bcba792bb30 localmodconfig: Fix whitespace repeat count after "tristate") Merging regmap/for-next (3f8cd61d24e6 Merge remote-tracking branch 'regmap/topic/maintainers' into regmap-next) Merging sound/for-next (0358ccc8ffd8 ALSA: uapi: Add three missing header files to Kbuild file) Merging sound-asoc/for-next (f5fe6c51e8b5 Merge remote-tracking branches 'asoc/topic/samsung', 'asoc/topic/simple', 'asoc/topic/tas571x', 'asoc/topic/tlv320aic31xx' and 'asoc/topic/wm8985' into asoc-next) Merging modules/modules-next (e2d1248432c4 module: Disable MODULE_FORCE_LOAD when MODULE_SIG_FORCE is enabled) Merging input/next (48a2b783483b Input: add Raydium I2C touchscreen driver) Merging block/for-next (661806a31989 Merge branch 'for-4.7/core' into for-next) Merging lightnvm/for-next (2a65aee4011b lightnvm: reserved space calculation incorrect) Merging device-mapper/for-next (b8ef07be98b4 dm mpath: add optional "queue_mode" feature) Merging pcmcia/master (e8e68fd86d22 pcmcia: do not break rsrc_nonstatic when handling anonymous cards) Merging mmc-uh/next (1a695a905c18 Linux 4.7-rc1) Merging md/for-next (412575807427 right meaning of PARITY_ENABLE_RMW and PARITY_PREFER_RMW) Merging mfd/for-mfd-next (b52207ef4ea5 mfd: hi655x: Add MFD driver for hi655x) Merging backlight/for-backlight-next (60d613d6aef4 backlight: pwm_bl: Free PWM requested by legacy API on error path) Merging battery/master (4a99fa06a8ca sbs-battery: fix power status when battery charging near dry) Merging omap_dss2/for-next (ab366b40b851 fbdev: Use IS_ENABLED() instead of checking for built-in or module) Merging regulator/for-next (500ed8bf3856 Merge remote-tracking branches 'regulator/topic/fixed', 'regulator/topic/headers', 'regulator/topic/max8973' and 'regulator/topic/mt6397' into regulator-next) Merging security/next (b937190c40de LSM: LoadPin: provide enablement CONFIG) Merging integrity/next (05d1a717ec04 ima: add support for creating files using the mknodat syscall) Merging keys/keys-next (75aeddd12f20 MAINTAINERS: Update keyrings record and add asymmetric keys record) Merging selinux/next (7ea59202db8d selinux: Only apply bounds checking to source types) Merging tpmdd/next (e8f2f45a4402 tpm: Fix suspend regression) Merging watchdog/master (1a695a905c18 Linux 4.7-rc1) Merging iommu/next (6c0b43df74f9 Merge branches 'arm/io-pgtable', 'arm/rockchip', 'arm/omap', 'x86/vt-d', 'ppc/pamu', 'core' and 'x86/amd' into next) Merging dwmw2-iommu/master (22e2f9fa63b0 iommu/vt-d: Use per-cpu IOVA caching) Merging vfio/next (f70552809419 vfio_pci: Test for extended capabilities if config space > 256 bytes) Merging jc_docs/docs-next (9f8036643dd9 doc: self-protection: provide initial details) Merging trivial/for-next (52bbe141f37f gitignore: fix wording) Merging audit/next (2b4c7afe79a8 audit: fixup: log on errors from filter user rules) Merging devicetree/devicetree/next (48a9b733e644 of/irq: Rename "intc_desc" to "of_intc_desc" to fix OF on sh) Merging dt-rh/for-next (f2c27767af0a devicetree: Add Creative Technology vendor id) Merging mailbox/mailbox-for-next (c430cf376fee mailbox: Fix devm_ioremap_resource error detection code) Merging spi/for-next (e767713092de Merge remote-tracking branches 'spi/topic/maintainers', 'spi/topic/orion', 'spi/topic/pxa2xx' and 'spi/topic/rockchip' into spi-next) Merging tip/auto-latest (65fd15a016cc Merge branch 'perf/urgent') Merging clockevents/clockevents/next (cee77c2c5b57 clocksource/drivers/tango-xtal: Fix incorrect test) Merging edac/linux_next (12f0721c5a70 sb_edac: correctly fetch DIMM width on Ivy Bridge and Haswell) Merging edac-amd/for-next (3f37a36b6282 EDAC, amd64_edac: Drop pci_register_driver() use) Merging irqchip/irqchip/for-next (a66ce4b7d9d2 Merge branch 'irqchip/mvebu' into irqchip/for-next) Merging ftrace/for-next (97f8827a8c79 ftracetest: Use proper logic to find process PID) Merging rcu/rcu/next (0e7e2457e4e4 Merge commit 'dcd36d01fb3f99d1d5df01714f6ccbe3fbbaf81f' into HEAD) Merging kvm/linux-next (e28e909c36bb Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm) Merging kvm-arm/next (35a2d58588f0 KVM: arm/arm64: vgic-new: Synchronize changes to active state) Merging kvm-ppc/kvm-ppc-next (c63517c2e381 KVM: PPC: Book3S: correct width in XER handling) Merging kvm-ppc-paulus/kvm-ppc-next (b1a4286b8f33 KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts) Merging kvms390/next (60a37709ce60 KVM: s390: Populate mask of non-hypervisor managed facility bits) Merging xen-tip/linux-next (bdadcaf2a7c1 xen: remove incorrect forward declaration) Merging percpu/for-next (6710e594f71c percpu: fix synchronization between synchronous map extension and chunk destruction) Merging workqueues/for-next (f1e89a8f3358 Merge branch 'for-4.6-fixes' into for-next) Merging drivers-x86/for-next (b740d2e9233c platform/x86: Add PMC Driver for Intel Core SoC) Merging chrome-platform/for-next (31b764171cb5 Revert "platform/chrome: chromeos_laptop: Add Leon Touch") Merging hsi/for-next (b32bd7e7d5c1 hsi: use kmemdup) Merging leds/for-next (a534769305ec leds: core: Fix brightness setting upon hardware blinking enabled) Merging ipmi/for-next (a1b4e31bfabb IPMI: reserve memio regions separately) Merging driver-core/driver-core-next (1a695a905c18 Linux 4.7-rc1) Merging tty/tty-next (1a695a905c18 Linux 4.7-rc1) Merging usb/usb-next (1a695a905c18 Linux 4.7-rc1) Merging usb-gadget/next (2a58f9c12bb3 usb: dwc3: gadget: disable automatic calculation of ACK TP NUMP) Merging usb-serial/usb-next (b923c6c62981 USB: serial: ti_usb_3410_5052: add MOXA UPORT 11x0 support) Merging usb-chipidea-next/ci-for-usb-next (764763f0a0c8 doc: usb: chipidea: update the doc for OTG FSM) Merging staging/staging-next (1a695a905c18 Linux 4.7-rc1) Merging char-misc/char-misc-next (1a695a905c18 Linux 4.7-rc1) Merging extcon/extcon-next (bd3adefe7ea1 extcon: usb-gpio: add support for ACPI gpio interface) Merging cgroup/for-next (332d8a2fd141 cgroup: set css->id to -1 during init) Merging scsi/for-next (787ab6e97024 aacraid: do not activate events on non-SRC adapters) Merging target-updates/for-next (8f0dfb3d8b11 iscsi-target: Fix early sk_data_ready LOGIN_FLAGS_READY race) Merging target-merge/for-next-merge (2994a7518317 cxgb4: update Kconfig and Makefile) Merging libata/for-next (5219d6530ef0 ata: Use IS_ENABLED() instead of checking for built-in or module) Merging pinctrl/for-next (a02fcf38ade9 Merge branch 'devel' into for-next) Merging vhost/linux-next (bb991288728e ringtest: pass buf != NULL) Merging remoteproc/for-next (7a6271a80cae remoteproc/wkup_m3: Use MODULE_DEVICE_TABLE to export alias) Merging rpmsg/for-next (da5cb422f15d Merge branches 'rpmsg-next' and 'rproc-next' into for-next) Merging gpio/for-next (63e213fc63c0 Merge branch 'devel' into for-next) Merging dma-mapping/dma-mapping-next (d770e558e219 Linux 4.2-rc1) Merging pwm/for-next (18c588786c08 Merge branch 'for-4.7/pwm-atomic' into for-next) Merging dma-buf/for-next (b02da6f82361 dma-buf: use vma_pages()) Merging userns/for-next (f2ca379642d7 namei: permit linking with CAP_FOWNER in userns) Merging ktest/for-next (2dcd0af568b0 Linux 4.6) Merging clk/clk-next (ef56b79b66fa clk: fix critical clock locking) Merging aio/master (b562e44f507e Linux 4.5) Merging kselftest/next (6eab37daf0ec tools: testing: define the _GNU_SOURCE macro) Merging y2038/y2038 (4b277763c5b3 vfs: Add support to document max and min inode times) Merging luto-misc/next (afd2ff9b7e1b Linux 4.4) Merging borntraeger/linux-next (b562e44f507e Linux 4.5) Merging livepatching/for-next (6d9122078097 Merge branch 'for-4.7/core' into for-next) Merging coresight/next (c568ba901f27 coresight: Handle build path error) Merging rtc/rtc-next (95df4c078bf3 char/genrtc: remove the rest of the driver) Merging hwspinlock/for-next (bd5717a4632c hwspinlock: qcom: Correct msb in regmap_field) Merging nvdimm/libnvdimm-for-next (36092ee8ba69 Merge branch 'for-4.7/dax' into libnvdimm-for-next) Merging dax-misc/dax-misc (4d9a2c874667 dax: Remove i_mmap_lock protection) Merging akpm-current/current (602525d7e860 ipc/msg.c: use freezable blocking call) CONFLICT (content): Merge conflict in sound/soc/qcom/lpass-platform.c CONFLICT (content): Merge conflict in net/9p/client.c CONFLICT (content): Merge conflict in fs/binfmt_flat.c $ git checkout -b akpm remotes/origin/akpm/master Applying: mm: make optimistic check for swapin readahead fix Applying: drivers/net/wireless/intel/iwlwifi/dvm/calib.c: simplfy min() expression Applying: drivers/fpga/Kconfig: fix build failure Merging akpm/master (200f147a6c83 drivers/fpga/Kconfig: fix build failure) ^ permalink raw reply [flat|nested] 28+ messages in thread
* [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-01 3:11 linux-next: Tree for Jun 1 Stephen Rothwell @ 2016-06-02 1:48 ` Sergey Senozhatsky 2016-06-02 9:21 ` Michal Hocko 2016-06-02 13:24 ` Vlastimil Babka 0 siblings, 2 replies; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-02 1:48 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On (06/01/16 13:11), Stephen Rothwell wrote: > Hi all, > > Changes since 20160531: > > My fixes tree contains: > > of: silence warnings due to max() usage > > The arm tree gained a conflict against Linus' tree. > > Non-merge commits (relative to Linus' tree): 1100 > 936 files changed, 38159 insertions(+), 17475 deletions(-) Hello, the cc1 process ended up in DN state during kernel -j4 compilation. ... [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 [ 2856.323068] Call Trace: [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f [ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds. [ 2877.322858] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 [ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2877.322861] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 [ 2877.322865] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 [ 2877.322867] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 [ 2877.322867] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 [ 2877.322870] Call Trace: [ 2877.322875] [<ffffffff81441e33>] schedule+0x83/0x98 [ 2877.322878] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 [ 2877.322881] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d [ 2877.322884] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 [ 2877.322885] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 [ 2877.322887] [<ffffffff81443630>] down_write+0x1f/0x2e [ 2877.322890] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a [ 2877.322892] [<ffffffff8103702a>] mmput+0x29/0xc5 [ 2877.322894] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 [ 2877.322896] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 [ 2877.322898] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 [ 2877.322900] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf [ 2877.322902] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f ... ps aux | grep cc1 ss 4582 0.0 0.0 0 0 pts/23 DN+ 10:10 0:01 [cc1] -ss ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky @ 2016-06-02 9:21 ` Michal Hocko 2016-06-02 12:08 ` Sergey Senozhatsky 2016-06-03 7:15 ` Sergey Senozhatsky 2016-06-02 13:24 ` Vlastimil Babka 1 sibling, 2 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-02 9:21 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli [CCing Andrea] On Thu 02-06-16 10:48:35, Sergey Senozhatsky wrote: > On (06/01/16 13:11), Stephen Rothwell wrote: > > Hi all, > > > > Changes since 20160531: > > > > My fixes tree contains: > > > > of: silence warnings due to max() usage > > > > The arm tree gained a conflict against Linus' tree. > > > > Non-merge commits (relative to Linus' tree): 1100 > > 936 files changed, 38159 insertions(+), 17475 deletions(-) > > Hello, > > the cc1 process ended up in DN state during kernel -j4 compilation. > > ... > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > [ 2856.323068] Call Trace: > [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f down_write in the exit path is certainly not nice. It is hard to tell who is blocking the mmap_sem but it is clear that __khugepaged_exit waits for the khugepaged to release its mmap_sem. Do you hapen to have a trace of khugepaged? Note that the lock holder might be another writer which just hasn't pinned mm_users so khugepaged might be blocked on read lock as well. Or khugepaged might be just stuck somewhere... I am trying to wrap my head around the synchronization here and I suspect it is unnecessarily complex. We should be able to go without down_write in the exit path... The following patch would only workaround the issue you are seeing but I guess it is worth considering this approach. Andrea, does the following look reasonable to you? I haven't tested it and I might be missing some subtle details. The code is really not trivial... --- >From 34416b980cf02280ad76b5603175eda327ce0603 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Thu, 2 Jun 2016 10:38:37 +0200 Subject: [PATCH] khugepaged: simplify khugepaged vs. __mmput __khugepaged_exit is called during the final __mmput and it employs a complex synchronization dances to make sure it doesn't race with the khugepaged which might be scanning this mm at the same time. This is all caused by the fact that khugepaged doesn't pin mm_users. Things would simplify considerably if we simply check the mm at khugepaged_scan_mm_slot and if mm_users was already 0 then we know it is dead and we can unhash the mm_slot and move on to another one. This will also guarantee that __khugepaged_exit cannot race with khugepaged and so we can free up the slot if it is still hashed. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/huge_memory.c | 40 ++++++++++++++++------------------------ 1 file changed, 16 insertions(+), 24 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index de62bd991827..3dfc62b1a90c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1936,7 +1936,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm, static inline int khugepaged_test_exit(struct mm_struct *mm) { - return atomic_read(&mm->mm_users) == 0; + /* the only pin is from khugepaged_scan_mm_slot */ + return atomic_read(&mm->mm_users) <= 1; } int __khugepaged_enter(struct mm_struct *mm) @@ -1948,8 +1949,6 @@ int __khugepaged_enter(struct mm_struct *mm) if (!mm_slot) return -ENOMEM; - /* __khugepaged_exit() must not run from under us */ - VM_BUG_ON_MM(khugepaged_test_exit(mm), mm); if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) { free_mm_slot(mm_slot); return 0; @@ -1999,29 +1998,11 @@ void __khugepaged_exit(struct mm_struct *mm) spin_lock(&khugepaged_mm_lock); mm_slot = get_mm_slot(mm); - if (mm_slot && khugepaged_scan.mm_slot != mm_slot) { - hash_del(&mm_slot->hash); - list_del(&mm_slot->mm_node); - free = 1; - } - spin_unlock(&khugepaged_mm_lock); - - if (free) { + if (mm_slot) { + collect_mm_slot(mm_slot); clear_bit(MMF_VM_HUGEPAGE, &mm->flags); - free_mm_slot(mm_slot); - mmdrop(mm); - } else if (mm_slot) { - /* - * This is required to serialize against - * khugepaged_test_exit() (which is guaranteed to run - * under mmap sem read mode). Stop here (after we - * return all pagetables will be destroyed) until - * khugepaged has finished working on the pagetables - * under the mmap_sem. - */ - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); } + spin_unlock(&khugepaged_mm_lock); } static void release_pte_page(struct page *page) @@ -2780,6 +2761,16 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, khugepaged_scan.address = 0; khugepaged_scan.mm_slot = mm_slot; } + + /* + * Do not even try to do anything if the current mm is already + * dead. khugepaged_mm_lock will make sure only this or + * __khugepaged_exit does the unhasing. + */ + if (!atomic_inc_not_zero(&mm_slot->mm->mm_users)) { + collect_mm_slot(mm_slot); + return progress; + } spin_unlock(&khugepaged_mm_lock); mm = mm_slot->mm; @@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, collect_mm_slot(mm_slot); } + mmput(mm); return progress; } -- 2.8.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 9:21 ` Michal Hocko @ 2016-06-02 12:08 ` Sergey Senozhatsky 2016-06-02 12:21 ` Michal Hocko 2016-06-03 7:15 ` Sergey Senozhatsky 1 sibling, 1 reply; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-02 12:08 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli Hello Michal, On (06/02/16 11:21), Michal Hocko wrote: [..] > > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > > [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > > [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > > [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > > [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > > [ 2856.323068] Call Trace: > > [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > > [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > > [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > > [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > > [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > > [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > > [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > > [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > > [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > > [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > > [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > > [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > > [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > down_write in the exit path is certainly not nice. It is hard to tell > who is blocking the mmap_sem but it is clear that __khugepaged_exit > waits for the khugepaged to release its mmap_sem. Do you hapen to have a > trace of khugepaged? Note that the lock holder might be another writer > which just hasn't pinned mm_users so khugepaged might be blocked on read > lock as well. Or khugepaged might be just stuck somewhere... sorry, no. this is all I have. the kernel was compiled with almost no debugging functionality enabled (no lockdep, no lock debug, nothing) for zram performance testing purposes. I'll try to reproduce the problem; and give your patch some testing. thanks. -ss ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 12:08 ` Sergey Senozhatsky @ 2016-06-02 12:21 ` Michal Hocko 2016-06-03 13:51 ` Andrea Arcangeli 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-02 12:21 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Thu 02-06-16 21:08:57, Sergey Senozhatsky wrote: > Hello Michal, > > On (06/02/16 11:21), Michal Hocko wrote: > [..] > > > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > > > [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > > > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > > > [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > > > [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > > > [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > > > [ 2856.323068] Call Trace: > > > [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > > > [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > > > [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > > > [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > > > [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > > > [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > > > [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > > > [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > > > [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > > > [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > > > [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > > > [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > > > [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > > > down_write in the exit path is certainly not nice. It is hard to tell > > who is blocking the mmap_sem but it is clear that __khugepaged_exit > > waits for the khugepaged to release its mmap_sem. Do you hapen to have a > > trace of khugepaged? Note that the lock holder might be another writer > > which just hasn't pinned mm_users so khugepaged might be blocked on read > > lock as well. Or khugepaged might be just stuck somewhere... > > sorry, no. this is all I have. the kernel was compiled with almost no > debugging functionality enabled (no lockdep, no lock debug, nothing) > for zram performance testing purposes. > > I'll try to reproduce the problem; and give your patch some testing. > thanks. The patch will drop the down_write from the exit path which is, I believe the right thing to do, so it would paper over an existing problem when khugepaged could get stuck with mmap_sem held for read (if that is really a problem). So reproducing without the patch still makes some sense. Testing with the patch makes some sense as well, but I would like to hear from Andrea whether the approach is good because I am wondering why he hasn't done that before - it feels so much simpler than the current code. Anyway, thanks a lot for testing! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 12:21 ` Michal Hocko @ 2016-06-03 13:51 ` Andrea Arcangeli 2016-06-03 14:46 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Andrea Arcangeli @ 2016-06-03 13:51 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On Thu, Jun 02, 2016 at 02:21:10PM +0200, Michal Hocko wrote: > Testing with the patch makes some sense as well, but I would like to > hear from Andrea whether the approach is good because I am wondering why > he hasn't done that before - it feels so much simpler than the current > code. The down_write in the exit path comes from __ksm_exit. If you don't like it there I'd suggest to also remove it from __ksm_exit. This is a proposed cleanup correct? The first thing that I can notice is that khugepaged_test_exit() then can only be called and provide the expected retval, after atomic_inc_not_zero(mm_users). Also note mmget_not_zero() should be used instead. However the code still uses khugepaged_test_exit in __khugepage_enter that won't increase the mm_users, so then the patch relaxes that check too much, albeit only for a debug check not strictly a bug. The cons of this change purely that it'll decrease the responsiveness in releasing the RAM of a killed task a bit. To me the fewer time we hold the mm_users the better and I don't see an obvious runtime improvement coming from this change. It's a bit simpler yes, but the down_write in the exit path is well understood, ksm does the same thing and it's in a slow path (it only happens if the mm that exited is the current one under scan by either ksmd or khugepaged, so normally the down_write is not executed in the exit path and the "mm" is collected right away both as a mm_users and mm_count). In short I think it's a tradeoff: pros) removes down_write in a slow path of the the mm exit which may simplify the code a bit, cons) it could increase the latency in freeing memory as result of a task exiting or being killed during the khugepaged scan, for example while the THP is being allocated. While compaction runs to allocate the THP in collapse_huge_page, if the task is killed currently the memory is released right away, without waiting for the allocation to succeed or fail. I don't see a big enough problem with the down_write in a slow path of khugepaged_exit to justify the increased latency in releasing memory. I was very happy by Oleg's patch reducing the mm_users holding of userfaultfd too. That was controlled by userland so it would only be an issue for non-cooperative usage which isn't upstream yet, and it was also much wider than this one would become with the patch applied, but I liked the direction. If prefer instead to remove the down_write, you probably could move the test_exit before the down_read/write to bail out before taking the lock: you don't need the mmap_sem to do test_exit anymore. The only reason the text_exit would remain in fact is just to reduce the latency of the memory freeing, it then becomes a voluntary preempt cond_resched() to release the memory to make a parallel ;), but unable to let the kernel free the memory while the THP allocation runs. Thanks, Andrea ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 13:51 ` Andrea Arcangeli @ 2016-06-03 14:46 ` Michal Hocko 2016-06-03 15:10 ` Andrea Arcangeli 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-03 14:46 UTC (permalink / raw) To: Andrea Arcangeli Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On Fri 03-06-16 15:51:54, Andrea Arcangeli wrote: > On Thu, Jun 02, 2016 at 02:21:10PM +0200, Michal Hocko wrote: > > Testing with the patch makes some sense as well, but I would like to > > hear from Andrea whether the approach is good because I am wondering why > > he hasn't done that before - it feels so much simpler than the current > > code. > > The down_write in the exit path comes from __ksm_exit. If you don't > like it there I'd suggest to also remove it from __ksm_exit. I see > This is a proposed cleanup correct? yes this is a cleanup but also a robustness thing, see below. > The first thing that I can notice is that khugepaged_test_exit() then > can only be called and provide the expected retval, after > atomic_inc_not_zero(mm_users). Also note mmget_not_zero() should be > used instead. I didn't get used to mmget_not_zero yet, but true a helper would be better. [...] > To me the fewer time we hold the mm_users the better and I don't see > an obvious runtime improvement coming from this change. It's a bit > simpler yes, but the down_write in the exit path is well understood, > ksm does the same thing and it's in a slow path (it only happens if > the mm that exited is the current one under scan by either ksmd or > khugepaged, so normally the down_write is not executed in the exit > path and the "mm" is collected right away both as a mm_users and > mm_count). OK, I see your point. I wasn't aware that the mmap_sem is dropped before the allocation request. Then the original code indeed might get into exit_mmap earlier wrt. to the patch. The reason I dislike taking write lock in the __mmput is basically for the same reason you have pointed out. exit_mmap might be delayed for an unbounded amount of time. khugepaged resp. ksmd might be well behaved and release their read lock for costly operations or when they detect the mm is dead but it is hard to guarantee that all potential kernel users/drivers are behaving the same way. It is not really trivial to check whether we have such users (there are 100+ users outside of mm/ as per my quick git grep). The exit path should be as simple as possible with the amount of external dependencies reduced to the bare minimum. > In short I think it's a tradeoff: pros) removes down_write in a slow > path of the the mm exit which may simplify the code a bit, cons) it > could increase the latency in freeing memory as result of a task > exiting or being killed during the khugepaged scan, for example while > the THP is being allocated. While compaction runs to allocate the THP > in collapse_huge_page, if the task is killed currently the memory is > released right away, without waiting for the allocation to succeed or > fail. Are those latencies a real problem. The allocation itself shouldn't really take a long time. > I don't see a big enough problem with the down_write in a slow path of > khugepaged_exit to justify the increased latency in releasing memory. What do you think about the external dependencies mentioned above. Do you think this is a sufficient argument wrt. occasional higher latencies? [...] > If prefer instead to remove the down_write, you probably could move > the test_exit before the down_read/write to bail out before taking the > lock: you don't need the mmap_sem to do test_exit anymore. The only > reason the text_exit would remain in fact is just to reduce the > latency of the memory freeing, it then becomes a voluntary preempt > cond_resched() to release the memory to make a parallel ;), but unable > to let the kernel free the memory while the THP allocation runs. OK, I will think about that as well. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 14:46 ` Michal Hocko @ 2016-06-03 15:10 ` Andrea Arcangeli 2016-06-07 7:34 ` Michal Hocko 2016-06-08 8:19 ` Vlastimil Babka 0 siblings, 2 replies; 28+ messages in thread From: Andrea Arcangeli @ 2016-06-03 15:10 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Hugh Dickins Hello Michal, CC'ed Hugh, On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote: > What do you think about the external dependencies mentioned above. Do > you think this is a sufficient argument wrt. occasional higher > latencies? It's a tradeoff and both latencies would be short and uncommon so it's hard to tell. There's also mmput_async for paths that may care about mmput latencies. Exit itself cannot use it, it's mostly for people taking the mm_users pin that may not want to wait for mmput to run. It also shouldn't happen that often, it's a slow path. The whole model inherited from KSM is to deliberately depend only on the mmap_sem + test_exit + mm_count, and never on mm_users, which to me in principle doesn't sound bad. I consider KSM version a "finegrined" implementation but I never thought it would be a problem to wait a bit in exit() in case the slow path hits. I thought it was more of a problem if exit() runs, the parent then start a new task but the memory wasn't freed yet. So I would suggest Hugh to share his view on the down_write/up_write that may temporarily block mmput (until the next test_exit bailout point) vs higher latency in reaching exit_mmap for a real exit(2) that would happen with the proposed change. Thanks! Andrea ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 15:10 ` Andrea Arcangeli @ 2016-06-07 7:34 ` Michal Hocko 2016-06-08 8:19 ` Vlastimil Babka 1 sibling, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-07 7:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Hugh Dickins On Fri 03-06-16 17:10:01, Andrea Arcangeli wrote: > Hello Michal, > > CC'ed Hugh, > > On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote: > > What do you think about the external dependencies mentioned above. Do > > you think this is a sufficient argument wrt. occasional higher > > latencies? > > It's a tradeoff and both latencies would be short and uncommon so it's > hard to tell. > > There's also mmput_async for paths that may care about mmput > latencies. Exit itself cannot use it, it's mostly for people taking > the mm_users pin that may not want to wait for mmput to run. It also > shouldn't happen that often, it's a slow path. > > The whole model inherited from KSM is to deliberately depend only on > the mmap_sem + test_exit + mm_count, and never on mm_users, which to > me in principle doesn't sound bad. I do agree that this model is quite clever (albeit convoluted). It just assumes that all other mmap_sem users are behaving the same. Now most in-kernel users will do get_task_mm() and then lock mmap_sem, but I haven't checked all of them and it is quite possible that some of those would like to optimize in a similar way and only increment mm_count. I might be too pessimistic about the out of mm code but I would feel much better if the exit path didn't depend on them. Anyway, if the current model sounds better I will definitely not insist on my patch. It is more of an idea for simplification than a fix for anything I have seen happening in the real life. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 15:10 ` Andrea Arcangeli 2016-06-07 7:34 ` Michal Hocko @ 2016-06-08 8:19 ` Vlastimil Babka 1 sibling, 0 replies; 28+ messages in thread From: Vlastimil Babka @ 2016-06-08 8:19 UTC (permalink / raw) To: Andrea Arcangeli, Michal Hocko Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Hugh Dickins On 06/03/2016 05:10 PM, Andrea Arcangeli wrote: > Hello Michal, > > CC'ed Hugh, > > On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote: >> What do you think about the external dependencies mentioned above. Do >> you think this is a sufficient argument wrt. occasional higher >> latencies? > > It's a tradeoff and both latencies would be short and uncommon so it's > hard to tell. Shouldn't it be possible to do a mmput() before the hugepage allocation, and then again mmget_not_zero()? That way it's no longer a tradeoff? > There's also mmput_async for paths that may care about mmput > latencies. Exit itself cannot use it, it's mostly for people taking > the mm_users pin that may not want to wait for mmput to run. It also > shouldn't happen that often, it's a slow path. > > The whole model inherited from KSM is to deliberately depend only on > the mmap_sem + test_exit + mm_count, and never on mm_users, which to > me in principle doesn't sound bad. I consider KSM version a > "finegrined" implementation but I never thought it would be a problem > to wait a bit in exit() in case the slow path hits. I thought it was > more of a problem if exit() runs, the parent then start a new task but > the memory wasn't freed yet. > > So I would suggest Hugh to share his view on the down_write/up_write > that may temporarily block mmput (until the next test_exit bailout > point) vs higher latency in reaching exit_mmap for a real exit(2) that > would happen with the proposed change. > > Thanks! > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 9:21 ` Michal Hocko 2016-06-02 12:08 ` Sergey Senozhatsky @ 2016-06-03 7:15 ` Sergey Senozhatsky 2016-06-03 7:25 ` Michal Hocko 1 sibling, 1 reply; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 7:15 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli Hello, On (06/02/16 11:21), Michal Hocko wrote: [..] > @@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > > collect_mm_slot(mm_slot); > } > + mmput(mm); > > return progress; > } this possibly sleeping mmput() is called from under the spin_lock(&khugepaged_mm_lock). there is also a trivial build fixup needed (move collect_mm_slot() before __khugepaged_exit()). it's quite hard to trigger the bug (somehow), so I can't follow up with more information as of now. -ss ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 7:15 ` Sergey Senozhatsky @ 2016-06-03 7:25 ` Michal Hocko 2016-06-03 8:43 ` Sergey Senozhatsky 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-03 7:25 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Fri 03-06-16 16:15:51, Sergey Senozhatsky wrote: > Hello, > > On (06/02/16 11:21), Michal Hocko wrote: > [..] > > @@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > > > > collect_mm_slot(mm_slot); > > } > > + mmput(mm); > > > > return progress; > > } > > this possibly sleeping mmput() is called from > under the spin_lock(&khugepaged_mm_lock). You are right. khugepaged_scan_mm_slot returns with the lock held. mmput_async would deal with it. > there is also a trivial build fixup needed > (move collect_mm_slot() before __khugepaged_exit()). will fix that. Thanks! > it's quite hard to trigger the bug (somehow), so I can't > follow up with more information as of now. Thanks anyway! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 7:25 ` Michal Hocko @ 2016-06-03 8:43 ` Sergey Senozhatsky 2016-06-03 9:55 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 8:43 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On (06/03/16 09:25), Michal Hocko wrote: > > it's quite hard to trigger the bug (somehow), so I can't > > follow up with more information as of now. either I did something very silly fixing up the patch, or the patch may be causing general protection faults on my system. RIP collect_mm_slot() + 0x42/0x84 khugepaged prepare_to_wait_event maybe_pmd_mkwrite kthread _raw_sin_unlock_irq ret_from_fork kthread_create_on_node collect_mm_slot() + 0x42/0x84 is 0000000000000328 <collect_mm_slot>: 328: 55 push %rbp 329: 48 89 e5 mov %rsp,%rbp 32c: 53 push %rbx 32d: 48 8b 5f 20 mov 0x20(%rdi),%rbx 331: 8b 43 48 mov 0x48(%rbx),%eax 334: ff c8 dec %eax 336: 7f 71 jg 3a9 <collect_mm_slot+0x81> 338: 48 8b 57 08 mov 0x8(%rdi),%rdx 33c: 48 85 d2 test %rdx,%rdx 33f: 74 1e je 35f <collect_mm_slot+0x37> 341: 48 8b 07 mov (%rdi),%rax 344: 48 85 c0 test %rax,%rax 347: 48 89 02 mov %rax,(%rdx) 34a: 74 04 je 350 <collect_mm_slot+0x28> 34c: 48 89 50 08 mov %rdx,0x8(%rax) 350: 48 c7 07 00 00 00 00 movq $0x0,(%rdi) 357: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 35e: 00 35f: 48 8b 57 10 mov 0x10(%rdi),%rdx 363: 48 8b 47 18 mov 0x18(%rdi),%rax 367: 48 89 fe mov %rdi,%rsi 36a: 48 89 42 08 mov %rax,0x8(%rdx) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 36e: 48 89 10 mov %rdx,(%rax) 371: 48 b8 00 01 00 00 00 movabs $0xdead000000000100,%rax 378: 00 ad de 37b: 48 89 47 10 mov %rax,0x10(%rdi) 37f: 48 b8 00 02 00 00 00 movabs $0xdead000000000200,%rax 386: 00 ad de 389: 48 89 47 18 mov %rax,0x18(%rdi) 38d: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 394 <collect_mm_slot+0x6c> 394: e8 00 00 00 00 callq 399 <collect_mm_slot+0x71> 399: f0 ff 4b 4c lock decl 0x4c(%rbx) 39d: 74 02 je 3a1 <collect_mm_slot+0x79> 39f: eb 08 jmp 3a9 <collect_mm_slot+0x81> 3a1: 48 89 df mov %rbx,%rdi 3a4: e8 00 00 00 00 callq 3a9 <collect_mm_slot+0x81> 3a9: 5b pop %rbx 3aa: 5d pop %rbp 3ab: c3 retq which is list_del(&mm_slot->mm_node), I believe. I attached the patch (just in case). --- mm/huge_memory.c | 87 +++++++++++++++++++++++++------------------------------- 1 file changed, 39 insertions(+), 48 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 292cedd..1c82fa4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1938,7 +1938,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm, static inline int khugepaged_test_exit(struct mm_struct *mm) { - return atomic_read(&mm->mm_users) == 0; + /* the only pin is from khugepaged_scan_mm_slot */ + return atomic_read(&mm->mm_users) <= 1; } int __khugepaged_enter(struct mm_struct *mm) @@ -1950,8 +1951,6 @@ int __khugepaged_enter(struct mm_struct *mm) if (!mm_slot) return -ENOMEM; - /* __khugepaged_exit() must not run from under us */ - VM_BUG_ON_MM(khugepaged_test_exit(mm), mm); if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) { free_mm_slot(mm_slot); return 0; @@ -1994,36 +1993,40 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma, return 0; } -void __khugepaged_exit(struct mm_struct *mm) +static void collect_mm_slot(struct mm_slot *mm_slot) { - struct mm_slot *mm_slot; - int free = 0; + struct mm_struct *mm = mm_slot->mm; - spin_lock(&khugepaged_mm_lock); - mm_slot = get_mm_slot(mm); - if (mm_slot && khugepaged_scan.mm_slot != mm_slot) { + VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock)); + + if (khugepaged_test_exit(mm)) { + /* free mm_slot */ hash_del(&mm_slot->hash); list_del(&mm_slot->mm_node); - free = 1; - } - spin_unlock(&khugepaged_mm_lock); - if (free) { - clear_bit(MMF_VM_HUGEPAGE, &mm->flags); - free_mm_slot(mm_slot); - mmdrop(mm); - } else if (mm_slot) { /* - * This is required to serialize against - * khugepaged_test_exit() (which is guaranteed to run - * under mmap sem read mode). Stop here (after we - * return all pagetables will be destroyed) until - * khugepaged has finished working on the pagetables - * under the mmap_sem. + * Not strictly needed because the mm exited already. + * + * clear_bit(MMF_VM_HUGEPAGE, &mm->flags); */ - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); + + /* khugepaged_mm_lock actually not necessary for the below */ + free_mm_slot(mm_slot); + mmdrop(mm); + } +} + +void __khugepaged_exit(struct mm_struct *mm) +{ + struct mm_slot *mm_slot; + + spin_lock(&khugepaged_mm_lock); + mm_slot = get_mm_slot(mm); + if (mm_slot) { + collect_mm_slot(mm_slot); + clear_bit(MMF_VM_HUGEPAGE, &mm->flags); } + spin_unlock(&khugepaged_mm_lock); } static void release_pte_page(struct page *page) @@ -2738,29 +2741,6 @@ out: return ret; } -static void collect_mm_slot(struct mm_slot *mm_slot) -{ - struct mm_struct *mm = mm_slot->mm; - - VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock)); - - if (khugepaged_test_exit(mm)) { - /* free mm_slot */ - hash_del(&mm_slot->hash); - list_del(&mm_slot->mm_node); - - /* - * Not strictly needed because the mm exited already. - * - * clear_bit(MMF_VM_HUGEPAGE, &mm->flags); - */ - - /* khugepaged_mm_lock actually not necessary for the below */ - free_mm_slot(mm_slot); - mmdrop(mm); - } -} - static unsigned int khugepaged_scan_mm_slot(unsigned int pages, struct page **hpage) __releases(&khugepaged_mm_lock) @@ -2782,6 +2762,16 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, khugepaged_scan.address = 0; khugepaged_scan.mm_slot = mm_slot; } + + /* + * Do not even try to do anything if the current mm is already + * dead. khugepaged_mm_lock will make sure only this or + * __khugepaged_exit does the unhasing. + */ + if (!atomic_inc_not_zero(&mm_slot->mm->mm_users)) { + collect_mm_slot(mm_slot); + return progress; + } spin_unlock(&khugepaged_mm_lock); mm = mm_slot->mm; @@ -2865,6 +2855,7 @@ breakouterloop_mmap_sem: collect_mm_slot(mm_slot); } + mmput_async(mm); return progress; } -- 2.9.0.rc1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 8:43 ` Sergey Senozhatsky @ 2016-06-03 9:55 ` Michal Hocko 2016-06-03 10:05 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-03 9:55 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Fri 03-06-16 17:43:47, Sergey Senozhatsky wrote: > On (06/03/16 09:25), Michal Hocko wrote: > > > it's quite hard to trigger the bug (somehow), so I can't > > > follow up with more information as of now. > > either I did something very silly fixing up the patch, or the > patch may be causing general protection faults on my system. > > RIP collect_mm_slot() + 0x42/0x84 > khugepaged So is this really collect_mm_slot called directly from khugepaged or is some inlining going on there? > prepare_to_wait_event > maybe_pmd_mkwrite > kthread > _raw_sin_unlock_irq > ret_from_fork > kthread_create_on_node > > collect_mm_slot() + 0x42/0x84 is I guess that the problem is that I have missed that __khugepaged_exit doesn't clear the cached khugepaged_scan.mm_slot. Does the following on top fixes that? --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6574c62ca4a3..e6f4e6fd587a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2021,6 +2021,8 @@ void __khugepaged_exit(struct mm_struct *mm) spin_lock(&khugepaged_mm_lock); mm_slot = get_mm_slot(mm); if (mm_slot) { + if (khugepaged_scan.mm_slot == mm_slot) + khugepaged_scan.mm_slot = NULL; collect_mm_slot(mm_slot); clear_bit(MMF_VM_HUGEPAGE, &mm->flags); } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 9:55 ` Michal Hocko @ 2016-06-03 10:05 ` Michal Hocko 2016-06-03 13:38 ` Sergey Senozhatsky 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-03 10:05 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Fri 03-06-16 11:55:49, Michal Hocko wrote: > On Fri 03-06-16 17:43:47, Sergey Senozhatsky wrote: > > On (06/03/16 09:25), Michal Hocko wrote: > > > > it's quite hard to trigger the bug (somehow), so I can't > > > > follow up with more information as of now. > > > > either I did something very silly fixing up the patch, or the > > patch may be causing general protection faults on my system. > > > > RIP collect_mm_slot() + 0x42/0x84 > > khugepaged > > So is this really collect_mm_slot called directly from khugepaged or is > some inlining going on there? > > > prepare_to_wait_event > > maybe_pmd_mkwrite > > kthread > > _raw_sin_unlock_irq > > ret_from_fork > > kthread_create_on_node > > > > collect_mm_slot() + 0x42/0x84 is > > I guess that the problem is that I have missed that __khugepaged_exit > doesn't clear the cached khugepaged_scan.mm_slot. Does the following on > top fixes that? That wouldn't be sufficient after a closer look. We need to do the same from khugepaged_scan_mm_slot when atomic_inc_not_zero fails. So I guess it would be better to stick it into collect_mm_slot. Thanks for your testing! --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6574c62ca4a3..0432581fb87c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2011,6 +2011,9 @@ static void collect_mm_slot(struct mm_slot *mm_slot) /* khugepaged_mm_lock actually not necessary for the below */ free_mm_slot(mm_slot); mmdrop(mm); + + if (khugepaged_scan.mm_slot == mm_slot) + khugepaged_scan.mm_slot = NULL; } } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 10:05 ` Michal Hocko @ 2016-06-03 13:38 ` Sergey Senozhatsky 2016-06-03 13:45 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 13:38 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On (06/03/16 12:05), Michal Hocko wrote: > > > RIP collect_mm_slot() + 0x42/0x84 > > > khugepaged > > > > So is this really collect_mm_slot called directly from khugepaged or is > > some inlining going on there? inlining I suppose. > > > prepare_to_wait_event > > > maybe_pmd_mkwrite > > > kthread > > > _raw_sin_unlock_irq > > > ret_from_fork > > > kthread_create_on_node > > > > > > collect_mm_slot() + 0x42/0x84 is > > > > I guess that the problem is that I have missed that __khugepaged_exit > > doesn't clear the cached khugepaged_scan.mm_slot. Does the following on > > top fixes that? > > That wouldn't be sufficient after a closer look. We need to do the same > from khugepaged_scan_mm_slot when atomic_inc_not_zero fails. So I guess > it would be better to stick it into collect_mm_slot. Michal, I'll try to test during the weekend (away from the affected box now), but in the worst case it can as late as next Thursday (gonna travel next week). -ss ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 13:38 ` Sergey Senozhatsky @ 2016-06-03 13:45 ` Michal Hocko 2016-06-03 13:49 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-03 13:45 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Fri 03-06-16 22:38:13, Sergey Senozhatsky wrote: [...] > Michal, I'll try to test during the weekend (away from the affected box > now), but in the worst case it can as late as next Thursday (gonna travel > next week). No problem. I would really like to hear from Andrea before we give this a serious try anyway. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 13:45 ` Michal Hocko @ 2016-06-03 13:49 ` Michal Hocko 2016-06-04 7:51 ` Sergey Senozhatsky 0 siblings, 1 reply; 28+ messages in thread From: Michal Hocko @ 2016-06-03 13:49 UTC (permalink / raw) To: Sergey Senozhatsky, Andrea Arcangeli Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On Fri 03-06-16 15:45:09, Michal Hocko wrote: > On Fri 03-06-16 22:38:13, Sergey Senozhatsky wrote: > [...] > > Michal, I'll try to test during the weekend (away from the affected box > > now), but in the worst case it can as late as next Thursday (gonna travel > > next week). > > No problem. I would really like to hear from Andrea before we give this > a serious try anyway. And just for an easier review, here is what I have right now: --- >From 1fa9428b215cea4a48737fc9650009616a5bcd4e Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Thu, 2 Jun 2016 10:38:37 +0200 Subject: [PATCH] khugepaged: simplify khugepaged vs. __mmput __khugepaged_exit is called during the final __mmput and it employs a complex synchronization dances to make sure it doesn't race with the khugepaged which might be scanning this mm at the same time. This is all caused by the fact that khugepaged doesn't pin mm_users. Things would simplify considerably if we simply check the mm at khugepaged_scan_mm_slot and if mm_users was already 0 then we know it is dead and we can unhash the mm_slot and move on to another one. This will also guarantee that __khugepaged_exit cannot race with khugepaged and so we can free up the slot if it is still hashed. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/huge_memory.c | 90 ++++++++++++++++++++++++++------------------------------ 1 file changed, 42 insertions(+), 48 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index de62bd991827..0432581fb87c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1936,7 +1936,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm, static inline int khugepaged_test_exit(struct mm_struct *mm) { - return atomic_read(&mm->mm_users) == 0; + /* the only pin is from khugepaged_scan_mm_slot */ + return atomic_read(&mm->mm_users) <= 1; } int __khugepaged_enter(struct mm_struct *mm) @@ -1948,8 +1949,6 @@ int __khugepaged_enter(struct mm_struct *mm) if (!mm_slot) return -ENOMEM; - /* __khugepaged_exit() must not run from under us */ - VM_BUG_ON_MM(khugepaged_test_exit(mm), mm); if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) { free_mm_slot(mm_slot); return 0; @@ -1992,36 +1991,43 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma, return 0; } -void __khugepaged_exit(struct mm_struct *mm) +static void collect_mm_slot(struct mm_slot *mm_slot) { - struct mm_slot *mm_slot; - int free = 0; + struct mm_struct *mm = mm_slot->mm; - spin_lock(&khugepaged_mm_lock); - mm_slot = get_mm_slot(mm); - if (mm_slot && khugepaged_scan.mm_slot != mm_slot) { + VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock)); + + if (khugepaged_test_exit(mm)) { + /* free mm_slot */ hash_del(&mm_slot->hash); list_del(&mm_slot->mm_node); - free = 1; - } - spin_unlock(&khugepaged_mm_lock); - if (free) { - clear_bit(MMF_VM_HUGEPAGE, &mm->flags); - free_mm_slot(mm_slot); - mmdrop(mm); - } else if (mm_slot) { /* - * This is required to serialize against - * khugepaged_test_exit() (which is guaranteed to run - * under mmap sem read mode). Stop here (after we - * return all pagetables will be destroyed) until - * khugepaged has finished working on the pagetables - * under the mmap_sem. + * Not strictly needed because the mm exited already. + * + * clear_bit(MMF_VM_HUGEPAGE, &mm->flags); */ - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); + + /* khugepaged_mm_lock actually not necessary for the below */ + free_mm_slot(mm_slot); + mmdrop(mm); + + if (khugepaged_scan.mm_slot == mm_slot) + khugepaged_scan.mm_slot = NULL; + } +} + +void __khugepaged_exit(struct mm_struct *mm) +{ + struct mm_slot *mm_slot; + + spin_lock(&khugepaged_mm_lock); + mm_slot = get_mm_slot(mm); + if (mm_slot) { + collect_mm_slot(mm_slot); + clear_bit(MMF_VM_HUGEPAGE, &mm->flags); } + spin_unlock(&khugepaged_mm_lock); } static void release_pte_page(struct page *page) @@ -2736,29 +2742,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, return ret; } -static void collect_mm_slot(struct mm_slot *mm_slot) -{ - struct mm_struct *mm = mm_slot->mm; - - VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock)); - - if (khugepaged_test_exit(mm)) { - /* free mm_slot */ - hash_del(&mm_slot->hash); - list_del(&mm_slot->mm_node); - - /* - * Not strictly needed because the mm exited already. - * - * clear_bit(MMF_VM_HUGEPAGE, &mm->flags); - */ - - /* khugepaged_mm_lock actually not necessary for the below */ - free_mm_slot(mm_slot); - mmdrop(mm); - } -} - static unsigned int khugepaged_scan_mm_slot(unsigned int pages, struct page **hpage) __releases(&khugepaged_mm_lock) @@ -2780,6 +2763,16 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, khugepaged_scan.address = 0; khugepaged_scan.mm_slot = mm_slot; } + + /* + * Do not even try to do anything if the current mm is already + * dead. khugepaged_mm_lock will make sure only this or + * __khugepaged_exit does the unhasing. + */ + if (!atomic_inc_not_zero(&mm_slot->mm->mm_users)) { + collect_mm_slot(mm_slot); + return progress; + } spin_unlock(&khugepaged_mm_lock); mm = mm_slot->mm; @@ -2863,6 +2856,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, collect_mm_slot(mm_slot); } + mmput_async(mm); return progress; } -- 2.8.1 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 13:49 ` Michal Hocko @ 2016-06-04 7:51 ` Sergey Senozhatsky 2016-06-06 8:39 ` Michal Hocko 0 siblings, 1 reply; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-04 7:51 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrea Arcangeli, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel Hello, On (06/03/16 15:49), Michal Hocko wrote: > __khugepaged_exit is called during the final __mmput and it employs a > complex synchronization dances to make sure it doesn't race with the > khugepaged which might be scanning this mm at the same time. This is > all caused by the fact that khugepaged doesn't pin mm_users. Things > would simplify considerably if we simply check the mm at > khugepaged_scan_mm_slot and if mm_users was already 0 then we know it > is dead and we can unhash the mm_slot and move on to another one. This > will also guarantee that __khugepaged_exit cannot race with khugepaged > and so we can free up the slot if it is still hashed. > > Signed-off-by: Michal Hocko <mhocko@suse.com> with this patch and http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem-fix-2.patch I saw no problems during my tests (well, may be didn't test hard enough). -ss ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-04 7:51 ` Sergey Senozhatsky @ 2016-06-06 8:39 ` Michal Hocko 0 siblings, 0 replies; 28+ messages in thread From: Michal Hocko @ 2016-06-06 8:39 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Sergey Senozhatsky, Andrea Arcangeli, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On Sat 04-06-16 16:51:14, Sergey Senozhatsky wrote: > Hello, > > On (06/03/16 15:49), Michal Hocko wrote: > > __khugepaged_exit is called during the final __mmput and it employs a > > complex synchronization dances to make sure it doesn't race with the > > khugepaged which might be scanning this mm at the same time. This is > > all caused by the fact that khugepaged doesn't pin mm_users. Things > > would simplify considerably if we simply check the mm at > > khugepaged_scan_mm_slot and if mm_users was already 0 then we know it > > is dead and we can unhash the mm_slot and move on to another one. This > > will also guarantee that __khugepaged_exit cannot race with khugepaged > > and so we can free up the slot if it is still hashed. > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > with this patch and > http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem-fix-2.patch > > I saw no problems during my tests (well, may be didn't test hard > enough). Thanks for the testing! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky 2016-06-02 9:21 ` Michal Hocko @ 2016-06-02 13:24 ` Vlastimil Babka 2016-06-02 18:58 ` Ebru Akagunduz 2016-06-03 12:28 ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz 1 sibling, 2 replies; 28+ messages in thread From: Vlastimil Babka @ 2016-06-02 13:24 UTC (permalink / raw) To: Sergey Senozhatsky, Andrew Morton, Ebru Akagunduz Cc: Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Rik van Riel, Andrea Arcangeli [+CC's] On 06/02/2016 03:48 AM, Sergey Senozhatsky wrote: > On (06/01/16 13:11), Stephen Rothwell wrote: >> Hi all, >> >> Changes since 20160531: >> >> My fixes tree contains: >> >> of: silence warnings due to max() usage >> >> The arm tree gained a conflict against Linus' tree. >> >> Non-merge commits (relative to Linus' tree): 1100 >> 936 files changed, 38159 insertions(+), 17475 deletions(-) > > Hello, > > the cc1 process ended up in DN state during kernel -j4 compilation. > > ... > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > [ 2856.323068] Call Trace: > [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > [ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds. > [ 2877.322858] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > [ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 2877.322861] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > [ 2877.322865] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > [ 2877.322867] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > [ 2877.322867] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > [ 2877.322870] Call Trace: > [ 2877.322875] [<ffffffff81441e33>] schedule+0x83/0x98 > [ 2877.322878] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > [ 2877.322881] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > [ 2877.322884] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > [ 2877.322885] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > [ 2877.322887] [<ffffffff81443630>] down_write+0x1f/0x2e > [ 2877.322890] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > [ 2877.322892] [<ffffffff8103702a>] mmput+0x29/0xc5 > [ 2877.322894] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > [ 2877.322896] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > [ 2877.322898] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > [ 2877.322900] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > [ 2877.322902] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f I think it's this patch: http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch Some parts of the code in collapse_huge_page() that were under down_write(mmap_sem) are under down_read() after the patch. But there's "goto out" which continues via "goto out_up_write" which does up_write(mmap_sem) so there's an imbalance. One path seems to go via both up_read() and up_write(). I can imagine this can cause a stuck down_write() among other things? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 13:24 ` Vlastimil Babka @ 2016-06-02 18:58 ` Ebru Akagunduz 2016-06-03 1:00 ` Sergey Senozhatsky 2016-06-03 12:28 ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz 1 sibling, 1 reply; 28+ messages in thread From: Ebru Akagunduz @ 2016-06-02 18:58 UTC (permalink / raw) To: vbabka, sergey.senozhatsky.work, akpm Cc: mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel, aarcange On Thu, Jun 02, 2016 at 03:24:05PM +0200, Vlastimil Babka wrote: > [+CC's] > > On 06/02/2016 03:48 AM, Sergey Senozhatsky wrote: > >On (06/01/16 13:11), Stephen Rothwell wrote: > >>Hi all, > >> > >>Changes since 20160531: > >> > >>My fixes tree contains: > >> > >> of: silence warnings due to max() usage > >> > >>The arm tree gained a conflict against Linus' tree. > >> > >>Non-merge commits (relative to Linus' tree): 1100 > >> 936 files changed, 38159 insertions(+), 17475 deletions(-) > > > >Hello, > > > >the cc1 process ended up in DN state during kernel -j4 compilation. > > > >... > >[ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > >[ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > >[ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >[ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > >[ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > >[ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > >[ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > >[ 2856.323068] Call Trace: > >[ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > >[ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > >[ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > >[ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > >[ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > >[ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > >[ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > >[ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > >[ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > >[ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > >[ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > >[ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > >[ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > > >[ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds. > >[ 2877.322858] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > >[ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >[ 2877.322861] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > >[ 2877.322865] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > >[ 2877.322867] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > >[ 2877.322867] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > >[ 2877.322870] Call Trace: > >[ 2877.322875] [<ffffffff81441e33>] schedule+0x83/0x98 > >[ 2877.322878] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > >[ 2877.322881] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > >[ 2877.322884] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > >[ 2877.322885] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > >[ 2877.322887] [<ffffffff81443630>] down_write+0x1f/0x2e > >[ 2877.322890] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > >[ 2877.322892] [<ffffffff8103702a>] mmput+0x29/0xc5 > >[ 2877.322894] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > >[ 2877.322896] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > >[ 2877.322898] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > >[ 2877.322900] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > >[ 2877.322902] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > I think it's this patch: > > http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch > > Some parts of the code in collapse_huge_page() that were under > down_write(mmap_sem) are under down_read() after the patch. But > there's "goto out" which continues via "goto out_up_write" which > does up_write(mmap_sem) so there's an imbalance. One path seems to > go via both up_read() and up_write(). I can imagine this can cause a > stuck down_write() among other things? Recently, I realized the same imbalance, it is an obvious inconsistency. I don't know, this issue can be related with mine. I'll send a fix patch. Kind regards. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 18:58 ` Ebru Akagunduz @ 2016-06-03 1:00 ` Sergey Senozhatsky 2016-06-03 1:29 ` Sergey Senozhatsky 0 siblings, 1 reply; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 1:00 UTC (permalink / raw) To: Ebru Akagunduz Cc: Vlastimil Babka, sergey.senozhatsky.work, Andrew Morton, Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli, Rik van Riel, linux-mm, linux-next, linux-kernel On (06/02/16 21:58), Ebru Akagunduz wrote: [..] > > I think it's this patch: > > > > http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch > > > > Some parts of the code in collapse_huge_page() that were under > > down_write(mmap_sem) are under down_read() after the patch. But > > there's "goto out" which continues via "goto out_up_write" which > > does up_write(mmap_sem) so there's an imbalance. One path seems to > > go via both up_read() and up_write(). I can imagine this can cause a > > stuck down_write() among other things? > Recently, I realized the same imbalance, it is an obvious > inconsistency. I don't know, this issue can be related with > mine. I'll send a fix patch. a good find by Vlastimil. Ebru, can you also re-visit __collapse_huge_page_swapin()? it's called from collapse_huge_page() under the down_read(&mm->mmap_sem), is there any reason to do the nested down_read(&mm->mmap_sem)? collapse_huge_page() ... down_read(&mm->mmap_sem); result = hugepage_vma_revalidate(mm, vma, address); if (result) goto out; pmd = mm_find_pmd(mm, address); if (!pmd) { result = SCAN_PMD_NULL; goto out; } if (allocstall == curr_allocstall && swap != 0) { if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) { { : if (ret & VM_FAULT_RETRY) { : down_read(&mm->mmap_sem); : ^^^^^^^^^ : if (hugepage_vma_revalidate(mm, vma, address)) : return false; : } } up_read(&mm->mmap_sem); goto out; } } up_read(&mm->mmap_sem); so if __collapse_huge_page_swapin() retruns true we have: - down_read() twice, up_read() once? the locking rules here are a bit confusing. (I didn't have my morning coffee yet). -ss ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 1:00 ` Sergey Senozhatsky @ 2016-06-03 1:29 ` Sergey Senozhatsky 2016-06-03 4:14 ` Sergey Senozhatsky 0 siblings, 1 reply; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 1:29 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Ebru Akagunduz, Vlastimil Babka, Andrew Morton, Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli, Rik van Riel, linux-mm, linux-next, linux-kernel On (06/03/16 10:00), Sergey Senozhatsky wrote: > a good find by Vlastimil. > > Ebru, can you also re-visit __collapse_huge_page_swapin()? it's called > from collapse_huge_page() under the down_read(&mm->mmap_sem), is there > any reason to do the nested down_read(&mm->mmap_sem)? > > collapse_huge_page() > ... > down_read(&mm->mmap_sem); > result = hugepage_vma_revalidate(mm, vma, address); > if (result) > goto out; > > pmd = mm_find_pmd(mm, address); > if (!pmd) { > result = SCAN_PMD_NULL; > goto out; > } > > if (allocstall == curr_allocstall && swap != 0) { > if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) { > { > : if (ret & VM_FAULT_RETRY) { > : down_read(&mm->mmap_sem); > : ^^^^^^^^^ oh... it's in a loop for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; pte++, _address += PAGE_SIZE) { ret = do_swap_page() if (ret & VM_FAULT_RETRY) { down_read(&mm->mmap_sem); ^^^^^^^^^ ... } } so there can be multiple sem->count++ in __collapse_huge_page_swapin(), and you don't know how many sem->count-- you need to do later? is this correct or am I hallucinating? -ss > : if (hugepage_vma_revalidate(mm, vma, address)) > : return false; > : } > } > > up_read(&mm->mmap_sem); > goto out; > } > } > > up_read(&mm->mmap_sem); > > > > so if __collapse_huge_page_swapin() retruns true we have: > - down_read() twice, up_read() once? > > the locking rules here are a bit confusing. (I didn't have my morning coffee yet). > > -ss > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 1:29 ` Sergey Senozhatsky @ 2016-06-03 4:14 ` Sergey Senozhatsky 0 siblings, 0 replies; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 4:14 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Ebru Akagunduz, Vlastimil Babka, Andrew Morton, Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli, Rik van Riel, linux-mm, linux-next, linux-kernel On (06/03/16 10:29), Sergey Senozhatsky wrote: > > if (allocstall == curr_allocstall && swap != 0) { > > if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) { > > { > > : if (ret & VM_FAULT_RETRY) { > > : down_read(&mm->mmap_sem); > > : ^^^^^^^^^ > > oh... it's in a loop > > for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; > pte++, _address += PAGE_SIZE) { > ret = do_swap_page() > if (ret & VM_FAULT_RETRY) { > down_read(&mm->mmap_sem); > ^^^^^^^^^ > ... > } > } > > so there can be multiple sem->count++ in __collapse_huge_page_swapin(), > and you don't know how many sem->count-- you need to do later? is this > correct or am I hallucinating? No, I was wrong, sorry for the noise. it's getting unlocked in __collapse_huge_page_swapin() do_swap_page() lock_page_or_retry() if (flags & FAULT_FLAG_ALLOW_RETRY) up_read(&mm->mmap_sem); return VM_FAULT_RETRY -ss ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page 2016-06-02 13:24 ` Vlastimil Babka 2016-06-02 18:58 ` Ebru Akagunduz @ 2016-06-03 12:28 ` Ebru Akagunduz 2016-06-06 13:05 ` Vlastimil Babka 1 sibling, 1 reply; 28+ messages in thread From: Ebru Akagunduz @ 2016-06-03 12:28 UTC (permalink / raw) To: akpm Cc: vbabka, sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel, aarcange, Ebru Akagunduz After creating revalidate vma function, locking inconsistency occured due to directing the code path to wrong label. This patch directs to correct label and fix the inconsistency. Related commit that caused inconsistency: http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0 Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> --- mm/huge_memory.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 292cedd..8043d91 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2493,13 +2493,18 @@ static void collapse_huge_page(struct mm_struct *mm, curr_allocstall = sum_vm_event(ALLOCSTALL); down_read(&mm->mmap_sem); result = hugepage_vma_revalidate(mm, vma, address); - if (result) - goto out; + if (result) { + mem_cgroup_cancel_charge(new_page, memcg, true); + up_read(&mm->mmap_sem); + goto out_nolock; + } pmd = mm_find_pmd(mm, address); if (!pmd) { result = SCAN_PMD_NULL; - goto out; + mem_cgroup_cancel_charge(new_page, memcg, true); + up_read(&mm->mmap_sem); + goto out_nolock; } /* @@ -2513,8 +2518,9 @@ static void collapse_huge_page(struct mm_struct *mm, * label out. Continuing to collapse causes inconsistency. */ if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) { + mem_cgroup_cancel_charge(new_page, memcg, true); up_read(&mm->mmap_sem); - goto out; + goto out_nolock; } } -- 1.9.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page 2016-06-03 12:28 ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz @ 2016-06-06 13:05 ` Vlastimil Babka 2016-06-09 3:51 ` Sergey Senozhatsky 0 siblings, 1 reply; 28+ messages in thread From: Vlastimil Babka @ 2016-06-06 13:05 UTC (permalink / raw) To: Ebru Akagunduz, akpm Cc: sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel, aarcange On 06/03/2016 02:28 PM, Ebru Akagunduz wrote: > After creating revalidate vma function, locking inconsistency occured > due to directing the code path to wrong label. This patch directs > to correct label and fix the inconsistency. > > Related commit that caused inconsistency: > http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0 > > Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> I think this does fix the inconsistency, thanks. But looking at collapse_huge_page() as of latest -next, I wonder if there's another problem: pmd = mm_find_pmd(mm, address); ... up_read(&mm->mmap_sem); down_write(&mm->mmap_sem); hugepage_vma_revalidate(mm, address); ... pte = pte_offset_map(pmd, address); What guarantees that 'pmd' is still valid? Vlastimil ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page 2016-06-06 13:05 ` Vlastimil Babka @ 2016-06-09 3:51 ` Sergey Senozhatsky 0 siblings, 0 replies; 28+ messages in thread From: Sergey Senozhatsky @ 2016-06-09 3:51 UTC (permalink / raw) To: Vlastimil Babka Cc: Ebru Akagunduz, akpm, sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel, aarcange On (06/06/16 15:05), Vlastimil Babka wrote: [..] > I think this does fix the inconsistency, thanks. > > But looking at collapse_huge_page() as of latest -next, I wonder if there's > another problem: > > pmd = mm_find_pmd(mm, address); > ... > up_read(&mm->mmap_sem); > down_write(&mm->mmap_sem); > hugepage_vma_revalidate(mm, address); > ... > pte = pte_offset_map(pmd, address); > > What guarantees that 'pmd' is still valid? the same question applied to __collapse_huge_page_swapin(), I think. __collapse_huge_page_swapin(pmd) pte = pte_offset_map(pmd, address); do_swap_page(mm, vma, _address, pte, pmd...) up_read(&mm->mmap_sem); down_read(&mm->mmap_sem); pte = pte_offset_map(pmd, _address); -ss ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2016-06-09 3:51 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-01 3:11 linux-next: Tree for Jun 1 Stephen Rothwell 2016-06-02 1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky 2016-06-02 9:21 ` Michal Hocko 2016-06-02 12:08 ` Sergey Senozhatsky 2016-06-02 12:21 ` Michal Hocko 2016-06-03 13:51 ` Andrea Arcangeli 2016-06-03 14:46 ` Michal Hocko 2016-06-03 15:10 ` Andrea Arcangeli 2016-06-07 7:34 ` Michal Hocko 2016-06-08 8:19 ` Vlastimil Babka 2016-06-03 7:15 ` Sergey Senozhatsky 2016-06-03 7:25 ` Michal Hocko 2016-06-03 8:43 ` Sergey Senozhatsky 2016-06-03 9:55 ` Michal Hocko 2016-06-03 10:05 ` Michal Hocko 2016-06-03 13:38 ` Sergey Senozhatsky 2016-06-03 13:45 ` Michal Hocko 2016-06-03 13:49 ` Michal Hocko 2016-06-04 7:51 ` Sergey Senozhatsky 2016-06-06 8:39 ` Michal Hocko 2016-06-02 13:24 ` Vlastimil Babka 2016-06-02 18:58 ` Ebru Akagunduz 2016-06-03 1:00 ` Sergey Senozhatsky 2016-06-03 1:29 ` Sergey Senozhatsky 2016-06-03 4:14 ` Sergey Senozhatsky 2016-06-03 12:28 ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz 2016-06-06 13:05 ` Vlastimil Babka 2016-06-09 3:51 ` Sergey Senozhatsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).