linux-next.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* linux-next: Tree for Jun 1
@ 2016-06-01  3:11 Stephen Rothwell
  2016-06-02  1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Rothwell @ 2016-06-01  3:11 UTC (permalink / raw)
  To: linux-next; +Cc: linux-kernel

Hi all,

Changes since 20160531:

My fixes tree contains:

  of: silence warnings due to max() usage

The arm tree gained a conflict against Linus' tree.

Non-merge commits (relative to Linus' tree): 1100
 936 files changed, 38159 insertions(+), 17475 deletions(-)

----------------------------------------------------------------------------

I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
(this fails its final link) and pseries_le_defconfig and i386, sparc
and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 236 trees (counting Linus' and 35 trees of patches
pending for Linus' tree).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (367d3fd50566 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux)
Merging fixes/master (b31033aacbd0 of: silence warnings due to max() usage)
Merging kbuild-current/rc-fixes (3d1450d54a4f Makefile: Force gzip and xz on module install)
Merging arc-current/for-curr (49acadff2a0c arc: Get rid of root core-frequency property)
Merging arm-current/fixes (85c42e89f312 ARM: fix PTRACE_SETVFPREGS on SMP systems)
Merging m68k-current/for-linus (9a6462763b17 m68k/mvme16x: Include generic <linux/rtc.h>)
Merging metag-fixes/fixes (0164a711c97b metag: Fix ioremap_wc/ioremap_cached build errors)
Merging powerpc-fixes/fixes (8dd75ccb571f powerpc: Use privileged SPR number for MMCR2)
Merging powerpc-merge-mpe/fixes (bc0195aad0da Linux 4.2-rc2)
Merging sparc/master (7cafc0b8bf13 sparc64: Fix return from trap window fill crashes.)
Merging net/master (d69d16949346 usbnet: smsc95xx: fix link detection for disabled autonegotiation)
Merging ipsec/master (d6af1a31cc72 vti: Add pmtu handling to vti_xmit.)
Merging ipvs/master (f28f20da704d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net)
Merging wireless-drivers/master (de26859dcf36 rtlwifi: Fix scheduling while atomic error from commit 49f86ec21c01)
Merging mac80211/master (6fe04128f158 mac80211: fix fast_tx header alignment)
Merging sound-current/for-linus (0358ccc8ffd8 ALSA: uapi: Add three missing header files to Kbuild file)
Merging pci-current/for-linus (1a695a905c18 Linux 4.7-rc1)
Merging driver-core.current/driver-core-linus (1a695a905c18 Linux 4.7-rc1)
Merging tty.current/tty-linus (1a695a905c18 Linux 4.7-rc1)
Merging usb.current/usb-linus (1a695a905c18 Linux 4.7-rc1)
Merging usb-gadget-fixes/fixes (27a0faafdca5 usb: dwc3: st: Fix USB_DR_MODE_PERIPHERAL configuration.)
Merging usb-serial-fixes/usb-linus (74d2a91aec97 USB: serial: option: add even more ZTE device ids)
Merging usb-chipidea-fixes/ci-for-usb-stable (d144dfea8af7 usb: chipidea: otg: change workqueue ci_otg as freezable)
Merging staging.current/staging-linus (1a695a905c18 Linux 4.7-rc1)
Merging char-misc.current/char-misc-linus (1a695a905c18 Linux 4.7-rc1)
Merging input-current/for-linus (f49cf3b8b4c8 Input: pwm-beeper - fix - scheduling while atomic)
Merging crypto-current/master (ab6a11a7c8ef crypto: ccp - Fix AES XTS error for request sizes above 4096)
Merging ide/master (1993b176a822 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide)
Merging devicetree-current/devicetree/merge (f76502aa9140 of/dynamic: Fix test for PPC_PSERIES)
Merging rr-fixes/fixes (8244062ef1e5 modules: fix longstanding /proc/kallsyms vs module insertion race.)
Merging vfio-fixes/for-linus (089f1c6b2dae vfio/type1: Fix build warning)
Merging kselftest-fixes/fixes (505ce68c6da3 selftest/seccomp: Fix the seccomp(2) signature)
Merging backlight-fixes/for-backlight-fixes (68feaca0b13e backlight: pwm: Handle EPROBE_DEFER while requesting the PWM)
Merging ftrace-fixes/for-next-urgent (6224beb12e19 tracing: Have branch tracer use recursive field of task struct)
Merging mfd-fixes/for-mfd-fixes (1b52e50f2a40 mfd: max77843: Fix max77843_chg_init() return on error)
Merging drm-intel-fixes/for-linux-next-fixes (1a695a905c18 Linux 4.7-rc1)
Merging asm-generic/master (b0da6d44157a asm-generic: Drop renameat syscall from default list)
Merging arc/for-next (776d7f1694a7 arc: axs103_smp: Fix CPU frequency to 100MHz for dual-core)
Merging arm/for-next (6a606d21bc0d Merge branches 'component' and 'fixes' into for-next)
CONFLICT (content): Merge conflict in drivers/gpu/drm/rockchip/rockchip_drm_drv.c
Merging arm-perf/for-next/perf (4ba2578fa7b5 arm64: perf: don't expose CHAIN event in sysfs)
Merging arm-soc/for-next (d6be64b09dd1 Merge branch 'fixes' into for-next)
Merging at91/at91-next (5a0d7c6a48ae Merge branch 'at91-4.7-defconfig' into at91-next)
Merging bcm2835-dt/bcm2835-dt-next (6a93792774fc ARM: bcm2835: dt: Add the ethernet to the device trees)
Merging bcm2835-soc/bcm2835-soc-next (92e963f50fc7 Linux 4.5-rc1)
Merging bcm2835-drivers/bcm2835-drivers-next (92e963f50fc7 Linux 4.5-rc1)
Merging bcm2835-defconfig/bcm2835-defconfig-next (1a695a905c18 Linux 4.7-rc1)
Merging berlin/berlin/for-next (9a7e06833249 Merge branch 'berlin/fixes' into berlin/for-next)
Merging cortex-m/for-next (f719a0d6a854 ARM: efm32: switch to vendor,device compatible strings)
Merging imx-mxs/for-next (63b44471754b Merge branch 'imx/defconfig64' into for-next)
Merging keystone/next (02e15d234006 Merge branch 'for_4.7/kesytone' into next)
Merging mvebu/for-next (01316cded75b Merge branch 'mvebu/defconfig' into mvebu/for-next)
Merging omap/for-next (5c66191b5c76 Merge branch 'omap-for-v4.7/dt' into for-next)
Merging omap-pending/for-next (c20c8f750d9f ARM: OMAP2+: hwmod: fix _idle() hwmod state sanity check sequence)
Merging qcom/for-next (eb8e0105700b firmware: qcom_scm: Make core clock optional)
Merging renesas/next (1df83bd17bee Merge branches 'heads/arm64-dt-for-v4.8', 'heads/dt-for-v4.8', 'heads/soc-for-v4.8' and 'heads/sh-drivers-for-v4.8' into next)
Merging rockchip/for-next (bc64bf4164ed Merge branch 'v4.7-clk/fixes' into for-next)
Merging rpi/for-rpi-next (bc0195aad0da Linux 4.2-rc2)
Merging samsung/for-next (92e963f50fc7 Linux 4.5-rc1)
Merging samsung-krzk/for-next (b68cbd51dbe4 Merge branch 'for-v4.8/dts-exynos5410-odroid-xu' into for-next)
Merging tegra/for-next (5c282bc9d0a3 Merge branch for-4.7/defconfig into for-next)
Merging arm64/for-next/core (e6d9a5254333 arm64: do not enforce strict 16 byte alignment to stack pointer)
Merging blackfin/for-linus (391e74a51ea2 eth: bf609 eth clock: add pclk clock for stmmac driver probe)
CONFLICT (content): Merge conflict in arch/blackfin/mach-common/pm.c
Merging c6x/for-linux-next (ca3060d39ae7 c6x: Use generic clkdev.h header)
Merging cris/for-next (f9f3f864b5e8 cris: Fix section mismatches in architecture startup code)
Merging h8300/h8300-next (8cad489261c5 h8300: switch EARLYCON)
Merging hexagon/linux-next (02cc2ccfe771 Revert "Hexagon: fix signal.c compile error")
Merging ia64/next (787ca32dc704 ia64/unaligned: Silence another GCC warning about an uninitialised variable)
Merging m68k/for-next (9a6462763b17 m68k/mvme16x: Include generic <linux/rtc.h>)
Merging m68knommu/for-next (1a695a905c18 Linux 4.7-rc1)
Merging metag/for-next (592ddeeff8cb metag: Fix typos)
Merging microblaze/next (52e9e6e05617 microblaze: pci: export isa_io_base to fix link errors)
Merging mips/mips-for-linux-next (b02b1fbdd338 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi)
Merging nios2/for-next (9fa78f63a892 nios2: Add order-only DTC dependency to %.dtb target)
Merging parisc-hd/for-next (57f3ea7a3d6e parisc: Fix backtrace on PA-RISC)
Merging powerpc/next (138a076496e6 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next)
Merging powerpc-mpe/next (bc0195aad0da Linux 4.2-rc2)
Merging fsl/next (1eef33bec12d powerpc/86xx: Fix PCI interrupt map definition)
Merging mpc5xxx/next (39e69f55f857 powerpc: Introduce the use of the managed version of kzalloc)
Merging s390/features (5e19a42ac6d9 s390/cpuinfo: show dynamic and static cpu mhz)
Merging sparc-next/master (9f935675d41a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input)
Merging tile/master (bdf03e59f8c1 Fix typo)
Merging uml/linux-next (a78ff1112263 um: add extended processor state save/restore support)
Merging unicore32/unicore32 (c83d8b2fc986 unicore32: mm: Add missing parameter to arch_vma_access_permitted)
Merging xtensa/for_next (9da8320bb977 xtensa: add test_kc705_hifi variant)
Merging btrfs/next (c315ef8d9db7 Merge branch 'for-chris-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.7)
Merging btrfs-kdave/for-next (33e66059b7ad Merge branch 'for-next-next-4.7-20160527' into for-next-20160527)
Merging ceph/master (4a3262b17c96 libceph: use %s instead of %pE in dout()s)
Merging cifs/for-next (3bdc426e2497 cifs: dynamic allocation of ntlmssp blob)
Merging configfs/for-next (96c22a329351 configfs: fix CONFIGFS_BIN_ATTR_[RW]O definitions)
Merging ecryptfs/next (933c32fe0e42 ecryptfs: drop null test before destroy functions)
Merging ext3/for_next (b9d8905e4a75 reiserfs: check kstrdup failure)
Merging ext4/dev (12735f881952 ext4: pre-zero allocated blocks for DAX IO)
Merging f2fs/dev (b02b1fbdd338 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi)
Merging fscache/fscache (b00c2ae2ed3c FS-Cache: Don't override netfs's primary_index if registering failed)
Merging fuse/for-next (4441f63ab7e5 fuse: update mailing list in MAINTAINERS)
Merging gfs2/for-next (29567292c0b5 Merge tag 'for-linus-4.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip)
Merging jfs/jfs-next (6ed71e9819ac jfs: Coalesce some formats)
Merging nfs/linux-next (1a695a905c18 Linux 4.7-rc1)
Merging nfsd/nfsd-next (9e62f931dd07 rpc: share one xps between all backchannels)
Merging orangefs/for-next (2dcd0af568b0 Linux 4.6)
Merging overlayfs/overlayfs-next (7d43ba76af20 ovl: store ovl_entry in inode->i_private for all inodes)
Merging v9fs/for-next (a333e4bf2556 fs/9p: use fscache mutex rather than spinlock)
Merging ubifs/linux-next (1112018cefc5 ubifs: ubifs_dump_inode: Fix dumping field bulk_read)
Merging xfs/for-next (1a695a905c18 Linux 4.7-rc1)
Merging file-locks/linux-next (5af9c2e19da6 Merge branch 'akpm' (patches from Andrew))
Merging vfs/for-next (1eb82bc8e712 Merge branch 'for-linus' into for-next)
Merging pci/next (1a695a905c18 Linux 4.7-rc1)
Merging hid/for-next (185a9cac5b1e Merge branch 'for-4.6/upstream-fixes' into for-next)
Merging i2c/i2c/for-next (1a695a905c18 Linux 4.7-rc1)
Merging jdelvare-hwmon/master (18c358ac5e32 Documentation/hwmon: Update links in max34440)
Merging dmi/master (c3db05ecf8ac firmware: dmi_scan: Save SMBIOS Type 9 System Slots)
Merging hwmon-staging/hwmon-next (03bd75a88d6c hwmon: (max1668) Fix typo in documentation)
Merging v4l-dvb/master (73dfb701d254 Merge branch 'v4l_for_linus' into to_next)
Merging pm/linux-next (1a695a905c18 Linux 4.7-rc1)
Merging idle/next (f55532a0c0b8 Linux 4.6-rc1)
Merging thermal/next (88ac99063e6e Merge branches 'thermal-core', 'thermal-intel' and 'thermal-soc' into next)
Merging thermal-soc/next (ddc8fdc6e2f0 Merge branch 'work-fixes' into work-next)
CONFLICT (add/add): Merge conflict in drivers/thermal/tango_thermal.c
CONFLICT (content): Merge conflict in drivers/thermal/rockchip_thermal.c
Merging ieee1394/for-next (384fbb96f926 firewire: nosy: Replace timeval with timespec64)
Merging dlm/next (82c7d823cc31 dlm: config: Fix ENOMEM failures in make_cluster())
Merging swiotlb/linux-next (386744425e35 swiotlb: Make linux/swiotlb.h standalone includible)
Merging slave-dma/next (4f0382030b6d Merge branch 'topic/sh' into next)
Merging net-next/master (07b75260ebc2 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus)
Merging ipsec-next/master (cb866e3298cd xfrm: Increment statistic counter on inner mode error)
Merging ipvs-next/master (698e2a8dca98 ipvs: make drop_entry protection effective for SIP-pe)
Merging wireless-drivers-next/master (52776a700b53 Merge ath-next from ath.git)
Merging bluetooth/master (07b75260ebc2 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus)
Merging mac80211-next/master (019ae3a91881 cfg80211: Advertise extended capabilities per interface type to userspace)
Merging rdma/for-next (7a226f9c32b0 staging/rdma: Remove the entire rdma subdirectory of staging)
Merging rdma-leon/rdma-next (1a695a905c18 Linux 4.7-rc1)
Merging rdma-leon-test/testing/rdma-next (1a695a905c18 Linux 4.7-rc1)
Merging mtd/master (becc7ae544c6 MAINTAINERS: Add file patterns for mtd device tree bindings)
Merging l2-mtd/master (becc7ae544c6 MAINTAINERS: Add file patterns for mtd device tree bindings)
Merging nand/nand/next (cabfeaa67843 ARM: OMAP2+: Update GPMC and NAND DT binding documentation)
Merging crypto/master (5318c53d5b4b crypto: s5p-sss - Use consistent indentation for variables and members)
Merging drm/drm-next (92181d47ee74 headers_check: don't warn about c++ guards)
Merging drm-panel/drm/panel/for-next (227e4f4079e1 drm/panel: simple: Add support for TPK U.S.A. LLC Fusion 7" and 10.1" panels)
Merging drm-intel/for-linux-next (1800ad255c4f drm/i915: Update GEN6_PMINTRMSK setup with GuC enabled)
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_ringbuffer.h
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_ringbuffer.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_psr.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_pm.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_lrc.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_display.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/i915_gem.c
Merging drm-tegra/drm/tegra/for-next (057eab2013ec MAINTAINERS: Remove Terje Bergström as Tegra DRM maintainer)
Merging drm-misc/topic/drm-misc (b82caafcf230 drm/vc4: Use lockless gem BO free callback)
Merging drm-exynos/exynos-drm/for-next (25364a9e54fb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid)
Merging drm-msm/msm-next (2b669875332f drm/msm: Drop load/unload drm_driver ops)
Merging hdlcd/for-upstream/hdlcd (5c5b8aedaec0 drm: hdlcd: Cleanup the atomic operations)
Merging drm-vc4/drm-vc4-next (efea172891fc drm/vc4: Return -EBUSY if there's already a pending flip event.)
Merging sunxi/sunxi/for-next (30ce0df9ee51 Merge branches 'sunxi/defconfig-for-4.8', 'sunxi/drm-fixes-for-4.7' and 'sunxi/dt-for-4.8' into sunxi/for-next)
Merging kbuild/for-next (0c644e04ad1b Merge branch 'kbuild/misc' into kbuild/for-next)
Merging kconfig/for-next (5bcba792bb30 localmodconfig: Fix whitespace repeat count after "tristate")
Merging regmap/for-next (3f8cd61d24e6 Merge remote-tracking branch 'regmap/topic/maintainers' into regmap-next)
Merging sound/for-next (0358ccc8ffd8 ALSA: uapi: Add three missing header files to Kbuild file)
Merging sound-asoc/for-next (f5fe6c51e8b5 Merge remote-tracking branches 'asoc/topic/samsung', 'asoc/topic/simple', 'asoc/topic/tas571x', 'asoc/topic/tlv320aic31xx' and 'asoc/topic/wm8985' into asoc-next)
Merging modules/modules-next (e2d1248432c4 module: Disable MODULE_FORCE_LOAD when MODULE_SIG_FORCE is enabled)
Merging input/next (48a2b783483b Input: add Raydium I2C touchscreen driver)
Merging block/for-next (661806a31989 Merge branch 'for-4.7/core' into for-next)
Merging lightnvm/for-next (2a65aee4011b lightnvm: reserved space calculation incorrect)
Merging device-mapper/for-next (b8ef07be98b4 dm mpath: add optional "queue_mode" feature)
Merging pcmcia/master (e8e68fd86d22 pcmcia: do not break rsrc_nonstatic when handling anonymous cards)
Merging mmc-uh/next (1a695a905c18 Linux 4.7-rc1)
Merging md/for-next (412575807427 right meaning of PARITY_ENABLE_RMW and PARITY_PREFER_RMW)
Merging mfd/for-mfd-next (b52207ef4ea5 mfd: hi655x: Add MFD driver for hi655x)
Merging backlight/for-backlight-next (60d613d6aef4 backlight: pwm_bl: Free PWM requested by legacy API on error path)
Merging battery/master (4a99fa06a8ca sbs-battery: fix power status when battery charging near dry)
Merging omap_dss2/for-next (ab366b40b851 fbdev: Use IS_ENABLED() instead of checking for built-in or module)
Merging regulator/for-next (500ed8bf3856 Merge remote-tracking branches 'regulator/topic/fixed', 'regulator/topic/headers', 'regulator/topic/max8973' and 'regulator/topic/mt6397' into regulator-next)
Merging security/next (b937190c40de LSM: LoadPin: provide enablement CONFIG)
Merging integrity/next (05d1a717ec04 ima: add support for creating files using the mknodat syscall)
Merging keys/keys-next (75aeddd12f20 MAINTAINERS: Update keyrings record and add asymmetric keys record)
Merging selinux/next (7ea59202db8d selinux: Only apply bounds checking to source types)
Merging tpmdd/next (e8f2f45a4402 tpm: Fix suspend regression)
Merging watchdog/master (1a695a905c18 Linux 4.7-rc1)
Merging iommu/next (6c0b43df74f9 Merge branches 'arm/io-pgtable', 'arm/rockchip', 'arm/omap', 'x86/vt-d', 'ppc/pamu', 'core' and 'x86/amd' into next)
Merging dwmw2-iommu/master (22e2f9fa63b0 iommu/vt-d: Use per-cpu IOVA caching)
Merging vfio/next (f70552809419 vfio_pci: Test for extended capabilities if config space > 256 bytes)
Merging jc_docs/docs-next (9f8036643dd9 doc: self-protection: provide initial details)
Merging trivial/for-next (52bbe141f37f gitignore: fix wording)
Merging audit/next (2b4c7afe79a8 audit: fixup: log on errors from filter user rules)
Merging devicetree/devicetree/next (48a9b733e644 of/irq: Rename "intc_desc" to "of_intc_desc" to fix OF on sh)
Merging dt-rh/for-next (f2c27767af0a devicetree: Add Creative Technology vendor id)
Merging mailbox/mailbox-for-next (c430cf376fee mailbox: Fix devm_ioremap_resource error detection code)
Merging spi/for-next (e767713092de Merge remote-tracking branches 'spi/topic/maintainers', 'spi/topic/orion', 'spi/topic/pxa2xx' and 'spi/topic/rockchip' into spi-next)
Merging tip/auto-latest (65fd15a016cc Merge branch 'perf/urgent')
Merging clockevents/clockevents/next (cee77c2c5b57 clocksource/drivers/tango-xtal: Fix incorrect test)
Merging edac/linux_next (12f0721c5a70 sb_edac: correctly fetch DIMM width on Ivy Bridge and Haswell)
Merging edac-amd/for-next (3f37a36b6282 EDAC, amd64_edac: Drop pci_register_driver() use)
Merging irqchip/irqchip/for-next (a66ce4b7d9d2 Merge branch 'irqchip/mvebu' into irqchip/for-next)
Merging ftrace/for-next (97f8827a8c79 ftracetest: Use proper logic to find process PID)
Merging rcu/rcu/next (0e7e2457e4e4 Merge commit 'dcd36d01fb3f99d1d5df01714f6ccbe3fbbaf81f' into HEAD)
Merging kvm/linux-next (e28e909c36bb Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm)
Merging kvm-arm/next (35a2d58588f0 KVM: arm/arm64: vgic-new: Synchronize changes to active state)
Merging kvm-ppc/kvm-ppc-next (c63517c2e381 KVM: PPC: Book3S: correct width in XER handling)
Merging kvm-ppc-paulus/kvm-ppc-next (b1a4286b8f33 KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts)
Merging kvms390/next (60a37709ce60 KVM: s390: Populate mask of non-hypervisor managed facility bits)
Merging xen-tip/linux-next (bdadcaf2a7c1 xen: remove incorrect forward declaration)
Merging percpu/for-next (6710e594f71c percpu: fix synchronization between synchronous map extension and chunk destruction)
Merging workqueues/for-next (f1e89a8f3358 Merge branch 'for-4.6-fixes' into for-next)
Merging drivers-x86/for-next (b740d2e9233c platform/x86: Add PMC Driver for Intel Core SoC)
Merging chrome-platform/for-next (31b764171cb5 Revert "platform/chrome: chromeos_laptop: Add Leon Touch")
Merging hsi/for-next (b32bd7e7d5c1 hsi: use kmemdup)
Merging leds/for-next (a534769305ec leds: core: Fix brightness setting upon hardware blinking enabled)
Merging ipmi/for-next (a1b4e31bfabb IPMI: reserve memio regions separately)
Merging driver-core/driver-core-next (1a695a905c18 Linux 4.7-rc1)
Merging tty/tty-next (1a695a905c18 Linux 4.7-rc1)
Merging usb/usb-next (1a695a905c18 Linux 4.7-rc1)
Merging usb-gadget/next (2a58f9c12bb3 usb: dwc3: gadget: disable automatic calculation of ACK TP NUMP)
Merging usb-serial/usb-next (b923c6c62981 USB: serial: ti_usb_3410_5052: add MOXA UPORT 11x0 support)
Merging usb-chipidea-next/ci-for-usb-next (764763f0a0c8 doc: usb: chipidea: update the doc for OTG FSM)
Merging staging/staging-next (1a695a905c18 Linux 4.7-rc1)
Merging char-misc/char-misc-next (1a695a905c18 Linux 4.7-rc1)
Merging extcon/extcon-next (bd3adefe7ea1 extcon: usb-gpio: add support for ACPI gpio interface)
Merging cgroup/for-next (332d8a2fd141 cgroup: set css->id to -1 during init)
Merging scsi/for-next (787ab6e97024 aacraid: do not activate events on non-SRC adapters)
Merging target-updates/for-next (8f0dfb3d8b11 iscsi-target: Fix early sk_data_ready LOGIN_FLAGS_READY race)
Merging target-merge/for-next-merge (2994a7518317 cxgb4: update Kconfig and Makefile)
Merging libata/for-next (5219d6530ef0 ata: Use IS_ENABLED() instead of checking for built-in or module)
Merging pinctrl/for-next (a02fcf38ade9 Merge branch 'devel' into for-next)
Merging vhost/linux-next (bb991288728e ringtest: pass buf != NULL)
Merging remoteproc/for-next (7a6271a80cae remoteproc/wkup_m3: Use MODULE_DEVICE_TABLE to export alias)
Merging rpmsg/for-next (da5cb422f15d Merge branches 'rpmsg-next' and 'rproc-next' into for-next)
Merging gpio/for-next (63e213fc63c0 Merge branch 'devel' into for-next)
Merging dma-mapping/dma-mapping-next (d770e558e219 Linux 4.2-rc1)
Merging pwm/for-next (18c588786c08 Merge branch 'for-4.7/pwm-atomic' into for-next)
Merging dma-buf/for-next (b02da6f82361 dma-buf: use vma_pages())
Merging userns/for-next (f2ca379642d7 namei: permit linking with CAP_FOWNER in userns)
Merging ktest/for-next (2dcd0af568b0 Linux 4.6)
Merging clk/clk-next (ef56b79b66fa clk: fix critical clock locking)
Merging aio/master (b562e44f507e Linux 4.5)
Merging kselftest/next (6eab37daf0ec tools: testing: define the _GNU_SOURCE macro)
Merging y2038/y2038 (4b277763c5b3 vfs: Add support to document max and min inode times)
Merging luto-misc/next (afd2ff9b7e1b Linux 4.4)
Merging borntraeger/linux-next (b562e44f507e Linux 4.5)
Merging livepatching/for-next (6d9122078097 Merge branch 'for-4.7/core' into for-next)
Merging coresight/next (c568ba901f27 coresight: Handle build path error)
Merging rtc/rtc-next (95df4c078bf3 char/genrtc: remove the rest of the driver)
Merging hwspinlock/for-next (bd5717a4632c hwspinlock: qcom: Correct msb in regmap_field)
Merging nvdimm/libnvdimm-for-next (36092ee8ba69 Merge branch 'for-4.7/dax' into libnvdimm-for-next)
Merging dax-misc/dax-misc (4d9a2c874667 dax: Remove i_mmap_lock protection)
Merging akpm-current/current (602525d7e860 ipc/msg.c: use freezable blocking call)
CONFLICT (content): Merge conflict in sound/soc/qcom/lpass-platform.c
CONFLICT (content): Merge conflict in net/9p/client.c
CONFLICT (content): Merge conflict in fs/binfmt_flat.c
$ git checkout -b akpm remotes/origin/akpm/master
Applying: mm: make optimistic check for swapin readahead fix
Applying: drivers/net/wireless/intel/iwlwifi/dvm/calib.c: simplfy min() expression
Applying: drivers/fpga/Kconfig: fix build failure
Merging akpm/master (200f147a6c83 drivers/fpga/Kconfig: fix build failure)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-01  3:11 linux-next: Tree for Jun 1 Stephen Rothwell
@ 2016-06-02  1:48 ` Sergey Senozhatsky
  2016-06-02  9:21   ` Michal Hocko
  2016-06-02 13:24   ` Vlastimil Babka
  0 siblings, 2 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-02  1:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel

On (06/01/16 13:11), Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20160531:
> 
> My fixes tree contains:
> 
>   of: silence warnings due to max() usage
> 
> The arm tree gained a conflict against Linus' tree.
> 
> Non-merge commits (relative to Linus' tree): 1100
>  936 files changed, 38159 insertions(+), 17475 deletions(-)

Hello,

the cc1 process ended up in DN state during kernel -j4 compilation.

...
[ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
[ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
[ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
[ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
[ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
[ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
[ 2856.323068] Call Trace:
[ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
[ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
[ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
[ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
[ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
[ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
[ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
[ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
[ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
[ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
[ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
[ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
[ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f

[ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds.
[ 2877.322858]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
[ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2877.322861] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
[ 2877.322865]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
[ 2877.322867]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
[ 2877.322867]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
[ 2877.322870] Call Trace:
[ 2877.322875]  [<ffffffff81441e33>] schedule+0x83/0x98
[ 2877.322878]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
[ 2877.322881]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
[ 2877.322884]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
[ 2877.322885]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
[ 2877.322887]  [<ffffffff81443630>] down_write+0x1f/0x2e
[ 2877.322890]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
[ 2877.322892]  [<ffffffff8103702a>] mmput+0x29/0xc5
[ 2877.322894]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
[ 2877.322896]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
[ 2877.322898]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
[ 2877.322900]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
[ 2877.322902]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
...

ps aux | grep cc1
ss        4582  0.0  0.0      0     0 pts/23   DN+  10:10   0:01 [cc1]

	-ss

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02  1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky
@ 2016-06-02  9:21   ` Michal Hocko
  2016-06-02 12:08     ` Sergey Senozhatsky
  2016-06-03  7:15     ` Sergey Senozhatsky
  2016-06-02 13:24   ` Vlastimil Babka
  1 sibling, 2 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-02  9:21 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel,
	Andrea Arcangeli

[CCing Andrea]

On Thu 02-06-16 10:48:35, Sergey Senozhatsky wrote:
> On (06/01/16 13:11), Stephen Rothwell wrote:
> > Hi all,
> > 
> > Changes since 20160531:
> > 
> > My fixes tree contains:
> > 
> >   of: silence warnings due to max() usage
> > 
> > The arm tree gained a conflict against Linus' tree.
> > 
> > Non-merge commits (relative to Linus' tree): 1100
> >  936 files changed, 38159 insertions(+), 17475 deletions(-)
> 
> Hello,
> 
> the cc1 process ended up in DN state during kernel -j4 compilation.
> 
> ...
> [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> [ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> [ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> [ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> [ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> [ 2856.323068] Call Trace:
> [ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> [ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> [ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> [ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> [ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> [ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> [ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> [ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> [ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> [ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> [ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> [ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> [ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f

down_write in the exit path is certainly not nice. It is hard to tell
who is blocking the mmap_sem but it is clear that __khugepaged_exit
waits for the khugepaged to release its mmap_sem. Do you hapen to have a
trace of khugepaged? Note that the lock holder might be another writer
which just hasn't pinned mm_users so khugepaged might be blocked on read
lock as well. Or khugepaged might be just stuck somewhere...

I am trying to wrap my head around the synchronization here and I
suspect it is unnecessarily complex. We should be able to go without
down_write in the exit path... The following patch would only workaround
the issue you are seeing but I guess it is worth considering this
approach.

Andrea, does the following look reasonable to you? I haven't tested it
and I might be missing some subtle details. The code is really not
trivial...
---
>From 34416b980cf02280ad76b5603175eda327ce0603 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 2 Jun 2016 10:38:37 +0200
Subject: [PATCH] khugepaged: simplify khugepaged vs. __mmput

__khugepaged_exit is called during the final __mmput and it employs a
complex synchronization dances to make sure it doesn't race with the
khugepaged which might be scanning this mm at the same time. This is
all caused by the fact that khugepaged doesn't pin mm_users. Things
would simplify considerably if we simply check the mm at
khugepaged_scan_mm_slot and if mm_users was already 0 then we know it
is dead and we can unhash the mm_slot and move on to another one. This
will also guarantee that __khugepaged_exit cannot race with khugepaged
and so we can free up the slot if it is still hashed.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/huge_memory.c | 40 ++++++++++++++++------------------------
 1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de62bd991827..3dfc62b1a90c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1936,7 +1936,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 
 static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
-	return atomic_read(&mm->mm_users) == 0;
+	/* the only pin is from khugepaged_scan_mm_slot */
+	return atomic_read(&mm->mm_users) <= 1;
 }
 
 int __khugepaged_enter(struct mm_struct *mm)
@@ -1948,8 +1949,6 @@ int __khugepaged_enter(struct mm_struct *mm)
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
 		return 0;
@@ -1999,29 +1998,11 @@ void __khugepaged_exit(struct mm_struct *mm)
 
 	spin_lock(&khugepaged_mm_lock);
 	mm_slot = get_mm_slot(mm);
-	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
-		hash_del(&mm_slot->hash);
-		list_del(&mm_slot->mm_node);
-		free = 1;
-	}
-	spin_unlock(&khugepaged_mm_lock);
-
-	if (free) {
+	if (mm_slot) {
+		collect_mm_slot(mm_slot);
 		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
-		free_mm_slot(mm_slot);
-		mmdrop(mm);
-	} else if (mm_slot) {
-		/*
-		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
-		 * under mmap sem read mode). Stop here (after we
-		 * return all pagetables will be destroyed) until
-		 * khugepaged has finished working on the pagetables
-		 * under the mmap_sem.
-		 */
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
 	}
+	spin_unlock(&khugepaged_mm_lock);
 }
 
 static void release_pte_page(struct page *page)
@@ -2780,6 +2761,16 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		khugepaged_scan.address = 0;
 		khugepaged_scan.mm_slot = mm_slot;
 	}
+
+	/*
+	 * Do not even try to do anything if the current mm is already
+	 * dead. khugepaged_mm_lock will make sure only this or
+	 * __khugepaged_exit does the unhasing.
+	 */
+	if (!atomic_inc_not_zero(&mm_slot->mm->mm_users)) {
+		collect_mm_slot(mm_slot);
+		return progress;
+	}
 	spin_unlock(&khugepaged_mm_lock);
 
 	mm = mm_slot->mm;
@@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 		collect_mm_slot(mm_slot);
 	}
+	mmput(mm);
 
 	return progress;
 }
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02  9:21   ` Michal Hocko
@ 2016-06-02 12:08     ` Sergey Senozhatsky
  2016-06-02 12:21       ` Michal Hocko
  2016-06-03  7:15     ` Sergey Senozhatsky
  1 sibling, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-02 12:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

Hello Michal,

On (06/02/16 11:21), Michal Hocko wrote:
[..]
> > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> > [ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> > [ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> > [ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> > [ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> > [ 2856.323068] Call Trace:
> > [ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> > [ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> > [ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> > [ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> > [ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> > [ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> > [ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> > [ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> > [ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> > [ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> > [ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> > [ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> > [ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
> 
> down_write in the exit path is certainly not nice. It is hard to tell
> who is blocking the mmap_sem but it is clear that __khugepaged_exit
> waits for the khugepaged to release its mmap_sem. Do you hapen to have a
> trace of khugepaged? Note that the lock holder might be another writer
> which just hasn't pinned mm_users so khugepaged might be blocked on read
> lock as well. Or khugepaged might be just stuck somewhere...

sorry, no. this is all I have. the kernel was compiled with almost no
debugging functionality enabled (no lockdep, no lock debug, nothing)
for zram performance testing purposes.

I'll try to reproduce the problem; and give your patch some testing.
thanks.

	-ss

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02 12:08     ` Sergey Senozhatsky
@ 2016-06-02 12:21       ` Michal Hocko
  2016-06-03 13:51         ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-02 12:21 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

On Thu 02-06-16 21:08:57, Sergey Senozhatsky wrote:
> Hello Michal,
> 
> On (06/02/16 11:21), Michal Hocko wrote:
> [..]
> > > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> > > [ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> > > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> > > [ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> > > [ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> > > [ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> > > [ 2856.323068] Call Trace:
> > > [ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> > > [ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> > > [ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> > > [ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> > > [ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> > > [ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> > > [ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> > > [ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> > > [ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> > > [ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> > > [ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> > > [ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> > > [ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
> > 
> > down_write in the exit path is certainly not nice. It is hard to tell
> > who is blocking the mmap_sem but it is clear that __khugepaged_exit
> > waits for the khugepaged to release its mmap_sem. Do you hapen to have a
> > trace of khugepaged? Note that the lock holder might be another writer
> > which just hasn't pinned mm_users so khugepaged might be blocked on read
> > lock as well. Or khugepaged might be just stuck somewhere...
> 
> sorry, no. this is all I have. the kernel was compiled with almost no
> debugging functionality enabled (no lockdep, no lock debug, nothing)
> for zram performance testing purposes.
> 
> I'll try to reproduce the problem; and give your patch some testing.
> thanks.

The patch will drop the down_write from the exit path which is, I
believe the right thing to do, so it would paper over an existing
problem when khugepaged could get stuck with mmap_sem held for read (if
that is really a problem). So reproducing without the patch still makes
some sense.

Testing with the patch makes some sense as well, but I would like to
hear from Andrea whether the approach is good because I am wondering why
he hasn't done that before - it feels so much simpler than the current
code.

Anyway, thanks a lot for testing!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02  1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky
  2016-06-02  9:21   ` Michal Hocko
@ 2016-06-02 13:24   ` Vlastimil Babka
  2016-06-02 18:58     ` Ebru Akagunduz
  2016-06-03 12:28     ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz
  1 sibling, 2 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-06-02 13:24 UTC (permalink / raw)
  To: Sergey Senozhatsky, Andrew Morton, Ebru Akagunduz
  Cc: Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel, Rik van Riel, Andrea Arcangeli

[+CC's]

On 06/02/2016 03:48 AM, Sergey Senozhatsky wrote:
> On (06/01/16 13:11), Stephen Rothwell wrote:
>> Hi all,
>>
>> Changes since 20160531:
>>
>> My fixes tree contains:
>>
>>   of: silence warnings due to max() usage
>>
>> The arm tree gained a conflict against Linus' tree.
>>
>> Non-merge commits (relative to Linus' tree): 1100
>>  936 files changed, 38159 insertions(+), 17475 deletions(-)
>
> Hello,
>
> the cc1 process ended up in DN state during kernel -j4 compilation.
>
> ...
> [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> [ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> [ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> [ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> [ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> [ 2856.323068] Call Trace:
> [ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> [ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> [ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> [ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> [ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> [ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> [ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> [ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> [ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> [ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> [ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> [ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> [ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
>
> [ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds.
> [ 2877.322858]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> [ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2877.322861] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> [ 2877.322865]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> [ 2877.322867]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> [ 2877.322867]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> [ 2877.322870] Call Trace:
> [ 2877.322875]  [<ffffffff81441e33>] schedule+0x83/0x98
> [ 2877.322878]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> [ 2877.322881]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> [ 2877.322884]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> [ 2877.322885]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> [ 2877.322887]  [<ffffffff81443630>] down_write+0x1f/0x2e
> [ 2877.322890]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> [ 2877.322892]  [<ffffffff8103702a>] mmput+0x29/0xc5
> [ 2877.322894]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> [ 2877.322896]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> [ 2877.322898]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> [ 2877.322900]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> [ 2877.322902]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f

I think it's this patch:

http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch

Some parts of the code in collapse_huge_page() that were under 
down_write(mmap_sem) are under down_read() after the patch. But there's 
"goto out" which continues via "goto out_up_write" which does 
up_write(mmap_sem) so there's an imbalance. One path seems to go via 
both up_read() and up_write(). I can imagine this can cause a stuck 
down_write() among other things?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02 13:24   ` Vlastimil Babka
@ 2016-06-02 18:58     ` Ebru Akagunduz
  2016-06-03  1:00       ` Sergey Senozhatsky
  2016-06-03 12:28     ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz
  1 sibling, 1 reply; 28+ messages in thread
From: Ebru Akagunduz @ 2016-06-02 18:58 UTC (permalink / raw)
  To: vbabka, sergey.senozhatsky.work, akpm
  Cc: mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel,
	riel, aarcange

On Thu, Jun 02, 2016 at 03:24:05PM +0200, Vlastimil Babka wrote:
> [+CC's]
> 
> On 06/02/2016 03:48 AM, Sergey Senozhatsky wrote:
> >On (06/01/16 13:11), Stephen Rothwell wrote:
> >>Hi all,
> >>
> >>Changes since 20160531:
> >>
> >>My fixes tree contains:
> >>
> >>  of: silence warnings due to max() usage
> >>
> >>The arm tree gained a conflict against Linus' tree.
> >>
> >>Non-merge commits (relative to Linus' tree): 1100
> >> 936 files changed, 38159 insertions(+), 17475 deletions(-)
> >
> >Hello,
> >
> >the cc1 process ended up in DN state during kernel -j4 compilation.
> >
> >...
> >[ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> >[ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> >[ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >[ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> >[ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> >[ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> >[ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> >[ 2856.323068] Call Trace:
> >[ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> >[ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> >[ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> >[ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> >[ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> >[ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> >[ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> >[ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> >[ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> >[ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> >[ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> >[ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> >[ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
> >
> >[ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds.
> >[ 2877.322858]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> >[ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >[ 2877.322861] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> >[ 2877.322865]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> >[ 2877.322867]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> >[ 2877.322867]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> >[ 2877.322870] Call Trace:
> >[ 2877.322875]  [<ffffffff81441e33>] schedule+0x83/0x98
> >[ 2877.322878]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> >[ 2877.322881]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> >[ 2877.322884]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> >[ 2877.322885]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> >[ 2877.322887]  [<ffffffff81443630>] down_write+0x1f/0x2e
> >[ 2877.322890]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> >[ 2877.322892]  [<ffffffff8103702a>] mmput+0x29/0xc5
> >[ 2877.322894]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> >[ 2877.322896]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> >[ 2877.322898]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> >[ 2877.322900]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> >[ 2877.322902]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
> 
> I think it's this patch:
> 
> http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch
> 
> Some parts of the code in collapse_huge_page() that were under
> down_write(mmap_sem) are under down_read() after the patch. But
> there's "goto out" which continues via "goto out_up_write" which
> does up_write(mmap_sem) so there's an imbalance. One path seems to
> go via both up_read() and up_write(). I can imagine this can cause a
> stuck down_write() among other things?
Recently, I realized the same imbalance, it is an obvious
inconsistency. I don't know, this issue can be related with
mine. I'll send a fix patch.

Kind regards.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02 18:58     ` Ebru Akagunduz
@ 2016-06-03  1:00       ` Sergey Senozhatsky
  2016-06-03  1:29         ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  1:00 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: Vlastimil Babka, sergey.senozhatsky.work, Andrew Morton,
	Michal Hocko, Kirill A. Shutemov, Stephen Rothwell,
	Andrea Arcangeli, Rik van Riel, linux-mm, linux-next,
	linux-kernel

On (06/02/16 21:58), Ebru Akagunduz wrote:
[..]
> > I think it's this patch:
> > 
> > http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch
> > 
> > Some parts of the code in collapse_huge_page() that were under
> > down_write(mmap_sem) are under down_read() after the patch. But
> > there's "goto out" which continues via "goto out_up_write" which
> > does up_write(mmap_sem) so there's an imbalance. One path seems to
> > go via both up_read() and up_write(). I can imagine this can cause a
> > stuck down_write() among other things?
> Recently, I realized the same imbalance, it is an obvious
> inconsistency. I don't know, this issue can be related with
> mine. I'll send a fix patch.

a good find by Vlastimil.

Ebru, can you also re-visit __collapse_huge_page_swapin()? it's called
from collapse_huge_page() under the down_read(&mm->mmap_sem), is there
any reason to do the nested down_read(&mm->mmap_sem)?

collapse_huge_page()
...
	down_read(&mm->mmap_sem);
	result = hugepage_vma_revalidate(mm, vma, address);
	if (result)
		goto out;

	pmd = mm_find_pmd(mm, address);
	if (!pmd) {
		result = SCAN_PMD_NULL;
		goto out;
	}

	if (allocstall == curr_allocstall && swap != 0) {
		if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
			{
			:	if (ret & VM_FAULT_RETRY) {
			:		down_read(&mm->mmap_sem);
			:		^^^^^^^^^
			:		if (hugepage_vma_revalidate(mm, vma, address))
			:			return false;
			:	}
			}

			up_read(&mm->mmap_sem);
			goto out;
		}
	}

	up_read(&mm->mmap_sem);



so if __collapse_huge_page_swapin() retruns true we have:
	- down_read() twice, up_read() once?

the locking rules here are a bit confusing. (I didn't have my morning coffee yet).

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  1:00       ` Sergey Senozhatsky
@ 2016-06-03  1:29         ` Sergey Senozhatsky
  2016-06-03  4:14           ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  1:29 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Ebru Akagunduz, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli,
	Rik van Riel, linux-mm, linux-next, linux-kernel

On (06/03/16 10:00), Sergey Senozhatsky wrote:
> a good find by Vlastimil.
> 
> Ebru, can you also re-visit __collapse_huge_page_swapin()? it's called
> from collapse_huge_page() under the down_read(&mm->mmap_sem), is there
> any reason to do the nested down_read(&mm->mmap_sem)?
> 
> collapse_huge_page()
> ...
> 	down_read(&mm->mmap_sem);
> 	result = hugepage_vma_revalidate(mm, vma, address);
> 	if (result)
> 		goto out;
> 
> 	pmd = mm_find_pmd(mm, address);
> 	if (!pmd) {
> 		result = SCAN_PMD_NULL;
> 		goto out;
> 	}
> 
> 	if (allocstall == curr_allocstall && swap != 0) {
> 		if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
> 			{
> 			:	if (ret & VM_FAULT_RETRY) {
> 			:		down_read(&mm->mmap_sem);
> 			:		^^^^^^^^^

oh... it's in a loop

		for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
						pte++, _address += PAGE_SIZE) {
			ret = do_swap_page()
			if (ret & VM_FAULT_RETRY) {
				down_read(&mm->mmap_sem);
				^^^^^^^^^
				...
			}
		}

so there can be multiple sem->count++ in __collapse_huge_page_swapin(),
and you don't know how many sem->count-- you need to do later? is this
correct or am I hallucinating?

	-ss

> 			:		if (hugepage_vma_revalidate(mm, vma, address))
> 			:			return false;
> 			:	}
> 			}
> 
> 			up_read(&mm->mmap_sem);
> 			goto out;
> 		}
> 	}
> 
> 	up_read(&mm->mmap_sem);
> 
> 
> 
> so if __collapse_huge_page_swapin() retruns true we have:
> 	- down_read() twice, up_read() once?
> 
> the locking rules here are a bit confusing. (I didn't have my morning coffee yet).
> 
> 	-ss
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  1:29         ` Sergey Senozhatsky
@ 2016-06-03  4:14           ` Sergey Senozhatsky
  0 siblings, 0 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  4:14 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Ebru Akagunduz, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli,
	Rik van Riel, linux-mm, linux-next, linux-kernel

On (06/03/16 10:29), Sergey Senozhatsky wrote:
> > 	if (allocstall == curr_allocstall && swap != 0) {
> > 		if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
> > 			{
> > 			:	if (ret & VM_FAULT_RETRY) {
> > 			:		down_read(&mm->mmap_sem);
> > 			:		^^^^^^^^^
> 
> oh... it's in a loop
> 
> 		for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
> 						pte++, _address += PAGE_SIZE) {
> 			ret = do_swap_page()
> 			if (ret & VM_FAULT_RETRY) {
> 				down_read(&mm->mmap_sem);
> 				^^^^^^^^^
> 				...
> 			}
> 		}
> 
> so there can be multiple sem->count++ in __collapse_huge_page_swapin(),
> and you don't know how many sem->count-- you need to do later? is this
> correct or am I hallucinating?

No, I was wrong, sorry for the noise.

it's getting unlocked in

__collapse_huge_page_swapin()
	do_swap_page()
		lock_page_or_retry()
			if (flags & FAULT_FLAG_ALLOW_RETRY)
				up_read(&mm->mmap_sem);
	return VM_FAULT_RETRY

	-ss

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02  9:21   ` Michal Hocko
  2016-06-02 12:08     ` Sergey Senozhatsky
@ 2016-06-03  7:15     ` Sergey Senozhatsky
  2016-06-03  7:25       ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  7:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

Hello,

On (06/02/16 11:21), Michal Hocko wrote:
[..]
> @@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>  
>  		collect_mm_slot(mm_slot);
>  	}
> +	mmput(mm);
>  
>  	return progress;
>  }

this possibly sleeping mmput() is called from
under the spin_lock(&khugepaged_mm_lock).

there is also a trivial build fixup needed
(move collect_mm_slot() before __khugepaged_exit()).


it's quite hard to trigger the bug (somehow), so I can't
follow up with more information as of now.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  7:15     ` Sergey Senozhatsky
@ 2016-06-03  7:25       ` Michal Hocko
  2016-06-03  8:43         ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-03  7:25 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel,
	Andrea Arcangeli

On Fri 03-06-16 16:15:51, Sergey Senozhatsky wrote:
> Hello,
> 
> On (06/02/16 11:21), Michal Hocko wrote:
> [..]
> > @@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> >  
> >  		collect_mm_slot(mm_slot);
> >  	}
> > +	mmput(mm);
> >  
> >  	return progress;
> >  }
> 
> this possibly sleeping mmput() is called from
> under the spin_lock(&khugepaged_mm_lock).

You are right. khugepaged_scan_mm_slot returns with the lock held.
mmput_async would deal with it.
 
> there is also a trivial build fixup needed
> (move collect_mm_slot() before __khugepaged_exit()).

will fix that. Thanks!

> it's quite hard to trigger the bug (somehow), so I can't
> follow up with more information as of now.

Thanks anyway!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  7:25       ` Michal Hocko
@ 2016-06-03  8:43         ` Sergey Senozhatsky
  2016-06-03  9:55           ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  8:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

On (06/03/16 09:25), Michal Hocko wrote:
> > it's quite hard to trigger the bug (somehow), so I can't
> > follow up with more information as of now.

either I did something very silly fixing up the patch, or the
patch may be causing general protection faults on my system.

RIP collect_mm_slot() + 0x42/0x84
	khugepaged
	prepare_to_wait_event
	maybe_pmd_mkwrite
	kthread
	_raw_sin_unlock_irq
	ret_from_fork
	kthread_create_on_node

collect_mm_slot() + 0x42/0x84 is

0000000000000328 <collect_mm_slot>:
     328:       55                      push   %rbp
     329:       48 89 e5                mov    %rsp,%rbp
     32c:       53                      push   %rbx
     32d:       48 8b 5f 20             mov    0x20(%rdi),%rbx
     331:       8b 43 48                mov    0x48(%rbx),%eax
     334:       ff c8                   dec    %eax
     336:       7f 71                   jg     3a9 <collect_mm_slot+0x81>
     338:       48 8b 57 08             mov    0x8(%rdi),%rdx
     33c:       48 85 d2                test   %rdx,%rdx
     33f:       74 1e                   je     35f <collect_mm_slot+0x37>
     341:       48 8b 07                mov    (%rdi),%rax
     344:       48 85 c0                test   %rax,%rax
     347:       48 89 02                mov    %rax,(%rdx)
     34a:       74 04                   je     350 <collect_mm_slot+0x28>
     34c:       48 89 50 08             mov    %rdx,0x8(%rax)
     350:       48 c7 07 00 00 00 00    movq   $0x0,(%rdi)
     357:       48 c7 47 08 00 00 00    movq   $0x0,0x8(%rdi)
     35e:       00 
     35f:       48 8b 57 10             mov    0x10(%rdi),%rdx
     363:       48 8b 47 18             mov    0x18(%rdi),%rax
     367:       48 89 fe                mov    %rdi,%rsi
     36a:       48 89 42 08             mov    %rax,0x8(%rdx)
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     36e:       48 89 10                mov    %rdx,(%rax)
     371:       48 b8 00 01 00 00 00    movabs $0xdead000000000100,%rax
     378:       00 ad de 
     37b:       48 89 47 10             mov    %rax,0x10(%rdi)
     37f:       48 b8 00 02 00 00 00    movabs $0xdead000000000200,%rax
     386:       00 ad de 
     389:       48 89 47 18             mov    %rax,0x18(%rdi)
     38d:       48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi        # 394 <collect_mm_slot+0x6c>
     394:       e8 00 00 00 00          callq  399 <collect_mm_slot+0x71>
     399:       f0 ff 4b 4c             lock decl 0x4c(%rbx)
     39d:       74 02                   je     3a1 <collect_mm_slot+0x79>
     39f:       eb 08                   jmp    3a9 <collect_mm_slot+0x81>
     3a1:       48 89 df                mov    %rbx,%rdi
     3a4:       e8 00 00 00 00          callq  3a9 <collect_mm_slot+0x81>
     3a9:       5b                      pop    %rbx
     3aa:       5d                      pop    %rbp
     3ab:       c3                      retq

which is list_del(&mm_slot->mm_node), I believe.


I attached the patch (just in case).

---
 mm/huge_memory.c | 87 +++++++++++++++++++++++++-------------------------------
 1 file changed, 39 insertions(+), 48 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 292cedd..1c82fa4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1938,7 +1938,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 
 static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
-	return atomic_read(&mm->mm_users) == 0;
+	/* the only pin is from khugepaged_scan_mm_slot */
+	return atomic_read(&mm->mm_users) <= 1;
 }
 
 int __khugepaged_enter(struct mm_struct *mm)
@@ -1950,8 +1951,6 @@ int __khugepaged_enter(struct mm_struct *mm)
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
 		return 0;
@@ -1994,36 +1993,40 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 	return 0;
 }
 
-void __khugepaged_exit(struct mm_struct *mm)
+static void collect_mm_slot(struct mm_slot *mm_slot)
 {
-	struct mm_slot *mm_slot;
-	int free = 0;
+	struct mm_struct *mm = mm_slot->mm;
 
-	spin_lock(&khugepaged_mm_lock);
-	mm_slot = get_mm_slot(mm);
-	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_test_exit(mm)) {
+		/* free mm_slot */
 		hash_del(&mm_slot->hash);
 		list_del(&mm_slot->mm_node);
-		free = 1;
-	}
-	spin_unlock(&khugepaged_mm_lock);
 
-	if (free) {
-		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
-		free_mm_slot(mm_slot);
-		mmdrop(mm);
-	} else if (mm_slot) {
 		/*
-		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
-		 * under mmap sem read mode). Stop here (after we
-		 * return all pagetables will be destroyed) until
-		 * khugepaged has finished working on the pagetables
-		 * under the mmap_sem.
+		 * Not strictly needed because the mm exited already.
+		 *
+		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
 		 */
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
+
+		/* khugepaged_mm_lock actually not necessary for the below */
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	}
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (mm_slot) {
+		collect_mm_slot(mm_slot);
+		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
 	}
+	spin_unlock(&khugepaged_mm_lock);
 }
 
 static void release_pte_page(struct page *page)
@@ -2738,29 +2741,6 @@ out:
 	return ret;
 }
 
-static void collect_mm_slot(struct mm_slot *mm_slot)
-{
-	struct mm_struct *mm = mm_slot->mm;
-
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
-
-	if (khugepaged_test_exit(mm)) {
-		/* free mm_slot */
-		hash_del(&mm_slot->hash);
-		list_del(&mm_slot->mm_node);
-
-		/*
-		 * Not strictly needed because the mm exited already.
-		 *
-		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
-		 */
-
-		/* khugepaged_mm_lock actually not necessary for the below */
-		free_mm_slot(mm_slot);
-		mmdrop(mm);
-	}
-}
-
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 					    struct page **hpage)
 	__releases(&khugepaged_mm_lock)
@@ -2782,6 +2762,16 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		khugepaged_scan.address = 0;
 		khugepaged_scan.mm_slot = mm_slot;
 	}
+
+	/*
+	 * Do not even try to do anything if the current mm is already
+	 * dead. khugepaged_mm_lock will make sure only this or
+	 * __khugepaged_exit does the unhasing.
+	 */
+	if (!atomic_inc_not_zero(&mm_slot->mm->mm_users)) {
+		collect_mm_slot(mm_slot);
+		return progress;
+	}
 	spin_unlock(&khugepaged_mm_lock);
 
 	mm = mm_slot->mm;
@@ -2865,6 +2855,7 @@ breakouterloop_mmap_sem:
 
 		collect_mm_slot(mm_slot);
 	}
+	mmput_async(mm);
 
 	return progress;
 }
-- 
2.9.0.rc1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  8:43         ` Sergey Senozhatsky
@ 2016-06-03  9:55           ` Michal Hocko
  2016-06-03 10:05             ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-03  9:55 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel,
	Andrea Arcangeli

On Fri 03-06-16 17:43:47, Sergey Senozhatsky wrote:
> On (06/03/16 09:25), Michal Hocko wrote:
> > > it's quite hard to trigger the bug (somehow), so I can't
> > > follow up with more information as of now.
> 
> either I did something very silly fixing up the patch, or the
> patch may be causing general protection faults on my system.
> 
> RIP collect_mm_slot() + 0x42/0x84
> 	khugepaged

So is this really collect_mm_slot called directly from khugepaged or is
some inlining going on there?

> 	prepare_to_wait_event
> 	maybe_pmd_mkwrite
> 	kthread
> 	_raw_sin_unlock_irq
> 	ret_from_fork
> 	kthread_create_on_node
> 
> collect_mm_slot() + 0x42/0x84 is

I guess that the problem is that I have missed that __khugepaged_exit
doesn't clear the cached khugepaged_scan.mm_slot. Does the following on
top fixes that?
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6574c62ca4a3..e6f4e6fd587a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2021,6 +2021,8 @@ void __khugepaged_exit(struct mm_struct *mm)
 	spin_lock(&khugepaged_mm_lock);
 	mm_slot = get_mm_slot(mm);
 	if (mm_slot) {
+		if (khugepaged_scan.mm_slot == mm_slot)
+			khugepaged_scan.mm_slot = NULL;
 		collect_mm_slot(mm_slot);
 		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
 	}
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  9:55           ` Michal Hocko
@ 2016-06-03 10:05             ` Michal Hocko
  2016-06-03 13:38               ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-03 10:05 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel,
	Andrea Arcangeli

On Fri 03-06-16 11:55:49, Michal Hocko wrote:
> On Fri 03-06-16 17:43:47, Sergey Senozhatsky wrote:
> > On (06/03/16 09:25), Michal Hocko wrote:
> > > > it's quite hard to trigger the bug (somehow), so I can't
> > > > follow up with more information as of now.
> > 
> > either I did something very silly fixing up the patch, or the
> > patch may be causing general protection faults on my system.
> > 
> > RIP collect_mm_slot() + 0x42/0x84
> > 	khugepaged
> 
> So is this really collect_mm_slot called directly from khugepaged or is
> some inlining going on there?
> 
> > 	prepare_to_wait_event
> > 	maybe_pmd_mkwrite
> > 	kthread
> > 	_raw_sin_unlock_irq
> > 	ret_from_fork
> > 	kthread_create_on_node
> > 
> > collect_mm_slot() + 0x42/0x84 is
> 
> I guess that the problem is that I have missed that __khugepaged_exit
> doesn't clear the cached khugepaged_scan.mm_slot. Does the following on
> top fixes that?

That wouldn't be sufficient after a closer look. We need to do the same
from khugepaged_scan_mm_slot when atomic_inc_not_zero fails. So I guess
it would be better to stick it into collect_mm_slot.

Thanks for your testing!
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6574c62ca4a3..0432581fb87c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2011,6 +2011,9 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 		/* khugepaged_mm_lock actually not necessary for the below */
 		free_mm_slot(mm_slot);
 		mmdrop(mm);
+
+		if (khugepaged_scan.mm_slot == mm_slot)
+			khugepaged_scan.mm_slot = NULL;
 	}
 }
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page
  2016-06-02 13:24   ` Vlastimil Babka
  2016-06-02 18:58     ` Ebru Akagunduz
@ 2016-06-03 12:28     ` Ebru Akagunduz
  2016-06-06 13:05       ` Vlastimil Babka
  1 sibling, 1 reply; 28+ messages in thread
From: Ebru Akagunduz @ 2016-06-03 12:28 UTC (permalink / raw)
  To: akpm
  Cc: vbabka, sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr,
	linux-mm, linux-next, linux-kernel, riel, aarcange,
	Ebru Akagunduz

After creating revalidate vma function, locking inconsistency occured
due to directing the code path to wrong label. This patch directs
to correct label and fix the inconsistency.

Related commit that caused inconsistency:
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
---
 mm/huge_memory.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 292cedd..8043d91 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2493,13 +2493,18 @@ static void collapse_huge_page(struct mm_struct *mm,
 	curr_allocstall = sum_vm_event(ALLOCSTALL);
 	down_read(&mm->mmap_sem);
 	result = hugepage_vma_revalidate(mm, vma, address);
-	if (result)
-		goto out;
+	if (result) {
+		mem_cgroup_cancel_charge(new_page, memcg, true);
+		up_read(&mm->mmap_sem);
+		goto out_nolock;
+	}
 
 	pmd = mm_find_pmd(mm, address);
 	if (!pmd) {
 		result = SCAN_PMD_NULL;
-		goto out;
+		mem_cgroup_cancel_charge(new_page, memcg, true);
+		up_read(&mm->mmap_sem);
+		goto out_nolock;
 	}
 
 	/*
@@ -2513,8 +2518,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 		 * label out. Continuing to collapse causes inconsistency.
 		 */
 		if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
+			mem_cgroup_cancel_charge(new_page, memcg, true);
 			up_read(&mm->mmap_sem);
-			goto out;
+			goto out_nolock;
 		}
 	}
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 10:05             ` Michal Hocko
@ 2016-06-03 13:38               ` Sergey Senozhatsky
  2016-06-03 13:45                 ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03 13:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

On (06/03/16 12:05), Michal Hocko wrote:
> > > RIP collect_mm_slot() + 0x42/0x84
> > > 	khugepaged
> > 
> > So is this really collect_mm_slot called directly from khugepaged or is
> > some inlining going on there?

inlining I suppose.

> > > 	prepare_to_wait_event
> > > 	maybe_pmd_mkwrite
> > > 	kthread
> > > 	_raw_sin_unlock_irq
> > > 	ret_from_fork
> > > 	kthread_create_on_node
> > > 
> > > collect_mm_slot() + 0x42/0x84 is
> > 
> > I guess that the problem is that I have missed that __khugepaged_exit
> > doesn't clear the cached khugepaged_scan.mm_slot. Does the following on
> > top fixes that?
> 
> That wouldn't be sufficient after a closer look. We need to do the same
> from khugepaged_scan_mm_slot when atomic_inc_not_zero fails. So I guess
> it would be better to stick it into collect_mm_slot.

Michal, I'll try to test during the weekend (away from the affected box
now), but in the worst case it can as late as next Thursday (gonna travel
next week).

	-ss

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 13:38               ` Sergey Senozhatsky
@ 2016-06-03 13:45                 ` Michal Hocko
  2016-06-03 13:49                   ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-03 13:45 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

On Fri 03-06-16 22:38:13, Sergey Senozhatsky wrote:
[...]
> Michal, I'll try to test during the weekend (away from the affected box
> now), but in the worst case it can as late as next Thursday (gonna travel
> next week).

No problem. I would really like to hear from Andrea before we give this
a serious try anyway.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 13:45                 ` Michal Hocko
@ 2016-06-03 13:49                   ` Michal Hocko
  2016-06-04  7:51                     ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-03 13:49 UTC (permalink / raw)
  To: Sergey Senozhatsky, Andrea Arcangeli
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel

On Fri 03-06-16 15:45:09, Michal Hocko wrote:
> On Fri 03-06-16 22:38:13, Sergey Senozhatsky wrote:
> [...]
> > Michal, I'll try to test during the weekend (away from the affected box
> > now), but in the worst case it can as late as next Thursday (gonna travel
> > next week).
> 
> No problem. I would really like to hear from Andrea before we give this
> a serious try anyway.

And just for an easier review, here is what I have right now:
---
>From 1fa9428b215cea4a48737fc9650009616a5bcd4e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 2 Jun 2016 10:38:37 +0200
Subject: [PATCH] khugepaged: simplify khugepaged vs. __mmput

__khugepaged_exit is called during the final __mmput and it employs a
complex synchronization dances to make sure it doesn't race with the
khugepaged which might be scanning this mm at the same time. This is
all caused by the fact that khugepaged doesn't pin mm_users. Things
would simplify considerably if we simply check the mm at
khugepaged_scan_mm_slot and if mm_users was already 0 then we know it
is dead and we can unhash the mm_slot and move on to another one. This
will also guarantee that __khugepaged_exit cannot race with khugepaged
and so we can free up the slot if it is still hashed.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/huge_memory.c | 90 ++++++++++++++++++++++++++------------------------------
 1 file changed, 42 insertions(+), 48 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de62bd991827..0432581fb87c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1936,7 +1936,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 
 static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
-	return atomic_read(&mm->mm_users) == 0;
+	/* the only pin is from khugepaged_scan_mm_slot */
+	return atomic_read(&mm->mm_users) <= 1;
 }
 
 int __khugepaged_enter(struct mm_struct *mm)
@@ -1948,8 +1949,6 @@ int __khugepaged_enter(struct mm_struct *mm)
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
 		return 0;
@@ -1992,36 +1991,43 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 	return 0;
 }
 
-void __khugepaged_exit(struct mm_struct *mm)
+static void collect_mm_slot(struct mm_slot *mm_slot)
 {
-	struct mm_slot *mm_slot;
-	int free = 0;
+	struct mm_struct *mm = mm_slot->mm;
 
-	spin_lock(&khugepaged_mm_lock);
-	mm_slot = get_mm_slot(mm);
-	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_test_exit(mm)) {
+		/* free mm_slot */
 		hash_del(&mm_slot->hash);
 		list_del(&mm_slot->mm_node);
-		free = 1;
-	}
-	spin_unlock(&khugepaged_mm_lock);
 
-	if (free) {
-		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
-		free_mm_slot(mm_slot);
-		mmdrop(mm);
-	} else if (mm_slot) {
 		/*
-		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
-		 * under mmap sem read mode). Stop here (after we
-		 * return all pagetables will be destroyed) until
-		 * khugepaged has finished working on the pagetables
-		 * under the mmap_sem.
+		 * Not strictly needed because the mm exited already.
+		 *
+		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
 		 */
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
+
+		/* khugepaged_mm_lock actually not necessary for the below */
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+
+		if (khugepaged_scan.mm_slot == mm_slot)
+			khugepaged_scan.mm_slot = NULL;
+	}
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (mm_slot) {
+		collect_mm_slot(mm_slot);
+		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
 	}
+	spin_unlock(&khugepaged_mm_lock);
 }
 
 static void release_pte_page(struct page *page)
@@ -2736,29 +2742,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	return ret;
 }
 
-static void collect_mm_slot(struct mm_slot *mm_slot)
-{
-	struct mm_struct *mm = mm_slot->mm;
-
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
-
-	if (khugepaged_test_exit(mm)) {
-		/* free mm_slot */
-		hash_del(&mm_slot->hash);
-		list_del(&mm_slot->mm_node);
-
-		/*
-		 * Not strictly needed because the mm exited already.
-		 *
-		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
-		 */
-
-		/* khugepaged_mm_lock actually not necessary for the below */
-		free_mm_slot(mm_slot);
-		mmdrop(mm);
-	}
-}
-
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 					    struct page **hpage)
 	__releases(&khugepaged_mm_lock)
@@ -2780,6 +2763,16 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		khugepaged_scan.address = 0;
 		khugepaged_scan.mm_slot = mm_slot;
 	}
+
+	/*
+	 * Do not even try to do anything if the current mm is already
+	 * dead. khugepaged_mm_lock will make sure only this or
+	 * __khugepaged_exit does the unhasing.
+	 */
+	if (!atomic_inc_not_zero(&mm_slot->mm->mm_users)) {
+		collect_mm_slot(mm_slot);
+		return progress;
+	}
 	spin_unlock(&khugepaged_mm_lock);
 
 	mm = mm_slot->mm;
@@ -2863,6 +2856,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 
 		collect_mm_slot(mm_slot);
 	}
+	mmput_async(mm);
 
 	return progress;
 }
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02 12:21       ` Michal Hocko
@ 2016-06-03 13:51         ` Andrea Arcangeli
  2016-06-03 14:46           ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Arcangeli @ 2016-06-03 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel

On Thu, Jun 02, 2016 at 02:21:10PM +0200, Michal Hocko wrote:
> Testing with the patch makes some sense as well, but I would like to
> hear from Andrea whether the approach is good because I am wondering why
> he hasn't done that before - it feels so much simpler than the current
> code.

The down_write in the exit path comes from __ksm_exit. If you don't
like it there I'd suggest to also remove it from __ksm_exit.

This is a proposed cleanup correct?

The first thing that I can notice is that khugepaged_test_exit() then
can only be called and provide the expected retval, after
atomic_inc_not_zero(mm_users). Also note mmget_not_zero() should be
used instead.

However the code still uses khugepaged_test_exit in __khugepage_enter
that won't increase the mm_users, so then the patch relaxes that check
too much, albeit only for a debug check not strictly a bug.

The cons of this change purely that it'll decrease the responsiveness
in releasing the RAM of a killed task a bit.

To me the fewer time we hold the mm_users the better and I don't see
an obvious runtime improvement coming from this change. It's a bit
simpler yes, but the down_write in the exit path is well understood,
ksm does the same thing and it's in a slow path (it only happens if
the mm that exited is the current one under scan by either ksmd or
khugepaged, so normally the down_write is not executed in the exit
path and the "mm" is collected right away both as a mm_users and
mm_count).

In short I think it's a tradeoff: pros) removes down_write in a slow
path of the the mm exit which may simplify the code a bit, cons) it
could increase the latency in freeing memory as result of a task
exiting or being killed during the khugepaged scan, for example while
the THP is being allocated. While compaction runs to allocate the THP
in collapse_huge_page, if the task is killed currently the memory is
released right away, without waiting for the allocation to succeed or
fail.

I don't see a big enough problem with the down_write in a slow path of
khugepaged_exit to justify the increased latency in releasing memory.

I was very happy by Oleg's patch reducing the mm_users holding of
userfaultfd too. That was controlled by userland so it would only be
an issue for non-cooperative usage which isn't upstream yet, and it
was also much wider than this one would become with the patch applied,
but I liked the direction.

If prefer instead to remove the down_write, you probably could move
the test_exit before the down_read/write to bail out before taking the
lock: you don't need the mmap_sem to do test_exit anymore. The only
reason the text_exit would remain in fact is just to reduce the
latency of the memory freeing, it then becomes a voluntary preempt
cond_resched() to release the memory to make a parallel ;), but unable
to let the kernel free the memory while the THP allocation runs.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 13:51         ` Andrea Arcangeli
@ 2016-06-03 14:46           ` Michal Hocko
  2016-06-03 15:10             ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-06-03 14:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel

On Fri 03-06-16 15:51:54, Andrea Arcangeli wrote:
> On Thu, Jun 02, 2016 at 02:21:10PM +0200, Michal Hocko wrote:
> > Testing with the patch makes some sense as well, but I would like to
> > hear from Andrea whether the approach is good because I am wondering why
> > he hasn't done that before - it feels so much simpler than the current
> > code.
> 
> The down_write in the exit path comes from __ksm_exit. If you don't
> like it there I'd suggest to also remove it from __ksm_exit.

I see

> This is a proposed cleanup correct?

yes this is a cleanup but also a robustness thing, see below.
 
> The first thing that I can notice is that khugepaged_test_exit() then
> can only be called and provide the expected retval, after
> atomic_inc_not_zero(mm_users). Also note mmget_not_zero() should be
> used instead.

I didn't get used to mmget_not_zero yet, but true a helper would be
better.

[...]

> To me the fewer time we hold the mm_users the better and I don't see
> an obvious runtime improvement coming from this change. It's a bit
> simpler yes, but the down_write in the exit path is well understood,
> ksm does the same thing and it's in a slow path (it only happens if
> the mm that exited is the current one under scan by either ksmd or
> khugepaged, so normally the down_write is not executed in the exit
> path and the "mm" is collected right away both as a mm_users and
> mm_count).

OK, I see your point. I wasn't aware that the mmap_sem is dropped
before the allocation request. Then the original code indeed might
get into exit_mmap earlier wrt. to the patch.

The reason I dislike taking write lock in the __mmput is basically
for the same reason you have pointed out. exit_mmap might be delayed
for an unbounded amount of time. khugepaged resp. ksmd might be well
behaved and release their read lock for costly operations or when they
detect the mm is dead but it is hard to guarantee that all potential
kernel users/drivers are behaving the same way. It is not really trivial
to check whether we have such users (there are 100+ users outside of mm/
as per my quick git grep).

The exit path should be as simple as possible with the amount of
external dependencies reduced to the bare minimum.

> In short I think it's a tradeoff: pros) removes down_write in a slow
> path of the the mm exit which may simplify the code a bit, cons) it
> could increase the latency in freeing memory as result of a task
> exiting or being killed during the khugepaged scan, for example while
> the THP is being allocated. While compaction runs to allocate the THP
> in collapse_huge_page, if the task is killed currently the memory is
> released right away, without waiting for the allocation to succeed or
> fail.

Are those latencies a real problem. The allocation itself shouldn't
really take a long time.

> I don't see a big enough problem with the down_write in a slow path of
> khugepaged_exit to justify the increased latency in releasing memory.

What do you think about the external dependencies mentioned above. Do
you think this is a sufficient argument wrt. occasional higher
latencies?
[...]

> If prefer instead to remove the down_write, you probably could move
> the test_exit before the down_read/write to bail out before taking the
> lock: you don't need the mmap_sem to do test_exit anymore. The only
> reason the text_exit would remain in fact is just to reduce the
> latency of the memory freeing, it then becomes a voluntary preempt
> cond_resched() to release the memory to make a parallel ;), but unable
> to let the kernel free the memory while the THP allocation runs.

OK, I will think about that as well.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 14:46           ` Michal Hocko
@ 2016-06-03 15:10             ` Andrea Arcangeli
  2016-06-07  7:34               ` Michal Hocko
  2016-06-08  8:19               ` Vlastimil Babka
  0 siblings, 2 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2016-06-03 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel, Hugh Dickins

Hello Michal,

CC'ed Hugh,

On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote:
> What do you think about the external dependencies mentioned above. Do
> you think this is a sufficient argument wrt. occasional higher
> latencies?

It's a tradeoff and both latencies would be short and uncommon so it's
hard to tell.

There's also mmput_async for paths that may care about mmput
latencies. Exit itself cannot use it, it's mostly for people taking
the mm_users pin that may not want to wait for mmput to run. It also
shouldn't happen that often, it's a slow path.

The whole model inherited from KSM is to deliberately depend only on
the mmap_sem + test_exit + mm_count, and never on mm_users, which to
me in principle doesn't sound bad. I consider KSM version a
"finegrined" implementation but I never thought it would be a problem
to wait a bit in exit() in case the slow path hits. I thought it was
more of a problem if exit() runs, the parent then start a new task but
the memory wasn't freed yet.

So I would suggest Hugh to share his view on the down_write/up_write
that may temporarily block mmput (until the next test_exit bailout
point) vs higher latency in reaching exit_mmap for a real exit(2) that
would happen with the proposed change.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 13:49                   ` Michal Hocko
@ 2016-06-04  7:51                     ` Sergey Senozhatsky
  2016-06-06  8:39                       ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-04  7:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrea Arcangeli, Sergey Senozhatsky,
	Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel

Hello,

On (06/03/16 15:49), Michal Hocko wrote:
> __khugepaged_exit is called during the final __mmput and it employs a
> complex synchronization dances to make sure it doesn't race with the
> khugepaged which might be scanning this mm at the same time. This is
> all caused by the fact that khugepaged doesn't pin mm_users. Things
> would simplify considerably if we simply check the mm at
> khugepaged_scan_mm_slot and if mm_users was already 0 then we know it
> is dead and we can unhash the mm_slot and move on to another one. This
> will also guarantee that __khugepaged_exit cannot race with khugepaged
> and so we can free up the slot if it is still hashed.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

with this patch and
http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem-fix-2.patch

I saw no problems during my tests (well, may be didn't test hard
enough).

	-ss

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-04  7:51                     ` Sergey Senozhatsky
@ 2016-06-06  8:39                       ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-06  8:39 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrea Arcangeli, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel

On Sat 04-06-16 16:51:14, Sergey Senozhatsky wrote:
> Hello,
> 
> On (06/03/16 15:49), Michal Hocko wrote:
> > __khugepaged_exit is called during the final __mmput and it employs a
> > complex synchronization dances to make sure it doesn't race with the
> > khugepaged which might be scanning this mm at the same time. This is
> > all caused by the fact that khugepaged doesn't pin mm_users. Things
> > would simplify considerably if we simply check the mm at
> > khugepaged_scan_mm_slot and if mm_users was already 0 then we know it
> > is dead and we can unhash the mm_slot and move on to another one. This
> > will also guarantee that __khugepaged_exit cannot race with khugepaged
> > and so we can free up the slot if it is still hashed.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> with this patch and
> http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem-fix-2.patch
> 
> I saw no problems during my tests (well, may be didn't test hard
> enough).

Thanks for the testing!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page
  2016-06-03 12:28     ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz
@ 2016-06-06 13:05       ` Vlastimil Babka
  2016-06-09  3:51         ` Sergey Senozhatsky
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2016-06-06 13:05 UTC (permalink / raw)
  To: Ebru Akagunduz, akpm
  Cc: sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr, linux-mm,
	linux-next, linux-kernel, riel, aarcange

On 06/03/2016 02:28 PM, Ebru Akagunduz wrote:
> After creating revalidate vma function, locking inconsistency occured
> due to directing the code path to wrong label. This patch directs
> to correct label and fix the inconsistency.
>
> Related commit that caused inconsistency:
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0
>
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>

I think this does fix the inconsistency, thanks.

But looking at collapse_huge_page() as of latest -next, I wonder if 
there's another problem:

pmd = mm_find_pmd(mm, address);
...
up_read(&mm->mmap_sem);
down_write(&mm->mmap_sem);
hugepage_vma_revalidate(mm, address);
...
pte = pte_offset_map(pmd, address);

What guarantees that 'pmd' is still valid?

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 15:10             ` Andrea Arcangeli
@ 2016-06-07  7:34               ` Michal Hocko
  2016-06-08  8:19               ` Vlastimil Babka
  1 sibling, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2016-06-07  7:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel, Hugh Dickins

On Fri 03-06-16 17:10:01, Andrea Arcangeli wrote:
> Hello Michal,
> 
> CC'ed Hugh,
> 
> On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote:
> > What do you think about the external dependencies mentioned above. Do
> > you think this is a sufficient argument wrt. occasional higher
> > latencies?
> 
> It's a tradeoff and both latencies would be short and uncommon so it's
> hard to tell.
> 
> There's also mmput_async for paths that may care about mmput
> latencies. Exit itself cannot use it, it's mostly for people taking
> the mm_users pin that may not want to wait for mmput to run. It also
> shouldn't happen that often, it's a slow path.
> 
> The whole model inherited from KSM is to deliberately depend only on
> the mmap_sem + test_exit + mm_count, and never on mm_users, which to
> me in principle doesn't sound bad.

I do agree that this model is quite clever (albeit convoluted). It just
assumes that all other mmap_sem users are behaving the same. Now most
in-kernel users will do get_task_mm() and then lock mmap_sem, but I
haven't checked all of them and it is quite possible that some of those
would like to optimize in a similar way and only increment mm_count.
I might be too pessimistic about the out of mm code but I would feel
much better if the exit path didn't depend on them.

Anyway, if the current model sounds better I will definitely not insist
on my patch. It is more of an idea for simplification than a fix for
anything I have seen happening in the real life.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 15:10             ` Andrea Arcangeli
  2016-06-07  7:34               ` Michal Hocko
@ 2016-06-08  8:19               ` Vlastimil Babka
  1 sibling, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-06-08  8:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Michal Hocko
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Hugh Dickins

On 06/03/2016 05:10 PM, Andrea Arcangeli wrote:
> Hello Michal,
>
> CC'ed Hugh,
>
> On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote:
>> What do you think about the external dependencies mentioned above. Do
>> you think this is a sufficient argument wrt. occasional higher
>> latencies?
>
> It's a tradeoff and both latencies would be short and uncommon so it's
> hard to tell.

Shouldn't it be possible to do a mmput() before the hugepage allocation, 
and then again mmget_not_zero()? That way it's no longer a tradeoff?

> There's also mmput_async for paths that may care about mmput
> latencies. Exit itself cannot use it, it's mostly for people taking
> the mm_users pin that may not want to wait for mmput to run. It also
> shouldn't happen that often, it's a slow path.
>
> The whole model inherited from KSM is to deliberately depend only on
> the mmap_sem + test_exit + mm_count, and never on mm_users, which to
> me in principle doesn't sound bad. I consider KSM version a
> "finegrined" implementation but I never thought it would be a problem
> to wait a bit in exit() in case the slow path hits. I thought it was
> more of a problem if exit() runs, the parent then start a new task but
> the memory wasn't freed yet.
>
> So I would suggest Hugh to share his view on the down_write/up_write
> that may temporarily block mmput (until the next test_exit bailout
> point) vs higher latency in reaching exit_mmap for a real exit(2) that
> would happen with the proposed change.
>
> Thanks!
> Andrea
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page
  2016-06-06 13:05       ` Vlastimil Babka
@ 2016-06-09  3:51         ` Sergey Senozhatsky
  0 siblings, 0 replies; 28+ messages in thread
From: Sergey Senozhatsky @ 2016-06-09  3:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Ebru Akagunduz, akpm, sergey.senozhatsky.work, mhocko,
	kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel,
	aarcange

On (06/06/16 15:05), Vlastimil Babka wrote:
[..]
> I think this does fix the inconsistency, thanks.
> 
> But looking at collapse_huge_page() as of latest -next, I wonder if there's
> another problem:
> 
> pmd = mm_find_pmd(mm, address);
> ...
> up_read(&mm->mmap_sem);
> down_write(&mm->mmap_sem);
> hugepage_vma_revalidate(mm, address);
> ...
> pte = pte_offset_map(pmd, address);
> 
> What guarantees that 'pmd' is still valid?

the same question applied to __collapse_huge_page_swapin(), I think.

__collapse_huge_page_swapin(pmd)
	pte = pte_offset_map(pmd, address);
	do_swap_page(mm, vma, _address, pte, pmd...)
		up_read(&mm->mmap_sem);
	down_read(&mm->mmap_sem);
	pte = pte_offset_map(pmd, _address);

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-06-09  3:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-01  3:11 linux-next: Tree for Jun 1 Stephen Rothwell
2016-06-02  1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky
2016-06-02  9:21   ` Michal Hocko
2016-06-02 12:08     ` Sergey Senozhatsky
2016-06-02 12:21       ` Michal Hocko
2016-06-03 13:51         ` Andrea Arcangeli
2016-06-03 14:46           ` Michal Hocko
2016-06-03 15:10             ` Andrea Arcangeli
2016-06-07  7:34               ` Michal Hocko
2016-06-08  8:19               ` Vlastimil Babka
2016-06-03  7:15     ` Sergey Senozhatsky
2016-06-03  7:25       ` Michal Hocko
2016-06-03  8:43         ` Sergey Senozhatsky
2016-06-03  9:55           ` Michal Hocko
2016-06-03 10:05             ` Michal Hocko
2016-06-03 13:38               ` Sergey Senozhatsky
2016-06-03 13:45                 ` Michal Hocko
2016-06-03 13:49                   ` Michal Hocko
2016-06-04  7:51                     ` Sergey Senozhatsky
2016-06-06  8:39                       ` Michal Hocko
2016-06-02 13:24   ` Vlastimil Babka
2016-06-02 18:58     ` Ebru Akagunduz
2016-06-03  1:00       ` Sergey Senozhatsky
2016-06-03  1:29         ` Sergey Senozhatsky
2016-06-03  4:14           ` Sergey Senozhatsky
2016-06-03 12:28     ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz
2016-06-06 13:05       ` Vlastimil Babka
2016-06-09  3:51         ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).