* Linux 4.9-rc6 @ 2016-11-20 22:05 Linus Torvalds 2016-11-20 22:27 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: Linus Torvalds @ 2016-11-20 22:05 UTC (permalink / raw) To: Linux Kernel Mailing List We're getting further in the rc series, and while things have stayed pretty calm, I'm not sure if we're quite there yet. There's a few outstanding issues that just shouldn't be issues at rc6 time, so we'll just have to see. This may be one of those releases that have an rc8, which considering the size of 4.9 is perhaps not that unusual. That said, nothing particular is bothering me all that much, but we've had some of the VMALLOC_STACK fixups continue to trickle in, so I worry that we're not quite done there yet. And let's see what Thorsten's regression list looks like next week. So no decision yet, it could still go either way. The fact that rc6 is bigger than rc5 was is not a particularly great sign, though. But most of that seems to be just the usual timing fluctuation: rc6 had networking updates, rc5 didn't, for example. There are also some rdma updates etc that stand out. Nothing that looks particularly worrisome. Aside from the aforementioned networking and rdma, there's gpu fixes, some tooling and build fixes, and various arch updates (x86, powerpc, arm, xtensa). And misc fixes all over (i2c, sound, fuse, kvm..) Go forth and test, Linus --- Aaron Lu (1): mremap: fix race between mremap() and page cleanning Abhi Das (1): fix iov_iter_advance() for ITER_PIPE Adam Ford (2): ARM: dts: omap3: Fix memory node in Torpedo board ARM: omap3: Add missing memory node in SOM-LV Alex Deucher (1): drm/amdgpu/powerplay: drop a redundant NULL check Alex Hemme (1): i2c: i2c-mux-pca954x: fix deselect enabling for device-tree Alexander Duyck (1): fib_trie: Correct /proc/net/route off by one error Alexei Starovoitov (1): ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records Allan Chou (1): Net Driver: Add Cypress GX3 VID=04b4 PID=3610. Andreas Gruenbacher (1): xattr: Fix setting security xattrs on sockfs Andrew Donnellan (1): powerpc/oops: Fix missing pr_cont()s in instruction dump Andy Gospodarek (1): bgmac: stop clearing DMA receive control register right after it is set Aneesh Kumar K.V (1): powerpc/mm: Fix missing update of HID register on secondary CPUs Arkadi Sharshevsky (1): mlxsw: spectrum_router: Correctly dump neighbour activity Arnd Bergmann (4): brcmfmac: avoid maybe-uninitialized warning in brcmf_cfg80211_start_ap netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning vxlan: hide unused local variable crypto: caam - fix type mismatch warning Axl-zhang (1): dmaengine: sun6i: fix the uninitialized value for v_lli Azhar Shaikh (1): mfd: intel-lpss: Do not put device in reset state on suspend Baoquan He (2): Revert "bnx2: Reset device during driver initialization" bnx2: Wait for in-flight DMA to complete at probe stage Bart Van Assche (1): nvmet-rdma: Fix possible NULL deref when handling rdma cm events Baruch Siach (1): net: bpqether.h: remove if_ether.h guard Benjamin Herrenschmidt (1): powerpc/64: Fix setting of AIL in hypervisor mode Benjamin Poirier (1): bna: Add synchronization for tx ring. Bert Kenward (1): sfc: clear napi_hash state when copying channels Bibby Hsieh (3): drm/mediatek: fix a typo of OD_CFG to OD_RELAYMODE drm/mediatek: set vblank_disable_allowed to true drm/mediatek: clear IRQ status before enable OVL interrupt Borislav Petkov (2): x86/efi: Fix EFI memmap pointer size warning kbuild: Steal gcc's pie from the very beginning Chris Metcalf (1): tile: handle __ro_after_init like parisc does Chris Wilson (1): drm/i915: Mark CPU cache as dirty when used for rendering Christoph Hellwig (1): nvme-rdma: reject non-connect commands before the queue is live Christophe JAILLET (2): drm/sun4i: Fix error handling drm/sun4i: Propagate error to the caller Christophe Jaillet (1): net/mlx5: Simplify a test Colin Ian King (3): ARM: OMAP2+: PRM: initialize en_uart4_mask and grpsel_uart4_mask net: ethernet: ixp4xx_eth: fix spelling mistake in debug message ps3_gelic: fix spelling mistake in debug message Cédric Le Goater (1): ipmi/bt-bmc: change compatible node to 'aspeed, ast2400-ibt-bmc' Dan Carpenter (1): ntb_perf: potential info leak in debugfs Daniel Borkmann (2): bpf: fix htab map destruction when extra reserve is in use bpf: fix map not being uncharged during map creation failure Daniel Jurgens (2): IB/mlx5: Use cache line size to select CQE stride IB/mlx4: Check gid_index return value Dasaratharaman Chandramouli (1): IB/hfi1: Fix ECN processing in prescan_rxq Dave Airlie (2): Revert "drm/mediatek: fix a typo of OD_CFG to OD_RELAYMODE" Revert "drm/mediatek: set vblank_disable_allowed to true" Dave Gerlach (1): ARM: AM43XX: Select OMAP_INTERCONNECT in Kconfig Dave Jiang (1): ntb: ntb_hw_intel: init peer_addr in struct intel_ntb_dev David Ahern (4): net: tcp: check skb is non-NULL for exact match on lookups net: icmp6_send should use dst dev to determine L3 domain net: icmp_route_lookup should use rt dev to determine L3 domain net: tcp response should set oif only if it is L3 master Dennis Dalessandro (3): IB/rdmavt: rdmavt can handle non aligned page maps IB/hfi1: Remove leftover snoop references IB/hfi1: Remove incorrect IS_ERR check Dongli Zhang (2): xen-netfront: do not cast grant table reference to signed short xen-netfront: cast grant table reference first to type int Easwar Hariharan (2): IB/hfi1: Clean up unused argument IB/hfi1: Delete unused lock Eli Cohen (2): IB/mlx5: Fix fatal error dispatching IB/mlx5: Fix NULL pointer dereference on debug print Eli Cooper (2): ip6_tunnel: Clear IP6CB in ip6tunnel_xmit() ip6_udp_tunnel: remove unused IPCB related codes Eric Biggers (2): fscrypto: don't use on-stack buffer for filename encryption fscrypto: don't use on-stack buffer for key derivation Eric Dumazet (12): net: clear sk_err_soft in sk_clone_lock() net: mangle zero checksum in skb_checksum_help() tcp: fix potential memory corruption tcp: fix return value for partial writes dccp: do not release listeners too soon dccp: do not send reset to already closed sockets dccp: fix out of bound access in dccp_v4_err() netlink: netlink_diag_dump() runs without locks ipv6: dccp: fix out of bound access in dccp_v6_err() ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped net: __skb_flow_dissect() must cap its return value tcp: take care of truncations done by sk_filter() Eugeniy Paltsev (1): drm/arcpgu: Accommodate adv7511 switch to DRM bridge Fabian Mewes (1): Documentation: networking: dsa: Update tagging protocols Fabio Estevam (1): ARM: dts: imx53-qsb: Fix regulator constraints Florian Fainelli (1): net: stmmac: Fix lack of link transition for fixed PHYs Florian Westphal (5): netfilter: conntrack: avoid excess memory allocation dctcp: avoid bogus doubling of cwnd after loss netfilter: connmark: ignore skbs with magic untracked conntrack objects netfilter: conntrack: fix CT target for UNSPEC helpers netfilter: conntrack: refine gc worker heuristics Gao Feng (1): driver: macvlan: Destroy new macvlan port if macvlan_common_newlink failed. Gregory CLEMENT (1): arm64: dts: marvell: Fix typo in label name on Armada 37xx Guenter Roeck (1): r8152: Fix error path in open function Guilherme G. Piccoli (1): ehea: fix operation state report H. Nikolaus Schaller (4): dts: omap5: board-common: add phandle to reference Palmas gpadc dts: omap5: board-common: enable twl6040 headset jack detection ASoC: omap-abe-twl6040: fix typo in bindings documentation ARM: dts: omap5: board-common: fix wrong SMPS6 (VDD-DDR3) voltage Haim Dreyfuss (1): iwlwifi: mvm: comply with fw_restart mod param on suspend Hariprasad Shenai (1): cxgb4: correct device ID of T6 adapter Heikki Krogerus (1): mfd: intel_soc_pmic_bxtwc: Fix usbc interrupt Herbert Xu (1): crypto: algif_hash - Fix NULL hash crash with shash Hoan Tran (1): mailbox: PCC: Fix lockdep warning when request PCC channel Hugh Dickins (1): powerpc: Fix exception vector build with 2.23 era binutils Hui Wang (1): ALSA: hda - add a new condition to check if it is thinkpad Huy Nguyen (1): net/mlx5: Fix invalid pointer reference when prof_sel parameter is invalid Icenowy Zheng (1): ARM: dts: sun8i: fix the pinmux for UART1 Ido Schimmel (2): mlxsw: spectrum: Fix incorrect reuse of MID entries mlxsw: spectrum_router: Flush FIB tables during fini Ignacio Alvarado (1): KVM: Disable irq while unregistering user notifier Ira Weiny (1): IB/hfi1: Fix rnr_timer addition Isaac Boukris (1): unix: escape all null bytes in abstract unix domain socket Iyappan Subramanian (2): drivers: net: xgene: fix: Disable coalescing on v1 hardware drivers: net: xgene: fix: Coalescing values for v2 hardware Jakub Pawlak (2): IB/hfi1: Fix integrity check flags default values IB/hfi1: Fix status error code for unsupported packets Jakub Sitnicki (1): ipv6: Don't use ufo handling on later transformed packets Jarkko Nikula (1): mfd: lpss: Fix Intel Kaby Lake PCH-H properties Javier Martinez Canillas (1): rtc: asm9260: fix module autoload Jianxin Xiong (2): IB/hfi1: Fix a potential memory leak in hfi1_create_ctxts() IB/hfi1: Prevent hardware counter names from being cut off Jiri Pirko (2): mlxsw: spectrum_router: Fix handling of neighbour structure mlxsw: spectrum_router: Ignore FIB notification events for non-init namespaces Johan Hovold (5): phy: fix device reference leaks net: ethernet: ti: cpsw: fix device and of_node leaks net: ethernet: ti: davinci_emac: fix device reference leak net: hns: fix device reference leaks mfd: core: Fix device reference leak in mfd_clone_cell Johannes Berg (1): iwlwifi: pcie: mark command queue lock with separate lockdep class John Allen (1): ibmvnic: Start completion queue negotiation at server-provided optimum values John W. Linville (1): netfilter: nf_tables: fix type mismatch with error return from nft_parse_u32_check Jonathan Liu (1): drm/sun4i: rgb: Enable panel after controller Junzhi Zhao (3): drm/mediatek: do mtk_hdmi_send_infoframe after HDMI clock enable drm/mediatek: enhance the HDMI driving current drm/mediatek: modify the factor to make the pll_rate set in the 1G-2G range Jérémy Lefaure (1): dmaengine: mmp_tdma: add missing select GENERIC_ALLOCATOR in Kconfig Kan Liang (1): perf/x86/intel/uncore: Add more Intel uncore IMC PCI IDs for SkyLake Keith Busch (1): nvme/pci: Don't free queues on error Keno Fischer (1): gpio: Remove GPIO_DEVRES option Krzysztof Blaszkowski (2): IB/hfi1: Return ENODEV for unsupported PCI device ids. IB/hfi1: Relocate rcvhdrcnt module parameter check. LABBE Corentin (1): rtc: cmos: remove all __exit_p annotations Lance Richardson (2): ipv4: allow local fragmentation in ip_finish_output_gso() ipv4: update comment to document GSO fragmentation cases. Leon Romanovsky (1): IB/core: Set routable RoCE gid type for ipv4/ipv6 networks Linus Torvalds (3): Revert "printk: make reading the kernel log flush pending lines" ASoC: lpass-platform: fix uninitialized variable Linux 4.9-rc6 Linus Walleij (5): video: ARM CLCD: fix Vexpress regression i2c: mux: fix up dependencies gpio: do not double-check direction on sleeping chips gpio: tc3589x: fix up .get_direction() mfd: stmpe: Fix RESET regression on STMPE2401 Liping Zhang (6): netfilter: nft_dynset: fix panic if NFT_SET_HASH is not enabled netfilter: nf_tables: fix *leak* when expr clone fail netfilter: nf_tables: fix race when create new element in dynset netfilter: nf_tables: destroy the set if fail to add transaction netfilter: nft_dup: do not use sreg_dev if the user doesn't specify it netfilter: nf_tables: fix oops when inserting an element into a verdict map Loic Pallardy (1): ARM: dts: STiH410-b2260: Fix typo in spi0 chipselect definition Lokesh Vutla (1): rtc: omap: Fix selecting external osc Luca Coelho (4): iwlwifi: mvm: use ssize_t for len in iwl_debugfs_mem_read() iwlwifi: mvm: fix d3_test with unified D0/D3 images iwlwifi: pcie: fix SPLC structure parsing iwlwifi: mvm: fix netdetect starting/stopping for unified images Lukas Resch (1): can: sja1000: plx_pci: Add support for Moxa CAN devices Lukas Wunner (1): x86/platform/intel-mid: Retrofit pci_platform_pm_ops ->get_state hook Lv Zheng (1): tools/power/acpi: Remove direct kernel source include reference Maciej Żenczykowski (1): net-ipv6: on device mtu change do not add mtu to mtu-less routes Majd Dibbiny (1): IB/mlx5: Fix memory leak in query device Maor Gottlieb (1): IB/mlx5: Validate requested RQT size Marcelo Ricardo Leitner (1): sctp: assign assoc_id earlier in __sctp_connect Marcin Wojtas (2): arm64: dts: marvell: fix clocksource for CP110 slave SPI0 arm64: dts: marvell: add unique identifiers for Armada A8k SPI controllers Marek Szyprowski (1): ARM: 8628/1: dma-mapping: preallocate DMA-debug hash tables in core_initcall Mario Kleiner (1): drm/amdgpu: Attach exclusive fence to prime exported bo's. (v5) Mark Bloch (3): IB/cm: Mark stale CM id's whenever the mad agent was unregistered IB/core: Add missing check for addr_resolve callback return value IB/core: Avoid unsigned int overflow in sg_alloc_table Mark Lord (1): r8152: Fix broken RX checksums. Martin KaFai Lau (2): bpf: Fix bpf_redirect to an ipip/ip6tnl dev bpf: Add test for bpf_redirect to ipip/ip6tnl Matan Barak (1): IB/mlx4: Fix create CQ error flow Mathias Krause (1): rtnl: reset calcit fptr in rtnl_unregister() Matt Fleming (1): x86/efi: Prevent mixed mode boot corruption with CONFIG_VMAP_STACK=y Mauro Carvalho Chehab (1): gp8psk-fe: add missing MODULE_foo() macros Max Filippov (2): xtensa: clean up printk usage for boot/crash logging xtensa: wire up new pkey_{mprotect,alloc,free} syscalls Maxime Ripard (1): drm/sun4i: rgb: Remove the bridge enable/disable functions Michael Chan (2): bnxt_en: Fix ring arithmetic in bnxt_setup_tc(). bnxt_en: Fix VF virtual link state. Michael Ellerman (3): powerpc/oops: Fix missing pr_cont()s in show_stack() powerpc/oops: Fix missing pr_cont()s in print_msr_bits() et. al. powerpc/oops: Fix missing pr_cont()s in show_regs() Michael Neuling (1): powerpc/mm/radix: Invalidate ERAT on tlbiel for POWER9 DD1 Michael S. Tsirkin (1): virtio-net: drop legacy features in virtio 1 mode Mike Frysinger (1): Revert "include/uapi/linux/atm_zatm.h: include linux/time.h" Mike Marshall (1): orangefs: add .owner to debugfs file_operations Miklos Szeredi (2): fuse: fix root dentry initialization fuse: fix fuse_write_end() if zero bytes were copied Mintz, Yuval (2): qede: Fix statistics' strings for Tx/Rx queues qede: Correctly map aggregation replacement pages Monk Liu (1): drm/amdgpu:fix vpost_needed routine Moshe Lazer (1): IB/mlx5: Resolve soft lock on massive reg MRs Namhyung Kim (5): perf hist browser: Fix hierarchy column counts perf hists browser: Fix indentation of folded sign on --hierarchy perf hists browser: Show folded sign properly on --hierarchy perf hists browser: Fix column indentation on --hierarchy perf hists: Fix column length on --hierarchy Nicholas Mc Guire (2): ntb_transport: make DMA_OUT_RESOURCE_TO HZ independent ntb: make DMA_OUT_RESOURCE_TO HZ independent Nicholas Piggin (4): kbuild: prevent lib-ksyms.o rebuilds kbuild: modversions for EXPORT_SYMBOL() for asm kbuild: be more careful about matching preprocessed asm ___EXPORT_SYMBOL powerpc/64s: Fix system reset interrupt winkle wakeups Nicolae Rosia (1): ARM: OMAP2+: avoid NULL pointer dereference Nicolas Pitre (1): ARM: 8624/1: proc-v7m.S: fix init section name Oliver Hartkopp (1): can: bcm: fix warning in bcm_connect/proc_register Or Gerlitz (3): net/mlx5e: Disallow changing name-space for VF representors net/mlx5e: Handle matching on vlan priority for offloaded TC rules net/mlx5: E-Switch, Set the actions for offloaded rules properly Paolo Bonzini (5): KVM: x86: do not go through vcpu in __get_kvmclock_ns kvm: kvmclock: let KVM_GET_CLOCK return whether the master clock is in use KVM: async_pf: avoid recursive flushing of work items KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr kvm: x86: merge kvm_arch_set_irq and kvm_arch_set_irq_inatomic Pavel Machek (1): MAINTAINERS: Add LED subsystem co-maintainer Peter Rosin (1): i2c: Documentation: i2c-topology: fix minor whitespace nit Phil Reid (2): gpio: pca953x: Fix corruption of other gpios in set_multiple. gpio: pca953x: Move memcpy into mutex lock for set multiple Rafael J. Wysocki (1): Revert "ACPICA: FADT support cleanup" Rafał Miłecki (1): net: bgmac: fix reversed checks for clock control flag Ram Amrani (2): qed: configure ll2 RoCE v1/v2 flavor correctly qed: Correct rdma params configuration Russell King (3): net: mv643xx_eth: ensure coalesce settings survive read-modify-write ARM: fix backtrace ARM: Fix XIP kernels Saeed Mahameed (3): MAINTAINERS: Update MELLANOX MLX5 core VPI driver maintainers net/mlx5e: Fix XDP error path of mlx5e_open_channel() net/mlx5e: Re-arrange XDP SQ/CQ creation Sagi Grimberg (3): nvmet: Don't queue fatal error work if csts.cfs is set nvmet-rdma: don't forget to delete a queue from the list of connection failed nvmet-rdma: drain the queue-pair just before freeing it Sara Sharon (1): iwlwifi: mvm: wake the wait queue when the RX sync counter is zero Scott Mayhew (1): sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports Sebastian Andrzej Siewior (3): kbuild: add -fno-PIE scripts/has-stack-protector: add -fno-PIE x86/kexec: add -fno-PIE Soheil Hassas Yeganeh (1): sock: fix sendmmsg for partial sendmsg Stefan Agner (3): drm/fsl-dcu: do not update when modifying irq registers drm/fsl-dcu: update all registers on flush drm/fsl-dcu: disable planes before disabling CRTC Stephen Suryaputra Lin (1): ipv4: use new_gw for redirect neigh lookup Steve Wise (3): nvme-rdma: stop and free io queues on connect failure iw_cxgb4: set *bad_wr for post_send/post_recv errors iw_cxgb4: invalidate the mr when posting a read_w_inv wr Steven Rostedt (Red Hat) (1): ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records Sven Ebenfeld (1): crypto: caam - do not register AES-XTS mode on LP units Tadeusz Struk (2): IB/hfi1: Remove redundant sysfs irq affinity entry IB/hfi1: Fix an Oops on pci device force remove Takashi Iwai (2): ALSA: hda - Fix mic regression by ASRock mobo fixup ALSA: usb-audio: Fix use-after-free of usb_device at disconnect Tariq Toukan (2): Revert "net/mlx4_en: Fix panic during reboot" IB/uverbs: Fix leak of XRC target QPs Tero Kristo (1): rtc: omap: prevent disabling of clock/module during suspend Theodore Ts'o (1): ext4: sanity check the block and cluster size at mount time Thomas Falcon (2): ibmvnic: Unmap ibmvnic_statistics structure ibmvnic: Fix size of debugfs name buffer Thomas Gleixner (2): genirq: Use irq type from irqdata instead of irqdesc x86/cpu: Deal with broken firmware (VMWare/XEN) Timur Tabi (3): net: qcom/emac: use correct value for SGMII_LN_UCDR_SO_GAIN_MODE0 net: qcom/emac: configure the external phy to allow pause frames net: qcom/emac: enable flow control if requested Tony Lindgren (5): ARM: OMAP3: Fix formatting of features printed dmaengine: cppi41: Fix list not empty warning on module removal dmaengine: cppi41: Fix unpaired pm runtime when only a USB hub is connected dmaengine: cpp41: Fix handling of error path dmaengine: cppi41: More PM runtime fixes Ulrich Weber (1): netfilter: nf_conntrack_sip: extend request line validation Ville Syrjälä (4): rtc: cmos: Don't enable interrupts in the middle of the interrupt handler drm/i915: Grab the rotation from the passed plane state for VLV sprites drm/i915: Refresh that status of MST capable connectors in ->detect() drm/i915: Assume non-DP++ port if dvo_port is HDMI and there's no AUX ch specified in the VBT WANG Cong (4): inet: fix sleeping inside inet_wait_for_connect() genetlink: fix a memory leak on error path taskstats: fix the length of cgroupstats_cmd_get_policy ipvs: use IPVS_CMD_ATTR_MAX for family.maxattr Wei Huang (2): arm64: KVM: pmu: Fix AArch32 cycle counter access KVM: arm64: Fix the issues when guest PMCCFILTR is configured Wei Yongjun (4): dmaengine: edma: Fix error return code in edma_alloc_chan_resources() ntb_pingpong: Fix db_init parameter description NTB: ntb_hw_intel: Fix typo in module parameter descriptions i2c: digicolor: use clk_disable_unprepare instead of clk_unprepare Wolfram Sang (1): i2c: mux: demux-pinctrl: make drivers with no pinctrl work again Xin Long (5): ipv6: add mtu lock check in __ip6_rt_update_pmtu sctp: hold transport instead of assoc in sctp_diag sctp: return back transport in __sctp_rcv_init_lookup sctp: hold transport instead of assoc when lookup assoc in rx path sctp: change sk state only when it has assocs in sctp_shutdown Yazen Ghannam (1): x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems Yonatan Cohen (4): IB/rxe: Fix kernel panic in UDP tunnel with GRO and RX checksum IB/rxe: Fix handling of erroneous WR IB/rxe: Clear queue buffer when modifying QP to reset IB/rxe: Update qp state for user query Yotam Gigi (1): mlxsw: spectrum: Fix refcount bug on span entries ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-11-20 22:05 Linux 4.9-rc6 Linus Torvalds @ 2016-11-20 22:27 ` Eric Dumazet 2016-11-20 23:27 ` Linus Torvalds 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2016-11-20 22:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List On Sun, 2016-11-20 at 14:05 -0800, Linus Torvalds wrote: > That said, nothing particular is bothering me all that much, but we've > had some of the VMALLOC_STACK fixups continue to trickle in, so I > worry that we're not quite done there yet. And let's see what > Thorsten's regression list looks like next week. So no decision yet, > it could still go either way. Hosts with ~100,000 threads have an issue with /prov/vmallocinfo It can take about 800 usec to skip over ~100,000 struct vmap_area in s_start(), while holding vmap_area_lock spinlock, and therefore blocking fork()/pthread_create(). I presume we can not switch to the rbtree (vmap_area_root) for /proc/vmallocinfo, because this file is seek-able, right ? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-11-20 22:27 ` Eric Dumazet @ 2016-11-20 23:27 ` Linus Torvalds 2016-11-21 1:35 ` Al Viro 0 siblings, 1 reply; 12+ messages in thread From: Linus Torvalds @ 2016-11-20 23:27 UTC (permalink / raw) To: Eric Dumazet, Al Viro; +Cc: Linux Kernel Mailing List On Sun, Nov 20, 2016 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Hosts with ~100,000 threads have an issue with /prov/vmallocinfo > > It can take about 800 usec to skip over ~100,000 struct vmap_area > in s_start(), while holding vmap_area_lock spinlock, and therefore > blocking fork()/pthread_create(). > > I presume we can not switch to the rbtree (vmap_area_root) > for /proc/vmallocinfo, because this file is seek-able, right ? Well, the good news is that the file is root-only anyway, which means that at least it won't have the issue that a lot of other /proc files have had - namely being opened by random user programs or libraries. Which means that the users of it are likely fairly limited. Which in turn means that we can probably afford to play more games with it. Including, for example, possibly marking it non-seekable. Or even just limit the maximum entries we are willing to walk. Or we could decide that that file shouldn't be a seq_file at all, use the old "one page buffer" approach that was so common for /proc files, and make the position encode the vmalloc address in it (make the lower PAGE_MASK bits be the offset in the line), and then we *could* just look things up using the btree method. Al, do you have any clever ideas? Linus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-11-20 23:27 ` Linus Torvalds @ 2016-11-21 1:35 ` Al Viro 2016-11-21 4:59 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: Al Viro @ 2016-11-21 1:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: Eric Dumazet, Linux Kernel Mailing List On Sun, Nov 20, 2016 at 03:27:07PM -0800, Linus Torvalds wrote: > On Sun, Nov 20, 2016 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > Hosts with ~100,000 threads have an issue with /prov/vmallocinfo > > > > It can take about 800 usec to skip over ~100,000 struct vmap_area > > in s_start(), while holding vmap_area_lock spinlock, and therefore > > blocking fork()/pthread_create(). > > > > I presume we can not switch to the rbtree (vmap_area_root) > > for /proc/vmallocinfo, because this file is seek-able, right ? > > Well, the good news is that the file is root-only anyway, which means > that at least it won't have the issue that a lot of other /proc files > have had - namely being opened by random user programs or libraries. > > Which means that the users of it are likely fairly limited. > > Which in turn means that we can probably afford to play more games > with it. Including, for example, possibly marking it non-seekable. > > Or even just limit the maximum entries we are willing to walk. > > Or we could decide that that file shouldn't be a seq_file at all, use > the old "one page buffer" approach that was so common for /proc files, > and make the position encode the vmalloc address in it (make the lower > PAGE_MASK bits be the offset in the line), and then we *could* just > look things up using the btree method. > > Al, do you have any clever ideas? Umm... One possibility would be something like fs/namespace.c:m_start() - if nothing has changed since the last time, just use a cached pointer. That has sped the damn thing (/proc/mounts et.al.) big way, but it's dependent upon having an event count updated whenever we change the mount tree - doing the same for vma_area list might or might not be a good idea. /proc/mounts and friends get ->poll() on that as well; that probably would _not_ be a good idea in this case. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-11-21 1:35 ` Al Viro @ 2016-11-21 4:59 ` Eric Dumazet 2016-11-21 8:34 ` David Rientjes 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2016-11-21 4:59 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Linux Kernel Mailing List On Mon, 2016-11-21 at 01:35 +0000, Al Viro wrote: > > Umm... One possibility would be something like fs/namespace.c:m_start() - > if nothing has changed since the last time, just use a cached pointer. > That has sped the damn thing (/proc/mounts et.al.) big way, but it's > dependent upon having an event count updated whenever we change the > mount tree - doing the same for vma_area list might or might not be > a good idea. /proc/mounts and friends get ->poll() on that as well; > that probably would _not_ be a good idea in this case. Yes, a generation number could help in some cases. Another potential issue with CONFIG_VMAP_STACK is that we make no attempt to allocate 4 consecutive pages. Even if we have plenty of memory, 4 calls to alloc_page() are likely to give us 4 pages in completely different locations. Here I printed the hugepage number of the 4 pages for some stacks : 0xffffc9001a07c000-0xffffc9001a081000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfcac Hfeba Hfec0 Hfc9d N0=4 0xffffc9001a084000-0xffffc9001a089000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc79 Hfc79 Hfc79 Hfc83 N0=4 0xffffc9001a08c000-0xffffc9001a091000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc9b Hfe91 Hfebe Hfca2 N0=4 0xffffc9001a094000-0xffffc9001a099000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfcaa Hfcaa Hfca6 Hfebc N0=4 0xffffc9001a09c000-0xffffc9001a0a1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe9b Hfe90 Hff09 Hfefb N0=4 0xffffc9001a0a4000-0xffffc9001a0a9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe94 Hfe62 Hfea0 Hfe7b N0=4 0xffffc9001a0ac000-0xffffc9001a0b1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe78 Hff05 Hff05 Hfc74 N0=4 0xffffc9001a0b4000-0xffffc9001a0b9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc9b Hfc9b Hfe83 Hf782 N0=4 0xffffc9001a0bc000-0xffffc9001a0c1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe78 Hfe78 Hfc7f Hfc7f N0=4 0xffffc9001a0c4000-0xffffc9001a0c9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfebe Hfebe Hfe82 Hfe85 N0=4 0xffffc9001a0cc000-0xffffc9001a0d1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc6b Hfe62 Hfe62 Hfcaa N0=4 0xffffc9001a0d4000-0xffffc9001a0d9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfebd Hfebd Hfc92 Hfc92 N0=4 This is a vmalloc() generic issue that is worth fixing now ? Note this RFC might conflict with NUMA interleave policy. diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2481cb4e6b2..0123e97debb9 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1602,9 +1602,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot, int node) { struct page **pages; - unsigned int nr_pages, array_size, i; + unsigned int nr_pages, array_size, i, j; const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN; + const gfp_t multi_alloc_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY; nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; array_size = (nr_pages * sizeof(struct page *)); @@ -1624,20 +1625,34 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, return NULL; } - for (i = 0; i < area->nr_pages; i++) { - struct page *page; - - if (node == NUMA_NO_NODE) - page = alloc_page(alloc_mask); - else - page = alloc_pages_node(node, alloc_mask, 0); + for (i = 0; i < area->nr_pages;) { + struct page *page = NULL; + unsigned int chunk_order = min(ilog2(area->nr_pages - i), MAX_ORDER - 1); + + while (chunk_order && !page) { + if (node == NUMA_NO_NODE) + page = alloc_pages(multi_alloc_mask, chunk_order); + else + page = alloc_pages_node(node, multi_alloc_mask, chunk_order); + if (page) + split_page(page, chunk_order); + else + chunk_order--; + } + if (!page) { + if (node == NUMA_NO_NODE) + page = alloc_pages(alloc_mask, 0); + else + page = alloc_pages_node(node, alloc_mask, 0); + } if (unlikely(!page)) { /* Successfully allocated i pages, free them in __vunmap() */ area->nr_pages = i; goto fail; } - area->pages[i] = page; + for (j = 0; j < (1 << chunk_order); j++) + area->pages[i++] = page++; if (gfpflags_allow_blocking(gfp_mask)) cond_resched(); } ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-11-21 4:59 ` Eric Dumazet @ 2016-11-21 8:34 ` David Rientjes 2016-11-21 13:32 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: David Rientjes @ 2016-11-21 8:34 UTC (permalink / raw) To: Eric Dumazet; +Cc: Al Viro, Linus Torvalds, Linux Kernel Mailing List On Sun, 20 Nov 2016, Eric Dumazet wrote: > Another potential issue with CONFIG_VMAP_STACK is that we make no > attempt to allocate 4 consecutive pages. > > Even if we have plenty of memory, 4 calls to alloc_page() are likely to > give us 4 pages in completely different locations. > > Here I printed the hugepage number of the 4 pages for some stacks : > > > 0xffffc9001a07c000-0xffffc9001a081000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfcac Hfeba Hfec0 Hfc9d N0=4 > 0xffffc9001a084000-0xffffc9001a089000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc79 Hfc79 Hfc79 Hfc83 N0=4 > 0xffffc9001a08c000-0xffffc9001a091000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc9b Hfe91 Hfebe Hfca2 N0=4 > 0xffffc9001a094000-0xffffc9001a099000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfcaa Hfcaa Hfca6 Hfebc N0=4 > 0xffffc9001a09c000-0xffffc9001a0a1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe9b Hfe90 Hff09 Hfefb N0=4 > 0xffffc9001a0a4000-0xffffc9001a0a9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe94 Hfe62 Hfea0 Hfe7b N0=4 > 0xffffc9001a0ac000-0xffffc9001a0b1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe78 Hff05 Hff05 Hfc74 N0=4 > 0xffffc9001a0b4000-0xffffc9001a0b9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc9b Hfc9b Hfe83 Hf782 N0=4 > 0xffffc9001a0bc000-0xffffc9001a0c1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe78 Hfe78 Hfc7f Hfc7f N0=4 > 0xffffc9001a0c4000-0xffffc9001a0c9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfebe Hfebe Hfe82 Hfe85 N0=4 > 0xffffc9001a0cc000-0xffffc9001a0d1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc6b Hfe62 Hfe62 Hfcaa N0=4 > 0xffffc9001a0d4000-0xffffc9001a0d9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfebd Hfebd Hfc92 Hfc92 N0=4 > > This is a vmalloc() generic issue that is worth fixing now ? > > Note this RFC might conflict with NUMA interleave policy. > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index f2481cb4e6b2..0123e97debb9 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -1602,9 +1602,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > pgprot_t prot, int node) > { > struct page **pages; > - unsigned int nr_pages, array_size, i; > + unsigned int nr_pages, array_size, i, j; > const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; > const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN; > + const gfp_t multi_alloc_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY; > > nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; > array_size = (nr_pages * sizeof(struct page *)); I think multi_alloc_mask wants to use alloc_mask rather than gfp_mask before clearing the bit, otherwise the failed high-order allocations with no chance to reclaim will spew page allocation failure warnings. Using __GFP_NORETRY here would be a no-op, but it depends on the implementation so no problems setting it. > @@ -1624,20 +1625,34 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > return NULL; > } > > - for (i = 0; i < area->nr_pages; i++) { > - struct page *page; > - > - if (node == NUMA_NO_NODE) > - page = alloc_page(alloc_mask); > - else > - page = alloc_pages_node(node, alloc_mask, 0); > + for (i = 0; i < area->nr_pages;) { > + struct page *page = NULL; > + unsigned int chunk_order = min(ilog2(area->nr_pages - i), MAX_ORDER - 1); > + > + while (chunk_order && !page) { > + if (node == NUMA_NO_NODE) > + page = alloc_pages(multi_alloc_mask, chunk_order); > + else > + page = alloc_pages_node(node, multi_alloc_mask, chunk_order); > + if (page) > + split_page(page, chunk_order); > + else > + chunk_order--; > + } > + if (!page) { > + if (node == NUMA_NO_NODE) > + page = alloc_pages(alloc_mask, 0); > + else > + page = alloc_pages_node(node, alloc_mask, 0); > + } > > if (unlikely(!page)) { > /* Successfully allocated i pages, free them in __vunmap() */ > area->nr_pages = i; > goto fail; > } > - area->pages[i] = page; > + for (j = 0; j < (1 << chunk_order); j++) > + area->pages[i++] = page++; > if (gfpflags_allow_blocking(gfp_mask)) > cond_resched(); > } > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-11-21 8:34 ` David Rientjes @ 2016-11-21 13:32 ` Eric Dumazet 2016-11-21 13:51 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2016-11-21 13:32 UTC (permalink / raw) To: David Rientjes; +Cc: Al Viro, Linus Torvalds, Linux Kernel Mailing List On Mon, 2016-11-21 at 00:34 -0800, David Rientjes wrote: > On Sun, 20 Nov 2016, Eric Dumazet wrote: > > > Another potential issue with CONFIG_VMAP_STACK is that we make no > > attempt to allocate 4 consecutive pages. > > > > Even if we have plenty of memory, 4 calls to alloc_page() are likely to > > give us 4 pages in completely different locations. > > > > Here I printed the hugepage number of the 4 pages for some stacks : > > > > > > 0xffffc9001a07c000-0xffffc9001a081000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfcac Hfeba Hfec0 Hfc9d N0=4 > > 0xffffc9001a084000-0xffffc9001a089000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc79 Hfc79 Hfc79 Hfc83 N0=4 > > 0xffffc9001a08c000-0xffffc9001a091000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc9b Hfe91 Hfebe Hfca2 N0=4 > > 0xffffc9001a094000-0xffffc9001a099000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfcaa Hfcaa Hfca6 Hfebc N0=4 > > 0xffffc9001a09c000-0xffffc9001a0a1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe9b Hfe90 Hff09 Hfefb N0=4 > > 0xffffc9001a0a4000-0xffffc9001a0a9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe94 Hfe62 Hfea0 Hfe7b N0=4 > > 0xffffc9001a0ac000-0xffffc9001a0b1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe78 Hff05 Hff05 Hfc74 N0=4 > > 0xffffc9001a0b4000-0xffffc9001a0b9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc9b Hfc9b Hfe83 Hf782 N0=4 > > 0xffffc9001a0bc000-0xffffc9001a0c1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfe78 Hfe78 Hfc7f Hfc7f N0=4 > > 0xffffc9001a0c4000-0xffffc9001a0c9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfebe Hfebe Hfe82 Hfe85 N0=4 > > 0xffffc9001a0cc000-0xffffc9001a0d1000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfc6b Hfe62 Hfe62 Hfcaa N0=4 > > 0xffffc9001a0d4000-0xffffc9001a0d9000 20480 _do_fork+0xe1/0x360 pages=4 vmalloc Hfebd Hfebd Hfc92 Hfc92 N0=4 > > > > This is a vmalloc() generic issue that is worth fixing now ? > > > > Note this RFC might conflict with NUMA interleave policy. > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index f2481cb4e6b2..0123e97debb9 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -1602,9 +1602,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > > pgprot_t prot, int node) > > { > > struct page **pages; > > - unsigned int nr_pages, array_size, i; > > + unsigned int nr_pages, array_size, i, j; > > const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; > > const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN; > > + const gfp_t multi_alloc_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY; > > > > nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; > > array_size = (nr_pages * sizeof(struct page *)); > > I think multi_alloc_mask wants to use alloc_mask rather than gfp_mask > before clearing the bit, otherwise the failed high-order allocations with > no chance to reclaim will spew page allocation failure warnings. Using > __GFP_NORETRY here would be a no-op, but it depends on the implementation > so no problems setting it. Oh, this was definitely my intent of course, thanks for noticing this typo ;) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-11-21 13:32 ` Eric Dumazet @ 2016-11-21 13:51 ` Eric Dumazet 2016-11-21 16:49 ` Eric Dumazet 2016-12-04 10:43 ` Thorsten Leemhuis 0 siblings, 2 replies; 12+ messages in thread From: Eric Dumazet @ 2016-11-21 13:51 UTC (permalink / raw) To: David Rientjes; +Cc: Al Viro, Linus Torvalds, Linux Kernel Mailing List On Mon, 2016-11-21 at 05:32 -0800, Eric Dumazet wrote: > > Oh, this was definitely my intent of course, thanks for noticing this > typo ;) V2 is fixing this, and brings back NUMA spreading, (eg alloc_large_system_hash() done at boot time ) lpaa24:~# grep alloc_large /proc/vmallocinfo 0xffffc90000009000-0xffffc9000000c000 12288 alloc_large_system_hash+0x178/0x238 pages=2 vmalloc N0=1 N1=1 0xffffc9000000c000-0xffffc9000000f000 12288 alloc_large_system_hash+0x178/0x238 pages=2 vmalloc N0=1 N1=1 0xffffc9000001e000-0xffffc9000009f000 528384 alloc_large_system_hash+0x178/0x238 pages=128 vmalloc N0=64 N1=64 0xffffc9000009f000-0xffffc900000e0000 266240 alloc_large_system_hash+0x178/0x238 pages=64 vmalloc N0=32 N1=32 0xffffc900001d3000-0xffffc900101d4000 268439552 alloc_large_system_hash+0x178/0x238 pages=65536 vmalloc vpages N0=32768 N1=32768 0xffffc900101d4000-0xffffc900181d5000 134221824 alloc_large_system_hash+0x178/0x238 pages=32768 vmalloc vpages N0=16384 N1=16384 0xffffc900181d5000-0xffffc900185d6000 4198400 alloc_large_system_hash+0x178/0x238 pages=1024 vmalloc vpages N0=512 N1=512 0xffffc900185d6000-0xffffc900189d7000 4198400 alloc_large_system_hash+0x178/0x238 pages=1024 vmalloc vpages N0=512 N1=512 0xffffc9001b271000-0xffffc9001b672000 4198400 alloc_large_system_hash+0x178/0x238 pages=1024 vmalloc vpages N0=512 N1=512 0xffffc9001b672000-0xffffc9001b675000 12288 alloc_large_system_hash+0x178/0x238 pages=2 vmalloc N0=1 N1=1 0xffffc9001b675000-0xffffc9001b776000 1052672 alloc_large_system_hash+0x178/0x238 pages=256 vmalloc N0=128 N1=128 0xffffc9001b776000-0xffffc9001b977000 2101248 alloc_large_system_hash+0x178/0x238 pages=512 vmalloc N0=256 N1=256 0xffffc9001b977000-0xffffc9001bb78000 2101248 alloc_large_system_hash+0x178/0x238 pages=512 vmalloc N0=256 N1=256 0xffffc9001c075000-0xffffc9001c176000 1052672 alloc_large_system_hash+0x178/0x238 pages=256 vmalloc N0=128 N1=128 mm/vmalloc.c | 47 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 39 insertions(+), 8 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2481cb4e6b2..f4b9c9238f86 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -21,6 +21,7 @@ #include <linux/debugobjects.h> #include <linux/kallsyms.h> #include <linux/list.h> +#include <linux/mempolicy.h> #include <linux/notifier.h> #include <linux/rbtree.h> #include <linux/radix-tree.h> @@ -1602,9 +1603,11 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot, int node) { struct page **pages; - unsigned int nr_pages, array_size, i; + unsigned int nr_pages, array_size, i, j; const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN; + const gfp_t multi_alloc_mask = (alloc_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY; + int max_node_order = MAX_ORDER - 1; nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; array_size = (nr_pages * sizeof(struct page *)); @@ -1624,20 +1627,48 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, return NULL; } - for (i = 0; i < area->nr_pages; i++) { - struct page *page; + if (IS_ENABLED(CONFIG_NUMA) && nr_online_nodes > 1) { + struct mempolicy *policy = current->mempolicy; + int pages_per_node; - if (node == NUMA_NO_NODE) - page = alloc_page(alloc_mask); - else - page = alloc_pages_node(node, alloc_mask, 0); + if (policy && policy->mode == MPOL_INTERLEAVE) { + pages_per_node = DIV_ROUND_UP(nr_pages, + nr_online_nodes); + max_node_order = min(max_node_order, + ilog2(pages_per_node)); + } + } + + for (i = 0; i < area->nr_pages;) { + unsigned int chunk_order = min(ilog2(area->nr_pages - i), + max_node_order); + struct page *page = NULL; + + while (chunk_order) { + if (node == NUMA_NO_NODE) + page = alloc_pages(multi_alloc_mask, chunk_order); + else + page = alloc_pages_node(node, multi_alloc_mask, chunk_order); + if (page) { + split_page(page, chunk_order); + break; + } + chunk_order--; + } + if (!page) { + if (node == NUMA_NO_NODE) + page = alloc_pages(alloc_mask, 0); + else + page = alloc_pages_node(node, alloc_mask, 0); + } if (unlikely(!page)) { /* Successfully allocated i pages, free them in __vunmap() */ area->nr_pages = i; goto fail; } - area->pages[i] = page; + for (j = 0; j < (1U << chunk_order); j++) + area->pages[i++] = page++; if (gfpflags_allow_blocking(gfp_mask)) cond_resched(); } ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-11-21 13:51 ` Eric Dumazet @ 2016-11-21 16:49 ` Eric Dumazet 2016-12-04 10:43 ` Thorsten Leemhuis 1 sibling, 0 replies; 12+ messages in thread From: Eric Dumazet @ 2016-11-21 16:49 UTC (permalink / raw) To: David Rientjes; +Cc: Al Viro, Linus Torvalds, Linux Kernel Mailing List On Mon, 2016-11-21 at 05:51 -0800, Eric Dumazet wrote: > + while (chunk_order) { > + if (node == NUMA_NO_NODE) > + page = alloc_pages(multi_alloc_mask, chunk_order); > + else > + page = alloc_pages_node(node, multi_alloc_mask, chunk_order); > + if (page) { > + split_page(page, chunk_order); > + break; > + } > + chunk_order--; > + } We also could remember the page order with set_page_private() and speedup show_numa_info() I wonder if we could avoid the split_page() and speedup vfree(). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-11-21 13:51 ` Eric Dumazet 2016-11-21 16:49 ` Eric Dumazet @ 2016-12-04 10:43 ` Thorsten Leemhuis [not found] ` <CA+55aFzPiZW4FfWbvM-+AFraa0fkUHv4C1Y9SCzHdXEcUSPqdg@mail.gmail.com> 1 sibling, 1 reply; 12+ messages in thread From: Thorsten Leemhuis @ 2016-12-04 10:43 UTC (permalink / raw) To: Eric Dumazet, David Rientjes Cc: Al Viro, Linus Torvalds, Linux Kernel Mailing List Lo! On 21.11.2016 14:51, Eric Dumazet wrote: > On Mon, 2016-11-21 at 05:32 -0800, Eric Dumazet wrote: >> Oh, this was definitely my intent of course, thanks for noticing this >> typo ;) > V2 is fixing this, and brings back NUMA spreading, > (eg alloc_large_system_hash() done at boot time ) What the status of below patch? From the discussion it looks a lot like it was developed to fix a regression in 4.9, but the patch afaics has neither his mainline or linux-next yet. That's why I'm inclined to add it to this weeks regression report. Ciao, Thorsten > lpaa24:~# grep alloc_large /proc/vmallocinfo > 0xffffc90000009000-0xffffc9000000c000 12288 alloc_large_system_hash+0x178/0x238 pages=2 vmalloc N0=1 N1=1 > 0xffffc9000000c000-0xffffc9000000f000 12288 alloc_large_system_hash+0x178/0x238 pages=2 vmalloc N0=1 N1=1 > 0xffffc9000001e000-0xffffc9000009f000 528384 alloc_large_system_hash+0x178/0x238 pages=128 vmalloc N0=64 N1=64 > 0xffffc9000009f000-0xffffc900000e0000 266240 alloc_large_system_hash+0x178/0x238 pages=64 vmalloc N0=32 N1=32 > 0xffffc900001d3000-0xffffc900101d4000 268439552 alloc_large_system_hash+0x178/0x238 pages=65536 vmalloc vpages N0=32768 N1=32768 > 0xffffc900101d4000-0xffffc900181d5000 134221824 alloc_large_system_hash+0x178/0x238 pages=32768 vmalloc vpages N0=16384 N1=16384 > 0xffffc900181d5000-0xffffc900185d6000 4198400 alloc_large_system_hash+0x178/0x238 pages=1024 vmalloc vpages N0=512 N1=512 > 0xffffc900185d6000-0xffffc900189d7000 4198400 alloc_large_system_hash+0x178/0x238 pages=1024 vmalloc vpages N0=512 N1=512 > 0xffffc9001b271000-0xffffc9001b672000 4198400 alloc_large_system_hash+0x178/0x238 pages=1024 vmalloc vpages N0=512 N1=512 > 0xffffc9001b672000-0xffffc9001b675000 12288 alloc_large_system_hash+0x178/0x238 pages=2 vmalloc N0=1 N1=1 > 0xffffc9001b675000-0xffffc9001b776000 1052672 alloc_large_system_hash+0x178/0x238 pages=256 vmalloc N0=128 N1=128 > 0xffffc9001b776000-0xffffc9001b977000 2101248 alloc_large_system_hash+0x178/0x238 pages=512 vmalloc N0=256 N1=256 > 0xffffc9001b977000-0xffffc9001bb78000 2101248 alloc_large_system_hash+0x178/0x238 pages=512 vmalloc N0=256 N1=256 > 0xffffc9001c075000-0xffffc9001c176000 1052672 alloc_large_system_hash+0x178/0x238 pages=256 vmalloc N0=128 N1=128 > > > mm/vmalloc.c | 47 +++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 39 insertions(+), 8 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index f2481cb4e6b2..f4b9c9238f86 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -21,6 +21,7 @@ > #include <linux/debugobjects.h> > #include <linux/kallsyms.h> > #include <linux/list.h> > +#include <linux/mempolicy.h> > #include <linux/notifier.h> > #include <linux/rbtree.h> > #include <linux/radix-tree.h> > @@ -1602,9 +1603,11 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > pgprot_t prot, int node) > { > struct page **pages; > - unsigned int nr_pages, array_size, i; > + unsigned int nr_pages, array_size, i, j; > const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; > const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN; > + const gfp_t multi_alloc_mask = (alloc_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_NORETRY; > + int max_node_order = MAX_ORDER - 1; > > nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; > array_size = (nr_pages * sizeof(struct page *)); > @@ -1624,20 +1627,48 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > return NULL; > } > > - for (i = 0; i < area->nr_pages; i++) { > - struct page *page; > + if (IS_ENABLED(CONFIG_NUMA) && nr_online_nodes > 1) { > + struct mempolicy *policy = current->mempolicy; > + int pages_per_node; > > - if (node == NUMA_NO_NODE) > - page = alloc_page(alloc_mask); > - else > - page = alloc_pages_node(node, alloc_mask, 0); > + if (policy && policy->mode == MPOL_INTERLEAVE) { > + pages_per_node = DIV_ROUND_UP(nr_pages, > + nr_online_nodes); > + max_node_order = min(max_node_order, > + ilog2(pages_per_node)); > + } > + } > + > + for (i = 0; i < area->nr_pages;) { > + unsigned int chunk_order = min(ilog2(area->nr_pages - i), > + max_node_order); > + struct page *page = NULL; > + > + while (chunk_order) { > + if (node == NUMA_NO_NODE) > + page = alloc_pages(multi_alloc_mask, chunk_order); > + else > + page = alloc_pages_node(node, multi_alloc_mask, chunk_order); > + if (page) { > + split_page(page, chunk_order); > + break; > + } > + chunk_order--; > + } > + if (!page) { > + if (node == NUMA_NO_NODE) > + page = alloc_pages(alloc_mask, 0); > + else > + page = alloc_pages_node(node, alloc_mask, 0); > + } > > if (unlikely(!page)) { > /* Successfully allocated i pages, free them in __vunmap() */ > area->nr_pages = i; > goto fail; > } > - area->pages[i] = page; > + for (j = 0; j < (1U << chunk_order); j++) > + area->pages[i++] = page++; > if (gfpflags_allow_blocking(gfp_mask)) > cond_resched(); > } > > ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <CA+55aFzPiZW4FfWbvM-+AFraa0fkUHv4C1Y9SCzHdXEcUSPqdg@mail.gmail.com>]
* Re: Linux 4.9-rc6 [not found] ` <CA+55aFzPiZW4FfWbvM-+AFraa0fkUHv4C1Y9SCzHdXEcUSPqdg@mail.gmail.com> @ 2016-12-04 17:17 ` Eric Dumazet 2016-12-21 15:30 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2016-12-04 17:17 UTC (permalink / raw) To: Linus Torvalds Cc: Thorsten Leemhuis, Linux Kernel Mailing List, Al Viro, David Rientjes On Sun, 2016-12-04 at 03:10 -0800, Linus Torvalds wrote: > > > On Dec 4, 2016 02:43, "Thorsten Leemhuis" <regressions@leemhuis.info> > wrote: > > > What the status of below patch? From the discussion it looks a > lot like > it was developed to fix a regression in 4.9, but the patch > afaics has > neither his mainline or linux-next yet. > > > It's not a regression as far as I can tell. It's a small optimization. > Maybe. > > > It's not going into 4.9, is not even clear it's worth it later either, > unless somebody had numbers (which I haven't seen) > Right, the patch was not in anyway ready for 4.9 ;) I'll try to complete this for next cycle. Thanks. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Linux 4.9-rc6 2016-12-04 17:17 ` Eric Dumazet @ 2016-12-21 15:30 ` Eric Dumazet 0 siblings, 0 replies; 12+ messages in thread From: Eric Dumazet @ 2016-12-21 15:30 UTC (permalink / raw) To: Linus Torvalds Cc: Thorsten Leemhuis, Linux Kernel Mailing List, Al Viro, David Rientjes, Hugh Dickins On Sun, 2016-12-04 at 09:17 -0800, Eric Dumazet wrote: > On Sun, 2016-12-04 at 03:10 -0800, Linus Torvalds wrote: > > > > > > On Dec 4, 2016 02:43, "Thorsten Leemhuis" <regressions@leemhuis.info> > > wrote: > > > > > > What the status of below patch? From the discussion it looks a > > lot like > > it was developed to fix a regression in 4.9, but the patch > > afaics has > > neither his mainline or linux-next yet. > > > > > > It's not a regression as far as I can tell. It's a small optimization. > > Maybe. > > > > > > It's not going into 4.9, is not even clear it's worth it later either, > > unless somebody had numbers (which I haven't seen) > > > Right, the patch was not in anyway ready for 4.9 ;) > > I'll try to complete this for next cycle. I now have a hacky patch that also adds PMD alignment for large allocations, and support hugepages (this last part depends on CONFIG_HAVE_ARCH_HUGE_VMAP at this moment, x86/arm64 so far) Toshi Kani added pmd_set_huge() in commit e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings"), I am not sure why vmalloc() was not considered (or I might have missed it completely) It seems to provide about 25 cycles gain per random access for large tables on my x86 lab hosts. (I did a test with a program having 10 Million fds) For allocations above 2 MB (pages >= 512), like Dentry cache, Inode-cache, TCP established hash table, or large alloc_fdmem() ones, might benefit from this. lpaa23:~# grep large /proc/vmallocinfo 0xffffc90000009000-0xffffc9000000c000 12288 alloc_large_system_hash+0x189/0x253 pages=2 vmalloc N0=1 N1=1 0xffffc9000000c000-0xffffc9000000f000 12288 alloc_large_system_hash+0x189/0x253 pages=2 vmalloc N0=1 N1=1 0xffffc9000001e000-0xffffc9000009f000 528384 alloc_large_system_hash+0x189/0x253 pages=128 vmalloc N0=64 N1=64 0xffffc9000009f000-0xffffc900000e0000 266240 alloc_large_system_hash+0x189/0x253 pages=64 vmalloc N0=32 N1=32 0xffffc900001d9000-0xffffc900001dc000 12288 alloc_large_system_hash+0x189/0x253 pages=2 vmalloc N0=1 N1=1 0xffffc90000200000-0xffffc90010201000 268439552 alloc_large_system_hash+0x189/0x253 pages=65536 vmalloc vpages N0=32768 N1=32768 0xffffc90010400000-0xffffc90018401000 134221824 alloc_large_system_hash+0x189/0x253 pages=32768 vmalloc vpages N0=16384 N1=16384 0xffffc90018600000-0xffffc90018a01000 4198400 alloc_large_system_hash+0x189/0x253 pages=1024 vmalloc vpages N0=512 N1=512 0xffffc90018c00000-0xffffc90019001000 4198400 alloc_large_system_hash+0x189/0x253 pages=1024 vmalloc vpages N0=512 N1=512 0xffffc9001b249000-0xffffc9001b34a000 1052672 alloc_large_system_hash+0x189/0x253 pages=256 vmalloc N0=128 N1=128 0xffffc9001b400000-0xffffc9001b801000 4198400 alloc_large_system_hash+0x189/0x253 pages=1024 vmalloc vpages N0=512 N1=512 0xffffc9001ba00000-0xffffc9001bc01000 2101248 alloc_large_system_hash+0x189/0x253 pages=512 vmalloc N0=256 N1=256 0xffffc9001bc01000-0xffffc9001bd02000 1052672 alloc_large_system_hash+0x189/0x253 pages=256 vmalloc N0=128 N1=128 0xffffc9001be00000-0xffffc9001c001000 2101248 alloc_large_system_hash+0x189/0x253 pages=512 vmalloc N0=256 N1=256 I wont be able to split this patch in 3 parts before January 6th, after my vacations. I am showing the WIP if anyone is interested seeing this. diff --git a/mm/vmalloc.c b/mm/vmalloc.c index a5584384eabc..055b027ee659 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -21,6 +21,7 @@ #include <linux/debugobjects.h> #include <linux/kallsyms.h> #include <linux/list.h> +#include <linux/mempolicy.h> #include <linux/notifier.h> #include <linux/rbtree.h> #include <linux/radix-tree.h> @@ -154,6 +155,18 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, return -ENOMEM; do { next = pmd_addr_end(addr, end); +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP + if (next - addr == PMD_SIZE) { + struct page *page = pages[*nr]; + + if (compound_order(page) == PMD_SHIFT - PAGE_SHIFT) { + if (pmd_set_huge(pmd, page_to_phys(page), prot)) { + (*nr) += 1 << (PMD_SHIFT - PAGE_SHIFT); + continue; + } + } + } +#endif if (vmap_pte_range(pmd, addr, next, prot, pages, nr)) return -ENOMEM; } while (pmd++, addr = next, addr != end); @@ -1349,7 +1362,8 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, if (flags & VM_IOREMAP) align = 1ul << clamp_t(int, get_count_order_long(size), PAGE_SHIFT, IOREMAP_MAX_ORDER); - + else if (size >= PMD_SIZE) + align = PMD_SIZE; area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node); if (unlikely(!area)) return NULL; @@ -1482,11 +1496,14 @@ static void __vunmap(const void *addr, int deallocate_pages) if (deallocate_pages) { int i; - for (i = 0; i < area->nr_pages; i++) { + for (i = 0; i < area->nr_pages;) { struct page *page = area->pages[i]; + unsigned int order; BUG_ON(!page); - __free_pages(page, 0); + order = compound_order(page); + __free_pages(page, order); + i += 1 << order; } kvfree(area->pages); @@ -1613,16 +1630,39 @@ EXPORT_SYMBOL(vmap); static void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, pgprot_t prot, int node, const void *caller); + +static int vmalloc_max_order(int node, int nr_pages) +{ + int max_node_order = min(PMD_SHIFT - PAGE_SHIFT, MAX_ORDER - 1); + +#if defined(CONFIG_NUMA) + if (nr_online_nodes > 1 && node == NUMA_NO_NODE) { + struct mempolicy *pol = current->mempolicy; + int pages_per_node, nr_nodes; + + if (pol && pol->mode == MPOL_INTERLEAVE) { + nr_nodes = nodes_weight(pol->v.nodes); + pages_per_node = DIV_ROUND_UP(nr_pages, nr_nodes); + max_node_order = min(max_node_order, + ilog2(pages_per_node)); + } + } +#endif + return max_node_order; +} + static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, pgprot_t prot, int node) { struct page **pages; - unsigned int nr_pages, array_size, i; + unsigned int nr_pages, array_size, i, j; const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN; + int max_node_order; nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; array_size = (nr_pages * sizeof(struct page *)); + max_node_order = vmalloc_max_order(node, nr_pages); area->nr_pages = nr_pages; /* Please note that the recursion is strictly bounded. */ @@ -1639,20 +1679,31 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, return NULL; } - for (i = 0; i < area->nr_pages; i++) { - struct page *page; - if (node == NUMA_NO_NODE) - page = alloc_page(alloc_mask); - else - page = alloc_pages_node(node, alloc_mask, 0); + for (i = 0; i < area->nr_pages;) { + int order = min(ilog2(area->nr_pages - i), max_node_order); + struct page *page; - if (unlikely(!page)) { - /* Successfully allocated i pages, free them in __vunmap() */ - area->nr_pages = i; - goto fail; + for (;;) { + gfp_t gfp = alloc_mask; + + if (order > 0) + gfp = (gfp & ~__GFP_DIRECT_RECLAIM) | + __GFP_NORETRY | __GFP_COMP; + if (node == NUMA_NO_NODE) + page = alloc_pages(gfp, order); + else + page = alloc_pages_node(node, gfp, order); + if (page) + break; + if (unlikely(--order < 0)) { + /* Successfully allocated i pages, free them in __vunmap() */ + area->nr_pages = i; + goto fail; + } } - area->pages[i] = page; + for (j = 0; j < (1U << order); j++) + area->pages[i++] = page++; if (gfpflags_allow_blocking(gfp_mask)) cond_resched(); } @@ -2619,9 +2670,13 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v) memset(counters, 0, nr_node_ids * sizeof(unsigned int)); - for (nr = 0; nr < v->nr_pages; nr++) - counters[page_to_nid(v->pages[nr])]++; + for (nr = 0; nr < v->nr_pages;) { + struct page *page = v->pages[nr]; + int npages = 1 << compound_order(page); + counters[page_to_nid(page)] += npages; + nr += npages; + } for_each_node_state(nr, N_HIGH_MEMORY) if (counters[nr]) seq_printf(m, " N%u=%u", nr, counters[nr]); ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-12-21 15:30 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-20 22:05 Linux 4.9-rc6 Linus Torvalds 2016-11-20 22:27 ` Eric Dumazet 2016-11-20 23:27 ` Linus Torvalds 2016-11-21 1:35 ` Al Viro 2016-11-21 4:59 ` Eric Dumazet 2016-11-21 8:34 ` David Rientjes 2016-11-21 13:32 ` Eric Dumazet 2016-11-21 13:51 ` Eric Dumazet 2016-11-21 16:49 ` Eric Dumazet 2016-12-04 10:43 ` Thorsten Leemhuis [not found] ` <CA+55aFzPiZW4FfWbvM-+AFraa0fkUHv4C1Y9SCzHdXEcUSPqdg@mail.gmail.com> 2016-12-04 17:17 ` Eric Dumazet 2016-12-21 15:30 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).