* Linux 3.19-rc5 @ 2015-01-18 9:17 Linus Torvalds 2015-01-19 18:02 ` Bruno Prémont 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2015-01-18 9:17 UTC (permalink / raw) To: Linux Kernel Mailing List Another week, another -rc. Fairly normal release, although I'd wish that by rc5 we'd have calmed down even further. But no, with some of the driver tree merges in particular, this is actually larger than rc4 was. That said, it's not like there is anything particularly scary in here. The arm64 vm bug that I mentioned as pending in the rc4 notes got fixed within a day of that previous rc release, and the rest looks pretty standard. Mostly drivers (networking, usb, scsi target, block layer, mmc, tty etc), but also arch updates (arm, x86, s390 and some tiny powerpc fixes), some filesystem updates (fuse and nfs), tracing fixes, and some perf tooling fixes. Shortlog with the details appended. Go forth and test. Linus --- Abhilash Kesavan (2): drivers: bus: check cci device tree node status ARM: dts: disable CCI on exynos5420 based arndale-octa Adrian Hunter (6): mmc: sdhci: Tuning should not change max_blk_count mmc: sdhci: Add out_unlock to sdhci_execute_tuning mmc: sdhci: Simplify use of tuning timer mmc: sdhci: Disable re-tuning for HS400 mmc: sdhci-acpi: Add ACPI HID INT344D mmc: sdhci-pci: Add support for Intel SPT Alan Stern (2): USB: EHCI: fix initialization bug in iso_stream_schedule() USB: EHCI: adjust error return code Alexander Stein (1): ARM: at91/dt: sam9263: Add missing clocks to lcdc node Alexander Usyskin (1): mei: clean reset bit before reset Alexandre Belloni (1): net/at91_ether: prepare and unprepare clock Alexey Brodkin (1): perf tools: Fix statfs.f_type data type mismatch build error with uclibc Alexey Khoroshilov (1): usb/kaweth: use GFP_ATOMIC under spin_lock in usb_start_wait_urb() Amit Virdi (2): usb: dwc3: gadget: Fix TRB preparation during SG usb: dwc3: gadget: Stop TRB preparation after limit is reached Andreas Faerber (1): ARM: exynos_defconfig: Enable LM90 driver Andrey Skvortsov (1): selftests/vm: fix link error for transhuge-stress test Andy Grover (1): iscsi-target: Fix typos in enum cmd_flags_table Andy Shevchenko (1): dmaengine: dw: balance PM runtime calls Ani Sinha (1): update ip-sysctl.txt documentation (v2) Anjali Singhai (2): i40e: Fix Rx checksum error counter i40e: Fix bug with TCP over IPv6 over VXLAN Anson Huang (1): Thermal: imx: add clk disable/enable for suspend/resume Anton Blanchard (1): powernv: Fix OPAL tracepoint code Arik Nemtsov (1): iwlwifi: pcie: correctly define 7265-D cfg Arnaldo Carvalho de Melo (1): tools: Remove bitops/hweight usage of bits in tools/perf Arnd Bergmann (2): usb: phy: mv-usb: fix usb_phy build errors bridge: only provide proxy ARP when CONFIG_INET is enabled Arseny Solokha (1): OHCI: add a quirk for ULi M5237 blocking on reset Axel Lin (2): gpio: dln2: Fix gpio output value in dln2_gpio_direction_output() gpio: grgpio: Avoid potential NULL pointer dereference B Viswanath (1): net: Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations Benjamin Poirier (1): netdevice: Add missing parentheses in macro Bo Shen (4): usb: gadget: udc: atmel: change setting for DMA usb: gadget: udc: atmel: fix possible IN hang issue ARM: at91/dt: sama5d4: fix the timer reg length ARM: at91: sama5d3: dt: correct the sound route Boris Brezillon (1): clk: at91: keep slow clk enabled to prevent system hang Boris Ostrovsky (2): x86/xen: Remove unnecessary BUG_ON(preemptible()) in xen_setup_timer() x86/xen: Free bootmem in free_p2m_page() during early boot Catalin Marinas (1): arm64: partially revert "ARM: 8167/1: extend the reserved memory for initrd to be page aligned" Chanwoo Choi (1): serial: samsung: Add the support for Exynos5433 SoC Chen Gang (1): s390/timex: fix get_tod_clock_ext() inline assembly Christian Borntraeger (3): s390/vtime: Get rid of redundant WARN_ON s390/kernel: use stnsm 255 instead of stosm 0 kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val) Christoph Hellwig (1): scsi: ->queue_rq can't sleep Christoph Jaeger (1): packet: bail out of packet_snd() if L2 header creation fails Chuck Lever (1): NFS: Ignore transport protocol when detecting server trunking Colin Ian King (1): fbdev/broadsheetfb: fix memory leak Dan Carpenter (3): ipvs: uninitialized data with IP_VS_IPV6 phy: miphy28lp: unlock on error in miphy28lp_init() usb: gadget: gadgetfs: fix an oops in ep_write() Darrick J. Wong (1): uas: disable UAS on Apricorn SATA dongles David Drysdale (1): selftests/exec: allow shell return code of 126 David Peterson (1): USB: cp210x: add IDs for CEL USB sticks and MeshWorks devices David Spinadel (2): iwlwifi: mvm: add a flag to enable match found notification iwlwifi: mvm: scan dwell time corrections David Vrabel (3): x86/xen: don't count how many PFNs are identity mapped x86/xen: add extra memory for remapped frames during setup xen-netfront: use different locks for Rx and Tx stats Doug Anderson (1): ARM: dts: rockchip: bump sd card pin drive strength up on rk3288-evb Eddie Kovsky (1): staging: vt6655: fix sparse warning: argument type Eduardo Valentin (1): MAINTAINERS: update ti-soc-thermal status Emmanuel Grumbach (2): iwlwifi: 7000: fix reported firmware name for 7265D iwlwifi: bump firmware API for mvm devices to 12 Eric Dumazet (2): net: dnet: fix dnet_poll() alx: fix alx_poll() Eyal Shapira (2): iwlwifi: mvm: fix Rx with both chains iwlwifi: mvm: fix out of bounds access to tid_to_mac80211_ac Fabien Proriol (1): iio: iio: Fix iio_channel_read return if channel havn't info Fabio Estevam (3): ARM: dts: imx25: Fix the SPI1 clocks ARM: imx6sx: Set PLL2 as parent of QSPI clocks ARM: dts: imx51-babbage: Fix ULPI PHY reset modelling Felipe Balbi (2): usb: musb: debugfs: cope with blackfin's oddities usb: musb: blackfin: fix build break Gary Bisson (1): ARM: clk-imx6q: fix video divider for rev T0 1.0 Geert Uytterhoeven (5): ARM: shmobile: r8a7740: Instantiate GIC from C board code in legacy builds thermal: of: Remove bogus type qualifier for of_thermal_get_trip_points() ARM: shmobile: sh73a0 legacy: Set .control_parent for all irqpin instances m68k: Wire up execveat thermal: rcar: Spelling/grammar: s/drier use .../driver uses ...s/ Giel van Schijndel (1): isdn: fix NUL (\0 or \x00) specification in string Hans de Goede (7): phy-sun4i-usb: Change disconnect threshold value for sun6i xhci: Add broken-streams quirk for Fresco Logic FL1000G xhci controllers uas: Add US_FL_NO_ATA_1X for Seagate devices with usb-id 0bc2:a013 uas: Add US_FL_NO_REPORT_OPCODES for JMicron JMS566 with usb-id 0bc2:a013 uas: Do not blacklist ASM1153 disk enclosures uas: Add US_FL_NO_ATA_1X for 2 more Seagate disk enclosures simplefb: Fix build failure on Sparc Harald Freudenberger (1): s390/zcrypt: kernel oops at insmod of the z90crypt device driver Hariprasad Shenai (2): cxgb4vf: Initialize mdio_addr before using it cxgb4vf: Fix queue allocation for 40G adapter Heikki Krogerus (1): usb: dwc3: pci: add support for Intel Sunrise Point PCH Heiko Carstens (1): s390: wire up execveat syscall Heiko Stuebner (2): clk: rockchip: fix rk3066 pll lock bit location clk: rockchip: fix rk3288 cpuclk core dividers Heiko Stübner (2): ARM: rockchip: disable jtag/sdmmc autoswitching on rk3288 clk: rockchip: fix deadlock possibility in cpuclk Hubert Feurstein (1): net: fec: fix NULL pointer dereference in fec_enet_timeout_work Ian Munsie (1): cxl: Fix issues when unmapping contexts James Bottomley (1): serial: fix parisc boot hang Jan Beulich (1): x86/xen: properly retrieve NMI reason Jan Willeke (1): s390/uprobes: fix user space PER events Javi Merino (1): Documentation: thermal: document of_cpufreq_cooling_register() Javier Martinez Canillas (1): ARM: exynos_defconfig: Enable options for display panel support Jean-Francois Remy (1): neighbour: fix base_reachable_time(_ms) not effective immediatly when changed Jens Axboe (3): block: wake up waiters when a queue is marked dying blk-mq: get rid of ->cmd_size in the hardware queue blk-mq: Add helper to abort requeued requests Jeremiah Mahler (2): usb: serial: silence all non-critical read errors usb: serial: handle -ENODEV quietly in generic_submit_read_urb Jesse Brandeburg (1): i40e: fix un-necessary Tx hangs Jiri Pirko (1): team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin Jisheng Zhang (4): ARM: dts: berlin: fix io clk and add missing core clk for BG2Q sdhci2 host ARM: dts: berlin: add broken-cd and set bus width for eMMC in Marvell DMP DT ARM: dts: berlin: correct BG2Q's SM GPIO location. clk: berlin: bg2q: remove non-exist "smemc" gate clock Johan Hovold (3): USB: keyspan: fix null-deref at probe USB: console: fix uninitialised ldisc semaphore USB: console: fix potential use after free Johannes Thumshirn (1): mcb: mcb-pci: Only remap the 1st 0x200 bytes of BAR 0 John W. Linville (1): usb: gadget: udc: avoid dereference before NULL check in ep_queue Jon Paul Maloy (1): tipc: fix bug in broadcast retransmit code Juergen Gross (4): xen: correct error for building p2m list on 32 bits xen: correct race in alloc_p2m_pmd() xen: use correct type for physical addresses xen: check for zero sized area when invalidating memory Julia Lawall (1): usb: gadget: fix misspelling of current function in string Julien CHAUVEAU (1): clk: rockchip: add CLK_IGNORE_UNUSED flag to fix rk3066/rk3188 USB Host Kan Liang (1): perf/x86/intel: Fix bug for "cycles:p" and "cycles:pp" on SLM Keith Busch (14): blk-mq: Exit queue on alloc failure blk-mq: Export freeze/unfreeze functions NVMe: Fix double free irq blk-mq: Wake tasks entering queue on dying blk-mq: Export if requests were started blk-mq: Let drivers cancel requeue_work blk-mq: Allow requests to never expire blk-mq: End unstarted requests on a dying queue NVMe: Start all requests NVMe: Reference count admin queue usage NVMe: Admin queue removal handling NVMe: Command abort handling fixes NVMe: Start and stop h/w queues on reset NVMe: Fix locking on abort handling Kevin Hao (1): Revert "clk: ppc-corenet: Fix Section mismatch warning" Krzysztof Kozlowski (1): mmc: sdhci: Fix sleep in atomic after inserting SD card Larry Finger (1): rtlwifi: Fix error when accessing unmapped memory in skb Lars-Peter Clausen (1): iio: ad799x: Fix ad7991/ad7995/ad7999 config setup Lee Duncan (1): target: Allow Write Exclusive non-reservation holders to READ Lennart Sorensen (3): ARM: omap5/dra7xx: Fix frequency typos ARM: dra7xx: Fix counter frequency drift for AM572x errata i856 ARM: omap5/dra7xx: Enable booting secondary CPU in HYP mode Linus Torvalds (1): Linux 3.19-rc5 Linus Walleij (1): ARM: nomadik: fix up leftover device tree pins Louis Langholtz (1): kernel: avoid overflow in cmp_range Malcolm Priestley (2): staging: vt6655: vnt_tx_packet Fix corrupted tx packets. staging: vt6655: Fix loss of distant/weak access points on channel change. Marc Zyngier (2): arm64: KVM: Fix TLB invalidation by IPA/VMID arm64: KVM: Fix HCR setting for 32bit guests Mario Schuknecht (1): usb: gadget: gadgetfs: Free memory allocated by memdup_user() Martin Schwidefsky (1): s390/mm: avoid using pmd_to_page for !USE_SPLIT_PMD_PTLOCKS Mathias Nyman (1): xhci: Check if slot is already in default state before moving it there Maxime Ripard (1): usb: phy: Fix deferred probing Michael Ellerman (1): powerpc: Work around gcc bug in current_thread_info() Michael Holzheu (2): s390/bpf: Fix ALU_NEG (A = -A) s390/bpf: Fix JMP_JGE_X (A > X) and JMP_JGT_X (A >= X) Mike Krinkin (1): staging: vt6655: fix sparse warnings: incorrect argument type Miklos Szeredi (2): fuse: fix LOOKUP vs INIT compat handling fuse: add memory barrier to INIT Ming Lei (1): block: fix checking return value of blk_mq_init_queue Mugunthan V N (2): ARM: dts: dra7-evm: fix qspi device tree partition size drivers: net: cpsw: fix multicast flush in dual emac mode Namhyung Kim (4): perf probe: Propagate error code when write(2) failed perf tools: Fix building error in x86_64 when dwarf unwind is on perf machine: Fix __machine__findnew_thread() error path perf tools: Fix segfault for symbol annotation on TUI NeilBrown (1): locks: fix NULL-deref in generic_delete_lease Nicholas Bellinger (4): vhost-scsi: Add missing virtio-scsi -> TCM attribute conversion Documentation/target: Update fabric_ops to latest code target: Drop arbitrary maximum I/O size limit target: Drop left-over fabric_max_sectors attribute Nilesh Javali (1): MAINTAINERS: Update maintainer list for qla4xxx Nishanth Menon (2): MAINTAINERS: Add linux-omap to list of reviewers for TI Thermal ARM: omap2plus_defconfig: use CONFIG_CPUFREQ_DT Nobuhiro Iwamatsu (2): sh-eth: Set fdr_value of R-Car SoCs sh_eth: Fix access to TRSCER register Octavian Purdila (2): gpio: dln2: fix issue when an IRQ is unmasked then enabled gpio: dln2: use bus_sync_unlock instead of scheduling work Pablo Neira Ayuso (4): netfilter: conntrack: fix race between confirmation and flush netfilter: nfnetlink: validate nfnetlink header from batch netfilter: nfnetlink: relax strict multicast group check from netlink_bind netfilter: nf_tables: fix flush ruleset chain dependencies Peter Chen (2): usb: gadget: f_uac1: access freed memory at f_audio_free_inst Revert "usb: chipidea: remove duplicate dev_set_drvdata for host_start" Peter Hurley (2): tty: Prevent hw state corruption in exclusive mode reopen Revert "tty: Fix pty master poll() after slave closes v2" Philipp Zabel (1): ARM: dts: imx6qdl: Fix CODA960 interrupt order Prashant Sreedharan (3): tg3: tg3_timer() should grab tp->lock before checking for tp->irq_sync tg3: tg3_reset_task() needs to use rtnl_lock to synchronize tg3: Release tp->lock before invoking synchronize_irq() Preston Fick (1): USB: cp210x: fix ID for production CEL MeshConnect USB Stick Rasmus Villemoes (2): usb: musb: Fix a few off-by-one lengths kernfs: Fix kernfs_name_compare Reinhard Speyerer (1): USB: qcserial/option: make AT URCs work for Sierra Wireless MC73xx Robert Baldyga (1): usb: dwc2: gadget: kill requests with 'force' in s3c_hsotg_udc_stop() Romain Perier (1): clk: rockchip: Fix clock gate for rk3188 hclk_emem_peri Sagi Grimberg (1): MAINTAINERS: Add entry for iSER target driver Sebastian Andrzej Siewior (1): usb: musb: stuff leak of struct usb_hcd Sergej Pupykin (1): tty: Add support for the WCH384 4S multi-IO card Simon Guinot (1): leds: netxbig: fix oops at probe time Songjun Wu (1): usb: gadget: udc: atmel: fix possible oops when unloading module Stanimir Varbanov (1): clk: fix possible null pointer dereference Stefan Agner (1): net: fec: fix MDIO bus assignement for dual fec SoC's Stephane Eranian (1): perf/rapl: Fix sysfs_show() initialization for RAPL PMU Steven Rostedt (Red Hat) (5): ftrace: Fix updating of filters for shared global_ops filters ftrace: Check both notrace and filter for old hash ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing tracing: Remove extra call to init_ftrace_syscalls() tracing: Fix enabling of syscall events on the command line Sukadev Bhattiprolu (1): perf tools powerpc: Use dwfl_report_elf() instead of offline. Thierry Reding (1): usb: phy: Restore deferred probing path Thomas Falcon (1): MAINTAINERS: add me as ibmveth maintainer Thomas Graf (1): openvswitch: packet messages need their own probe attribtue Thomas Petazzoni (1): mmc: sdhci-pxav3: do the mbus window configuration after enabling clocks Tim Kryger (1): mmc: sdhci: Set SDHCI_POWER_ON with external vmmc Tomas Winkler (1): mei: add ABI documentation for fw_status exported through sysfs Tony Lindgren (3): usb: musb: Fix randconfig build issues for Kconfig options ARM: OMAP2+: Fix n900 board name for legacy user space ARM: dts: Revert disabling of smc91x for n900 Trond Myklebust (5): LOCKD: Fix a race when initialising nlmsvc_timeout NFSv4.1: Fix client id trunking on Linux NFSv4: Cache the NFSv4/v4.1 client owner_id in the struct nfs_client NFSv4/v4.1: Verify the client owner id during trunking detection NFSv4: Remove incorrect check in can_open_delegated() Tyler Baker (1): reset: sunxi: fix spinlock initialization Vasu Dev (1): i40e: adds FCoE configure option Vignesh R (1): phy: phy-ti-pipe3: fix inconsistent enumeration of PCIe gen2 cards Vince Hsu (1): usb: host: ehci-tegra: request deferred probe when failing to get phy Vineet Gupta (2): perf tools: Elide strlcpy warning with uclibc perf tools: Avoid build splat for syscall numbers with uclibc Vinod Koul (1): MAINTAINERS: dmaengine: fix the header file for dmaengine Vitaly Kuznetsov (1): x86/xen: avoid freeing static 'name' when kasprintf() fails Vivek Gautam (1): arm: dts: Use pmu_system_controller phandle for dp phy Wang Nan (1): perf test: Fix dwarf unwind using libunwind. Wenyou Yang (1): ARM: at91: board-dt-sama5: add phy_fixup to override NAND_Tree Will Deacon (2): arm64: compat: wire up compat_sys_execveat mm: mmu_gather: use tlb->end != 0 only for TLB invalidation Xiubo Li (1): ARM: ls1021a: dtsi: add 'big-endian' property for scfg node Yoshihiro Shimoda (2): thermal: rcar: fix ENR register value thermal: rcar: change type of ctemp in rcar_thermal_update_temp() Zhang Rui (3): ACPI/int340x_thermal: enumerate INT340X devices even if they're not in _ART/_TRT ACPI/int340x_thermal: enumerate INT3401 for Intel SoC DTS thermal driver int340x_thermal/processor_thermal_device: return failure when dann frazier (1): tools: testing: selftests: mq_perf_tests: Fix infinite loop on ARM leroy christophe (1): netfilter: nf_tables: fix port natting in little endian archs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-18 9:17 Linux 3.19-rc5 Linus Torvalds @ 2015-01-19 18:02 ` Bruno Prémont 2015-01-20 6:24 ` Linus Torvalds 0 siblings, 1 reply; 31+ messages in thread From: Bruno Prémont @ 2015-01-19 18:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Sun, 18 January 2015 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Another week, another -rc. > > Fairly normal release, although I'd wish that by rc5 we'd have calmed > down even further. But no, with some of the driver tree merges in > particular, this is actually larger than rc4 was. > > That said, it's not like there is anything particularly scary in here. > > The arm64 vm bug that I mentioned as pending in the rc4 notes got > fixed within a day of that previous rc release, and the rest looks > pretty standard. Mostly drivers (networking, usb, scsi target, block > layer, mmc, tty etc), but also arch updates (arm, x86, s390 and some > tiny powerpc fixes), some filesystem updates (fuse and nfs), tracing > fixes, and some perf tooling fixes. > > Shortlog with the details appended. > > Go forth and test. No idea yet which rc is the offender (nor exact patch), but on my not so recent UP laptop with a pccard slot I have 2 pccardd kernel threads converting my laptop into a heater. lspci for affected nodes: 02:06.0 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) 02:06.1 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) Very basics I have, before I attempt any bisection: cat /proc/$(pidof pccardd)/stack [<c143ec95>] pccardd+0x1a5/0x3e0 [<c10543a0>] kthread+0xa0/0xc0 [<c16cd840>] ret_from_kernel_thread+0x20/0x30 [<ffffffff>] 0xffffffff Doing objdump on drivers/pcmcia/cs.o in which pccardd() is defined it seems pccardd is stuck in try_to_freeze_unsafe(). Does this kind of behavior ring a bell? I can unbind the two yenta_cardbus devices from yenta_cardbus to stop the CPU usage. Possibly important config option: # CONFIG_SMP is not set Objdump below: static int pccardd(void *__skt) { ca0: 55 push %ebp ca1: 89 e5 mov %esp,%ebp ca3: 57 push %edi ca4: 56 push %esi ca5: 53 push %ebx ca6: 89 c3 mov %eax,%ebx ca8: 83 ec 24 sub $0x24,%esp DECLARE_PER_CPU(struct task_struct *, current_task); static __always_inline struct task_struct *get_current(void) { return this_cpu_read_stable(current_task); cab: a1 00 00 00 00 mov 0x0,%eax struct pcmcia_socket *skt = __skt; int ret; skt->thread = current; cb0: 89 83 e4 00 00 00 mov %eax,0xe4(%ebx) skt->socket = dead_socket; cb6: a1 00 00 00 00 mov 0x0,%eax skt->ops->init(skt); cbb: 8b 93 cc 00 00 00 mov 0xcc(%ebx),%edx { struct pcmcia_socket *skt = __skt; int ret; skt->thread = current; skt->socket = dead_socket; cc1: 89 43 04 mov %eax,0x4(%ebx) cc4: a1 04 00 00 00 mov 0x4,%eax cc9: 89 43 08 mov %eax,0x8(%ebx) ccc: a1 08 00 00 00 mov 0x8,%eax cd1: 89 43 0c mov %eax,0xc(%ebx) skt->ops->init(skt); cd4: 89 d8 mov %ebx,%eax cd6: ff 12 call *(%edx) skt->ops->set_socket(skt, &skt->socket); cd8: 8b 8b cc 00 00 00 mov 0xcc(%ebx),%ecx cde: 8d 53 04 lea 0x4(%ebx),%edx ce1: 89 d8 mov %ebx,%eax ce3: ff 51 0c call *0xc(%ecx) /* register with the device core */ ret = device_register(&skt->dev); ce6: 8d 83 2c 01 00 00 lea 0x12c(%ebx),%eax cec: 89 45 e0 mov %eax,-0x20(%ebp) cef: e8 fc ff ff ff call cf0 <pccardd+0x50> if (ret) { cf4: 85 c0 test %eax,%eax cf6: 0f 85 d4 02 00 00 jne fd0 <pccardd+0x330> "PCMCIA: unable to register socket\n"); skt->thread = NULL; complete(&skt->thread_done); return 0; } ret = pccard_sysfs_add_socket(&skt->dev); cfc: 8b 45 e0 mov -0x20(%ebp),%eax cff: e8 fc ff ff ff call d00 <pccardd+0x60> if (ret) d04: 85 c0 test %eax,%eax "PCMCIA: unable to register socket\n"); skt->thread = NULL; complete(&skt->thread_done); return 0; } ret = pccard_sysfs_add_socket(&skt->dev); d06: 89 45 e4 mov %eax,-0x1c(%ebp) if (ret) d09: 0f 85 01 03 00 00 jne 1010 <pccardd+0x370> dev_warn(&skt->dev, "err %d adding socket attributes\n", ret); complete(&skt->thread_done); d0f: 8d 83 e8 00 00 00 lea 0xe8(%ebx),%eax d15: e8 fc ff ff ff call d16 <pccardd+0x76> /* wait for userspace to catch up */ msleep(250); d1a: b8 fa 00 00 00 mov $0xfa,%eax d1f: e8 fc ff ff ff call d20 <pccardd+0x80> set_freezable(); d24: e8 fc ff ff ff call d25 <pccardd+0x85> d29: 8d 83 fc 00 00 00 lea 0xfc(%ebx),%eax d2f: 8b 3d 00 00 00 00 mov 0x0,%edi d35: 89 45 e8 mov %eax,-0x18(%ebp) d38: 89 7d dc mov %edi,-0x24(%ebp) d3b: 90 nop d3c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi for (;;) { unsigned long flags; unsigned int events; unsigned int sysfs_events; set_current_state(TASK_INTERRUPTIBLE); d40: be 40 0d 00 00 mov $0xd40,%esi d45: 89 b7 8c 0c 00 00 mov %esi,0xc8c(%edi) d4b: c7 07 01 00 00 00 movl $0x1,(%edi) /* * "=rm" is safe here, because "pop" adjusts the stack before * it evaluates its effective address -- this is part of the * documented behavior of the "pop" instruction. */ asm volatile("# __raw_save_flags\n\t" d51: 9c pushf d52: 58 pop %eax :"memory", "cc"); } static inline void native_irq_disable(void) { asm volatile("cli": : :"memory"); d53: fa cli * The various preempt_count add/sub methods */ static __always_inline void __preempt_count_add(int val) { raw_cpu_add_4(__preempt_count, val); d54: ff 05 00 00 00 00 incl 0x0 spin_lock_irqsave(&skt->thread_lock, flags); events = skt->thread_events; d5a: 8b 8b f4 00 00 00 mov 0xf4(%ebx),%ecx skt->thread_events = 0; d60: 31 d2 xor %edx,%edx sysfs_events = skt->sysfs_events; d62: 8b b3 f8 00 00 00 mov 0xf8(%ebx),%esi set_current_state(TASK_INTERRUPTIBLE); spin_lock_irqsave(&skt->thread_lock, flags); events = skt->thread_events; skt->thread_events = 0; d68: 89 93 f4 00 00 00 mov %edx,0xf4(%ebx) unsigned int sysfs_events; set_current_state(TASK_INTERRUPTIBLE); spin_lock_irqsave(&skt->thread_lock, flags); events = skt->thread_events; d6e: 89 4d ec mov %ecx,-0x14(%ebp) skt->thread_events = 0; sysfs_events = skt->sysfs_events; skt->sysfs_events = 0; d71: 31 c9 xor %ecx,%ecx d73: 89 8b f8 00 00 00 mov %ecx,0xf8(%ebx) return flags; } static inline void native_restore_fl(unsigned long flags) { asm volatile("push %0 ; popf" d79: 50 push %eax d7a: 9d popf spin_unlock_irqrestore(&skt->thread_lock, flags); mutex_lock(&skt->skt_mutex); d7b: 8b 45 e8 mov -0x18(%ebp),%eax } static __always_inline void __preempt_count_sub(int val) { raw_cpu_add_4(__preempt_count, -val); d7e: ff 0d 00 00 00 00 decl 0x0 d84: e8 fc ff ff ff call d85 <pccardd+0xe5> if (events & SS_DETECT) d89: f6 45 ec 80 testb $0x80,-0x14(%ebp) d8d: 0f 85 f5 00 00 00 jne e88 <pccardd+0x1e8> socket_detect_change(skt); if (sysfs_events) { d93: 85 f6 test %esi,%esi d95: 0f 84 85 00 00 00 je e20 <pccardd+0x180> if (sysfs_events & PCMCIA_UEVENT_EJECT) d9b: f7 c6 01 00 00 00 test $0x1,%esi da1: 0f 85 21 01 00 00 jne ec8 <pccardd+0x228> socket_remove(skt); if (sysfs_events & PCMCIA_UEVENT_INSERT) da7: f7 c6 02 00 00 00 test $0x2,%esi dad: 0f 85 28 01 00 00 jne edb <pccardd+0x23b> socket_insert(skt); if ((sysfs_events & PCMCIA_UEVENT_SUSPEND) && db3: f7 c6 04 00 00 00 test $0x4,%esi db9: 74 25 je de0 <pccardd+0x140> dbb: f6 43 11 80 testb $0x80,0x11(%ebx) dbf: 75 1f jne de0 <pccardd+0x140> !(skt->state & SOCKET_CARDBUS)) { if (skt->callback) dc1: 8b 93 14 01 00 00 mov 0x114(%ebx),%edx dc7: 85 d2 test %edx,%edx dc9: 0f 84 e1 01 00 00 je fb0 <pccardd+0x310> ret = skt->callback->suspend(skt); dcf: 89 d8 mov %ebx,%eax dd1: ff 52 14 call *0x14(%edx) else ret = 0; if (!ret) { dd4: 85 c0 test %eax,%eax if (sysfs_events & PCMCIA_UEVENT_INSERT) socket_insert(skt); if ((sysfs_events & PCMCIA_UEVENT_SUSPEND) && !(skt->state & SOCKET_CARDBUS)) { if (skt->callback) ret = skt->callback->suspend(skt); dd6: 89 45 e4 mov %eax,-0x1c(%ebp) else ret = 0; if (!ret) { dd9: 0f 84 d1 01 00 00 je fb0 <pccardd+0x310> ddf: 90 nop socket_suspend(skt); msleep(100); } } if ((sysfs_events & PCMCIA_UEVENT_RESUME) && de0: f7 c6 08 00 00 00 test $0x8,%esi de6: 74 0c je df4 <pccardd+0x154> !(skt->state & SOCKET_CARDBUS)) { de8: 8b 43 10 mov 0x10(%ebx),%eax if (!ret) { socket_suspend(skt); msleep(100); } } if ((sysfs_events & PCMCIA_UEVENT_RESUME) && deb: f6 c4 80 test $0x80,%ah dee: 0f 84 24 01 00 00 je f18 <pccardd+0x278> !(skt->state & SOCKET_CARDBUS)) { ret = socket_resume(skt); if (!ret && skt->callback) skt->callback->resume(skt); } if ((sysfs_events & PCMCIA_UEVENT_REQUERY) && df4: f7 c6 10 00 00 00 test $0x10,%esi dfa: 74 24 je e20 <pccardd+0x180> dfc: f6 43 11 80 testb $0x80,0x11(%ebx) e00: 75 1e jne e20 <pccardd+0x180> !(skt->state & SOCKET_CARDBUS)) { if (!ret && skt->callback) e02: 8b 4d e4 mov -0x1c(%ebp),%ecx e05: 85 c9 test %ecx,%ecx e07: 75 17 jne e20 <pccardd+0x180> e09: 8b 93 14 01 00 00 mov 0x114(%ebx),%edx e0f: 85 d2 test %edx,%edx e11: 74 0d je e20 <pccardd+0x180> skt->callback->requery(skt); e13: 89 d8 mov %ebx,%eax e15: ff 52 0c call *0xc(%edx) e18: 90 nop e19: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi } } mutex_unlock(&skt->skt_mutex); e20: 8b 45 e8 mov -0x18(%ebp),%eax e23: e8 fc ff ff ff call e24 <pccardd+0x184> if (events || sysfs_events) e28: 8b 45 ec mov -0x14(%ebp),%eax e2b: 09 f0 or %esi,%eax e2d: 0f 85 0d ff ff ff jne d40 <pccardd+0xa0> continue; if (kthread_should_stop()) e33: e8 fc ff ff ff call e34 <pccardd+0x194> e38: 84 c0 test %al,%al e3a: 0f 85 30 01 00 00 jne f70 <pccardd+0x2d0> break; schedule(); e40: e8 fc ff ff ff call e41 <pccardd+0x1a1> * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION * If try_to_freeze causes a lockdep warning it means the caller may deadlock */ static inline bool try_to_freeze_unsafe(void) { might_sleep(); e45: 31 c9 xor %ecx,%ecx Stack as obtained points to here. e47: ba 38 00 00 00 mov $0x38,%edx e4c: b8 3c 01 00 00 mov $0x13c,%eax e51: e8 fc ff ff ff call e52 <pccardd+0x1b2> e56: e8 fc ff ff ff call e57 <pccardd+0x1b7> * * Atomically reads the value of @v. */ static inline int atomic_read(const atomic_t *v) { return ACCESS_ONCE((v)->counter); e5b: a1 00 00 00 00 mov 0x0,%eax /* * Check if there is a request to freeze a process */ static inline bool freezing(struct task_struct *p) { if (likely(!atomic_read(&system_freezing_cnt))) e60: 85 c0 test %eax,%eax e62: 0f 84 d8 fe ff ff je d40 <pccardd+0xa0> return false; return freezing_slow_path(p); e68: 8b 45 dc mov -0x24(%ebp),%eax e6b: e8 fc ff ff ff call e6c <pccardd+0x1cc> * If try_to_freeze causes a lockdep warning it means the caller may deadlock */ ... ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-19 18:02 ` Bruno Prémont @ 2015-01-20 6:24 ` Linus Torvalds 2015-01-21 20:37 ` Bruno Prémont 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2015-01-20 6:24 UTC (permalink / raw) To: Bruno Prémont; +Cc: Linux Kernel Mailing List On Tue, Jan 20, 2015 at 6:02 AM, Bruno Prémont <bonbons@linux-vserver.org> wrote: > > No idea yet which rc is the offender (nor exact patch), but on my not > so recent UP laptop with a pccard slot I have 2 pccardd kernel threads > converting my laptop into a heater. > > lspci for affected nodes: > 02:06.0 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) > 02:06.1 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) > > Very basics I have, before I attempt any bisection: Hmm. I'm not seeing anything recent changing anything in this area, so I suspect that unless somebody else steps up and says "Ahh, that sounds like xyz", your bisection is the best option. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-20 6:24 ` Linus Torvalds @ 2015-01-21 20:37 ` Bruno Prémont 2015-01-21 21:37 ` Bruno Prémont 0 siblings, 1 reply; 31+ messages in thread From: Bruno Prémont @ 2015-01-21 20:37 UTC (permalink / raw) To: Linus Torvalds, Peter Zijlstra; +Cc: Linux Kernel Mailing List On Tue, 20 January 2015 Linus Torvalds wrote: > On Tue, Jan 20, 2015 at 6:02 AM, Bruno Prémont wrote: > > > > No idea yet which rc is the offender (nor exact patch), but on my not > > so recent UP laptop with a pccard slot I have 2 pccardd kernel threads > > converting my laptop into a heater. > > > > lspci for affected nodes: > > 02:06.0 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) > > 02:06.1 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) > > > > Very basics I have, before I attempt any bisection: > > Hmm. I'm not seeing anything recent changing anything in this area, so > I suspect that unless somebody else steps up and says "Ahh, that > sounds like xyz", your bisection is the best option. I've done most of the bisection and ended up in the scheduler changes. CCing Peter. A few iterations from the end kernel is warning about (might?)sleeping during locking or so (scrolling too fast to read end making access to console hard). My current bisection log is: git bisect start # bad: [117af36e3bceb4fceb1c63489d6f3d94ed259a4c] Apple-GMUX: inhibit VGA arbitration changes git bisect bad 117af36e3bceb4fceb1c63489d6f3d94ed259a4c # good: [b2776bf7149bddd1f4161f14f79520f17fc1d71d] Linux 3.18 git bisect good b2776bf7149bddd1f4161f14f79520f17fc1d71d # bad: [c0222ac086669a631814bbf857f8c8023452a4d7] Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus git bisect bad c0222ac086669a631814bbf857f8c8023452a4d7 # bad: [2183a58803c2bbd87c2d0057eed6779ec4718d4d] Merge tag 'media/v3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect bad 2183a58803c2bbd87c2d0057eed6779ec4718d4d # good: [6da314122ddc11936c6f054753bbb956a499d020] Merge tag 'dt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc git bisect good 6da314122ddc11936c6f054753bbb956a499d020 # bad: [cbfe0de303a55ed96d8831c2d5f56f8131cd6612] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect bad cbfe0de303a55ed96d8831c2d5f56f8131cd6612 # bad: [9e66645d72d3c395da92b0f8855c787f4b5f0e89] Merge branch 'irq-irqdomain-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 9e66645d72d3c395da92b0f8855c787f4b5f0e89 # good: [5706ffd045c3810912c4982357d7daa721af3464] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 5706ffd045c3810912c4982357d7daa721af3464 # bad: [a157508c9790ccd1c8b5c6a828d6ba85bbe95aaa] Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad a157508c9790ccd1c8b5c6a828d6ba85bbe95aaa # bad: [acb32132ec0433c03bed750f3e9508dc29db0328] sched/deadline: Add deadline rq status print git bisect bad acb32132ec0433c03bed750f3e9508dc29db0328 # good: [1029a2b52c09e479fd7b07275812ad97868c0fb0] sched, exit: Deal with nested sleeps git bisect good 1029a2b52c09e479fd7b07275812ad97868c0fb0 That means, the remaining commits would be: commit acb32132ec0433c03bed750f3e9508dc29db0328 Author: Wanpeng Li <wanpeng.li@linux.intel.com> Date: Fri Oct 31 06:39:33 2014 +0800 sched/deadline: Add deadline rq status print ... commit e23738a7300a7591a57a22f47b813fd1b53ec404 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Sep 24 10:18:50 2014 +0200 sched, inotify: Deal with nested sleeps It looks like pccardd might be doing the wrong thing for locking/sleeping/waiting. Trying to complete bisection. Bruno ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-21 20:37 ` Bruno Prémont @ 2015-01-21 21:37 ` Bruno Prémont 2015-01-21 22:12 ` Davidlohr Bueso 0 siblings, 1 reply; 31+ messages in thread From: Bruno Prémont @ 2015-01-21 21:37 UTC (permalink / raw) To: Linus Torvalds, Peter Zijlstra Cc: Linux Kernel Mailing List, tglx, ilya.dryomov, umgwanakikbuti, oleg On Wed, 21 January 2015 Bruno Prémont wrote: > On Tue, 20 January 2015 Linus Torvalds wrote: > > On Tue, Jan 20, 2015 at 6:02 AM, Bruno Prémont wrote: > > > > > > No idea yet which rc is the offender (nor exact patch), but on my not > > > so recent UP laptop with a pccard slot I have 2 pccardd kernel threads > > > converting my laptop into a heater. > > > > > > lspci for affected nodes: > > > 02:06.0 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) > > > 02:06.1 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) > > > > > > Very basics I have, before I attempt any bisection: > > > > Hmm. I'm not seeing anything recent changing anything in this area, so > > I suspect that unless somebody else steps up and says "Ahh, that > > sounds like xyz", your bisection is the best option. Bisecting to the end did point me at (the warning traces produced in great quantities might not be the very same issue as the abusive CPU usage, but certainly look very related): [CCing people on CC for the patch] commit 8eb23b9f35aae413140d3fda766a98092c21e9b0 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Sep 24 10:18:55 2014 +0200 sched: Debug nested sleeps Validate we call might_sleep() with TASK_RUNNING, which catches places where we nest blocking primitives, eg. mutex usage in a wait loop. Since all blocking is arranged through task_struct::state, nesting this will cause the inner primitive to set TASK_RUNNING and the outer will thus not block. Another observed problem is calling a blocking function from schedule()->sched_submit_work()->blk_schedule_flush_plug() which will then destroy the task state for the actual __schedule() call that comes after it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: tglx@linutronix.de Cc: ilya.dryomov@inktank.com Cc: umgwanakikbuti@gmail.com Cc: oleg@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Which does produce the following trace (hand-copied most important parts of it): Warning: CPU 0 PID: 68 at kernel/sched/core.c:7311 __might_sleep+0x143/0x170 do not call blocking ops when !TASK_RUNNING; state=1 set at [<c1436390>] pccardd+0xa0/0x3e0 ... Call trace: ... __might_sleep+0x143/0x170 ? pccardd+0xa0/0x3e0 ? pccardd+0xa0/0x3e0 mutex_lock+0x17/0x2a pccardd+0xe9/0x3e0 ? pcmcia_socket_uevent+0x30/0x30 pccardd() is located in drivers/pcmcia/cs.c and seems to be of the structure Peter's patch wants to warn about. Bruno ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-21 21:37 ` Bruno Prémont @ 2015-01-21 22:12 ` Davidlohr Bueso 2015-01-21 22:54 ` Peter Hurley 0 siblings, 1 reply; 31+ messages in thread From: Davidlohr Bueso @ 2015-01-21 22:12 UTC (permalink / raw) To: Bruno Prémont Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, tglx, ilya.dryomov, umgwanakikbuti, oleg On Wed, 2015-01-21 at 22:37 +0100, Bruno Prémont wrote: > On Wed, 21 January 2015 Bruno Prémont wrote: > > On Tue, 20 January 2015 Linus Torvalds wrote: > > > On Tue, Jan 20, 2015 at 6:02 AM, Bruno Prémont wrote: > > > > > > > > No idea yet which rc is the offender (nor exact patch), but on my not > > > > so recent UP laptop with a pccard slot I have 2 pccardd kernel threads > > > > converting my laptop into a heater. > > > > > > > > lspci for affected nodes: > > > > 02:06.0 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) > > > > 02:06.1 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) > > > > > > > > Very basics I have, before I attempt any bisection: > > > > > > Hmm. I'm not seeing anything recent changing anything in this area, so > > > I suspect that unless somebody else steps up and says "Ahh, that > > > sounds like xyz", your bisection is the best option. > > Bisecting to the end did point me at (the warning traces produced in great > quantities might not be the very same issue as the abusive CPU usage, but > certainly look very related): > [CCing people on CC for the patch] > > commit 8eb23b9f35aae413140d3fda766a98092c21e9b0 > Author: Peter Zijlstra <peterz@infradead.org> > Date: Wed Sep 24 10:18:55 2014 +0200 > > sched: Debug nested sleeps > > Validate we call might_sleep() with TASK_RUNNING, which catches places > where we nest blocking primitives, eg. mutex usage in a wait loop. > > Since all blocking is arranged through task_struct::state, nesting > this will cause the inner primitive to set TASK_RUNNING and the outer > will thus not block. > > Another observed problem is calling a blocking function from > schedule()->sched_submit_work()->blk_schedule_flush_plug() which will > then destroy the task state for the actual __schedule() call that > comes after it. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Cc: tglx@linutronix.de > Cc: ilya.dryomov@inktank.com > Cc: umgwanakikbuti@gmail.com > Cc: oleg@redhat.com > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org > Signed-off-by: Ingo Molnar <mingo@kernel.org> > > Which does produce the following trace (hand-copied most important parts of it): > Warning: CPU 0 PID: 68 at kernel/sched/core.c:7311 __might_sleep+0x143/0x170 > do not call blocking ops when !TASK_RUNNING; state=1 set at [<c1436390>] pccardd+0xa0/0x3e0 > ... > Call trace: > ... > __might_sleep+0x143/0x170 > ? pccardd+0xa0/0x3e0 > ? pccardd+0xa0/0x3e0 > mutex_lock+0x17/0x2a > pccardd+0xe9/0x3e0 > ? pcmcia_socket_uevent+0x30/0x30 > > pccardd() is located in drivers/pcmcia/cs.c and seems to be of the structure > Peter's patch wants to warn about. Yeah setting current to interruptable so early in the game is bogus. It should be set after unlocking the skt_mutex. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-21 22:12 ` Davidlohr Bueso @ 2015-01-21 22:54 ` Peter Hurley 2015-01-30 1:22 ` Rafael J. Wysocki 0 siblings, 1 reply; 31+ messages in thread From: Peter Hurley @ 2015-01-21 22:54 UTC (permalink / raw) To: Davidlohr Bueso, Bruno Prémont, Peter Zijlstra Cc: Linus Torvalds, Linux Kernel Mailing List, tglx, ilya.dryomov, umgwanakikbuti, oleg On 01/21/2015 05:12 PM, Davidlohr Bueso wrote: > On Wed, 2015-01-21 at 22:37 +0100, Bruno Prémont wrote: >> On Wed, 21 January 2015 Bruno Prémont wrote: >>> On Tue, 20 January 2015 Linus Torvalds wrote: >>>> On Tue, Jan 20, 2015 at 6:02 AM, Bruno Prémont wrote: >>>>> >>>>> No idea yet which rc is the offender (nor exact patch), but on my not >>>>> so recent UP laptop with a pccard slot I have 2 pccardd kernel threads >>>>> converting my laptop into a heater. [...] >> Bisecting to the end did point me at (the warning traces produced in great >> quantities might not be the very same issue as the abusive CPU usage, but >> certainly look very related): >> [CCing people on CC for the patch] >> >> commit 8eb23b9f35aae413140d3fda766a98092c21e9b0 >> Author: Peter Zijlstra <peterz@infradead.org> >> Date: Wed Sep 24 10:18:55 2014 +0200 [...] >> Which does produce the following trace (hand-copied most important parts of it): >> Warning: CPU 0 PID: 68 at kernel/sched/core.c:7311 __might_sleep+0x143/0x170 >> do not call blocking ops when !TASK_RUNNING; state=1 set at [<c1436390>] pccardd+0xa0/0x3e0 >> ... >> Call trace: >> ... >> __might_sleep+0x143/0x170 >> ? pccardd+0xa0/0x3e0 >> ? pccardd+0xa0/0x3e0 >> mutex_lock+0x17/0x2a >> pccardd+0xe9/0x3e0 >> ? pcmcia_socket_uevent+0x30/0x30 >> >> pccardd() is located in drivers/pcmcia/cs.c and seems to be of the structure >> Peter's patch wants to warn about. > > Yeah setting current to interruptable so early in the game is bogus. It > should be set after unlocking the skt_mutex. Yeah, but the debug check is triggering worse behavior, requiring bisecting back to the debug commit. How does the might_sleep() check here guarantee the task won't sleep? Regards, Peter Hurley ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-21 22:54 ` Peter Hurley @ 2015-01-30 1:22 ` Rafael J. Wysocki 2015-01-30 1:12 ` Linus Torvalds 0 siblings, 1 reply; 31+ messages in thread From: Rafael J. Wysocki @ 2015-01-30 1:22 UTC (permalink / raw) To: Peter Hurley, Davidlohr Bueso, Peter Zijlstra Cc: Bruno Prémont, Linus Torvalds, Linux Kernel Mailing List, tglx, ilya.dryomov, umgwanakikbuti, oleg On Wednesday, January 21, 2015 05:54:00 PM Peter Hurley wrote: > On 01/21/2015 05:12 PM, Davidlohr Bueso wrote: > > On Wed, 2015-01-21 at 22:37 +0100, Bruno Prémont wrote: > >> On Wed, 21 January 2015 Bruno Prémont wrote: > >>> On Tue, 20 January 2015 Linus Torvalds wrote: > >>>> On Tue, Jan 20, 2015 at 6:02 AM, Bruno Prémont wrote: > >>>>> > >>>>> No idea yet which rc is the offender (nor exact patch), but on my not > >>>>> so recent UP laptop with a pccard slot I have 2 pccardd kernel threads > >>>>> converting my laptop into a heater. > > [...] > > >> Bisecting to the end did point me at (the warning traces produced in great > >> quantities might not be the very same issue as the abusive CPU usage, but > >> certainly look very related): > >> [CCing people on CC for the patch] > >> > >> commit 8eb23b9f35aae413140d3fda766a98092c21e9b0 > >> Author: Peter Zijlstra <peterz@infradead.org> > >> Date: Wed Sep 24 10:18:55 2014 +0200 > > [...] > > >> Which does produce the following trace (hand-copied most important parts of it): > >> Warning: CPU 0 PID: 68 at kernel/sched/core.c:7311 __might_sleep+0x143/0x170 > >> do not call blocking ops when !TASK_RUNNING; state=1 set at [<c1436390>] pccardd+0xa0/0x3e0 > >> ... > >> Call trace: > >> ... > >> __might_sleep+0x143/0x170 > >> ? pccardd+0xa0/0x3e0 > >> ? pccardd+0xa0/0x3e0 > >> mutex_lock+0x17/0x2a > >> pccardd+0xe9/0x3e0 > >> ? pcmcia_socket_uevent+0x30/0x30 > >> > >> pccardd() is located in drivers/pcmcia/cs.c and seems to be of the structure > >> Peter's patch wants to warn about. > > > > Yeah setting current to interruptable so early in the game is bogus. It > > should be set after unlocking the skt_mutex. > > Yeah, but the debug check is triggering worse behavior, requiring > bisecting back to the debug commit. Yes, it is. So I'm wondering is anyone is working on fixing this in any way? It kind of sucks when this is happening on an otherwise perfectly usable old(ish) machine ... Rafael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:22 ` Rafael J. Wysocki @ 2015-01-30 1:12 ` Linus Torvalds 2015-01-30 1:25 ` Linus Torvalds 2015-01-30 1:49 ` Rafael J. Wysocki 0 siblings, 2 replies; 31+ messages in thread From: Linus Torvalds @ 2015-01-30 1:12 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Thu, Jan 29, 2015 at 5:22 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > On Wednesday, January 21, 2015 05:54:00 PM Peter Hurley wrote: >> >> Yeah, but the debug check is triggering worse behavior, requiring >> bisecting back to the debug commit. > > Yes, it is. > > So I'm wondering is anyone is working on fixing this in any way? > > It kind of sucks when this is happening on an otherwise perfectly usable > old(ish) machine ... The WARN() was already changed to a WARN_ONCE(). So that debug check doesn't cause problems any more. If somebody is bisecting something else, and the WARN() is a problem for those intermediate kernels, then just disabling CONFIG_DEBUG_ATOMIC_SLEEP should get you past that point. IOW, this really shouldn't be an issue. Does the pccard thing still not work? Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:12 ` Linus Torvalds @ 2015-01-30 1:25 ` Linus Torvalds 2015-01-30 1:35 ` Linus Torvalds ` (5 more replies) 2015-01-30 1:49 ` Rafael J. Wysocki 1 sibling, 6 replies; 31+ messages in thread From: Linus Torvalds @ 2015-01-30 1:25 UTC (permalink / raw) To: Rafael J. Wysocki, Ingo Molnar Cc: Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov [-- Attachment #1: Type: text/plain, Size: 1844 bytes --] On Thu, Jan 29, 2015 at 5:12 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The WARN() was already changed to a WARN_ONCE(). Oh, but I notice that the "__set_current_state(TASK_RUNNING) ends up always happening. So I think the right fix is to: - warn once like we do - but *not* do that __set_current_state() which was always total crap anyway Why do I say "total crap"? Because of two independent issues: (a) it actually changes behavior for a debug vs non-debug kernel, which is a really bad idea to begin with (b) it's really wrong. The whole "nested sleep" case was never a major bug to begin with, just a possible inefficiency where constant nested sleeps would possibly make the outer sleep not sleep. But that "could possibly make" case was the unlikely case, and the debug patch made it happen *all* the time by explicitly setting things running. So I think the proper patch is the attached. The comment is also crap. The comment says "Blocking primitives will set (and therefore destroy) current->state [...]" but the reality is that they *may* set it, and only in the unlikely slow-path where they actually block. So doing this in "__may_sleep()" is just bogus and horrible horrible crap. It turns the "harmless ugliness" into a real *harmful* bug. The key word of "__may_sleep()" is that "MAY" part. It's a debug thing to make relatively rare cases show up. PeterZ, please don't make "debugging" patches like this. Ever again. Because this was just stupid, and it took me too long to realize that despite the warning being shut up, the debug patch was still actively doing bad bad things. Ingo, maybe you'd want to apply this through the scheduler tree, the way you already did the WARN_ONCE() thing. Bruno, does this finally actually fix your pccard thing? Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 845 bytes --] kernel/sched/core.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c0accc00566e..76aaf5c61114 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7292,13 +7292,12 @@ void __might_sleep(const char *file, int line, int preempt_offset) * since we will exit with TASK_RUNNING make sure we enter with it, * otherwise we will destroy state. */ - if (WARN_ONCE(current->state != TASK_RUNNING, + WARN_ONCE(current->state != TASK_RUNNING, "do not call blocking ops when !TASK_RUNNING; " "state=%lx set at [<%p>] %pS\n", current->state, (void *)current->task_state_change, - (void *)current->task_state_change)) - __set_current_state(TASK_RUNNING); + (void *)current->task_state_change); ___might_sleep(file, line, preempt_offset); } ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:25 ` Linus Torvalds @ 2015-01-30 1:35 ` Linus Torvalds 2015-01-30 2:06 ` Rafael J. Wysocki ` (4 subsequent siblings) 5 siblings, 0 replies; 31+ messages in thread From: Linus Torvalds @ 2015-01-30 1:35 UTC (permalink / raw) To: Rafael J. Wysocki, Ingo Molnar Cc: Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Thu, Jan 29, 2015 at 5:25 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ingo, maybe you'd want to apply this through the scheduler tree, the > way you already did the WARN_ONCE() thing. Side note: I can obviously just commit it myself, but for things that have obvious maintainers I tend to try to try to push the changes through channels. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:25 ` Linus Torvalds 2015-01-30 1:35 ` Linus Torvalds @ 2015-01-30 2:06 ` Rafael J. Wysocki 2015-01-30 15:47 ` Oleg Nesterov ` (3 subsequent siblings) 5 siblings, 0 replies; 31+ messages in thread From: Rafael J. Wysocki @ 2015-01-30 2:06 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Thursday, January 29, 2015 05:25:07 PM Linus Torvalds wrote: > On Thu, Jan 29, 2015 at 5:12 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > The WARN() was already changed to a WARN_ONCE(). > > Oh, but I notice that the "__set_current_state(TASK_RUNNING) ends up > always happening. > > So I think the right fix is to: > > - warn once like we do > > - but *not* do that __set_current_state() which was always total crap anyway > > Why do I say "total crap"? Because of two independent issues: > > (a) it actually changes behavior for a debug vs non-debug kernel, > which is a really bad idea to begin with > > (b) it's really wrong. The whole "nested sleep" case was never a > major bug to begin with, just a possible inefficiency where constant > nested sleeps would possibly make the outer sleep not sleep. But that > "could possibly make" case was the unlikely case, and the debug patch > made it happen *all* the time by explicitly setting things running. > > So I think the proper patch is the attached. I applied the patch and compiled the kernel with CONFIG_DEBUG_ATOMIC_SLEEP set again. The warning is back, so the DEBUG_ATOMIC_SLEEP thing appears to be working (and I don't really care about the freaking warning anyway), but the pccardd load issue is gone, so the patch fixes the problem for me. Thanks! Rafael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:25 ` Linus Torvalds 2015-01-30 1:35 ` Linus Torvalds 2015-01-30 2:06 ` Rafael J. Wysocki @ 2015-01-30 15:47 ` Oleg Nesterov 2015-01-31 18:32 ` Linus Torvalds 2015-01-31 9:16 ` Bruno Prémont ` (2 subsequent siblings) 5 siblings, 1 reply; 31+ messages in thread From: Oleg Nesterov @ 2015-01-30 15:47 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith On 01/29, Linus Torvalds wrote: > > (a) it actually changes behavior for a debug vs non-debug kernel, > which is a really bad idea to begin with Perhaps sched_annotate_sleep() shouldn't depend on CONFIG_DEBUG_ATOMIC_SLEEP too... Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 15:47 ` Oleg Nesterov @ 2015-01-31 18:32 ` Linus Torvalds 2015-01-31 20:16 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2015-01-31 18:32 UTC (permalink / raw) To: Oleg Nesterov Cc: Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith On Fri, Jan 30, 2015 at 7:47 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > Perhaps sched_annotate_sleep() shouldn't depend on CONFIG_DEBUG_ATOMIC_SLEEP > too... Ugh. That thing is horrible. The naming doesn't make it obvious at all that it's actually making sure that we have state set to TASK_RUNNING, and I could easily imagine that it would cause similar "busy-loops while scheduling" issues if anybody ever uses it in the wrong context. So I really think that whole thing is a sign of "the debug infrastructure is buggy, and people are introducing fragile things to just shut up the false positives". I don't know how to fix it. I really get the feeling that the whole new "nested sleep" detection code was a mistake to begin with, since it wasn't even a real bug, and it has now created more bugs than it ever detected afaik. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-31 18:32 ` Linus Torvalds @ 2015-01-31 20:16 ` Peter Zijlstra 2015-01-31 21:54 ` Linus Torvalds 2015-02-01 19:43 ` Oleg Nesterov 0 siblings, 2 replies; 31+ messages in thread From: Peter Zijlstra @ 2015-01-31 20:16 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith On Sat, Jan 31, 2015 at 10:32:23AM -0800, Linus Torvalds wrote: > On Fri, Jan 30, 2015 at 7:47 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > Perhaps sched_annotate_sleep() shouldn't depend on CONFIG_DEBUG_ATOMIC_SLEEP > > too... > > Ugh. That thing is horrible. The naming doesn't make it obvious at all > that it's actually making sure that we have state set to TASK_RUNNING, > and I could easily imagine that it would cause similar "busy-loops > while scheduling" issues if anybody ever uses it in the wrong context. The alternative was putting unconditional __set_task_state(TASK_RUNNING) things in a few code paths. I didn't want to cause the extra code in case we didn't need them. Particularly the include/net/sock.h one, as I know the network people are cycle counters. But this function should indeed be used very rarely. Currently we're at 5 instances. > So I really think that whole thing is a sign of "the debug > infrastructure is buggy, and people are introducing fragile things to > just shut up the false positives". Aside from this one annotation, which is used in cases where we broke out of the wait loop but haven't reset task state yet, there have not really been false positives. All the stuff it flagged are genuinely wrong, albeit not disastrously so, things mostly just work. Should I make the default wait_event() safe against sleeps? It would make it slightly more expensive, but on the whole that should not really be a problem. Alternatively we should provide an alternative to wait_event() that allows sleeps, but I'm failing to come up with a good name. > I don't know how to fix it. I really get the feeling that the whole > new "nested sleep" detection code was a mistake to begin with, since > it wasn't even a real bug, and it has now created more bugs than it > ever detected afaik. I appreciate I caused you some pain here, and I'm very sorry for that. But I do think we should be little more careful with task::state, on occasion things do go wrong there and funny stuff happens. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-31 20:16 ` Peter Zijlstra @ 2015-01-31 21:54 ` Linus Torvalds 2015-02-02 13:19 ` Peter Zijlstra 2015-02-01 19:43 ` Oleg Nesterov 1 sibling, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2015-01-31 21:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith On Sat, Jan 31, 2015 at 12:16 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > All the stuff it flagged are genuinely wrong, albeit not disastrously > so, things mostly just work. I really disagree. They weren't wrong. They *could* occasionally result in extra reschedules, but that was never wrong to begin with. But the debugging code made that much much worse, and made that "could result in extra reschedules" happen all the time if it triggered. And the whole "set task to sleep early" thing was rather intentional, and was a fundamental part of the original design for "select()/poll()" behavior, for example. Yes, we ended up having the waitqueue functions and using that "pollwake()" and use "wq->triggered" etc instead, but the whole optimistic early sleep thing worked reasonably well for a long time even when the "early TASK_SLEEP" was done before calling *thousands* of random poll routines. So you say "genuinely wrong", and I say "but that's how things were designed - it's an optimistic approach, not an exact one". Your debugging code changed that behavior, and actually introduced a real bug, exactly because you felt that the "no nested sleeps" was a harder requirement than it has ever actually been. In other words, I think the debugging code itself is wrong, and then that sched_annotate_sleep() thing is just a symptom of how it is wrong. If you have to sprinkle these kinds of random workarounds in a few core scheduler places (ok, mainly "wait()" paths, it looks like), why would you expect random *drivers* to have to care about things that even core kernel code says "I'm not going to care about this, I'll just shut the warning up, because the warning is wrong". Yes, the fact that select/poll was changed to try to avoid excessive polling scheduling because it actually *was* a problem under some loads does say that we generally want to try to avoid nested sleeping. Because while it is rare and the optimistic approach works fine in most cases, it certainly *can* become a problem if the optimistic "I'm normally not going to sleep" thing ends up not being sufficiently accurate. So don't get me wrong - I think the whole "add debug code to find places where we might have issues" was well worth it, and resulted in improvements. But once the low-hanging fruit and the core code that everybody hits has been fixed, and people cannot apparently even be bothered with the other cases it finds (like the pccardd case), at that point the value of the debug code becomes rather less obvious. And the downsides become bigger. The pccardd example is an example of legacy use of our old and original semantics of how the whole nested sleep was supposed to work. And it *does* work. It's not a bug. It's how things have worked time immemorial, and clearly nobody is really willing to bother with changing working - but legacy - cardbus code. And at that point, I think the debug code is actually *wrong*, and causes more problems than it "fixes". And debug code that causes more problems that it fixes should either be removed, or improved to the point where the problems go away. The "improved" part might be about saying "it's actually perfectly _fine_ to have nested sleeps, as long as it is truly rare that the nested sleep actually sleeps". And thus make the debug code really test that it's *rare*, rather than test that it never happens. Warn if it happens more than a couple of times a second for any particular process, or something like that. Hmm? Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-31 21:54 ` Linus Torvalds @ 2015-02-02 13:19 ` Peter Zijlstra 0 siblings, 0 replies; 31+ messages in thread From: Peter Zijlstra @ 2015-02-02 13:19 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Julia Lawall, Gilles Muller, Nicolas Palix, Michal Marek On Sat, Jan 31, 2015 at 01:54:36PM -0800, Linus Torvalds wrote: > On Sat, Jan 31, 2015 at 12:16 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > All the stuff it flagged are genuinely wrong, albeit not disastrously > > so, things mostly just work. > > I really disagree. > > They weren't wrong. They *could* occasionally result in extra > reschedules, but that was never wrong to begin with. > > But the debugging code made that much much worse, and made that "could > result in extra reschedules" happen all the time if it triggered. Yes, that was a bad choice. Again, sorry about that. > So you say "genuinely wrong", and I say "but that's how things were > designed - it's an optimistic approach, not an exact one". Your > debugging code changed that behavior, and actually introduced a real > bug, exactly because you felt that the "no nested sleeps" was a harder > requirement than it has ever actually been. > > In other words, I think the debugging code itself is wrong, and then > that sched_annotate_sleep() thing is just a symptom of how it is > wrong. If you have to sprinkle these kinds of random workarounds in a > few core scheduler places (ok, mainly "wait()" paths, it looks like), > why would you expect random *drivers* to have to care about things > that even core kernel code says "I'm not going to care about this, > I'll just shut the warning up, because the warning is wrong". I was not aware this pattern was so wide spread / accepted. I have totally underestimated the amount of fallout here. > So don't get me wrong - I think the whole "add debug code to find > places where we might have issues" was well worth it, and resulted in > improvements. > > But once the low-hanging fruit and the core code that everybody hits > has been fixed, and people cannot apparently even be bothered with the > other cases it finds (like the pccardd case), at that point the value > of the debug code becomes rather less obvious. My bad (again) for not paying proper attention there. No excuses for that. > The pccardd example is an example of legacy use of our old and > original semantics of how the whole nested sleep was supposed to work. > And it *does* work. It's not a bug. It's how things have worked time > immemorial, and clearly nobody is really willing to bother with > changing working - but legacy - cardbus code. And at that point, I > think the debug code is actually *wrong*, and causes more problems > than it "fixes". > > And debug code that causes more problems that it fixes should either > be removed, or improved to the point where the problems go away. > > The "improved" part might be about saying "it's actually perfectly > _fine_ to have nested sleeps, as long as it is truly rare that the > nested sleep actually sleeps". And thus make the debug code really > test that it's *rare*, rather than test that it never happens. Warn if > it happens more than a couple of times a second for any particular > process, or something like that. So the thing that makes it work is the fact that schedule is called in a loop, without that loop things come apart very quickly. Now most core primitives have this loop, and are safe from spurious wakeups -- and I think we can class this under that. But there's heaps of legacy code that has non looping calls of schedule and really breaks with spurious wakeups (I ran into that a few years ago when I made schedule() return 'randomly' for some other reason). So I worry about separating nesting sleep in loops, which are mostly harmless vs the non-loop case which is really bad. FWIW the bug that started all this was someone calling blocking functions from io_schedule(): io_schedule() -> blk_flush_plug() -> block-layer-magic -> device-magic() -> mutex_lock(). And mutex_lock(() used: if (need_resched()) schedule(); for preemption and would never return (because !TASK_RUNNING). Now, we've since fixed the mutex code to not assume TASK_RUNNING, so we won't actually trigger the reported 'deadlock' anymore. In any case, I can certainly make the warning go away for now and try again later with a (hopefully) more intelligent version. On a related note, would it be possible to make sparse/coccinelle try and warn on broken wait primitives? ie, no loop around schedule(), no conditional after set_current_state()? Examples: drivers/atm/atmtcp.c- while (test_bit(flag,&vcc->flags) == old_test) { drivers/atm/atmtcp.c- mb(); drivers/atm/atmtcp.c- out_vcc = PRIV(vcc->dev) ? PRIV(vcc->dev)->vcc : NULL; drivers/atm/atmtcp.c- if (!out_vcc) { drivers/atm/atmtcp.c- error = -EUNATCH; drivers/atm/atmtcp.c- break; drivers/atm/atmtcp.c- } set_bit(flag, &vcc->flags); wake_up_process(foo); drivers/atm/atmtcp.c- set_current_state(TASK_UNINTERRUPTIBLE); drivers/atm/atmtcp.c: schedule(); drivers/atm/atmtcp.c- } Would not ever wake again it seems. And I'm not entire sure what this is supposed to do, but it looks fishy (seems to be popular though, I've found more instances of it): drivers/iommu/amd_iommu_v2.c-static void put_device_state_wait(struct device_state *dev_state) drivers/iommu/amd_iommu_v2.c-{ drivers/iommu/amd_iommu_v2.c- DEFINE_WAIT(wait); drivers/iommu/amd_iommu_v2.c- drivers/iommu/amd_iommu_v2.c- prepare_to_wait(&dev_state->wq, &wait, TASK_UNINTERRUPTIBLE); drivers/iommu/amd_iommu_v2.c- if (!atomic_dec_and_test(&dev_state->count)) drivers/iommu/amd_iommu_v2.c: schedule(); drivers/iommu/amd_iommu_v2.c- finish_wait(&dev_state->wq, &wait); drivers/iommu/amd_iommu_v2.c- drivers/iommu/amd_iommu_v2.c- free_device_state(dev_state); drivers/iommu/amd_iommu_v2.c-} Did they want to write: if (atomic_dec_and_test(&dev_state->count)) { wake_up(&dev_state->wq); return; } wait_event(dev_state->wq, !atomic_read(&dev_state->count)); Also, people seem to have gotten the memo that sched_yield() is undesired, but not actually understood why; there's a absolute ton of: while(!foo) schedule(); Some explicitly set TASK_RUNNING, most not so much. I realize we're not ever going to fix all of that; but it would be good to avoid growing more of it. And I think the amd_iommu_v2 thing shows its not only legacy code. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-31 20:16 ` Peter Zijlstra 2015-01-31 21:54 ` Linus Torvalds @ 2015-02-01 19:43 ` Oleg Nesterov 2015-02-01 20:09 ` Linus Torvalds 1 sibling, 1 reply; 31+ messages in thread From: Oleg Nesterov @ 2015-02-01 19:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith On 01/31, Peter Zijlstra wrote: > > On Sat, Jan 31, 2015 at 10:32:23AM -0800, Linus Torvalds wrote: > > On Fri, Jan 30, 2015 at 7:47 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > Perhaps sched_annotate_sleep() shouldn't depend on CONFIG_DEBUG_ATOMIC_SLEEP > > > too... > > > > Ugh. That thing is horrible. The naming doesn't make it obvious at all > > that it's actually making sure that we have state set to TASK_RUNNING, > > and I could easily imagine that it would cause similar "busy-loops > > while scheduling" issues if anybody ever uses it in the wrong context. > > The alternative was putting unconditional __set_task_state(TASK_RUNNING) > things in a few code paths. I didn't want to cause the extra code in > case we didn't need them. Particularly the include/net/sock.h one, as I > know the network people are cycle counters. And personally I agree. sched_annotate_sleep() looks self-documented, it is clear that it is used to suppress the warning. Still. Can't we avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds? Oleg. --- x/include/linux/kernel.h +++ x/include/linux/kernel.h @@ -176,7 +176,7 @@ extern int _cond_resched(void); */ # define might_sleep() \ do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0) -# define sched_annotate_sleep() __set_current_state(TASK_RUNNING) +# define sched_annotate_sleep() (current->task_state_change = 0) #else static inline void ___might_sleep(const char *file, int line, int preempt_offset) { } --- x/kernel/sched/core.c +++ x/kernel/sched/core.c @@ -7292,7 +7292,7 @@ void __might_sleep(const char *file, int * since we will exit with TASK_RUNNING make sure we enter with it, * otherwise we will destroy state. */ - if (WARN_ONCE(current->state != TASK_RUNNING, + if (WARN_ONCE(current->state != TASK_RUNNING && current->task_state_change, "do not call blocking ops when !TASK_RUNNING; " "state=%lx set at [<%p>] %pS\n", current->state, ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-02-01 19:43 ` Oleg Nesterov @ 2015-02-01 20:09 ` Linus Torvalds 2015-02-01 20:19 ` Oleg Nesterov 2015-02-02 15:34 ` Peter Zijlstra 0 siblings, 2 replies; 31+ messages in thread From: Linus Torvalds @ 2015-02-01 20:09 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith On Sun, Feb 1, 2015 at 11:43 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > And personally I agree. sched_annotate_sleep() looks self-documented, it > is clear that it is used to suppress the warning. But *that's not the problem*. If it was just silencing the warning, things would be fine. But it is actively screwing task state up, and actually changing behavior of the kernel (in a very subtle part of the code too), and doing so in ways that potentially introduce WORSE BUGS THAN THE WHOLE DAMN DEBUG SUPPORT WAS SUPPOSED TO FIND IN THE FIRST PLACE. That's the thing that upsets me. This is debug code. It's not even debugging things that are "buggy" - its' just finding things that can be potentially slightly inefficient. And it already introduced a subtle bug once. (Ok, it wasn't *that* subtle in the end, but it needed both a good bisection result and some reading of unrelated source code to figure out, so it certainly wasn't really obvious). > Still. Can't we avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP > adds? Now this patch I agree with 100% percent, apart from you calling it a "subtle change". It's more than a "subtle change", this was adding a real and present bug that took two weeks to get fixed! (And by "took two weeks to get fixed" I mean "not actually fixed yet, but now we at least know what was going on"). I like your patch, but I'm going to combine it with mine that actually fixes a real bug, because what you don't see in that patch of yours: > --- x/include/linux/kernel.h > +++ x/include/linux/kernel.h > @@ -176,7 +176,7 @@ extern int _cond_resched(void); > */ > # define might_sleep() \ > do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0) > -# define sched_annotate_sleep() __set_current_state(TASK_RUNNING) > +# define sched_annotate_sleep() (current->task_state_change = 0) > #else > static inline void ___might_sleep(const char *file, int line, > int preempt_offset) { } > --- x/kernel/sched/core.c > +++ x/kernel/sched/core.c > @@ -7292,7 +7292,7 @@ void __might_sleep(const char *file, int > * since we will exit with TASK_RUNNING make sure we enter with it, > * otherwise we will destroy state. > */ > - if (WARN_ONCE(current->state != TASK_RUNNING, > + if (WARN_ONCE(current->state != TASK_RUNNING && current->task_state_change, > "do not call blocking ops when !TASK_RUNNING; " > "state=%lx set at [<%p>] %pS\n", > current->state, is that the whole "if (WARN_ONCE()" remains horribly buggy, because of the line that follows it: __set_current_state(TASK_RUNNING); in the if-statement. Now, I have the patch that removes that thing (but I was hoping to get it from the scheduler tree before doing rc7, which seems to not have happened), but yes, that together with your patch seems like it should fix all the nasty bug-inducing crud where the "debugging helpers" end up silently changing core process state. I'll just combine it with yours to avoid extra noise in this area, and mark you as the author, fixing *both* of the incorrect state changes. Ok? Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-02-01 20:09 ` Linus Torvalds @ 2015-02-01 20:19 ` Oleg Nesterov 2015-02-02 15:34 ` Peter Zijlstra 1 sibling, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-02-01 20:19 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith On 02/01, Linus Torvalds wrote: > > On Sun, Feb 1, 2015 at 11:43 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > And personally I agree. sched_annotate_sleep() looks self-documented, it > > is clear that it is used to suppress the warning. > > But *that's not the problem*. > > If it was just silencing the warning, things would be fine. > > But it is actively screwing task state up, and actually changing > behavior of the kernel (in a very subtle part of the code too), and > doing so in ways that potentially introduce WORSE BUGS THAN THE WHOLE > DAMN DEBUG SUPPORT WAS SUPPOSED TO FIND IN THE FIRST PLACE. I understand, that is why I suggested to change it. > I like your patch, but I'm going to combine it with mine that actually > fixes a real bug, Sure, I agree. > because what you don't see in that patch of yours: > ... > > > is that the whole "if (WARN_ONCE()" remains horribly buggy, because of > the line that follows it: > > __set_current_state(TASK_RUNNING); > > in the if-statement. Ah. I just forgot to mention that this change should be rediffed on top of your patch, of course it is not enough. > I'll just combine it with yours to avoid extra noise in this area, and > mark you as the author, fixing *both* of the incorrect state changes. > Ok? Please combine them, but don't mark me as an author, I do not want to take the false credits ;) Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-02-01 20:09 ` Linus Torvalds 2015-02-01 20:19 ` Oleg Nesterov @ 2015-02-02 15:34 ` Peter Zijlstra 1 sibling, 0 replies; 31+ messages in thread From: Peter Zijlstra @ 2015-02-02 15:34 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith On Sun, Feb 01, 2015 at 12:09:32PM -0800, Linus Torvalds wrote: > Now, I have the patch that removes that thing (but I was hoping to get > it from the scheduler tree before doing rc7, which seems to not have > happened), but yes, that together with your patch seems like it should > fix all the nasty bug-inducing crud where the "debugging helpers" end > up silently changing core process state. > > I'll just combine it with yours to avoid extra noise in this area, and > mark you as the author, fixing *both* of the incorrect state changes. > Ok? Ah I see it in your tree; I was about to suggest: -# define sched_annotate_sleep() __set_current_state(TASK_RUNNING) +# define sched_annotate_sleep() do { current->task_state_change = 0; } while (0) Instead of the assignment, which has a rvalue. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:25 ` Linus Torvalds ` (2 preceding siblings ...) 2015-01-30 15:47 ` Oleg Nesterov @ 2015-01-31 9:16 ` Bruno Prémont 2015-01-31 9:48 ` Peter Zijlstra 2015-02-05 21:14 ` Bruno Prémont 5 siblings, 0 replies; 31+ messages in thread From: Bruno Prémont @ 2015-01-31 9:16 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Thu, 29 Jan 2015 17:25:07 Linus Torvalds wrote: > On Thu, Jan 29, 2015 at 5:12 PM, Linus Torvalds wrote: > > > > The WARN() was already changed to a WARN_ONCE(). > > Oh, but I notice that the "__set_current_state(TASK_RUNNING) ends up > always happening. > > So I think the right fix is to: > > - warn once like we do > > - but *not* do that __set_current_state() which was always total > crap anyway > > Why do I say "total crap"? Because of two independent issues: > > (a) it actually changes behavior for a debug vs non-debug kernel, > which is a really bad idea to begin with > > (b) it's really wrong. The whole "nested sleep" case was never a > major bug to begin with, just a possible inefficiency where constant > nested sleeps would possibly make the outer sleep not sleep. But that > "could possibly make" case was the unlikely case, and the debug patch > made it happen *all* the time by explicitly setting things running. > > So I think the proper patch is the attached. > > The comment is also crap. The comment says > > "Blocking primitives will set (and therefore destroy) > current->state [...]" > > but the reality is that they *may* set it, and only in the unlikely > slow-path where they actually block. > > So doing this in "__may_sleep()" is just bogus and horrible horrible > crap. It turns the "harmless ugliness" into a real *harmful* bug. The > key word of "__may_sleep()" is that "MAY" part. It's a debug thing to > make relatively rare cases show up. > > PeterZ, please don't make "debugging" patches like this. Ever again. > Because this was just stupid, and it took me too long to realize that > despite the warning being shut up, the debug patch was still actively > doing bad bad things. > > Ingo, maybe you'd want to apply this through the scheduler tree, the > way you already did the WARN_ONCE() thing. > > Bruno, does this finally actually fix your pccard thing? I will report back on Wednesday when I'm back home from FOSDEM. I don't have the affected machine at hand at the moment. Thanks for looking into it! Bruno > Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:25 ` Linus Torvalds ` (3 preceding siblings ...) 2015-01-31 9:16 ` Bruno Prémont @ 2015-01-31 9:48 ` Peter Zijlstra 2015-02-03 11:18 ` Ingo Molnar 2015-02-05 21:14 ` Bruno Prémont 5 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2015-01-31 9:48 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Thu, Jan 29, 2015 at 05:25:07PM -0800, Linus Torvalds wrote: > PeterZ, please don't make "debugging" patches like this. Ever again. > Because this was just stupid, and it took me too long to realize that > despite the warning being shut up, the debug patch was still actively > doing bad bad things. Understood. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-31 9:48 ` Peter Zijlstra @ 2015-02-03 11:18 ` Ingo Molnar 0 siblings, 0 replies; 31+ messages in thread From: Ingo Molnar @ 2015-02-03 11:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Rafael J. Wysocki, Peter Hurley, Davidlohr Bueso, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov * Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jan 29, 2015 at 05:25:07PM -0800, Linus Torvalds wrote: > > > PeterZ, please don't make "debugging" patches like this. Ever > > again. Because this was just stupid, and it took me too long > > to realize that despite the warning being shut up, the debug > > patch was still actively doing bad bad things. > > Understood. It was my fault too - I should have realized this side effect of the new debugging facility when I merged it, which side effect, as Linus pointed it out, is a really bad idea. We are generally pretty good at doing transparent debugging in the scheduler and in the locking code, but screwed up this time around :-( Thanks, Ingo ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:25 ` Linus Torvalds ` (4 preceding siblings ...) 2015-01-31 9:48 ` Peter Zijlstra @ 2015-02-05 21:14 ` Bruno Prémont 2015-02-06 11:50 ` Peter Zijlstra 5 siblings, 1 reply; 31+ messages in thread From: Bruno Prémont @ 2015-02-05 21:14 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Thu, 29 January 2015 Linus Torvalds wrote: > On Thu, Jan 29, 2015 at 5:12 PM, Linus Torvalds wrote: > > > > The WARN() was already changed to a WARN_ONCE(). > > Oh, but I notice that the "__set_current_state(TASK_RUNNING) ends up > always happening. > > So I think the right fix is to: > > - warn once like we do > > - but *not* do that __set_current_state() which was always total crap anyway > > Why do I say "total crap"? Because of two independent issues: > > (a) it actually changes behavior for a debug vs non-debug kernel, > which is a really bad idea to begin with > > (b) it's really wrong. The whole "nested sleep" case was never a > major bug to begin with, just a possible inefficiency where constant > nested sleeps would possibly make the outer sleep not sleep. But that > "could possibly make" case was the unlikely case, and the debug patch > made it happen *all* the time by explicitly setting things running. > > So I think the proper patch is the attached. > > The comment is also crap. The comment says > > "Blocking primitives will set (and therefore destroy) current->state [...]" > > but the reality is that they *may* set it, and only in the unlikely > slow-path where they actually block. > > So doing this in "__may_sleep()" is just bogus and horrible horrible > crap. It turns the "harmless ugliness" into a real *harmful* bug. The > key word of "__may_sleep()" is that "MAY" part. It's a debug thing to > make relatively rare cases show up. > > PeterZ, please don't make "debugging" patches like this. Ever again. > Because this was just stupid, and it took me too long to realize that > despite the warning being shut up, the debug patch was still actively > doing bad bad things. > > Ingo, maybe you'd want to apply this through the scheduler tree, the > way you already did the WARN_ONCE() thing. > > Bruno, does this finally actually fix your pccard thing? Tested the variant that was applied by running rc7 and it fixes the endless loop. The loop is now replaced by a single WARN() trace - I guess expected: [ 3.083647] ------------[ cut here ]------------ [ 3.087477] WARNING: CPU: 0 PID: 67 at /usr/src/linux-git/kernel/sched/core.c:7300 __might_sleep+0x79/0x90() [ 3.091357] do not call blocking ops when !TASK_RUNNING; state=1 set at [<c1442040>] pccardd+0xa0/0x3e0 [ 3.095232] Modules linked in: [ 3.099020] CPU: 0 PID: 67 Comm: pccardd Not tainted 3.19.0-rc7-00003-g67288c4 #17 [ 3.102760] Hardware name: Acer TravelMate 660/TravelMate 660, BIOS 3A19 01/14/2004 [ 3.106504] c212def4 c212def4 c212deb4 c16caf23 c212dee4 c10416fd c1907334 c212df10 [ 3.110315] 00000043 c1907380 00001c84 c105a099 c105a099 c1442040 00000001 f5f54bc0 [ 3.114143] c212defc c104176e 00000009 c212def4 c1907334 c212df10 c212df30 c105a099 [ 3.117960] Call Trace: [ 3.121703] [<c16caf23>] dump_stack+0x16/0x18 [ 3.125447] [<c10416fd>] warn_slowpath_common+0x7d/0xc0 [ 3.129172] [<c105a099>] ? __might_sleep+0x79/0x90 [ 3.132868] [<c105a099>] ? __might_sleep+0x79/0x90 [ 3.136500] [<c1442040>] ? pccardd+0xa0/0x3e0 [ 3.140092] [<c104176e>] warn_slowpath_fmt+0x2e/0x30 [ 3.143657] [<c105a099>] __might_sleep+0x79/0x90 [ 3.147209] [<c1442040>] ? pccardd+0xa0/0x3e0 [ 3.150747] [<c1442040>] ? pccardd+0xa0/0x3e0 [ 3.154256] [<c16cf447>] mutex_lock+0x17/0x2a [ 3.157734] [<c1442089>] pccardd+0xe9/0x3e0 [ 3.161207] [<c1441fa0>] ? pcmcia_socket_uevent+0x30/0x30 [ 3.164660] [<c1056990>] kthread+0xa0/0xc0 [ 3.168059] [<c16d1040>] ret_from_kernel_thread+0x20/0x30 [ 3.171436] [<c10568f0>] ? kthread_worker_fn+0x140/0x140 [ 3.174796] ---[ end trace c3f708b642e3d8f0 ]--- >From my reading of the thread fixing pccardd/sched TASK_RUNNING usage/check is another issue left for the future. Thanks, Bruno > Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-02-05 21:14 ` Bruno Prémont @ 2015-02-06 11:50 ` Peter Zijlstra 2015-02-06 16:02 ` Linus Torvalds 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2015-02-06 11:50 UTC (permalink / raw) To: Bruno Prémont Cc: Linus Torvalds, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Thu, Feb 05, 2015 at 10:14:36PM +0100, Bruno Prémont wrote: > The loop is now replaced by a single WARN() trace - I guess expected: > From my reading of the thread fixing pccardd/sched TASK_RUNNING usage/check > is another issue left for the future. Yeah, something like the below will make it go away -- under the assumption that that comment is actually correct, I don't know, the pcmcia people should probably write a better comment :/ Also, set_current_state(TASK_RUNNING) is almost always pointless, nobody cares about that barrier, so make it go away. --- drivers/pcmcia/cs.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/pcmcia/cs.c b/drivers/pcmcia/cs.c index 5292db69c426..5678e161a17d 100644 --- a/drivers/pcmcia/cs.c +++ b/drivers/pcmcia/cs.c @@ -635,6 +635,12 @@ static int pccardd(void *__skt) skt->sysfs_events = 0; spin_unlock_irqrestore(&skt->thread_lock, flags); + /* + * Supposedly this is a rarely contended mutex and + * sleeping is therefore unlikely, the occasional + * extra loop iteration is harmless. + */ + sched_annotate_sleep(); mutex_lock(&skt->skt_mutex); if (events & SS_DETECT) socket_detect_change(skt); @@ -679,7 +685,7 @@ static int pccardd(void *__skt) try_to_freeze(); } /* make sure we are running before we exit */ - set_current_state(TASK_RUNNING); + __set_current_state(TASK_RUNNING); /* shut down socket, if a device is still present */ if (skt->state & SOCKET_PRESENT) { ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-02-06 11:50 ` Peter Zijlstra @ 2015-02-06 16:02 ` Linus Torvalds 2015-02-06 16:39 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2015-02-06 16:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Bruno Prémont, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Fri, Feb 6, 2015 at 3:50 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > Also, set_current_state(TASK_RUNNING) is almost always pointless, nobody > cares about that barrier, so make it go away. I'd rather not mix this with the patch, and wonder if we should just do that globally with some preprocessor magic. We do have a fair number of "set_current_state(TASK_RUNNING)" and at least for the *documented* reason for the memory barrier, all of them could/should be barrier-less. So something like if (__is_constant_p(state) && state == TASK_RUNNING) tsk->state = state; else set_mb(tsk->state, state); might be more general solution than randomly doing one at a time when changing code around it.. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-02-06 16:02 ` Linus Torvalds @ 2015-02-06 16:39 ` Peter Zijlstra 2015-02-06 16:46 ` Linus Torvalds 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2015-02-06 16:39 UTC (permalink / raw) To: Linus Torvalds Cc: Bruno Prémont, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Fri, Feb 06, 2015 at 08:02:41AM -0800, Linus Torvalds wrote: > On Fri, Feb 6, 2015 at 3:50 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > Also, set_current_state(TASK_RUNNING) is almost always pointless, nobody > > cares about that barrier, so make it go away. > > I'd rather not mix this with the patch, and wonder if we should just > do that globally with some preprocessor magic. We do have a fair > number of "set_current_state(TASK_RUNNING)" 138 > and at least for the > *documented* reason for the memory barrier, all of them could/should > be barrier-less. > > So something like > > if (__is_constant_p(state) && state == TASK_RUNNING) > tsk->state = state; > else > set_mb(tsk->state, state); > > might be more general solution than randomly doing one at a time when > changing code around it.. Yeah, or we could do the coccinelle thing and do a mass conversion. I like the macro one though; I worry a wee bit about non-documented cases through. If someone is doing something way subtle we'll break it :/ --- Subject: sched: Avoid the full memory-barrier for set_current_state(TASK_RUNNING) One should never need the full memory barrier implied by set_current_state() to set TASK_RUNNING for the documented reason of avoiding races against wakeup. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- diff --git a/include/linux/sched.h b/include/linux/sched.h index 8db31ef98d2f..aea44c4eeed8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -243,6 +243,27 @@ extern char ___assert_task_state[1 - 2*!!( ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \ (task->flags & PF_FROZEN) == 0) +/* + * set_current_state() includes a barrier so that the write of current->state + * is correctly serialised wrt the caller's subsequent test of whether to + * actually sleep: + * + * set_current_state(TASK_UNINTERRUPTIBLE); + * if (do_i_need_to_sleep()) + * schedule(); + * + * If the caller does not need such serialisation then use + * __set_current_state(). This is always true for TASK_RUNNING since + * there is no race against wakeup -- both write the same value. + */ +#define ___set_current_state(state) \ +do { \ + if (__is_constant_p(state) && (state) == TASK_RUNNING) \ + current->state = (state); \ + else \ + set_mb(current->state, (state)); \ +} while (0) + #ifdef CONFIG_DEBUG_ATOMIC_SLEEP #define __set_task_state(tsk, state_value) \ @@ -256,17 +277,6 @@ extern char ___assert_task_state[1 - 2*!!( set_mb((tsk)->state, (state_value)); \ } while (0) -/* - * set_current_state() includes a barrier so that the write of current->state - * is correctly serialised wrt the caller's subsequent test of whether to - * actually sleep: - * - * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); - * - * If the caller does not need such serialisation then use __set_current_state() - */ #define __set_current_state(state_value) \ do { \ current->task_state_change = _THIS_IP_; \ @@ -275,7 +285,7 @@ extern char ___assert_task_state[1 - 2*!!( #define set_current_state(state_value) \ do { \ current->task_state_change = _THIS_IP_; \ - set_mb(current->state, (state_value)); \ + ___set_current_state(state_value); \ } while (0) #else @@ -285,21 +295,10 @@ extern char ___assert_task_state[1 - 2*!!( #define set_task_state(tsk, state_value) \ set_mb((tsk)->state, (state_value)) -/* - * set_current_state() includes a barrier so that the write of current->state - * is correctly serialised wrt the caller's subsequent test of whether to - * actually sleep: - * - * set_current_state(TASK_UNINTERRUPTIBLE); - * if (do_i_need_to_sleep()) - * schedule(); - * - * If the caller does not need such serialisation then use __set_current_state() - */ #define __set_current_state(state_value) \ do { current->state = (state_value); } while (0) #define set_current_state(state_value) \ - set_mb(current->state, (state_value)) + ___set_current_state(state_value); #endif ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-02-06 16:39 ` Peter Zijlstra @ 2015-02-06 16:46 ` Linus Torvalds 0 siblings, 0 replies; 31+ messages in thread From: Linus Torvalds @ 2015-02-06 16:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Bruno Prémont, Rafael J. Wysocki, Ingo Molnar, Peter Hurley, Davidlohr Bueso, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Fri, Feb 6, 2015 at 8:39 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > I like the macro one though; I worry a wee bit about non-documented > cases through. If someone is doing something way subtle we'll break it > :/ Yeah, I agree. And I wonder how much we should even care. People who do this thing by hand - rather than using the wait-event model - are *likely* legacy stuff or just random drivers, or simply stuff that isn't all that performance-sensitive. So I dunno how worthwhile this all is. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:12 ` Linus Torvalds 2015-01-30 1:25 ` Linus Torvalds @ 2015-01-30 1:49 ` Rafael J. Wysocki 2015-02-02 9:48 ` Zdenek Kabelac 1 sibling, 1 reply; 31+ messages in thread From: Rafael J. Wysocki @ 2015-01-30 1:49 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov On Thursday, January 29, 2015 05:12:11 PM Linus Torvalds wrote: > On Thu, Jan 29, 2015 at 5:22 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > On Wednesday, January 21, 2015 05:54:00 PM Peter Hurley wrote: > >> > >> Yeah, but the debug check is triggering worse behavior, requiring > >> bisecting back to the debug commit. > > > > Yes, it is. > > > > So I'm wondering is anyone is working on fixing this in any way? > > > > It kind of sucks when this is happening on an otherwise perfectly usable > > old(ish) machine ... > > The WARN() was already changed to a WARN_ONCE(). > > So that debug check doesn't cause problems any more. If somebody is > bisecting something else, and the WARN() is a problem for those > intermediate kernels, then just disabling CONFIG_DEBUG_ATOMIC_SLEEP > should get you past that point. > > IOW, this really shouldn't be an issue. > > Does the pccard thing still not work? Interestingly enough, if the kernel is built with CONFIG_DEBUG_ATOMIC_SLEEP unset, the problem with 99+% CPU load from pccardd goes away, so thanks for the hint. Rafael ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Linux 3.19-rc5 2015-01-30 1:49 ` Rafael J. Wysocki @ 2015-02-02 9:48 ` Zdenek Kabelac 0 siblings, 0 replies; 31+ messages in thread From: Zdenek Kabelac @ 2015-02-02 9:48 UTC (permalink / raw) To: Rafael J. Wysocki, Linus Torvalds Cc: Peter Hurley, Davidlohr Bueso, Peter Zijlstra, Bruno Prémont, Linux Kernel Mailing List, Thomas Gleixner, Ilya Dryomov, Mike Galbraith, Oleg Nesterov Dne 30.1.2015 v 02:49 Rafael J. Wysocki napsal(a): > On Thursday, January 29, 2015 05:12:11 PM Linus Torvalds wrote: >> On Thu, Jan 29, 2015 at 5:22 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >>> On Wednesday, January 21, 2015 05:54:00 PM Peter Hurley wrote: >>>> >>>> Yeah, but the debug check is triggering worse behavior, requiring >>>> bisecting back to the debug commit. >>> >>> Yes, it is. >>> >>> So I'm wondering is anyone is working on fixing this in any way? >>> >>> It kind of sucks when this is happening on an otherwise perfectly usable >>> old(ish) machine ... >> >> The WARN() was already changed to a WARN_ONCE(). >> >> So that debug check doesn't cause problems any more. If somebody is >> bisecting something else, and the WARN() is a problem for those >> intermediate kernels, then just disabling CONFIG_DEBUG_ATOMIC_SLEEP >> should get you past that point. >> >> IOW, this really shouldn't be an issue. >> >> Does the pccard thing still not work? > > Interestingly enough, if the kernel is built with CONFIG_DEBUG_ATOMIC_SLEEP > unset, the problem with 99+% CPU load from pccardd goes away, so thanks for > the hint. Ok - I can now 'use' 3.19-rc7 on T61 (C2D) without having a single core occupied on 100% with pccardd, but I still get this one logged: [ 2.185320] pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0fffff: [ 2.185363] excluding 0xc0000-0xcffff 0xe0000-0xfffff [ 2.185526] pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: [ 2.185559] excluding 0xa0000000-0xa0ffffff [ 2.185720] pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: [ 2.185754] excluding 0x60000000-0x60ffffff [ 2.297692] ------------[ cut here ]------------ [ 2.297698] WARNING: CPU: 1 PID: 185 at kernel/sched/core.c:7300 __might_sleep+0xae/0xc0() [ 2.297702] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff81518e18>] pccardd+0xe8/0x490 [ 2.297712] Modules linked in: uhci_hcd sr_mod cdrom ehci_pci ehci_hcd yenta_socket usbcore usb_common video backlight autofs4 [ 2.297715] CPU: 1 PID: 185 Comm: pccardd Not tainted 3.19.0-rc7-00002-g9afdf96 #5 [ 2.297716] Hardware name: LENOVO 6464CTO/6464CTO, BIOS 7LETC9WW (2.29 ) 03/18/2011 [ 2.297719] ffffffff819e4d8a ffff8800b8bdbc58 ffffffff8163926d ffffffff810a9e1e [ 2.297721] ffff8800b8bdbca8 ffff8800b8bdbc98 ffffffff810587ba ffff880000000000 [ 2.297724] ffffffff819e6073 000000000000026d 0000000000000000 0000000000000080 [ 2.297725] Call Trace: [ 2.297729] [<ffffffff8163926d>] dump_stack+0x4f/0x7b [ 2.297732] [<ffffffff810a9e1e>] ? down_trylock+0x2e/0x40 [ 2.297736] [<ffffffff810587ba>] warn_slowpath_common+0x8a/0xc0 [ 2.297738] [<ffffffff81058836>] warn_slowpath_fmt+0x46/0x50 [ 2.297741] [<ffffffff81518e18>] ? pccardd+0xe8/0x490 [ 2.297742] [<ffffffff81518e18>] ? pccardd+0xe8/0x490 [ 2.297744] [<ffffffff8108412e>] __might_sleep+0xae/0xc0 [ 2.297747] [<ffffffff8163cd7e>] mutex_lock_nested+0x2e/0x440 [ 2.297749] [<ffffffff8164189d>] ? _raw_spin_unlock_irqrestore+0x5d/0x80 [ 2.297753] [<ffffffff8108b42b>] ? preempt_count_sub+0xab/0x100 [ 2.297755] [<ffffffff81518e80>] pccardd+0x150/0x490 [ 2.297757] [<ffffffff81518d30>] ? pcmcia_socket_dev_complete+0x40/0x40 [ 2.297760] [<ffffffff8107db1f>] kthread+0x11f/0x140 [ 2.297763] [<ffffffff8107da00>] ? kthread_create_on_node+0x260/0x260 [ 2.297766] [<ffffffff8164242c>] ret_from_fork+0x7c/0xb0 [ 2.297768] [<ffffffff8107da00>] ? kthread_create_on_node+0x260/0x260 [ 2.297770] ---[ end trace ed9e591061d223e6 ]--- [ 2.464635] usb 3-1: new full-speed USB device number 2 using uhci_hcd and of course number of these: [ 29.660708] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device [ 29.667061] i915 0000:00:02.0: registered panic notifier [ 30.314206] ------------[ cut here ]------------ [ 30.318866] WARNING: CPU: 0 PID: 1010 at drivers/gpu/drm/drm_irq.c:1121 drm_wait_one_vblank+0x180/0x190 [drm]() [ 30.329085] vblank not available on crtc 1, ret=-22 [ 30.334003] Modules linked in: i915 i2c_algo_bit drm_kms_helper drm xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun brid ge stp llc ipv6 ebtables ip6_tables iptable_filter ip_tables x_tables bnep iTCO_wdt hid_generic iTCO_vendor_support snd_hda_codec_analo g snd_hda_codec_generic coretemp kvm_intel kvm microcode usbhid psmouse serio_raw hid arc4 r852 sm_common nand i2c_i801 nand_ecc i2c_co re nand_ids mtd btusb bluetooth iwl3945 sdhci_pci r592 iwlegacy memstick pcmcia snd_hda_intel sdhci mac80211 snd_hda_controller mmc_cor e snd_hda_codec snd_hwdep lpc_ich snd_seq mfd_core snd_seq_device snd_pcm e1000e cfg80211 ptp thinkpad_acpi snd_timer pps_core wmi nvra m [ 30.406592] snd soundcore evdev nfsd auth_rpcgss oid_registry nfs_acl lockd grace binfmt_misc loop sunrpc uhci_hcd sr_mod cdrom ehc i_pci ehci_hcd yenta_socket usbcore usb_common video backlight autofs4 [ 30.424031] CPU: 1 PID: 1010 Comm: Xorg.bin Tainted: G W 3.19.0-rc7-00002-g9afdf96 #5 [ 30.432930] Hardware name: LENOVO 6464CTO/6464CTO, BIOS 7LETC9WW (2.29 ) 03/18/2011 [ 30.440627] ffffffffa0e1017f ffff880135a37a28 ffffffff8163926d ffff88013abcf238 [ 30.448250] ffff880135a37a78 ffff880135a37a68 ffffffff810587ba 0000000000000001 [ 30.455780] ffff8800af5b4000 ffff880135974000 0000000000000001 ffff8800acca0000 [ 30.463328] Call Trace: [ 30.465850] [<ffffffff8163926d>] dump_stack+0x4f/0x7b [ 30.471066] [<ffffffff810587ba>] warn_slowpath_common+0x8a/0xc0 [ 30.477095] [<ffffffff81058836>] warn_slowpath_fmt+0x46/0x50 [ 30.482919] [<ffffffffa0de5a10>] drm_wait_one_vblank+0x180/0x190 [drm] [ 30.489683] [<ffffffffa0f11a70>] intel_enable_sdvo+0x70/0x120 [i915] [ 30.496206] [<ffffffffa0ed6241>] ? intel_enable_pipe+0xf1/0x220 [i915] [ 30.502914] [<ffffffffa0ede11b>] i9xx_crtc_enable+0x3fb/0x4b0 [i915] [ 30.509478] [<ffffffffa0edc692>] __intel_set_mode+0x8a2/0xcb0 [i915] [ 30.515995] [<ffffffffa0ee351b>] intel_crtc_set_config+0xc0b/0xf90 [i915] [ 30.522922] [<ffffffffa0def259>] drm_mode_set_config_internal+0x69/0x120 [drm] [ 30.530310] [<ffffffffa0e4e5d8>] restore_fbdev_mode+0xc8/0xf0 [drm_kms_helper] [ 30.537682] [<ffffffff811d1607>] ? kfree+0xe7/0x3b0 [ 30.542771] [<ffffffffa0e50649>] drm_fb_helper_restore_fbdev_mode_unlocked+0x29/0x80 [drm_kms_helper] [ 30.552175] [<ffffffffa0ef212e>] intel_fbdev_restore_mode+0x1e/0x50 [i915] [ 30.559259] [<ffffffffa0f16fee>] i915_driver_lastclose+0xe/0x10 [i915] [ 30.565916] [<ffffffffa0de1e1e>] drm_lastclose+0x2e/0x150 [drm] [ 30.571991] [<ffffffffa0de2280>] drm_release+0x340/0x550 [drm] [ 30.577986] [<ffffffff811ef5b3>] __fput+0xf3/0x210 [ 30.582926] [<ffffffff811ef71e>] ____fput+0xe/0x10 [ 30.587931] [<ffffffff8107bfdc>] task_work_run+0xcc/0x100 [ 30.593479] [<ffffffff81004181>] do_notify_resume+0x61/0x90 [ 30.599241] [<ffffffff816427c7>] int_signal+0x12/0x17 [ 30.604456] ---[ end trace ed9e591061d223e7 ]--- [ 30.609191] ------------[ cut here ]------------ Regards Zdenek ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2015-02-06 16:46 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-01-18 9:17 Linux 3.19-rc5 Linus Torvalds 2015-01-19 18:02 ` Bruno Prémont 2015-01-20 6:24 ` Linus Torvalds 2015-01-21 20:37 ` Bruno Prémont 2015-01-21 21:37 ` Bruno Prémont 2015-01-21 22:12 ` Davidlohr Bueso 2015-01-21 22:54 ` Peter Hurley 2015-01-30 1:22 ` Rafael J. Wysocki 2015-01-30 1:12 ` Linus Torvalds 2015-01-30 1:25 ` Linus Torvalds 2015-01-30 1:35 ` Linus Torvalds 2015-01-30 2:06 ` Rafael J. Wysocki 2015-01-30 15:47 ` Oleg Nesterov 2015-01-31 18:32 ` Linus Torvalds 2015-01-31 20:16 ` Peter Zijlstra 2015-01-31 21:54 ` Linus Torvalds 2015-02-02 13:19 ` Peter Zijlstra 2015-02-01 19:43 ` Oleg Nesterov 2015-02-01 20:09 ` Linus Torvalds 2015-02-01 20:19 ` Oleg Nesterov 2015-02-02 15:34 ` Peter Zijlstra 2015-01-31 9:16 ` Bruno Prémont 2015-01-31 9:48 ` Peter Zijlstra 2015-02-03 11:18 ` Ingo Molnar 2015-02-05 21:14 ` Bruno Prémont 2015-02-06 11:50 ` Peter Zijlstra 2015-02-06 16:02 ` Linus Torvalds 2015-02-06 16:39 ` Peter Zijlstra 2015-02-06 16:46 ` Linus Torvalds 2015-01-30 1:49 ` Rafael J. Wysocki 2015-02-02 9:48 ` Zdenek Kabelac
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).