* Linux 4.15-rc2 @ 2017-12-03 16:22 Linus Torvalds 2017-12-04 22:25 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Rafael J. Wysocki 2018-02-21 18:36 ` Linux 4.15-rc2 Eugene Syromiatnikov 0 siblings, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2017-12-03 16:22 UTC (permalink / raw) To: Linux Kernel Mailing List It's Sunday, but a few hours earlier than usual, since I'm on the east coast, three hours ahead of my normal release schedule. It's a slightly bigger rc2 than I would have wished for, but this early in the release process I don't worry about it. The appended shortlog gives the details, it's fixes all over the place - architectures, drivers, filesystems, networking, core kernel. One thing I'll point out is that I'm trying to get some kernel ASLR leaks plugged, and as part of that we now hash any pointers printed by "%p' by default. That won't affect a lot of people, but where it is a debugging problem (rather than leaking interesting kernel pointers), we will have to fix things up. It can be a small annoyance, but the alternatives (trying to actually find all the cases where we might be leaking) were worse. But let's see if anybody even notices - a lot of the pointer printouts are stale debug information from when some driver was originally written, and aren't actually really interesting. There will probably be some more leak fixes during this rc process, we'll see how that all sorts out. Linus --- Abhishek Goel (2): cpupowerutils: bench - Fix cpu online check cpupower : Fix cpupower working when cpu0 is offline Adrian Hunter (4): mmc: block: Fix missing blk_put_request() mmc: block: Check return value of blk_get_request() mmc: core: Do not leave the block driver in a suspended state mmc: block: Ensure that debugfs files are removed Ahmad Fatoum (1): e1000: Fix off-by-one in debug message Alex Deucher (4): drm/amdgpu/gfx7: cache raster_config values drm/amdgpu: used cached gca values for cik_read_register Revert "drm/amdgpu: fix rmmod KCQ disable failed error" drm/amdgpu: drop experimental flag for raven Amritha Nambiar (1): i40e: Fix reporting incorrect error codes Andrew Elble (2): nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex class nfsd: check for use of the closed special stateid Andrew Jiang (1): drm/amd/display: Don't reject 3D timings Andrew Waterman (3): RISC-V: Add VDSO entries for clock_get/gettimeofday/getcpu RISC-V: Flush I$ when making a dirty page executable RISC-V: Allow userspace to flush the instruction cache Andrey Grodzovsky (1): drm/amd/display: Switch to drm_atomic_helper_wait_for_flip_done Andrey Gusakov (6): drm/bridge: tc358767: do no fail on hi-res displays drm/bridge: tc358767: filter out too high modes drm/bridge: tc358767: fix DP0_MISC register set drm/bridge: tc358767: fix timing calculations drm/bridge: tc358767: fix AUXDATAn registers access drm/bridge: tc358767: fix 1-lane behavior Andy Shevchenko (1): scripts/bloat-o-meter: don't fail with division by 0 Antoine Tenart (4): net: mvpp2: fix the txq_init error path net: mvpp2: cleanup probed ports in the probe error path net: mvpp2: check ethtool sets the Tx ring size is to a valid min value net: phy: marvell10g: fix the PHY id mask Ard Biesheuvel (2): arm64: module-plts: factor out PLT generation code for ftrace arm64: ftrace: emit ftrace-mod.o contents through code Arnd Bergmann (2): drm/i915: fix intel_backlight_device_register declaration drm/omap: displays: panel-dpi: add backlight dependency Bartosz Golaszewski (1): eeprom: at24: correctly set the size for at24mac402 Bastian Stender (2): mmc: core: prepend 0x to pre_eol_info entry in sysfs mmc: core: prepend 0x to OCR entry in sysfs Benjamin Gaignard (1): ethernet: dwmac-stm32: Fix copyright Bhawanpreet Lakha (1): drm/amd/display: Add null check for 24BPP (xfm and dpp) Bhumika Goyal (3): sunrpc: make the function arg as const NFSD: make cache_detail structures const SUNRPC: make cache_detail structures const Bjorn Andersson (1): mmc: sdhci-msm: Optionally wait for signal level changes Carlos Maiolino (1): xfs: Properly retry failed dquot items in case of error during buffer writeback Changbin Du (1): drm/i915/gvt: Fix unsafe locking caused by spin_unlock_bh Chao Yu (1): quota: propagate error from __dquot_initialize Charlene Liu (2): drm/amd/display: fix seq issue: turn on clock before programming afmt. drm/amd/display: try to find matching audio inst for enc inst first Chris Wilson (3): drm/i915: Clear breadcrumb node when cancelling signaling drm/i915: Mark the userptr invalidate workqueue as WQ_MEM_RECLAIM drm/i915/fbdev: Serialise early hotplug events with async fbdev config Christian Borntraeger (1): s390/debug: use pK for kernel pointers Christian König (2): drm/amdgpu: don't try to move pinned BOs drm/ttm: fix populate_and_map() functions once more Christoph Hellwig (1): move libgcc.h to include/linux Christophe JAILLET (2): bnxt_en: Fix an error handling path in 'bnxt_get_module_eeprom()' drm/omap: Fix error handling path in 'omap_dmm_probe()' Chun-Yeow Yeoh (1): mac80211: fix the update of path metric for RANN frame Cihangir Akturk (1): drm: mali-dp: switch to drm_*_get(), drm_*_put() helpers Colin Ian King (9): nvme: fix spelling mistake: "requeing" -> "requeuing" drm/i915/gvt: ensure -ve return value is handled correctly i2c: i2c-boardinfo: fix memory leaks on devinfo ambassador: fix incorrect indentation of assignment statement atm: fore200e: use %pK to format kernel addresses instead of %x atm: lanai: use %p to format kernel addresses instead of %x atm: suni: remove extraneous space to fix indentation drm/amd/display: fix memory leaks on error exit return net: via: via-rhine: use %p to format void * address instead of %x Dan Carpenter (1): omapdrm: hdmi4_cec: signedness bug in hdmi4_cec_init() Dan Williams (11): mm: fix device-dax pud write-faults triggered by get_user_pages() mm: switch to 'define pmd_write' instead of __HAVE_ARCH_PMD_WRITE mm: replace pud_write with pud_access_permitted in fault + gup paths mm: replace pmd_write with pmd_access_permitted in fault + gup paths mm: replace pte_write with pte_access_permitted in fault + gup paths mm, hugetlbfs: introduce ->split() to vm_operations_struct device-dax: implement ->split() to catch invalid munmap attempts mm: introduce get_user_pages_longterm mm: fail get_vaddr_frames() for filesystem-dax mappings v4l2: disable filesystem-dax mapping support IB/core: disable memory registration of filesystem-dax vmas Darrick J. Wong (5): xfs: always free inline data before resetting inode fork during ifree xfs: log recovery should replay deferred ops in order xfs: ubsan fixes xfs: remove unused parameter from xfs_writepage_map xfs: scrub inode mode properly Dave Airlie (1): drm/ttm: don't attempt to use hugepages if dma32 requested (v2) Dave Martin (1): arm64: fpsimd: Fix failure to restore FPSIMD state after signals David Disseldorp (1): null_blk: fix dev->badblocks leak David Hildenbrand (1): KVM: x86: fix em_fxstor() sleeping while in atomic David Howells (14): rxrpc: The mutex lock returned by rxrpc_accept_call() needs releasing rxrpc: Don't set upgrade by default in sendmsg() rxrpc: Provide a different lockdep key for call->user_mutex for kernel calls rxrpc: Delay terminal ACK transmission on a client call rxrpc: Split the call params from the operation params rxrpc: Fix call timeouts rxrpc: Don't transmit DELAY ACKs immediately on proposal rxrpc: Express protocol timeouts in terms of RTT rxrpc: Add a timeout for detecting lost ACKs/lost DATA rxrpc: Add keepalive for a call rxrpc: Fix service endpoint expiry rxrpc: Fix conn expiry timers afs: Fix permit refcounting afs: Properly reset afs_vnode (inode) fields David S. Miller (1): sparc64: Fix boot on T4 and later. David Sterba (2): btrfs: add missing device::flush_bio puts btrfs: dev_alloc_list is not protected by RCU, use normal list_del Dmitry V. Levin (1): uapi: fix linux/kfd_ioctl.h userspace compilation errors Dmytro Laktyushkin (3): drm/amd/display: fix split recout calculation drm/amd/display: fix split recout offset drm/amd/display: fix split viewport rounding error Dr. David Alan Gilbert (2): KVM: lapic: Split out x2apic ldr calculation KVM: lapic: Fixup LDR on load in x2apic Eduardo Otubo (1): xen-netfront: remove warning when unloading module Eric Anholt (1): drm/bridge: Fix lvds-encoder since the panel_bridge rework. Eric Dumazet (1): net/packet: fix a race in packet_bind() and packet_notifier() Eric Sandeen (3): xfs: fix leaks on corruption errors in xfs_bmap.c xfs: fix uninitialized variable in xfs_scrub_quota xfs: calculate correct offset in xfs_scrub_quota_item Eric Yang (1): drm/amd/display: Add timing validation against dongle cap Eyal Moscovici (1): KVM: x86: Allow suppressing prints on RDMSR/WRMSR of unhandled MSRs Felix Kuehling (2): drm/amdgpu: Fix SDMA load/unload sequence on HWS disabled mode drm/amdkfd: Fix SDMA oversubsription handling Filipe Manana (3): Btrfs: move definition of the function btrfs_find_new_delalloc_bytes Btrfs: fix reported number of inode blocks after buffered append writes Btrfs: incremental send, fix wrong unlink path after renaming file Geert Uytterhoeven (1): net: ethernet: xilinx: Mark XILINX_LL_TEMAC broken on 64-bit Greg Kroah-Hartman (19): s390: block: add SPDX identifiers to the remaining files s390: crypto: add SPDX identifiers to the remaining files s390: cio: add SPDX identifiers to the remaining files s390: char: add SPDX identifiers to the remaining files s390: net: add SPDX identifiers to the remaining files s390: scsi: zfcp_aux: add SPDX identifier s390: virtio: add SPDX identifiers to the remaining files s390: crypto: Remove redundant license text s390: drivers: Remove redundant license text s390: kernel: add SPDX identifiers to the remaining files s390: crypto: add SPDX identifiers to the remaining files s390: mm: add SPDX identifiers to the remaining files s390: pci: add SPDX identifiers to the remaining files s390: appldata: add SPDX identifiers to the remaining files s390: add SPDX identifiers to the remaining files s390: kernel: Remove redundant license text s390: include: Remove redundant license text s390: crypto: Remove redundant license text s390: Remove redundant license text Gustavo A R Silva (1): i40e/virtchnl: fix application of sizeof to pointer Gustavo A. R. Silva (1): net: openvswitch: datapath: fix data type in queue_gso_packets Hans Verkuil (1): drm/bridge: adv7511/33: Fix adv7511_cec_init() failure handling Hans de Goede (4): ACPI / bus: Leave modalias empty for devices which are not present drm/i915: Fix false-positive assert_rpm_wakelock_held in i915_pmic_bus_access_notifier v2 drm/i915: Re-register PMIC bus access notifier on runtime resume i2c: i801: Fix Failed to allocate irq -2147483648 error Harald Freudenberger (1): s390/zcrypt: Fix wrong comparison leading to strange load balancing Harry Wentland (6): drm/amd/display: Fix amdgpu_dm bugs found by smatch drm/amd/display: Bunch of smatch error and warning fixes in DC drm/amd/display: Fix use before NULL check in validate_timing drm/amd/display: Fix hubp check in set_cursor_position drm/amd/display: Fix potential NULL and mem leak in create_links drm/amd/display: Fix couple more inconsistent NULL checks in dc_resource Heiko Carstens (2): s390: rework __switch_to() to allow larger task_struct offsets s390/disassembler: remove confusing code Heiner Kallweit (2): eeprom: at24: fix reading from 24MAC402/24MAC602 eeprom: at24: check at24_read/write arguments Hersen Wu (2): drm/amd/display: Handle as MST first and then DP dongle if sink support both drm/amd/display: USB-C / thunderbolt dock specific workaround Huacai Chen (1): bcache: Fix building error on MIPS Hyong-Youb Kim (1): myri10ge: Update MAINTAINERS Ian Kent (2): autofs: revert "autofs: take more care to not update last_used on path walk" autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored" Israel Rukshin (1): nvme-rdma: Use mr pool Jakub Kicinski (1): cls_bpf: don't decrement net's refcount when offload fails James Hogan (1): cpufreq: Add Loongson machine dependencies James Smart (1): nvmet-fc: correct ref counting error when deferred rcv used Jan H. Schönherr (1): KVM: Let KVM_SET_SIGNAL_MASK work as advertised Janakarajan Natarajan (1): KVM: x86: Fix CPUID function for word 6 (80000001_ECX) Jean Delvare (1): hwmon: Drop reference to Jean's tree Jeff Layton (1): reiserfs: remove unneeded i_version bump Jeff Lien (1): nvme-pci: add quirk for delay before CHK RDY for WDC SN200 Jens Axboe (2): nvme-fc: don't use bit masks for set/test_bit() numbers blktrace: fix trace mutex deadlock Jerry (Fangzhi) Zuo (1): drm/amd/display: Check aux channel before MST resume Jesse Chan (1): cpufreq: mediatek: add missing MODULE_DESCRIPTION/AUTHOR/LICENSE Jiang Biao (1): fs/mbcache.c: make count_objects() more robust Jinbum Park (1): arm64: pgd: Mark pgd_cache as __ro_after_init Jiri Pirko (1): net: sched: cbq: create block for q->link.block Johannes Berg (2): cfg80211: select CRYPTO_SHA256 if needed mac80211: use QoS NDP for AP probing John Johansen (1): apparmor: fix oops in audit_signal_cb hook Jon Maloy (1): tipc: eliminate access after delete in group_filter_msg() Joonas Lahtinen (1): drm/i915: Disable THP until we have a GPU read BW W/A Jordan Lazare (1): drm/amd/display: Revert noisy assert messages Jorgen Hansen (2): VSOCK: Don't call vsock_stream_has_data in atomic context VSOCK: Don't set sk_state to TCP_CLOSE before testing it Josef Bacik (2): btrfs: clear space cache inode generation always btrfs: fix deadlock when writing out space cache Kai-Heng Feng (1): nvme-pci: disable APST on Samsung SSD 960 EVO + ASUS PRIME B350M-A Kees Cook (1): exec: avoid RLIMIT_STACK races with prlimit() Keith Busch (2): nvme: Fix NULL dereference on reservation request nvme: Suppress static analyis warning Kirill A. Shutemov (3): mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d() mm, thp: Do not make pmd/pud dirty without a reason mm/hugetlb: fix NULL-pointer dereference on 5-level paging machine Laurent Pinchart (1): drm: omapdrm: Fix DPI on platforms using the DSI VDDS Leo (Sunpeng) Li (3): drm/amd/display: Should disable when new stream is null drm/amd/display: Do DC mode-change check when adding CRTCs drm/amd/display: Do not put drm_atomic_state on resume Leo Liu (1): drm/amdgpu: move UVD/VCE and VCN structure out from union Linus Torvalds (6): Rename superblock flags (MS_xyz -> SB_xyz) proc: don't report kernel addresses in /proc/<pid>/stack Revert "mm, thp: Do not make pmd/pud dirty without a reason" kallsyms: take advantage of the new '%px' format vsprintf: don't use 'restricted_pointer()' when not restricting Linux 4.15-rc2 Liran Alon (6): KVM: x86: pvclock: Handle first-time write to pvclock-page contains random junk KVM: nVMX/nSVM: Don't intercept #UD when running L2 KVM: x86: Exit to user-mode on #UD intercept when emulator requires KVM: x86: emulator: Return to user-mode on L1 CPL=0 emulation failure KVM: x86: Don't re-execute instruction when not passing CR2 value KVM: nVMX: Fix vmx_check_nested_events() return value in case an event was reinjected to L2 Liu Bo (3): Btrfs: add write_flags for compression bio Btrfs: bail out gracefully rather than BUG_ON Btrfs: fix list_add corruption and soft lockups in fsync Liu, Changcheng (1): scripts/faddr2line: extend usage on generic arch Liviu Dudau (3): drm: hdlcd: Update PM code to save/restore console. drm: mali-dp: Separate static internal data into a read-only structure. drm: mali-dp: Disable planes when their CRTC gets disabled. Lucas Stach (2): drm/atomic: make drm_atomic_helper_wait_for_vblanks more agressive drm/imx: always call wait_for_flip_done in commit_tail Lv Zheng (1): ACPI / EC: Fix regression related to PM ops support in ECDT device Maarten Lankhorst (2): drm/vblank: Pass crtc_id to page_flip_ioctl. drm/fb_helper: Disable all crtc's when initial setup fails. Mahesh Salgaonkar (1): powerpc/powernv: Fix kexec crashes caused by tlbie tracing Marcos Paulo de Souza (1): blktrace: Use blk_trace_bio_get_cgid inside blk_add_trace_bio Mark Rutland (1): arm64: mm: cleanup stale AIVIVT references Martin Schwidefsky (4): s390: fix alloc_pgste check in init_new_context again s390: sthyi: add SPDX identifiers to the remaining files s390: revert ELF_ET_DYN_BASE base changes s390/gs: add compat regset for the guarded storage broadcast control block Max Gurtovoy (1): nvme-rdma: fix memory leak during queue allocation Michael Ellerman (1): powerpc/kexec: Fix kexec/kdump in P9 guest kernels Michael Lyle (1): bcache: check return value of register_shrinker Michal Hocko (3): xfs: fortify xfs_alloc_buftarg error handling mm, memory_hotplug: do not back off draining pcp free pages from kworker context Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" Michel Dänzer (2): drm/amdgpu: Set adev->vcn.irq.num_types for VCN drm/amdgpu: Use unsigned ring indices in amdgpu_queue_mgr_map Mika Westerberg (1): net: thunderbolt: Stop using zero to mean no valid DMA mapping Mike Kravetz (1): mm/cma: fix alloc_contig_range ret code/potential leak Mike Maloney (1): packet: fix crash in fanout_demux_rollover() Mikulas Patocka (1): block: remove useless assignment in bio_split Minwoo Im (2): nvme-pci: avoid hmb desc array idx out-of-bound when hmmaxd set. nvme-pci: fix NULL pointer dereference in nvme_free_host_mem() Mirza Krak (1): drm/rockchip: dw-mipi-dsi: fix possible un-balanced runtime PM enable Nadav Amit (1): fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate() Naofumi Honda (1): nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat Nikita Leshenko (5): KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race KVM: x86: ioapic: Don't fire level irq when Remote IRR set KVM: x86: ioapic: Remove redundant check for Remote IRR in ioapic_set_irq KVM: x86: ioapic: Clear Remote IRR when entry is switched to edge-triggered KVM: x86: ioapic: Preserve read-only values in the redirection table Nikolay Borisov (1): btrfs: Fix transaction abort during failure in btrfs_rm_dev_item OGAWA Hirofumi (1): fs/fat/inode.c: fix sb_rdonly() change Oded Gabbay (2): microblaze: add missing include to mmu_context_mm.h drm/radeon: remove init of CIK VMIDs 8-16 for amdkfd Olof Johansson (8): RISC-V: use generic serial.h RISC-V: use RISCV_{INT,SHORT} instead of {INT,SHORT} for asm macros RISC-V: io.h: type fixes for warnings RISC-V: move empty_zero_page definition to C and export it RISC-V: Export some expected symbols for modules RISC-V: Provide stub of setup_profiling_timer() RISC-V: Use define for get_cycles like other architectures RISC-V: Add missing include Ondrej Mosnáček (1): crypto: skcipher - Fix skcipher_walk_aead_common Palmer Dabbelt (11): RISC-V: Remove __vdso_cmpxchg{32,64} symbol versions RISC-V: Remove unused arguments from ATOMIC_OP RISC-V: Comment on why {,cmp}xchg is ordered how it is RISC-V: Remove __smp_bp__{before,after}_atomic RISC-V: Remove smb_mb__{before,after}_spinlock() RISC-V: __test_and_op_bit_ord should be strongly ordered RISC-V: Add READ_ONCE in arch_spin_is_locked() RISC-V: `sfence.vma` orderes the instruction cache RISC-V: remove spin_unlock_wait() RISC-V: Clean up an unused include RISC-V: __io_writes should respect the length argument Paolo Abeni (1): sch_sfq: fix null pointer dereference at timer expiration Paolo Bonzini (2): KVM: x86: inject exceptions produced by x86_decode_insn KVM: vmx: use X86_CR4_UMIP and X86_FEATURE_UMIP Paul Mackerras (1): KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts Peter Rosin (1): hwmon: (jc42) optionally try to disable the SMBUS timeout Peter Ujfalusi (1): omapdrm: hdmi4: Correct the SoC revision matching Petr Machata (4): mlxsw: spectrum_router: Offload decap only for up tunnels mlxsw: spectrum_router: Demote tunnels on VRF migration mlxsw: spectrum_router: Handle encap to demoted tunnels mlxsw: spectrum_router: Update nexthop RIF on update Pierre-Hugues Husson (1): drm/bridge: synopsys/dw-hdmi: Enable cec clock Qu Wenruo (2): btrfs: Fix wild memory access in compression level parser btrfs: tree-checker: Fix false panic for sanity test Randy Dunlap (2): block: genhd.c: fix message typo drm/amdkfd: fix amdkfd use-after-free GP fault Robert Lippert (1): hwmon: (pmbus) Use 64bit math for DIRECT format values Roman Kapl (1): net: sched: crash on blocks with goto chain action Roman Li (2): drm/amd/display: Fix S3 topology change drm/amd/display: fix gamma setting Rui Hua (1): bcache: recover data from backing when data is clean Russell King (1): ARM: avoid faulting on qemu Sagi Grimberg (7): nvme-fabrics: introduce init command check for a queue that is not alive nvme-fc: check if queue is ready in queue_rq nvme-loop: check if queue is ready in queue_rq nvme-rdma: don't suppress send completions nvme-rdma: don't complete requests before a send work request has completed nvme-rdma: wait for local invalidation before completing a request nvme-rdma: Check remotely invalidated rkey matches our expected rkey Sara Sharon (1): mac80211: tear down RX aggregations first Sasha Neftin (1): e1000e: fix the use of magic numbers for buffer overrun issue Shakeel Butt (1): mm, memcg: fix mem_cgroup_swapout() for THPs Shirish S (1): drm/amd/display: check plane state before validating fbc Srishti Sharma (2): drm/arm: Replace instances of drm_dev_unref with drm_dev_put. drm/arm: Replace instances of drm_dev_unref with drm_dev_put. Stefan Schake (1): drm/vc4: Account for interrupts in flight Stephan Mueller (2): crypto: algif_aead - skip SGL entries with NULL page crypto: af_alg - remove locking in async callback Stephen Hemminger (1): uapi: add SPDX identifier to vm_sockets_diag.h Sunil Goutham (1): net: thunderx: Fix TCP/UDP checksum offload for IPv6 pkts Takashi Iwai (1): Revert "ALSA: usb-audio: Fix potential zero-division at parsing FU" Tang Junhui (1): bcache: add a comment in journal bucket reading Tetsuo Handa (1): quota: Check for register_shrinker() failure. Thomas Meyer (1): auxdisplay: img-ascii-lcd: Only build on archs that have IOMEM Thomas Richter (1): s390/topology: fix compile error in file arch/s390/kernel/smp.c Tobin C. Harding (5): docs: correct documentation for %pK vsprintf: refactor %pK code out of pointer() printk: hash addresses printed with %p vsprintf: add printk specifier %px kasan: use %px to print addresses instead of %p Trond Myklebust (11): nfsd: Fix stateid races between OPEN and CLOSE nfsd: Fix another OPEN stateid race nfsd: CLOSE SHOULD return the invalid special stateid for NFSv4.x (x>0) nfsd: Ensure we don't recognise lock stateids after freeing them nfsd4: move find_lock_stateid nfsd: Fix race in lock stateid creation nfsd: Ensure we check stateid validity in the seqid operation checks nfsd: Fix races with check_stateid_generation() NFSv4: Ensure gcc 4.4.4 can compile initialiser for "invalid_stateid" SUNRPC: Allow connect to return EHOSTUNREACH SUNRPC: Handle ENETDOWN errors Ulf Hansson (1): mmc: sdhci: Avoid swiotlb buffer being full Vaibhav Jain (3): cxl: Check if vphb exists before iterating over AFU devices powerpc: Avoid signed to unsigned conversion in set_thread_tidr() powerpc: Do not assign thread.tidr if already assigned Vasily Averin (9): nfsd: remove net pointer from debug messages lockd: remove net pointer from messages grace: replace BUG_ON by WARN_ONCE in exit_net hook lockd: added cleanup checks in exit_net hook lockd: lost rollback of set_grace_period() in lockd_down_net() race of lockd inetaddr notifiers vs nlmsvc_rqst change race of nfsd inetaddr notifiers vs nn->nfsd_serv change nlm_shutdown_hosts_net() cleanup lockd: fix "list_add double add" caused by legacy signal interface Vasily Gorbik (1): s390/disassembler: correct disassembly lines alignment Vasyl Gomonovych (1): lmc: Use memdup_user() as a cleanup Ville Syrjälä (4): drm/edid: Don't send non-zero YQ in AVI infoframe for HDMI 1.x sinks drm/i915: Fix init_clock_gating for resume drm/i915: Don't try indexed reads to alternate slave addresses drm/i915: Prevent zero length "index" write Vitor Massaru Iha (1): drm: Fix checkpatch issue: "WARNING: braces {} are not necessary for single statement blocks." Vivien Didelot (1): net: dsa: fix 'increment on 0' warning Wang Nan (1): mm, oom_reaper: gather each vma to prevent leaking TLB entry Wanpeng Li (6): KVM: X86: Fix operand/address-size during instruction decoding KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure KVM: X86: Fix softlockup when get the current kvmclock KVM: VMX: Fix rflags cache during vCPU reset KVM: VMX: Fix vmx->nested freeing when no SMI handler Weinan Li (1): drm/i915/gvt: remove skl_misc_ctl_write handler Will Deacon (1): arm64: context: Fix comments and remove pointless smp_wmb() Xiaolin Zhang (1): drm/i915/gvt: enabled pipe A default on creating vgpu Xin Long (11): sctp: use sizeof(__u16) for each stream number length instead of magic number sctp: only allow the out stream reset when the stream outq is empty sctp: only allow the asoc reset when the asoc outq is empty sctp: avoid flushing unsent queue when doing asoc reset sctp: set sender next_tsn for the old result with ctsn_ack_point plus 1 sctp: force SCTP_ERROR_INV_STRM with __u32 when calling sctp_chunk_fail sctp: force the params with right types for sctp csum apis sctp: remove extern from stream sched sctp: use right member as the param of list_for_each_entry bonding: use nla_get_u64 to extract the value for IFLA_BOND_AD_ACTOR_SYSTEM vxlan: use __be32 type for the param vni in __vxlan_fdb_delete Xiong Zhang (1): drm/i915/gvt: Correct ADDR_4K/2M/1G_MASK definition Xu YiPing (1): arm64: perf: remove unsupported events for Cortex-A73 Yan Markman (1): net: mvpp2: do not disable GMAC padding Yisheng Xie (1): kmemleak: add scheduling point to kmemleak_scan() Yury Norov (1): arm64: cpu_ops: Add missing 'const' qualifiers Zhu Yanjun (1): forcedeth: replace pci_unmap_page with dma_unmap_page Zi Yan (1): mm: migrate: fix an incorrect call of prep_transhuge_page() chenjie (1): mm/madvise.c: fix madvise() infinite loop under special circumstances fred gao (1): drm/i915/gvt: Move request alloc to dispatch_workload path only shaoyunl (1): drm/amdkfd: Fix SDMA ring buffer size calculation weiping zhang (7): bdi: convert bdi_debug_register to int bdi: add error handle for bdi_debug_register block: add WARN_ON if bdi register fail blk-wbt: remove duplicated setting in wbt_init blk-sysfs: remove NULL pointer checking in queue_wb_lat_store blk-wbt: move wbt_clear_stat to common place in wbt_done blk-wbt: fix comments typo zhangliping (1): openvswitch: fix the incorrect flow action alloc size ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-03 16:22 Linux 4.15-rc2 Linus Torvalds @ 2017-12-04 22:25 ` Rafael J. Wysocki 2017-12-04 22:36 ` Linus Torvalds 2018-02-21 18:36 ` Linux 4.15-rc2 Eugene Syromiatnikov 1 sibling, 1 reply; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-04 22:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List, x86 On Sunday, December 3, 2017 5:22:56 PM CET Linus Torvalds wrote: > It's Sunday, but a few hours earlier than usual, since I'm on the east > coast, three hours ahead of my normal release schedule. > > It's a slightly bigger rc2 than I would have wished for, but this > early in the release process I don't worry about it. The appended > shortlog gives the details, it's fixes all over the place - > architectures, drivers, filesystems, networking, core kernel. > > One thing I'll point out is that I'm trying to get some kernel ASLR > leaks plugged, and as part of that we now hash any pointers printed by > "%p' by default. That won't affect a lot of people, but where it is a > debugging problem (rather than leaking interesting kernel pointers), > we will have to fix things up. > > It can be a small annoyance, but the alternatives (trying to actually > find all the cases where we might be leaking) were worse. But let's > see if anybody even notices - a lot of the pointer printouts are stale > debug information from when some driver was originally written, and > aren't actually really interesting. > > There will probably be some more leak fixes during this rc process, > we'll see how that all sorts out. So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the systems I have tested, so it is probably safe to assume it to be broken everywhere. I'm quite confident that this is not something that went in through the PM tree, because I was running those changes on the systems that turn out to be broken now. It looks like the the ACPI waking vector mechanism stopped working, so I'm suspecting some x86 changes having to do with virtual-to-physical address mapping. I've just started bisection. Thanks, Rafael ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-04 22:25 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Rafael J. Wysocki @ 2017-12-04 22:36 ` Linus Torvalds 2017-12-04 22:38 ` Thomas Gleixner 2017-12-06 12:15 ` Michal Hocko 0 siblings, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2017-12-04 22:36 UTC (permalink / raw) To: Rafael J. Wysocki, Andy Lutomirski Cc: Linux Kernel Mailing List, the arch/x86 maintainers On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the > systems I have tested, so it is probably safe to assume it to be > broken everywhere. Oh, it's definitely not broken everywhere, because I use it myself, and was traveling last week due to my mom's bday. HOWEVER. Some of the x86 work seems to have broken it for some configurations. In particular, do you have a big "everything enabled" kernel config - particularly lockdep and irqflags tracing enabled? Andy has a patch, but it hasn't made it to me yet (probably because the x86 people are very busy with the kaiser work): https://lkml.org/lkml/2017/11/30/546 (also note his follow-up "fix the commit message" note, but that one doesn't actually affect the code itself). Does that patch fix it for you? Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-04 22:36 ` Linus Torvalds @ 2017-12-04 22:38 ` Thomas Gleixner 2017-12-04 22:41 ` Rafael J. Wysocki 2017-12-06 12:15 ` Michal Hocko 1 sibling, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-04 22:38 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Mon, 4 Dec 2017, Linus Torvalds wrote: > On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the > > systems I have tested, so it is probably safe to assume it to be > > broken everywhere. > > Oh, it's definitely not broken everywhere, because I use it myself, > and was traveling last week due to my mom's bday. > > HOWEVER. > > Some of the x86 work seems to have broken it for some configurations. > In particular, do you have a big "everything enabled" kernel config - > particularly lockdep and irqflags tracing enabled? > > Andy has a patch, but it hasn't made it to me yet (probably because > the x86 people are very busy with the kaiser work): Picking it up right now. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-04 22:38 ` Thomas Gleixner @ 2017-12-04 22:41 ` Rafael J. Wysocki 2017-12-05 0:25 ` Rafael J. Wysocki 0 siblings, 1 reply; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-04 22:41 UTC (permalink / raw) To: Thomas Gleixner, Linus Torvalds Cc: Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Monday, December 4, 2017 11:38:54 PM CET Thomas Gleixner wrote: > On Mon, 4 Dec 2017, Linus Torvalds wrote: > > > On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the > > > systems I have tested, so it is probably safe to assume it to be > > > broken everywhere. > > > > Oh, it's definitely not broken everywhere, because I use it myself, > > and was traveling last week due to my mom's bday. > > > > HOWEVER. > > > > Some of the x86 work seems to have broken it for some configurations. > > In particular, do you have a big "everything enabled" kernel config - > > particularly lockdep and irqflags tracing enabled? > > > > Andy has a patch, but it hasn't made it to me yet (probably because > > the x86 people are very busy with the kaiser work): This definitely fixes the problem at least on one of the affected machines. > Picking it up right now. Cool, thanks! Rafael ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-04 22:41 ` Rafael J. Wysocki @ 2017-12-05 0:25 ` Rafael J. Wysocki 2017-12-09 10:33 ` Pavel Machek 0 siblings, 1 reply; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-05 0:25 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Monday, December 4, 2017 11:41:06 PM CET Rafael J. Wysocki wrote: > On Monday, December 4, 2017 11:38:54 PM CET Thomas Gleixner wrote: > > On Mon, 4 Dec 2017, Linus Torvalds wrote: > > > > > On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > > > So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the > > > > systems I have tested, so it is probably safe to assume it to be > > > > broken everywhere. > > > > > > Oh, it's definitely not broken everywhere, because I use it myself, > > > and was traveling last week due to my mom's bday. > > > > > > HOWEVER. > > > > > > Some of the x86 work seems to have broken it for some configurations. > > > In particular, do you have a big "everything enabled" kernel config - > > > particularly lockdep and irqflags tracing enabled? > > > > > > Andy has a patch, but it hasn't made it to me yet (probably because > > > the x86 people are very busy with the kaiser work): > > This definitely fixes the problem at least on one of the affected machines. I can confirm that the Andy's patch fixes it on all systems that had this issue here. Thanks, Rafael ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-05 0:25 ` Rafael J. Wysocki @ 2017-12-09 10:33 ` Pavel Machek 2017-12-09 11:41 ` Pavel Machek [not found] ` <CA+55aFw8tuoJ2gcXx3K2sKFf2Y9hXX4naMVQNqGOUivnjwhjkg@mail.gmail.com> 0 siblings, 2 replies; 83+ messages in thread From: Pavel Machek @ 2017-12-09 10:33 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Thomas Gleixner, Linus Torvalds, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 1556 bytes --] On Tue 2017-12-05 01:25:55, Rafael J. Wysocki wrote: > On Monday, December 4, 2017 11:41:06 PM CET Rafael J. Wysocki wrote: > > On Monday, December 4, 2017 11:38:54 PM CET Thomas Gleixner wrote: > > > On Mon, 4 Dec 2017, Linus Torvalds wrote: > > > > > > > On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > > > > > So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the > > > > > systems I have tested, so it is probably safe to assume it to be > > > > > broken everywhere. > > > > > > > > Oh, it's definitely not broken everywhere, because I use it myself, > > > > and was traveling last week due to my mom's bday. > > > > > > > > HOWEVER. > > > > > > > > Some of the x86 work seems to have broken it for some configurations. > > > > In particular, do you have a big "everything enabled" kernel config - > > > > particularly lockdep and irqflags tracing enabled? > > > > > > > > Andy has a patch, but it hasn't made it to me yet (probably because > > > > the x86 people are very busy with the kaiser work): > > > > This definitely fixes the problem at least on one of the affected machines. > > I can confirm that the Andy's patch fixes it on all systems that had this > issue here. I believe I have the issue here, too (-next on thinkpad x60). Which patch is expected to fix it? Let me try recent -next... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-09 10:33 ` Pavel Machek @ 2017-12-09 11:41 ` Pavel Machek [not found] ` <CA+55aFw8tuoJ2gcXx3K2sKFf2Y9hXX4naMVQNqGOUivnjwhjkg@mail.gmail.com> 1 sibling, 0 replies; 83+ messages in thread From: Pavel Machek @ 2017-12-09 11:41 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Thomas Gleixner, Linus Torvalds, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 1690 bytes --] On Sat 2017-12-09 11:33:25, Pavel Machek wrote: > On Tue 2017-12-05 01:25:55, Rafael J. Wysocki wrote: > > On Monday, December 4, 2017 11:41:06 PM CET Rafael J. Wysocki wrote: > > > On Monday, December 4, 2017 11:38:54 PM CET Thomas Gleixner wrote: > > > > On Mon, 4 Dec 2017, Linus Torvalds wrote: > > > > > > > > > On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > > > > > > > So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the > > > > > > systems I have tested, so it is probably safe to assume it to be > > > > > > broken everywhere. > > > > > > > > > > Oh, it's definitely not broken everywhere, because I use it myself, > > > > > and was traveling last week due to my mom's bday. > > > > > > > > > > HOWEVER. > > > > > > > > > > Some of the x86 work seems to have broken it for some configurations. > > > > > In particular, do you have a big "everything enabled" kernel config - > > > > > particularly lockdep and irqflags tracing enabled? > > > > > > > > > > Andy has a patch, but it hasn't made it to me yet (probably because > > > > > the x86 people are very busy with the kaiser work): > > > > > > This definitely fixes the problem at least on one of the affected machines. > > > > I can confirm that the Andy's patch fixes it on all systems that had this > > issue here. > > I believe I have the issue here, too (-next on thinkpad x60). Which > patch is expected to fix it? Let me try recent -next... Still there AFAICT. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
[parent not found: <CA+55aFw8tuoJ2gcXx3K2sKFf2Y9hXX4naMVQNqGOUivnjwhjkg@mail.gmail.com>]
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 [not found] ` <CA+55aFw8tuoJ2gcXx3K2sKFf2Y9hXX4naMVQNqGOUivnjwhjkg@mail.gmail.com> @ 2017-12-09 22:01 ` Pavel Machek [not found] ` <CA+55aFySAdiBZhZ0PSDjH5PuvPPcMsBRXbxCkObfm1eY7gHDbQ@mail.gmail.com> 0 siblings, 1 reply; 83+ messages in thread From: Pavel Machek @ 2017-12-09 22:01 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Thomas Gleixner, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 820 bytes --] Hi! On Sat 2017-12-09 10:55:14, Linus Torvalds wrote: > On Dec 9, 2017 02:33, "Pavel Machek" <pavel@ucw.cz> wrote: > > > I believe I have the issue here, too (-next on thinkpad x60). Which > > patch is expected to fix it? Let me try recent -next... > > > It should be fixed in mainline. I don't know if next has picked it > up yet. Strange. I was at 4.15-rc1, and suspend worked there (thinkpad x60, 32-bit). It was broken in -next. I updated to current mainline ( 4ded3bec65a07343258ed8fd9d46483f032d866f ) and suspend is broken. It suspends ok, I press Fn button to make it resume, fans spin up but moon LED is still lit. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
[parent not found: <CA+55aFySAdiBZhZ0PSDjH5PuvPPcMsBRXbxCkObfm1eY7gHDbQ@mail.gmail.com>]
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 [not found] ` <CA+55aFySAdiBZhZ0PSDjH5PuvPPcMsBRXbxCkObfm1eY7gHDbQ@mail.gmail.com> @ 2017-12-10 16:23 ` Pavel Machek 2017-12-10 16:37 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Pavel Machek @ 2017-12-10 16:23 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Thomas Gleixner, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 964 bytes --] On Sat 2017-12-09 14:47:41, Linus Torvalds wrote: > On Dec 9, 2017 14:01, "Pavel Machek" <pavel@ucw.cz> wrote: > > > Strange. I was at 4.15-rc1, and suspend worked there (thinkpad x60, > 32-bit). It was broken in -next. I updated to current mainline ( > 4ded3bec65a07343258ed8fd9d46483f032d866f ) and suspend is broken. > Can you do something about html emails? Quoting them doesn't work too well. > Any chance to bisect it? This doesn't sound like the other problem > we had. Let me try... v4.15-rc2: suspends/resumes ok. 4ded3be: hangs on resume. Given that between those, there was supposed "fix" for suspend, I believe I should try reverting that one first. .. if someone can tell me commit id, that would help. And yes, if everything else fails, I can probably bisect. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-10 16:23 ` Pavel Machek @ 2017-12-10 16:37 ` Linus Torvalds 2017-12-10 18:56 ` Pavel Machek 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2017-12-10 16:37 UTC (permalink / raw) To: Pavel Machek Cc: Rafael J. Wysocki, Thomas Gleixner, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Sun, Dec 10, 2017 at 8:23 AM, Pavel Machek <pavel@ucw.cz> wrote: > > Can you do something about html emails? Quoting them doesn't work too well. Yeah, and they don't show up onlkml either because of rules. I try to avoid them, but have been more on mobile for various reasons lately than usual. That should be over and done with after today, though. >> Any chance to bisect it? This doesn't sound like the other problem >> we had. > > Let me try... > > v4.15-rc2: suspends/resumes ok. > 4ded3be: hangs on resume. > > Given that between those, there was supposed "fix" for suspend, I > believe I should try reverting that one first. .. if someone can tell > me commit id, that would help. The fix in there should be 5b06bbcfc2c6 ("x86/power: Fix some ordering bugs in __restore_processor_context()") so you can certainly see if it works before that (or just reverting it). But there are also a few other x86 low-level things there, and that fix really looks very safe, so I'd almost expect something else to have triggered your problem. There's less than 500 commits in that range you have, so a few bisections should narrow it down a lot. Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-10 16:37 ` Linus Torvalds @ 2017-12-10 18:56 ` Pavel Machek 2017-12-10 20:30 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Pavel Machek @ 2017-12-10 18:56 UTC (permalink / raw) To: Linus Torvalds, luto, tglx, jarkko.nikula Cc: Rafael J. Wysocki, Thomas Gleixner, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 1950 bytes --] On Sun 2017-12-10 08:37:56, Linus Torvalds wrote: > On Sun, Dec 10, 2017 at 8:23 AM, Pavel Machek <pavel@ucw.cz> wrote: > > > > Can you do something about html emails? Quoting them doesn't work too well. > > Yeah, and they don't show up onlkml either because of rules. I try to > avoid them, but have been more on mobile for various reasons lately > than usual. That should be over and done with after today, though. > > >> Any chance to bisect it? This doesn't sound like the other problem > >> we had. > > > > Let me try... > > > > v4.15-rc2: suspends/resumes ok. > > 4ded3be: hangs on resume. > > > > Given that between those, there was supposed "fix" for suspend, I > > believe I should try reverting that one first. .. if someone can tell > > me commit id, that would help. > > The fix in there should be 5b06bbcfc2c6 ("x86/power: Fix some ordering > bugs in __restore_processor_context()") so you can certainly see if it > works before that (or just reverting it). Revert is easier. > But there are also a few other x86 low-level things there, and that > fix really looks very safe, so I'd almost expect something else to > have triggered your problem. There's less than 500 commits in that > range you have, so a few bisections should narrow it down a lot. No, that commit does _look_ pretty suspect to me... Confirmed, revert fixes it. You see how it moves fix_processor_context around #ifdef CONFIG_X86_32 block? And how people forget 32-bit machines exist? Aha. Which brings me to .. various people do automated testing of kernel. Testing 32-bit kernel for boot, and both 32-bit and 64-bit for boot and suspend would be very nice. The last item is not hard, either: sudo rtcwake -l -m mem -s 5 ...should take 10 seconds or so. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-10 18:56 ` Pavel Machek @ 2017-12-10 20:30 ` Linus Torvalds 2017-12-10 20:43 ` Pavel Machek ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Linus Torvalds @ 2017-12-10 20:30 UTC (permalink / raw) To: Pavel Machek, Zhang Rui Cc: Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Sun, Dec 10, 2017 at 10:56 AM, Pavel Machek <pavel@ucw.cz> wrote: > > Confirmed, revert fixes it. You see how it moves fix_processor_context > around #ifdef CONFIG_X86_32 block? And how people forget 32-bit > machines exist? Aha. Yeah, people do. Andy? > Which brings me to .. various people do automated testing of > kernel. Testing 32-bit kernel for boot, and both 32-bit and 64-bit for > boot and suspend would be very nice. The last item is not hard, either: > > sudo rtcwake -l -m mem -s 5 > > ...should take 10 seconds or so. I'm told 0day does *some* suspend/resume testing, but I think it's pretty limited, partly because the kinds of machines it primarily works on don't really support suspend/resume at all. I'm also not sure just how many of those machines are 32-bit at all.. But I'm adding Zhang Rui to the cc, to see if my recollection is right. Because you're right, more suspend/resume automated testing would be good to have. And yes, people test mainly 64-bit these days. Also, I'm not even sure what the 0day rules are for just plain mainline. I don't tend to see a lot of breakage reports, even though I'd expect to. This came in from the x86 trees (and those do their own tests too, but probably not suspend/resume either), but it hit my tree fairly soon after going into the x86 -tip trees. Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-10 20:30 ` Linus Torvalds @ 2017-12-10 20:43 ` Pavel Machek 2017-12-10 21:28 ` Linus Torvalds 2017-12-10 21:38 ` [PATCH] Fix resume on x86-32 machines Pavel Machek 2017-12-11 14:09 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Zhang Rui 2 siblings, 1 reply; 83+ messages in thread From: Pavel Machek @ 2017-12-10 20:43 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 1875 bytes --] On Sun 2017-12-10 12:30:52, Linus Torvalds wrote: > On Sun, Dec 10, 2017 at 10:56 AM, Pavel Machek <pavel@ucw.cz> wrote: > > > > Confirmed, revert fixes it. You see how it moves fix_processor_context > > around #ifdef CONFIG_X86_32 block? And how people forget 32-bit > > machines exist? Aha. > > Yeah, people do. > > Andy? For the record... this should fix it. Tested on x60. More tests pending. Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c index 5191de1..d59f05f 100644 --- a/arch/x86/power/cpu.c +++ b/arch/x86/power/cpu.c @@ -216,8 +216,8 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) write_cr0(ctxt->cr0); /* - * now restore the descriptor tables to their proper values - * ltr is done i fix_processor_context(). + * Now restore the descriptor tables to their proper values + * ltr is done in fix_processor_context(). */ #ifdef CONFIG_X86_32 load_idt(&ctxt->idt); @@ -235,8 +235,6 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) wrmsrl(MSR_GS_BASE, ctxt->gs_base); #endif - fix_processor_context(); - /* * Restore segment registers. This happens after restoring the GDT * and LDT, which happen in fix_processor_context(). @@ -252,8 +250,12 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) */ if (boot_cpu_has(X86_FEATURE_SEP)) enable_sep_cpu(); + + fix_processor_context(); #else /* CONFIG_X86_64 */ + fix_processor_context(); + asm volatile ("movw %0, %%ds" :: "r" (ctxt->ds)); asm volatile ("movw %0, %%es" :: "r" (ctxt->es)); asm volatile ("movw %0, %%fs" :: "r" (ctxt->fs)); -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-10 20:43 ` Pavel Machek @ 2017-12-10 21:28 ` Linus Torvalds 2017-12-10 21:35 ` Pavel Machek 2017-12-12 17:27 ` Linus Torvalds 0 siblings, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2017-12-10 21:28 UTC (permalink / raw) To: Pavel Machek Cc: Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Sun, Dec 10, 2017 at 12:43 PM, Pavel Machek <pavel@ucw.cz> wrote: > > For the record... this should fix it. Tested on x60. More tests pending. This can't be right. At the very least, now the comment is wrong. And the comment does seem relevant for 32-bit too: > - fix_processor_context(); > - > /* > * Restore segment registers. This happens after restoring the GDT > * and LDT, which happen in fix_processor_context(). Notice? You've moved down the 32-bit fix_processor_context() call to after the loadsegment() calls, which smells wrong. That said, this *all* smells wrong. Why is there a special fix_processor_context() function at all with different 32-bit and 64-bit behavior? This code is all written to be maximally confusing. I think this could do with some re-org to make it more logical. That "some random things done in fix_processor_context(), other random things done directly in __restore_processor_state()" makes no sense at all to me. There's no logic to what is done where. Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-10 21:28 ` Linus Torvalds @ 2017-12-10 21:35 ` Pavel Machek 2017-12-12 17:27 ` Linus Torvalds 1 sibling, 0 replies; 83+ messages in thread From: Pavel Machek @ 2017-12-10 21:35 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 1760 bytes --] On Sun 2017-12-10 13:28:50, Linus Torvalds wrote: > On Sun, Dec 10, 2017 at 12:43 PM, Pavel Machek <pavel@ucw.cz> wrote: > > > > For the record... this should fix it. Tested on x60. More tests pending. > > This can't be right. > > At the very least, now the comment is wrong. And the comment does seem > relevant for 32-bit too: Well, take a look at orignal patch. I'm reverting 32-bit code to v4.15-rc1 version, while keeping 64-bit code at v4.15-rc3 version. Yes, my brain hurts from looking at the code :-(. In the meantime, I did short test on 64-bit machine. No ill effect observed. Hmm. Aha. Yes, the comment is wrong... as it was in wrong in -rc1. > > - fix_processor_context(); > > - > > /* > > * Restore segment registers. This happens after restoring the GDT > > * and LDT, which happen in fix_processor_context(). > > Notice? You've moved down the 32-bit fix_processor_context() call to > after the loadsegment() calls, which smells wrong. Yeah, I did. There's where it was in v4.15-rc1, and that's what ws working for me. > That said, this *all* smells wrong. Why is there a special > fix_processor_context() function at all with different 32-bit and > 64-bit behavior? This code is all written to be maximally confusing. > > I think this could do with some re-org to make it more logical. That > "some random things done in fix_processor_context(), other random > things done directly in __restore_processor_state()" makes no sense at > all to me. There's no logic to what is done where. I have to agree. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-10 21:28 ` Linus Torvalds 2017-12-10 21:35 ` Pavel Machek @ 2017-12-12 17:27 ` Linus Torvalds 2017-12-12 18:05 ` Andy Lutomirski 1 sibling, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2017-12-12 17:27 UTC (permalink / raw) To: Pavel Machek Cc: Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Sun, Dec 10, 2017 at 1:28 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That said, this *all* smells wrong. Why is there a special > fix_processor_context() function at all with different 32-bit and > 64-bit behavior? This code is all written to be maximally confusing. Hmm. Looking a bit more at this, I think it should be solved by: - load the original read-write GDT early, along with the IDT. We have already saved it off in __save_processor_state(), and it may have already gotten loaded very early in at least some of the paths, but it definitely hasn't gotten reloaded in all the paths (think "suspend/resume testing" etc). - add the LDT descriptor to the save area too, exactly like we already have IDT/GDT. Then, we can do "load_ldt()" early (along with IDT and GDT). - now we can just load all the segments early, and get the percpu and stack canary stuff without any special cases - do NOT use "load_gs_index()", which does that swapgs dance (twice!) and plays with interrupt state. Just load the segment register, and then do the wrmsrl() of the {FS,GS,KERNEL_GS}_BASE values. There is no need for the swapgs dance. - now that we have a fully working system, then the "fix_processor_context()" code can do the "fancy" stuff like setting up the RO fixmap GDT and re-initializing the TLB state. Those want percpu stuff. In other words, why not try to organize this so that we do a simple and fairly mindless restore of the low-level state first? Avoid all the "system is halfway up" crud. Would that work for people? Andy? Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-12 17:27 ` Linus Torvalds @ 2017-12-12 18:05 ` Andy Lutomirski 2017-12-12 18:36 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Andy Lutomirski @ 2017-12-12 18:05 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Tue, Dec 12, 2017 at 9:27 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Dec 10, 2017 at 1:28 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> That said, this *all* smells wrong. Why is there a special >> fix_processor_context() function at all with different 32-bit and >> 64-bit behavior? This code is all written to be maximally confusing. > > Hmm. Looking a bit more at this, I think it should be solved by: > > - load the original read-write GDT early, along with the IDT. > > We have already saved it off in __save_processor_state(), and it may > have already gotten loaded very early in at least some of the paths, > but it definitely hasn't gotten reloaded in all the paths (think > "suspend/resume testing" etc). > > - add the LDT descriptor to the save area too, exactly like we > already have IDT/GDT. > > Then, we can do "load_ldt()" early (along with IDT and GDT). > > - now we can just load all the segments early, and get the percpu and > stack canary stuff without any special cases > > - do NOT use "load_gs_index()", which does that swapgs dance (twice!) > and plays with interrupt state. Just load the segment register, and > then do the wrmsrl() of the {FS,GS,KERNEL_GS}_BASE values. There is no > need for the swapgs dance. Using what helper? On x86_64, it can fault, and IIRC we explicitly don't allow loadsegment(gs, ...). > > - now that we have a fully working system, then the > "fix_processor_context()" code can do the "fancy" stuff like setting > up the RO fixmap GDT and re-initializing the TLB state. Those want > percpu stuff. > > In other words, why not try to organize this so that we do a simple > and fairly mindless restore of the low-level state first? Avoid all > the "system is halfway up" crud. > > Would that work for people? Andy? Other than the above, more or less. But we should really do all the user segments last. They're not at all needed for normal kernel execution, so I think they should be the very last part. I'll try to get to this today. ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-12 18:05 ` Andy Lutomirski @ 2017-12-12 18:36 ` Linus Torvalds 2017-12-12 22:10 ` Andy Lutomirski 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2017-12-12 18:36 UTC (permalink / raw) To: Andy Lutomirski Cc: Pavel Machek, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Tue, Dec 12, 2017 at 10:05 AM, Andy Lutomirski <luto@kernel.org> wrote: >> >> - do NOT use "load_gs_index()", which does that swapgs dance (twice!) >> and plays with interrupt state. Just load the segment register, and >> then do the wrmsrl() of the {FS,GS,KERNEL_GS}_BASE values. There is no >> need for the swapgs dance. > > Using what helper? On x86_64, it can fault, and IIRC we explicitly > don't allow loadsegment(gs, ...). Just do the loadsegment() thing. The fact that we don't have a gs version of it is legacy - to catch bad users. It shouldn't stop us from having good users. That said - can it really fault? Because if it can, then why can't %fs fault? And on x86-64, we just do asm volatile ("movw %0, %%fs" :: "r" (ctxt->fs)); and don't actually use 'loadsegment()' for _any_ of the segments. We only do the fault protection on 32-bit. In fact, we really should try to avoid taking faults here anyway, shouldn't we? We haven't loaded enough of the context yet. Hmm. Maybe we should load only the fixed kernel segments at this point, and then do all the loadsegment() of gs/fs in the later phase when we're all set up. THERE we can do the swapgs dance with interrupt tracing etc, because *there* we actually are fully set up. I guess that means reloading the FS/GS base MSR's, Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-12 18:36 ` Linus Torvalds @ 2017-12-12 22:10 ` Andy Lutomirski 2017-12-12 22:33 ` Linus Torvalds 2017-12-13 11:16 ` Jarkko Nikula 0 siblings, 2 replies; 83+ messages in thread From: Andy Lutomirski @ 2017-12-12 22:10 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Pavel Machek, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Tue, Dec 12, 2017 at 10:36 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Dec 12, 2017 at 10:05 AM, Andy Lutomirski <luto@kernel.org> wrote: >>> >>> - do NOT use "load_gs_index()", which does that swapgs dance (twice!) >>> and plays with interrupt state. Just load the segment register, and >>> then do the wrmsrl() of the {FS,GS,KERNEL_GS}_BASE values. There is no >>> need for the swapgs dance. >> >> Using what helper? On x86_64, it can fault, and IIRC we explicitly >> don't allow loadsegment(gs, ...). > > Just do the loadsegment() thing. The fact that we don't have a gs > version of it is legacy - to catch bad users. It shouldn't stop us > from having good users. > > That said - can it really fault? Because if it can, then why can't %fs > fault? And on x86-64, we just do > > asm volatile ("movw %0, %%fs" :: "r" (ctxt->fs)); > > and don't actually use 'loadsegment()' for _any_ of the segments. We > only do the fault protection on 32-bit. > > In fact, we really should try to avoid taking faults here anyway, > shouldn't we? We haven't loaded enough of the context yet. > > Hmm. > > Maybe we should load only the fixed kernel segments at this point, and > then do all the loadsegment() of gs/fs in the later phase when we're > all set up. > > THERE we can do the swapgs dance with interrupt tracing etc, because > *there* we actually are fully set up. I guess that means reloading the > FS/GS base MSR's, Like this? https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb855aa9679a15adbe43732f5854270de2b35856 I've barely tested it. It suspended and resumed once in a 64-bit VM. It compiles on 32-bit. (That link might not work for a little bit. I'm not sure what's up.) ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-12 22:10 ` Andy Lutomirski @ 2017-12-12 22:33 ` Linus Torvalds 2017-12-12 23:10 ` Andy Lutomirski 2017-12-13 11:16 ` Jarkko Nikula 1 sibling, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2017-12-12 22:33 UTC (permalink / raw) To: Andy Lutomirski Cc: Pavel Machek, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Tue, Dec 12, 2017 at 2:10 PM, Andy Lutomirski <luto@kernel.org> wrote: > > (That link might not work for a little bit. I'm not sure what's up.) I think your link is just bogus. https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes works. Anyway, the code looks much better to me. Whether it _works_ is another matter, of course. Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-12 22:33 ` Linus Torvalds @ 2017-12-12 23:10 ` Andy Lutomirski 0 siblings, 0 replies; 83+ messages in thread From: Andy Lutomirski @ 2017-12-12 23:10 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Pavel Machek, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Tue, Dec 12, 2017 at 2:33 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Dec 12, 2017 at 2:10 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> (That link might not work for a little bit. I'm not sure what's up.) > > I think your link is just bogus. > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes > > works. If I click "commit" on that page, I get my link, and it's still busted. > > Anyway, the code looks much better to me. > > Whether it _works_ is another matter, of course. > > Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-12 22:10 ` Andy Lutomirski 2017-12-12 22:33 ` Linus Torvalds @ 2017-12-13 11:16 ` Jarkko Nikula 2017-12-13 12:40 ` Ingo Molnar 2017-12-13 18:50 ` Andy Lutomirski 1 sibling, 2 replies; 83+ messages in thread From: Jarkko Nikula @ 2017-12-13 11:16 UTC (permalink / raw) To: Andy Lutomirski, Linus Torvalds Cc: Pavel Machek, Zhang Rui, Thomas Gleixner, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On 12/13/2017 12:10 AM, Andy Lutomirski wrote: > On Tue, Dec 12, 2017 at 10:36 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Tue, Dec 12, 2017 at 10:05 AM, Andy Lutomirski <luto@kernel.org> wrote: > Like this? > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb855aa9679a15adbe43732f5854270de2b35856 > > I've barely tested it. It suspended and resumed once in a 64-bit VM. > It compiles on 32-bit. > I tested with "rtcwake -s 5 -m mem" 2-3 times on a real machine. Both 64-bit and 32-bit work. Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 11:16 ` Jarkko Nikula @ 2017-12-13 12:40 ` Ingo Molnar 2017-12-13 18:50 ` Andy Lutomirski 1 sibling, 0 replies; 83+ messages in thread From: Ingo Molnar @ 2017-12-13 12:40 UTC (permalink / raw) To: Jarkko Nikula Cc: Andy Lutomirski, Linus Torvalds, Pavel Machek, Zhang Rui, Thomas Gleixner, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers * Jarkko Nikula <jarkko.nikula@linux.intel.com> wrote: > On 12/13/2017 12:10 AM, Andy Lutomirski wrote: > > On Tue, Dec 12, 2017 at 10:36 AM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > On Tue, Dec 12, 2017 at 10:05 AM, Andy Lutomirski <luto@kernel.org> wrote: > > Like this? > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb855aa9679a15adbe43732f5854270de2b35856 > > > > I've barely tested it. It suspended and resumed once in a 64-bit VM. > > It compiles on 32-bit. > > > I tested with "rtcwake -s 5 -m mem" 2-3 times on a real machine. Both 64-bit > and 32-bit work. > > Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Great, thanks for the testing! Andy, mind sending the final fix with a changelog and the Reported-by/Tested-by tags, etc? Thanks, Ingo ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 11:16 ` Jarkko Nikula 2017-12-13 12:40 ` Ingo Molnar @ 2017-12-13 18:50 ` Andy Lutomirski 1 sibling, 0 replies; 83+ messages in thread From: Andy Lutomirski @ 2017-12-13 18:50 UTC (permalink / raw) To: Jarkko Nikula Cc: Andy Lutomirski, Linus Torvalds, Pavel Machek, Zhang Rui, Thomas Gleixner, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Wed, Dec 13, 2017 at 3:16 AM, Jarkko Nikula <jarkko.nikula@linux.intel.com> wrote: > On 12/13/2017 12:10 AM, Andy Lutomirski wrote: >> >> On Tue, Dec 12, 2017 at 10:36 AM, Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> On Tue, Dec 12, 2017 at 10:05 AM, Andy Lutomirski <luto@kernel.org> >>> wrote: >> >> Like this? >> >> >> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=cb855aa9679a15adbe43732f5854270de2b35856 >> >> I've barely tested it. It suspended and resumed once in a 64-bit VM. >> It compiles on 32-bit. >> > I tested with "rtcwake -s 5 -m mem" 2-3 times on a real machine. Both 64-bit > and 32-bit work. > > Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Thanks! I've split the patch into three and put your Tested-By on all of them, since the end state is exactly the same as what you tested. I'll email them out once I test them myself :) ^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH] Fix resume on x86-32 machines 2017-12-10 20:30 ` Linus Torvalds 2017-12-10 20:43 ` Pavel Machek @ 2017-12-10 21:38 ` Pavel Machek 2017-12-10 21:58 ` Andy Lutomirski 2017-12-11 14:09 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Zhang Rui 2 siblings, 1 reply; 83+ messages in thread From: Pavel Machek @ 2017-12-10 21:38 UTC (permalink / raw) To: Linus Torvalds Cc: Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 2072 bytes --] After 4.15-rc2, suspend stopped working on Thinkpad X60. 5b06bbc (unintentionally?) reordered stuff with respect to fix_processor_context() on 32-bit and 64-bit. We undo that change on 32-bit. While we are at it, fix a comment. Signed-off-by: Pavel Machek <pavel@ucw.cz> Fixes: 5b06bbcfc2c621da3009da8decb7511500c293ed diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c index 5191de1..af7b613 100644 --- a/arch/x86/power/cpu.c +++ b/arch/x86/power/cpu.c @@ -216,8 +216,8 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) write_cr0(ctxt->cr0); /* - * now restore the descriptor tables to their proper values - * ltr is done i fix_processor_context(). + * Now restore the descriptor tables to their proper values + * ltr is done in fix_processor_context(). */ #ifdef CONFIG_X86_32 load_idt(&ctxt->idt); @@ -235,13 +235,11 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) wrmsrl(MSR_GS_BASE, ctxt->gs_base); #endif - fix_processor_context(); - +#ifdef CONFIG_X86_32 /* - * Restore segment registers. This happens after restoring the GDT - * and LDT, which happen in fix_processor_context(). + * Restore segment registers. */ -#ifdef CONFIG_X86_32 + loadsegment(es, ctxt->es); loadsegment(fs, ctxt->fs); loadsegment(gs, ctxt->gs); @@ -252,8 +250,17 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) */ if (boot_cpu_has(X86_FEATURE_SEP)) enable_sep_cpu(); + + fix_processor_context(); #else /* CONFIG_X86_64 */ + /* + * Restore segment registers. This happens after restoring the GDT + * and LDT, which happen in fix_processor_context(). + */ + + fix_processor_context(); + asm volatile ("movw %0, %%ds" :: "r" (ctxt->ds)); asm volatile ("movw %0, %%es" :: "r" (ctxt->es)); asm volatile ("movw %0, %%fs" :: "r" (ctxt->fs)); -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply related [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-10 21:38 ` [PATCH] Fix resume on x86-32 machines Pavel Machek @ 2017-12-10 21:58 ` Andy Lutomirski 2017-12-10 22:20 ` Pavel Machek ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Andy Lutomirski @ 2017-12-10 21:58 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers > On Dec 10, 2017, at 1:38 PM, Pavel Machek <pavel@ucw.cz> wrote: > > > After 4.15-rc2, suspend stopped working on Thinkpad X60. 5b06bbc > (unintentionally?) reordered stuff with respect to > fix_processor_context() on 32-bit and 64-bit. We undo that change on > 32-bit. > Can you explain what was wrong with the reordering? Your patch certainly *looks* incorrect. I'm guessing that the real issue is that 32-bit needs %fs restored early for TLS. > While we are at it, fix a comment. > > Signed-off-by: Pavel Machek <pavel@ucw.cz> > Fixes: 5b06bbcfc2c621da3009da8decb7511500c293ed > > diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c > index 5191de1..af7b613 100644 > --- a/arch/x86/power/cpu.c > +++ b/arch/x86/power/cpu.c > @@ -216,8 +216,8 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) > write_cr0(ctxt->cr0); > > /* > - * now restore the descriptor tables to their proper values > - * ltr is done i fix_processor_context(). > + * Now restore the descriptor tables to their proper values > + * ltr is done in fix_processor_context(). > */ > #ifdef CONFIG_X86_32 > load_idt(&ctxt->idt); > @@ -235,13 +235,11 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) > wrmsrl(MSR_GS_BASE, ctxt->gs_base); > #endif > > - fix_processor_context(); > - > +#ifdef CONFIG_X86_32 > /* > - * Restore segment registers. This happens after restoring the GDT > - * and LDT, which happen in fix_processor_context(). > + * Restore segment registers. > */ > -#ifdef CONFIG_X86_32 > + > loadsegment(es, ctxt->es); > loadsegment(fs, ctxt->fs); > loadsegment(gs, ctxt->gs); > @@ -252,8 +250,17 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) > */ > if (boot_cpu_has(X86_FEATURE_SEP)) > enable_sep_cpu(); > + > + fix_processor_context(); > #else > /* CONFIG_X86_64 */ > + /* > + * Restore segment registers. This happens after restoring the GDT > + * and LDT, which happen in fix_processor_context(). > + */ > + > + fix_processor_context(); > + > asm volatile ("movw %0, %%ds" :: "r" (ctxt->ds)); > asm volatile ("movw %0, %%es" :: "r" (ctxt->es)); > asm volatile ("movw %0, %%fs" :: "r" (ctxt->fs)); > > > -- > (english) http://www.livejournal.com/~pavelmachek > (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-10 21:58 ` Andy Lutomirski @ 2017-12-10 22:20 ` Pavel Machek 2017-12-11 9:25 ` Jarkko Nikula 2017-12-11 14:22 ` Rafael J. Wysocki 2017-12-11 15:13 ` Ingo Molnar 2 siblings, 1 reply; 83+ messages in thread From: Pavel Machek @ 2017-12-10 22:20 UTC (permalink / raw) To: Andy Lutomirski Cc: Linus Torvalds, Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 3062 bytes --] On Sun 2017-12-10 13:58:23, Andy Lutomirski wrote: > > On Dec 10, 2017, at 1:38 PM, Pavel Machek <pavel@ucw.cz> wrote: > > > > > > After 4.15-rc2, suspend stopped working on Thinkpad X60. 5b06bbc > > (unintentionally?) reordered stuff with respect to > > fix_processor_context() on 32-bit and 64-bit. We undo that change on > > 32-bit. > > > > Can you explain what was wrong with the reordering? Your patch certainly *looks* incorrect. > No, I can't, sorry. > I'm guessing that the real issue is that 32-bit needs %fs restored > early for TLS. Maybe. I can test patches... I don't think it would be good idea to revert 5b06bbcfc2c621da3009da8decb7511500c293ed, but since it introduced regression in -rc2, I believe we should fix the regression now, and then we can try to provide cleaner solution. As Linus noted, the code is quite "interesting". Pavel > > > While we are at it, fix a comment. > > > > Signed-off-by: Pavel Machek <pavel@ucw.cz> > > Fixes: 5b06bbcfc2c621da3009da8decb7511500c293ed > > > > diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c > > index 5191de1..af7b613 100644 > > --- a/arch/x86/power/cpu.c > > +++ b/arch/x86/power/cpu.c > > @@ -216,8 +216,8 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) > > write_cr0(ctxt->cr0); > > > > /* > > - * now restore the descriptor tables to their proper values > > - * ltr is done i fix_processor_context(). > > + * Now restore the descriptor tables to their proper values > > + * ltr is done in fix_processor_context(). > > */ > > #ifdef CONFIG_X86_32 > > load_idt(&ctxt->idt); > > @@ -235,13 +235,11 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) > > wrmsrl(MSR_GS_BASE, ctxt->gs_base); > > #endif > > > > - fix_processor_context(); > > - > > +#ifdef CONFIG_X86_32 > > /* > > - * Restore segment registers. This happens after restoring the GDT > > - * and LDT, which happen in fix_processor_context(). > > + * Restore segment registers. > > */ > > -#ifdef CONFIG_X86_32 > > + > > loadsegment(es, ctxt->es); > > loadsegment(fs, ctxt->fs); > > loadsegment(gs, ctxt->gs); > > @@ -252,8 +250,17 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) > > */ > > if (boot_cpu_has(X86_FEATURE_SEP)) > > enable_sep_cpu(); > > + > > + fix_processor_context(); > > #else > > /* CONFIG_X86_64 */ > > + /* > > + * Restore segment registers. This happens after restoring the GDT > > + * and LDT, which happen in fix_processor_context(). > > + */ > > + > > + fix_processor_context(); > > + > > asm volatile ("movw %0, %%ds" :: "r" (ctxt->ds)); > > asm volatile ("movw %0, %%es" :: "r" (ctxt->es)); > > asm volatile ("movw %0, %%fs" :: "r" (ctxt->fs)); > > -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-10 22:20 ` Pavel Machek @ 2017-12-11 9:25 ` Jarkko Nikula 0 siblings, 0 replies; 83+ messages in thread From: Jarkko Nikula @ 2017-12-11 9:25 UTC (permalink / raw) To: Pavel Machek, Andy Lutomirski Cc: Linus Torvalds, Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On 12/11/2017 12:20 AM, Pavel Machek wrote: > On Sun 2017-12-10 13:58:23, Andy Lutomirski wrote: >>> On Dec 10, 2017, at 1:38 PM, Pavel Machek <pavel@ucw.cz> wrote: >>> >>> >>> After 4.15-rc2, suspend stopped working on Thinkpad X60. 5b06bbc >>> (unintentionally?) reordered stuff with respect to >>> fix_processor_context() on 32-bit and 64-bit. We undo that change on >>> 32-bit. >>> >> >> Can you explain what was wrong with the reordering? Your patch certainly *looks* incorrect. >> > > No, I can't, sorry. > >> I'm guessing that the real issue is that 32-bit needs %fs restored >> early for TLS. > > Maybe. I can test patches... > > I don't think it would be good idea to revert > 5b06bbcfc2c621da3009da8decb7511500c293ed, but since it introduced > regression in -rc2, I believe we should fix the regression now, and > then we can try to provide cleaner solution. > I can confirm Pavel's findings. Commit 5b06bbcfc2c6 ("x86/power: Fix some ordering bugs in __restore_processor_context()") broke the suspend/resume on 32-bit kernel. v4.15-rc3 works either by reverting the commit or by Pavel's patch. Fortunately Pavel's patch still keeps the 64-bit suspend/resume ok. -- Jarkko ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-10 21:58 ` Andy Lutomirski 2017-12-10 22:20 ` Pavel Machek @ 2017-12-11 14:22 ` Rafael J. Wysocki 2017-12-11 14:43 ` Rafael J. Wysocki ` (2 more replies) 2017-12-11 15:13 ` Ingo Molnar 2 siblings, 3 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-11 14:22 UTC (permalink / raw) To: Pavel Machek, Andrew Lutomirski Cc: Andy Lutomirski, Linus Torvalds, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Linux Kernel Mailing List, the arch/x86 maintainers On Sunday, December 10, 2017 10:58:23 PM CET Andy Lutomirski wrote: > > > On Dec 10, 2017, at 1:38 PM, Pavel Machek <pavel@ucw.cz> wrote: > > > > > > After 4.15-rc2, suspend stopped working on Thinkpad X60. 5b06bbc > > (unintentionally?) reordered stuff with respect to > > fix_processor_context() on 32-bit and 64-bit. We undo that change on > > 32-bit. > > > > Can you explain what was wrong with the reordering? Your patch certainly *looks* incorrect. > > I'm guessing that the real issue is that 32-bit needs %fs restored early for TLS. I *think* you are right. Anyway, that should be easy enough to verify. Pavel, can you please check if the below change works too? --- arch/x86/power/cpu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-pm/arch/x86/power/cpu.c =================================================================== --- linux-pm.orig/arch/x86/power/cpu.c +++ linux-pm/arch/x86/power/cpu.c @@ -221,6 +221,8 @@ static void notrace __restore_processor_ */ #ifdef CONFIG_X86_32 load_idt(&ctxt->idt); + + loadsegment(fs, ctxt->fs); #else /* CONFIG_X86_64 */ load_idt((const struct desc_ptr *)&ctxt->idt_limit); @@ -243,7 +245,6 @@ static void notrace __restore_processor_ */ #ifdef CONFIG_X86_32 loadsegment(es, ctxt->es); - loadsegment(fs, ctxt->fs); loadsegment(gs, ctxt->gs); loadsegment(ss, ctxt->ss); ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-11 14:22 ` Rafael J. Wysocki @ 2017-12-11 14:43 ` Rafael J. Wysocki 2017-12-11 14:59 ` Jarkko Nikula 2017-12-11 18:31 ` Linus Torvalds 2 siblings, 0 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-11 14:43 UTC (permalink / raw) To: Andrew Lutomirski Cc: Pavel Machek, Andy Lutomirski, Linus Torvalds, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Linux Kernel Mailing List, the arch/x86 maintainers On Monday, December 11, 2017 3:22:39 PM CET Rafael J. Wysocki wrote: > On Sunday, December 10, 2017 10:58:23 PM CET Andy Lutomirski wrote: > > > > > On Dec 10, 2017, at 1:38 PM, Pavel Machek <pavel@ucw.cz> wrote: > > > > > > > > > After 4.15-rc2, suspend stopped working on Thinkpad X60. 5b06bbc > > > (unintentionally?) reordered stuff with respect to > > > fix_processor_context() on 32-bit and 64-bit. We undo that change on > > > 32-bit. > > > > > > > Can you explain what was wrong with the reordering? Your patch certainly *looks* incorrect. > > > > I'm guessing that the real issue is that 32-bit needs %fs restored early for TLS. > > I *think* you are right. Hmm. Don't we need to restore GS on 32-bit before doing the per_cpu() thing in fix_processor_context()? ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-11 14:22 ` Rafael J. Wysocki 2017-12-11 14:43 ` Rafael J. Wysocki @ 2017-12-11 14:59 ` Jarkko Nikula 2017-12-11 18:31 ` Linus Torvalds 2 siblings, 0 replies; 83+ messages in thread From: Jarkko Nikula @ 2017-12-11 14:59 UTC (permalink / raw) To: Rafael J. Wysocki, Pavel Machek, Andrew Lutomirski Cc: Andy Lutomirski, Linus Torvalds, Zhang Rui, Thomas Gleixner, Linux Kernel Mailing List, the arch/x86 maintainers On 12/11/2017 04:22 PM, Rafael J. Wysocki wrote: > On Sunday, December 10, 2017 10:58:23 PM CET Andy Lutomirski wrote: >> >>> On Dec 10, 2017, at 1:38 PM, Pavel Machek <pavel@ucw.cz> wrote: >>> >>> >>> After 4.15-rc2, suspend stopped working on Thinkpad X60. 5b06bbc >>> (unintentionally?) reordered stuff with respect to >>> fix_processor_context() on 32-bit and 64-bit. We undo that change on >>> 32-bit. >>> >> >> Can you explain what was wrong with the reordering? Your patch certainly *looks* incorrect. >> >> I'm guessing that the real issue is that 32-bit needs %fs restored early for TLS. > > I *think* you are right. > > Anyway, that should be easy enough to verify. > > Pavel, can you please check if the below change works too? > > --- > arch/x86/power/cpu.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > Works here on my test machine. Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-11 14:22 ` Rafael J. Wysocki 2017-12-11 14:43 ` Rafael J. Wysocki 2017-12-11 14:59 ` Jarkko Nikula @ 2017-12-11 18:31 ` Linus Torvalds 2017-12-11 18:41 ` Andy Lutomirski 2 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2017-12-11 18:31 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Pavel Machek, Andrew Lutomirski, Andy Lutomirski, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Linux Kernel Mailing List, the arch/x86 maintainers On Mon, Dec 11, 2017 at 6:22 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > On Sunday, December 10, 2017 10:58:23 PM CET Andy Lutomirski wrote: >> >> I'm guessing that the real issue is that 32-bit needs %fs restored early for TLS. > > I *think* you are right. > > Anyway, that should be easy enough to verify. > > Pavel, can you please check if the below change works too? So Jarkko confirmed this works for him, but the more I look at this crap, the less I like it. Why do we save fs/ds/es/ss at all on x86-32? Don't they all have fixed values in the kernel, with %fs being __KERNEL_PERCPU, and the others being __USER_DS? Nothing else can possibly be valid, as far as I can tell. I think we actually leave the user-space percpu segment in %gs (or the stack canary base), so that one we should actually save/restore, but I'm getting the feeling that we should just reset the other segment registers to known values on 32-bit. Also, why does the 32-bit code do loadsegment(es, ctxt->es); but the 64-bit code does asm volatile ("movw %0, %%es" :: "r" (ctxt->es)); And look at that confusion between MSR_GS_BASE and MSR_KERNEL_GS_BASE all within the 64-bit case. In particular, note how we reload the %gs segment in between the two - wouldn't that mess with the currently active gs base if %gs can be non-zero? Christ, what a mess. So I think that whole sequence is garbage. It has been written as some kind of "save and restore registers", but that's not what it really then does - or what it should do. It should make sure to restore a sane kernel state, not some random register state. And the 32-bit and 64-bit code really should strive to be at least _sanely_ different, not this randomly and insanely different mess. But yes, Rafael's patch looks like the minimal one-liner. But I think we should do the %gs load early too for the 32-bit stack canary case, kind of like we need to do %fs for percpu base. Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-11 18:31 ` Linus Torvalds @ 2017-12-11 18:41 ` Andy Lutomirski 2017-12-11 19:12 ` Linus Torvalds 2017-12-14 20:38 ` Pavel Machek 0 siblings, 2 replies; 83+ messages in thread From: Andy Lutomirski @ 2017-12-11 18:41 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Pavel Machek, Andrew Lutomirski, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Linux Kernel Mailing List, the arch/x86 maintainers On Mon, Dec 11, 2017 at 10:31 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Dec 11, 2017 at 6:22 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >> On Sunday, December 10, 2017 10:58:23 PM CET Andy Lutomirski wrote: >>> >>> I'm guessing that the real issue is that 32-bit needs %fs restored early for TLS. >> >> I *think* you are right. >> >> Anyway, that should be easy enough to verify. >> >> Pavel, can you please check if the below change works too? > > So Jarkko confirmed this works for him, but the more I look at this > crap, the less I like it. > > Why do we save fs/ds/es/ss at all on x86-32? Don't they all have fixed > values in the kernel, with %fs being __KERNEL_PERCPU, and the others > being __USER_DS? > > Nothing else can possibly be valid, as far as I can tell. > > I think we actually leave the user-space percpu segment in %gs (or the > stack canary base), so that one we should actually save/restore, but > I'm getting the feeling that we should just reset the other segment > registers to known values on 32-bit. > > Also, why does the 32-bit code do > > loadsegment(es, ctxt->es); > > but the 64-bit code does > > asm volatile ("movw %0, %%es" :: "r" (ctxt->es)); > > And look at that confusion between MSR_GS_BASE and MSR_KERNEL_GS_BASE > all within the 64-bit case. > > In particular, note how we reload the %gs segment in between the two - > wouldn't that mess with the currently active gs base if %gs can be > non-zero? > > Christ, what a mess. > > So I think that whole sequence is garbage. It has been written as some > kind of "save and restore registers", but that's not what it really > then does - or what it should do. > > It should make sure to restore a sane kernel state, not some random > register state. > > And the 32-bit and 64-bit code really should strive to be at least > _sanely_ different, not this randomly and insanely different mess. > > But yes, Rafael's patch looks like the minimal one-liner. But I think > we should do the %gs load early too for the 32-bit stack canary case, > kind of like we need to do %fs for percpu base. I'll try to get to this in a day or so -- is that okay? Or should we do some trivial fix/revert and fix it for real next time around? ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-11 18:41 ` Andy Lutomirski @ 2017-12-11 19:12 ` Linus Torvalds 2017-12-14 20:38 ` Pavel Machek 1 sibling, 0 replies; 83+ messages in thread From: Linus Torvalds @ 2017-12-11 19:12 UTC (permalink / raw) To: Andy Lutomirski Cc: Rafael J. Wysocki, Pavel Machek, Andrew Lutomirski, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Linux Kernel Mailing List, the arch/x86 maintainers On Mon, Dec 11, 2017 at 10:41 AM, Andy Lutomirski <luto@amacapital.net> wrote: > > I'll try to get to this in a day or so -- is that okay? Or should we > do some trivial fix/revert and fix it for real next time around? I don't think we want some trivial fix/revert just to keep it working. This code is too fragile as-is, and I think that "make it work" is more than reverting. You did fix real issues on x86-64 with odd segment use, for example. Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-11 18:41 ` Andy Lutomirski 2017-12-11 19:12 ` Linus Torvalds @ 2017-12-14 20:38 ` Pavel Machek 2017-12-14 20:47 ` Linus Torvalds 1 sibling, 1 reply; 83+ messages in thread From: Pavel Machek @ 2017-12-14 20:38 UTC (permalink / raw) To: Andy Lutomirski Cc: Linus Torvalds, Rafael J. Wysocki, Andrew Lutomirski, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 738 bytes --] Hi! > > But yes, Rafael's patch looks like the minimal one-liner. But I think > > we should do the %gs load early too for the 32-bit stack canary case, > > kind of like we need to do %fs for percpu base. > > I'll try to get to this in a day or so -- is that okay? Or should we > do some trivial fix/revert and fix it for real next time around? I can test the patch.... But given this is already "regression fix for x86-64 caused regression on x86-32", I really believe we should merge trivial fix now, and do the cleanups / nicer fixes sometime later? Thanks, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-14 20:38 ` Pavel Machek @ 2017-12-14 20:47 ` Linus Torvalds 2017-12-14 21:20 ` Andy Lutomirski 2017-12-14 22:22 ` Pavel Machek 0 siblings, 2 replies; 83+ messages in thread From: Linus Torvalds @ 2017-12-14 20:47 UTC (permalink / raw) To: Pavel Machek Cc: Andy Lutomirski, Rafael J. Wysocki, Andrew Lutomirski, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Linux Kernel Mailing List, the arch/x86 maintainers On Thu, Dec 14, 2017 at 12:38 PM, Pavel Machek <pavel@ucw.cz> wrote: > > But given this is already "regression fix for x86-64 caused regression > on x86-32", I really believe we should merge trivial fix now, and do > the cleanups / nicer fixes sometime later? The fix patch was already posted, but in another thread (confusingly with _almost_ the same subject: "Re: Linux 4.15-rc2: Regression in resume from ACPI S3"). Here: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes but it hasn't actually gotten to me yet (I think Luto made a new version where he split it up into three smaller patches, and I'm assuming it's in one of the -tip trees heading for me asap). All the x86 people have been insanely busy with the crazy page table isolation patches, that probably is slowing down things. But I was expecting to get the -tip tree pulls tomorrow (.. because.. Friday. It's my busiest day of the week because everybody wants to get their work out before the weekend). Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-14 20:47 ` Linus Torvalds @ 2017-12-14 21:20 ` Andy Lutomirski 2017-12-14 22:22 ` Pavel Machek 1 sibling, 0 replies; 83+ messages in thread From: Andy Lutomirski @ 2017-12-14 21:20 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Rafael J. Wysocki, Andrew Lutomirski, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Linux Kernel Mailing List, the arch/x86 maintainers On Thu, Dec 14, 2017 at 12:47 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Dec 14, 2017 at 12:38 PM, Pavel Machek <pavel@ucw.cz> wrote: >> >> But given this is already "regression fix for x86-64 caused regression >> on x86-32", I really believe we should merge trivial fix now, and do >> the cleanups / nicer fixes sometime later? > > The fix patch was already posted, but in another thread (confusingly > with _almost_ the same subject: "Re: Linux 4.15-rc2: Regression in > resume from ACPI S3"). > > Here: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes > > but it hasn't actually gotten to me yet (I think Luto made a new > version where he split it up into three smaller patches, and I'm > assuming it's in one of the -tip trees heading for me asap). > > All the x86 people have been insanely busy with the crazy page table > isolation patches, that probably is slowing down things. But I was > expecting to get the -tip tree pulls tomorrow (.. because.. Friday. > It's my busiest day of the week because everybody wants to get their > work out before the weekend). Nah, it's that I was busy with the stupid PTI stuff, and the original version of the patch didn't compile on some configurations. I've tested the version I just sent for suspend-to-ram and suspend-to-disk on 32-bit and 64-bit in a couple of configurations. ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-14 20:47 ` Linus Torvalds 2017-12-14 21:20 ` Andy Lutomirski @ 2017-12-14 22:22 ` Pavel Machek 1 sibling, 0 replies; 83+ messages in thread From: Pavel Machek @ 2017-12-14 22:22 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Rafael J. Wysocki, Andrew Lutomirski, Zhang Rui, Thomas Gleixner, Jarkko Nikula, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 1074 bytes --] On Thu 2017-12-14 12:47:56, Linus Torvalds wrote: > On Thu, Dec 14, 2017 at 12:38 PM, Pavel Machek <pavel@ucw.cz> wrote: > > > > But given this is already "regression fix for x86-64 caused regression > > on x86-32", I really believe we should merge trivial fix now, and do > > the cleanups / nicer fixes sometime later? > > The fix patch was already posted, but in another thread (confusingly > with _almost_ the same subject: "Re: Linux 4.15-rc2: Regression in > resume from ACPI S3"). > > Here: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes > > but it hasn't actually gotten to me yet (I think Luto made a new > version where he split it up into three smaller patches, and I'm > assuming it's in one of the -tip trees heading for me asap). Ok.. It did not apply cleanly. ... but I see fresh patches in my inbox, let me test those. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-10 21:58 ` Andy Lutomirski 2017-12-10 22:20 ` Pavel Machek 2017-12-11 14:22 ` Rafael J. Wysocki @ 2017-12-11 15:13 ` Ingo Molnar 2017-12-11 16:26 ` Andy Lutomirski 2 siblings, 1 reply; 83+ messages in thread From: Ingo Molnar @ 2017-12-11 15:13 UTC (permalink / raw) To: Andy Lutomirski Cc: Pavel Machek, Linus Torvalds, Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers * Andy Lutomirski <luto@amacapital.net> wrote: > > > > On Dec 10, 2017, at 1:38 PM, Pavel Machek <pavel@ucw.cz> wrote: > > > > > > After 4.15-rc2, suspend stopped working on Thinkpad X60. 5b06bbc > > (unintentionally?) reordered stuff with respect to > > fix_processor_context() on 32-bit and 64-bit. We undo that change on > > 32-bit. > > > > Can you explain what was wrong with the reordering? Your patch certainly *looks* incorrect. > > I'm guessing that the real issue is that 32-bit needs %fs restored early for TLS. Does some early percpu primitive need GS as well perhaps? Might be safest to restore both FS and GS early. Thanks, Ingo ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Fix resume on x86-32 machines 2017-12-11 15:13 ` Ingo Molnar @ 2017-12-11 16:26 ` Andy Lutomirski 0 siblings, 0 replies; 83+ messages in thread From: Andy Lutomirski @ 2017-12-11 16:26 UTC (permalink / raw) To: Ingo Molnar Cc: Pavel Machek, Linus Torvalds, Zhang Rui, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Mon, Dec 11, 2017 at 7:13 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Andy Lutomirski <luto@amacapital.net> wrote: > >> >> >> > On Dec 10, 2017, at 1:38 PM, Pavel Machek <pavel@ucw.cz> wrote: >> > >> > >> > After 4.15-rc2, suspend stopped working on Thinkpad X60. 5b06bbc >> > (unintentionally?) reordered stuff with respect to >> > fix_processor_context() on 32-bit and 64-bit. We undo that change on >> > 32-bit. >> > >> >> Can you explain what was wrong with the reordering? Your patch certainly *looks* incorrect. >> >> I'm guessing that the real issue is that 32-bit needs %fs restored early for TLS. > > Does some early percpu primitive need GS as well perhaps? > > Might be safest to restore both FS and GS early. > fs needs to be restored early for TLS. gs needs to be restored early, maybe, if !X86_32_LAZY_GS -- it's used for stack-protector. If X86_32_LAZY_GS, gs must *not* be restored early. ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-10 20:30 ` Linus Torvalds 2017-12-10 20:43 ` Pavel Machek 2017-12-10 21:38 ` [PATCH] Fix resume on x86-32 machines Pavel Machek @ 2017-12-11 14:09 ` Zhang Rui 2017-12-11 16:28 ` Andy Lutomirski 2017-12-12 8:00 ` Pavel Machek 2 siblings, 2 replies; 83+ messages in thread From: Zhang Rui @ 2017-12-11 14:09 UTC (permalink / raw) To: Linus Torvalds, Pavel Machek Cc: Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Sun, 2017-12-10 at 12:30 -0800, Linus Torvalds wrote: > On Sun, Dec 10, 2017 at 10:56 AM, Pavel Machek <pavel@ucw.cz> wrote: > > > > > > Confirmed, revert fixes it. You see how it moves > > fix_processor_context > > around #ifdef CONFIG_X86_32 block? And how people forget 32-bit > > machines exist? Aha. > Yeah, people do. > > Andy? > > > > > Which brings me to .. various people do automated testing of > > kernel. Testing 32-bit kernel for boot, and both 32-bit and 64-bit > > for > > boot and suspend would be very nice. The last item is not hard, > > either: > > > > sudo rtcwake -l -m mem -s 5 > > > > ...should take 10 seconds or so. > I'm told 0day does *some* suspend/resume testing, but I think it's > pretty limited, partly because the kinds of machines it primarily > works on don't really support suspend/resume at all. currently, we're running suspend test on 1 platform only, with 64 bit kernel. suspend test will be enabled on more platforms (laptops) in next two weeks. I will check why it does not find the first regression introduced by ca37e57bbe0c ("x86/entry/64: Add missing irqflags tracing to native_load_gs_index()"). > I'm also not sure > just how many of those machines are 32-bit at all.. for this, I suppose it can be reproduced if we use 32-bit kernel and rootfs, right? Then it's easier to enable this in 0Day. thanks, rui > > But I'm adding Zhang Rui to the cc, to see if my recollection is > right. > > Because you're right, more suspend/resume automated testing would be > good to have. And yes, people test mainly 64-bit these days. > > Also, I'm not even sure what the 0day rules are for just plain > mainline. I don't tend to see a lot of breakage reports, even though > I'd expect to. This came in from the x86 trees (and those do their > own > tests too, but probably not suspend/resume either), but it hit my > tree > fairly soon after going into the x86 -tip trees. > > Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-11 14:09 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Zhang Rui @ 2017-12-11 16:28 ` Andy Lutomirski 2017-12-12 8:00 ` Pavel Machek 1 sibling, 0 replies; 83+ messages in thread From: Andy Lutomirski @ 2017-12-11 16:28 UTC (permalink / raw) To: Zhang Rui Cc: Linus Torvalds, Pavel Machek, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers On Mon, Dec 11, 2017 at 6:09 AM, Zhang Rui <rui.zhang@intel.com> wrote: > On Sun, 2017-12-10 at 12:30 -0800, Linus Torvalds wrote: >> On Sun, Dec 10, 2017 at 10:56 AM, Pavel Machek <pavel@ucw.cz> wrote: >> > >> > >> > Confirmed, revert fixes it. You see how it moves >> > fix_processor_context >> > around #ifdef CONFIG_X86_32 block? And how people forget 32-bit >> > machines exist? Aha. >> Yeah, people do. >> >> Andy? >> >> > >> > Which brings me to .. various people do automated testing of >> > kernel. Testing 32-bit kernel for boot, and both 32-bit and 64-bit >> > for >> > boot and suspend would be very nice. The last item is not hard, >> > either: >> > >> > sudo rtcwake -l -m mem -s 5 >> > >> > ...should take 10 seconds or so. >> I'm told 0day does *some* suspend/resume testing, but I think it's >> pretty limited, partly because the kinds of machines it primarily >> works on don't really support suspend/resume at all. > > currently, we're running suspend test on 1 platform only, with 64 bit > kernel. suspend test will be enabled on more platforms (laptops) in > next two weeks. > > I will check why it does not find the first regression introduced by > ca37e57bbe0c ("x86/entry/64: Add missing irqflags tracing to > native_load_gs_index()"). > >> I'm also not sure >> just how many of those machines are 32-bit at all.. > > for this, I suppose it can be reproduced if we use 32-bit kernel and > rootfs, right? Then it's easier to enable this in 0Day. > Yes. The 64-bit problem should also be reproducible with rtcwake even in a vm. Also, on this topic, could make run_tests in tools/testing/selftests/x86 be added to the rotation as well? The testing dir should match the kernel being tested IMO. > thanks, > rui >> >> But I'm adding Zhang Rui to the cc, to see if my recollection is >> right. >> >> Because you're right, more suspend/resume automated testing would be >> good to have. And yes, people test mainly 64-bit these days. >> >> Also, I'm not even sure what the 0day rules are for just plain >> mainline. I don't tend to see a lot of breakage reports, even though >> I'd expect to. This came in from the x86 trees (and those do their >> own >> tests too, but probably not suspend/resume either), but it hit my >> tree >> fairly soon after going into the x86 -tip trees. >> >> Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-11 14:09 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Zhang Rui 2017-12-11 16:28 ` Andy Lutomirski @ 2017-12-12 8:00 ` Pavel Machek 1 sibling, 0 replies; 83+ messages in thread From: Pavel Machek @ 2017-12-12 8:00 UTC (permalink / raw) To: Zhang Rui Cc: Linus Torvalds, Andrew Lutomirski, Thomas Gleixner, Jarkko Nikula, Rafael J. Wysocki, Linux Kernel Mailing List, the arch/x86 maintainers [-- Attachment #1: Type: text/plain, Size: 1008 bytes --] Hi! > > > ...should take 10 seconds or so. > > I'm told 0day does *some* suspend/resume testing, but I think it's > > pretty limited, partly because the kinds of machines it primarily > > works on don't really support suspend/resume at all. > > currently, we're running suspend test on 1 platform only, with 64 bit > kernel. suspend test will be enabled on more platforms (laptops) in > next two weeks. Thanks! > > I'm also not sure > > just how many of those machines are 32-bit at all.. > > for this, I suppose it can be reproduced if we use 32-bit kernel and > rootfs, right? Then it's easier to enable this in 0Day. Yes, Intel cpus are pretty good at backwards compatibility, and most problems are not subtle at all. So yes, 32-bit kernel / rootfs on recent machine should be good for testing. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-04 22:36 ` Linus Torvalds 2017-12-04 22:38 ` Thomas Gleixner @ 2017-12-06 12:15 ` Michal Hocko 2017-12-06 12:23 ` Thomas Gleixner ` (2 more replies) 1 sibling, 3 replies; 83+ messages in thread From: Michal Hocko @ 2017-12-06 12:15 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Mon 04-12-17 14:36:20, Linus Torvalds wrote: > On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the > > systems I have tested, so it is probably safe to assume it to be > > broken everywhere. > > Oh, it's definitely not broken everywhere, because I use it myself, > and was traveling last week due to my mom's bday. > > HOWEVER. > > Some of the x86 work seems to have broken it for some configurations. > In particular, do you have a big "everything enabled" kernel config - > particularly lockdep and irqflags tracing enabled? > > Andy has a patch, but it hasn't made it to me yet (probably because > the x86 people are very busy with the kaiser work): > > https://lkml.org/lkml/2017/11/30/546 > > (also note his follow-up "fix the commit message" note, but that one > doesn't actually affect the code itself). merging tip/x86/urgent on top of your tree fixed this problem for me, but I am seeing something else [ 131.711412] ACPI: Preparing to enter system sleep state S3 [ 131.755328] ACPI: EC: event blocked [ 131.755328] ACPI: EC: EC stopped [ 131.755328] PM: Saving platform NVS memory [ 131.755344] Disabling non-boot CPUs ... [ 131.779330] IRQ 124: no longer affine to CPU1 [ 131.780334] smpboot: CPU 1 is now offline [ 131.804465] smpboot: CPU 2 is now offline [ 131.827291] IRQ 122: no longer affine to CPU3 [ 131.827292] IRQ 123: no longer affine to CPU3 [ 131.828293] smpboot: CPU 3 is now offline [ 131.830991] ACPI: Low-level resume complete [ 131.831092] ACPI: EC: EC started [ 131.831093] PM: Restoring platform NVS memory [ 131.831864] do_IRQ: 0.55 No irq handler for vector [ 131.831884] Enabling non-boot CPUs ... [ 131.831909] x86: Booting SMP configuration: [ 131.831910] smpboot: Booting Node 0 Processor 1 APIC 0x2 [ 131.832913] cache: parent cpu1 should not be sleeping [ 131.833058] CPU1 is up [ 131.833067] smpboot: Booting Node 0 Processor 2 APIC 0x1 [ 131.833864] cache: parent cpu2 should not be sleeping [ 131.833983] CPU2 is up [ 131.833995] smpboot: Booting Node 0 Processor 3 APIC 0x3 [ 131.834776] cache: parent cpu3 should not be sleeping [ 131.834923] CPU3 is up "No irq handler" part looks a bit scary (maybe related to lost affinity messages?) but the following messages look quite as well. Is this something known? The system seems to be up and running without any visible issues. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-06 12:15 ` Michal Hocko @ 2017-12-06 12:23 ` Thomas Gleixner 2017-12-06 14:04 ` Rafael J. Wysocki 2017-12-06 12:31 ` Maarten Lankhorst 2017-12-07 7:55 ` Michal Hocko 2 siblings, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-06 12:23 UTC (permalink / raw) To: Michal Hocko Cc: Linus Torvalds, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Wed, 6 Dec 2017, Michal Hocko wrote: > merging tip/x86/urgent on top of your tree fixed this problem for me, > but I am seeing something else > [ 131.711412] ACPI: Preparing to enter system sleep state S3 > [ 131.755328] ACPI: EC: event blocked > [ 131.755328] ACPI: EC: EC stopped > [ 131.755328] PM: Saving platform NVS memory > [ 131.755344] Disabling non-boot CPUs ... > [ 131.779330] IRQ 124: no longer affine to CPU1 > [ 131.780334] smpboot: CPU 1 is now offline > [ 131.804465] smpboot: CPU 2 is now offline > [ 131.827291] IRQ 122: no longer affine to CPU3 > [ 131.827292] IRQ 123: no longer affine to CPU3 > [ 131.828293] smpboot: CPU 3 is now offline > [ 131.830991] ACPI: Low-level resume complete > [ 131.831092] ACPI: EC: EC started > [ 131.831093] PM: Restoring platform NVS memory > [ 131.831864] do_IRQ: 0.55 No irq handler for vector Hmm, that's really odd. > [ 131.831884] Enabling non-boot CPUs ... > [ 131.831909] x86: Booting SMP configuration: > [ 131.831910] smpboot: Booting Node 0 Processor 1 APIC 0x2 > [ 131.832913] cache: parent cpu1 should not be sleeping This is an old one. > [ 131.833058] CPU1 is up > [ 131.833067] smpboot: Booting Node 0 Processor 2 APIC 0x1 > [ 131.833864] cache: parent cpu2 should not be sleeping > [ 131.833983] CPU2 is up > [ 131.833995] smpboot: Booting Node 0 Processor 3 APIC 0x3 > [ 131.834776] cache: parent cpu3 should not be sleeping > [ 131.834923] CPU3 is up > > "No irq handler" part looks a bit scary (maybe related to lost affinity > messages?) but the following messages look quite as well. Is this > something known? The system seems to be up and running without any > visible issues. I assume it's due to the affinity break, just that we don't know right now on which CPU that do_IRQ() message triggered. I assume it's CPU0 because the others are offline already, but .... I'll think about it how we can figure out what's going on. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-06 12:23 ` Thomas Gleixner @ 2017-12-06 14:04 ` Rafael J. Wysocki 0 siblings, 0 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-06 14:04 UTC (permalink / raw) To: Thomas Gleixner Cc: Michal Hocko, Linus Torvalds, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Wednesday, December 6, 2017 1:23:34 PM CET Thomas Gleixner wrote: > On Wed, 6 Dec 2017, Michal Hocko wrote: > > merging tip/x86/urgent on top of your tree fixed this problem for me, > > but I am seeing something else > > [ 131.711412] ACPI: Preparing to enter system sleep state S3 > > [ 131.755328] ACPI: EC: event blocked > > [ 131.755328] ACPI: EC: EC stopped > > [ 131.755328] PM: Saving platform NVS memory > > [ 131.755344] Disabling non-boot CPUs ... > > [ 131.779330] IRQ 124: no longer affine to CPU1 > > [ 131.780334] smpboot: CPU 1 is now offline > > [ 131.804465] smpboot: CPU 2 is now offline > > [ 131.827291] IRQ 122: no longer affine to CPU3 > > [ 131.827292] IRQ 123: no longer affine to CPU3 > > [ 131.828293] smpboot: CPU 3 is now offline > > [ 131.830991] ACPI: Low-level resume complete > > [ 131.831092] ACPI: EC: EC started > > [ 131.831093] PM: Restoring platform NVS memory > > [ 131.831864] do_IRQ: 0.55 No irq handler for vector > > Hmm, that's really odd. > > > [ 131.831884] Enabling non-boot CPUs ... > > [ 131.831909] x86: Booting SMP configuration: > > [ 131.831910] smpboot: Booting Node 0 Processor 1 APIC 0x2 > > [ 131.832913] cache: parent cpu1 should not be sleeping > > This is an old one. > > > [ 131.833058] CPU1 is up > > [ 131.833067] smpboot: Booting Node 0 Processor 2 APIC 0x1 > > [ 131.833864] cache: parent cpu2 should not be sleeping > > [ 131.833983] CPU2 is up > > [ 131.833995] smpboot: Booting Node 0 Processor 3 APIC 0x3 > > [ 131.834776] cache: parent cpu3 should not be sleeping > > [ 131.834923] CPU3 is up > > > > "No irq handler" part looks a bit scary (maybe related to lost affinity > > messages?) but the following messages look quite as well. Is this > > something known? The system seems to be up and running without any > > visible issues. > > I assume it's due to the affinity break, just that we don't know right now > on which CPU that do_IRQ() message triggered. I assume it's CPU0 because > the others are offline already, but .... This is resume from S3, so the firmware might do something odd to the other CPUs, but in case it didn't (which is quite likely or we would have seen more of these messages), they are offline and in mwait_play_dead(), so IMO it is safe to assume that this was CPU0. And this appears to have happened at the atch_suspend_enable_irqs() time, which is just local_irq_enable() on x86 running on CPU0. > I'll think about it how we can figure out what's going on. It looks like an interrupt that have triggered right after we've enabled interrupts on the boot CPU. Thanks, Rafael ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-06 12:15 ` Michal Hocko 2017-12-06 12:23 ` Thomas Gleixner @ 2017-12-06 12:31 ` Maarten Lankhorst 2017-12-06 12:46 ` Thomas Gleixner 2017-12-07 7:55 ` Michal Hocko 2 siblings, 1 reply; 83+ messages in thread From: Maarten Lankhorst @ 2017-12-06 12:31 UTC (permalink / raw) To: Michal Hocko, Linus Torvalds Cc: Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers Op 06-12-17 om 13:15 schreef Michal Hocko: > On Mon 04-12-17 14:36:20, Linus Torvalds wrote: >> On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: >>> So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the >>> systems I have tested, so it is probably safe to assume it to be >>> broken everywhere. >> Oh, it's definitely not broken everywhere, because I use it myself, >> and was traveling last week due to my mom's bday. >> >> HOWEVER. >> >> Some of the x86 work seems to have broken it for some configurations. >> In particular, do you have a big "everything enabled" kernel config - >> particularly lockdep and irqflags tracing enabled? >> >> Andy has a patch, but it hasn't made it to me yet (probably because >> the x86 people are very busy with the kaiser work): >> >> https://lkml.org/lkml/2017/11/30/546 >> >> (also note his follow-up "fix the commit message" note, but that one >> doesn't actually affect the code itself). > merging tip/x86/urgent on top of your tree fixed this problem for me, > but I am seeing something else > [ 131.711412] ACPI: Preparing to enter system sleep state S3 > [ 131.755328] ACPI: EC: event blocked > [ 131.755328] ACPI: EC: EC stopped > [ 131.755328] PM: Saving platform NVS memory > [ 131.755344] Disabling non-boot CPUs ... > [ 131.779330] IRQ 124: no longer affine to CPU1 > [ 131.780334] smpboot: CPU 1 is now offline > [ 131.804465] smpboot: CPU 2 is now offline > [ 131.827291] IRQ 122: no longer affine to CPU3 > [ 131.827292] IRQ 123: no longer affine to CPU3 > [ 131.828293] smpboot: CPU 3 is now offline > [ 131.830991] ACPI: Low-level resume complete > [ 131.831092] ACPI: EC: EC started > [ 131.831093] PM: Restoring platform NVS memory > [ 131.831864] do_IRQ: 0.55 No irq handler for vector > [ 131.831884] Enabling non-boot CPUs ... > [ 131.831909] x86: Booting SMP configuration: > [ 131.831910] smpboot: Booting Node 0 Processor 1 APIC 0x2 > [ 131.832913] cache: parent cpu1 should not be sleeping > [ 131.833058] CPU1 is up > [ 131.833067] smpboot: Booting Node 0 Processor 2 APIC 0x1 > [ 131.833864] cache: parent cpu2 should not be sleeping > [ 131.833983] CPU2 is up > [ 131.833995] smpboot: Booting Node 0 Processor 3 APIC 0x3 > [ 131.834776] cache: parent cpu3 should not be sleeping > [ 131.834923] CPU3 is up > > "No irq handler" part looks a bit scary (maybe related to lost affinity > messages?) but the following messages look quite as well. Is this > something known? The system seems to be up and running without any > visible issues. Another reproducer for https://bugzilla.kernel.org/show_bug.cgi?id=198033 ? Symptoms are similar.. ~Maarten ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-06 12:31 ` Maarten Lankhorst @ 2017-12-06 12:46 ` Thomas Gleixner 2017-12-06 13:09 ` Maarten Lankhorst 0 siblings, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-06 12:46 UTC (permalink / raw) To: Maarten Lankhorst Cc: Michal Hocko, Linus Torvalds, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Wed, 6 Dec 2017, Maarten Lankhorst wrote: > Op 06-12-17 om 13:15 schreef Michal Hocko: > > > > "No irq handler" part looks a bit scary (maybe related to lost affinity > > messages?) but the following messages look quite as well. Is this > > something known? The system seems to be up and running without any > > visible issues. > > Another reproducer for https://bugzilla.kernel.org/show_bug.cgi?id=198033 ? > Symptoms are similar.. Well, the spurious interrupt is one thing, but you obviously lose interrupts for some reason. Did you ever manage to get the data out which I asked for? Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-06 12:46 ` Thomas Gleixner @ 2017-12-06 13:09 ` Maarten Lankhorst 2017-12-06 14:15 ` Thomas Gleixner 0 siblings, 1 reply; 83+ messages in thread From: Maarten Lankhorst @ 2017-12-06 13:09 UTC (permalink / raw) To: Thomas Gleixner Cc: Michal Hocko, Linus Torvalds, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers Op 06-12-17 om 13:46 schreef Thomas Gleixner: > On Wed, 6 Dec 2017, Maarten Lankhorst wrote: >> Op 06-12-17 om 13:15 schreef Michal Hocko: >>> "No irq handler" part looks a bit scary (maybe related to lost affinity >>> messages?) but the following messages look quite as well. Is this >>> something known? The system seems to be up and running without any >>> visible issues. >> Another reproducer for https://bugzilla.kernel.org/show_bug.cgi?id=198033 ? >> Symptoms are similar.. > Well, the spurious interrupt is one thing, but you obviously lose > interrupts for some reason. > > Did you ever manage to get the data out which I asked for? > > Thanks, > > tglx > Yes, sent this out about an hour ago https://lkml.org/lkml/2017/12/6/215 Cheers, Maarten ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-06 13:09 ` Maarten Lankhorst @ 2017-12-06 14:15 ` Thomas Gleixner 2017-12-07 13:33 ` Maarten Lankhorst 0 siblings, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-06 14:15 UTC (permalink / raw) To: Maarten Lankhorst Cc: Michal Hocko, Linus Torvalds, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Wed, 6 Dec 2017, Maarten Lankhorst wrote: > Op 06-12-17 om 13:46 schreef Thomas Gleixner: > > On Wed, 6 Dec 2017, Maarten Lankhorst wrote: > >> Op 06-12-17 om 13:15 schreef Michal Hocko: > >>> "No irq handler" part looks a bit scary (maybe related to lost affinity > >>> messages?) but the following messages look quite as well. Is this > >>> something known? The system seems to be up and running without any > >>> visible issues. > >> Another reproducer for https://bugzilla.kernel.org/show_bug.cgi?id=198033 ? > >> Symptoms are similar.. > > Well, the spurious interrupt is one thing, but you obviously lose > > interrupts for some reason. > > > > Did you ever manage to get the data out which I asked for? > > > > Thanks, > > > > tglx > > > Yes, sent this out about an hour ago > > https://lkml.org/lkml/2017/12/6/215 Weird. Did not reach me ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-06 14:15 ` Thomas Gleixner @ 2017-12-07 13:33 ` Maarten Lankhorst 2017-12-08 10:30 ` Thomas Gleixner 0 siblings, 1 reply; 83+ messages in thread From: Maarten Lankhorst @ 2017-12-07 13:33 UTC (permalink / raw) To: Thomas Gleixner Cc: Michal Hocko, Linus Torvalds, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers Op 06-12-17 om 15:15 schreef Thomas Gleixner: > On Wed, 6 Dec 2017, Maarten Lankhorst wrote: >> Op 06-12-17 om 13:46 schreef Thomas Gleixner: >>> On Wed, 6 Dec 2017, Maarten Lankhorst wrote: >>>> Op 06-12-17 om 13:15 schreef Michal Hocko: >>>>> "No irq handler" part looks a bit scary (maybe related to lost affinity >>>>> messages?) but the following messages look quite as well. Is this >>>>> something known? The system seems to be up and running without any >>>>> visible issues. >>>> Another reproducer for https://bugzilla.kernel.org/show_bug.cgi?id=198033 ? >>>> Symptoms are similar.. >>> Well, the spurious interrupt is one thing, but you obviously lose >>> interrupts for some reason. >>> >>> Did you ever manage to get the data out which I asked for? >>> >>> Thanks, >>> >>> tglx >>> >> Yes, sent this out about an hour ago >> >> https://lkml.org/lkml/2017/12/6/215 > Weird. Did not reach me > But do you have any idea? ~Maarten ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-07 13:33 ` Maarten Lankhorst @ 2017-12-08 10:30 ` Thomas Gleixner 2017-12-13 15:57 ` Thomas Gleixner 0 siblings, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-08 10:30 UTC (permalink / raw) To: Maarten Lankhorst Cc: Michal Hocko, Linus Torvalds, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Thu, 7 Dec 2017, Maarten Lankhorst wrote: > Op 06-12-17 om 15:15 schreef Thomas Gleixner: > > On Wed, 6 Dec 2017, Maarten Lankhorst wrote: > >> Op 06-12-17 om 13:46 schreef Thomas Gleixner: > >>> On Wed, 6 Dec 2017, Maarten Lankhorst wrote: > >>>> Op 06-12-17 om 13:15 schreef Michal Hocko: > >>>>> "No irq handler" part looks a bit scary (maybe related to lost affinity > >>>>> messages?) but the following messages look quite as well. Is this > >>>>> something known? The system seems to be up and running without any > >>>>> visible issues. > >>>> Another reproducer for https://bugzilla.kernel.org/show_bug.cgi?id=198033 ? > >>>> Symptoms are similar.. > >>> Well, the spurious interrupt is one thing, but you obviously lose > >>> interrupts for some reason. > >>> > >>> Did you ever manage to get the data out which I asked for? > >>> > >>> Thanks, > >>> > >>> tglx > >>> > >> Yes, sent this out about an hour ago > >> > >> https://lkml.org/lkml/2017/12/6/215 > > Weird. Did not reach me > > > But do you have any idea? Can you please provide the full trace, dmesg and the full output of .../debug/irq/... ? Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-08 10:30 ` Thomas Gleixner @ 2017-12-13 15:57 ` Thomas Gleixner 2017-12-13 16:23 ` Bjorn Helgaas 0 siblings, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-13 15:57 UTC (permalink / raw) To: Maarten Lankhorst Cc: Michal Hocko, Linus Torvalds, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki So I was finally able to figure out what the hell is going on: Suspend: - The device suspend code puts the graphics card into a power state != PCI_D0. - Offline non boot CPUs - Break interrupt affinity. Allocate new vector on CPU 0, compose and write MSI message which ends up in: __pci_write_msi_msg(entry, msg) { if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) { /* Don't touch the hardware now */ } else { .... } entry->msg = *msg; } So because the device is not in PCI_D0 the message is not written. It's written in the device resume path. Resume: [ 139.670446] ACPI: Low-level resume complete [ 139.670541] PM: Restoring platform NVS memory [ 139.672462] do_IRQ: 0.55 No irq handler for vector [ 139.672475] Enabling non-boot CPUs ... So the spurious interrupt happens early and way before the device resume code writes the new MSI message. I checked the behaviour on 4.14. The MSI write is delayed there in the same way, but there is no spurious interrupt. There is no interrupt coming in at all _BEFORE_ the device is put out of PCI_D0. And this has certainly nothing to do with the vector management changes, but I can't figure yet what makes that spurious interrupt to be sent. Any ideas welcome. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 15:57 ` Thomas Gleixner @ 2017-12-13 16:23 ` Bjorn Helgaas 2017-12-13 16:41 ` Thomas Gleixner 0 siblings, 1 reply; 83+ messages in thread From: Bjorn Helgaas @ 2017-12-13 16:23 UTC (permalink / raw) To: Thomas Gleixner Cc: Maarten Lankhorst, Michal Hocko, Linus Torvalds, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm [+cc linux-pci, linux-pm] On Wed, Dec 13, 2017 at 04:57:56PM +0100, Thomas Gleixner wrote: > So I was finally able to figure out what the hell is going on: > > Suspend: > > - The device suspend code puts the graphics card into a power > state != PCI_D0. > > - Offline non boot CPUs > > - Break interrupt affinity. Allocate new vector on CPU 0, compose and > write MSI message which ends up in: > > __pci_write_msi_msg(entry, msg) > { > if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) { > /* Don't touch the hardware now */ > } else { > .... > } > entry->msg = *msg; > } > > So because the device is not in PCI_D0 the message is not written. It's > written in the device resume path. I'm not a PM guru, but this ordering seems fragile. If we offline CPUs before re-targeting interrupts directed at those CPUs, aren't we always going to be at risk of sending interrupts to an offline CPU? Even if the device is now asleep and therefore should not generate an interrupt, it seems like there's a window when the device returns to PCI_D0 where it could generate an interrupt before we have a chance to update the MSI message. > Resume: > [ 139.670446] ACPI: Low-level resume complete > [ 139.670541] PM: Restoring platform NVS memory > [ 139.672462] do_IRQ: 0.55 No irq handler for vector > [ 139.672475] Enabling non-boot CPUs ... > > So the spurious interrupt happens early and way before the device resume > code writes the new MSI message. > > I checked the behaviour on 4.14. The MSI write is delayed there in the same > way, but there is no spurious interrupt. There is no interrupt coming in at > all _BEFORE_ the device is put out of PCI_D0. > > And this has certainly nothing to do with the vector management changes, > but I can't figure yet what makes that spurious interrupt to be sent. > > Any ideas welcome. > > Thanks, > > tglx > ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 16:23 ` Bjorn Helgaas @ 2017-12-13 16:41 ` Thomas Gleixner 2017-12-13 17:45 ` Linus Torvalds 0 siblings, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-13 16:41 UTC (permalink / raw) To: Bjorn Helgaas Cc: Maarten Lankhorst, Michal Hocko, Linus Torvalds, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Wed, 13 Dec 2017, Bjorn Helgaas wrote: > [+cc linux-pci, linux-pm] > > On Wed, Dec 13, 2017 at 04:57:56PM +0100, Thomas Gleixner wrote: > > So I was finally able to figure out what the hell is going on: > > > > Suspend: > > > > - The device suspend code puts the graphics card into a power > > state != PCI_D0. > > > > - Offline non boot CPUs > > > > - Break interrupt affinity. Allocate new vector on CPU 0, compose and > > write MSI message which ends up in: > > > > __pci_write_msi_msg(entry, msg) > > { > > if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) { > > /* Don't touch the hardware now */ > > } else { > > .... > > } > > entry->msg = *msg; > > } > > > > So because the device is not in PCI_D0 the message is not written. It's > > written in the device resume path. > > I'm not a PM guru, but this ordering seems fragile. If we offline > CPUs before re-targeting interrupts directed at those CPUs, aren't we > always going to be at risk of sending interrupts to an offline CPU? > > Even if the device is now asleep and therefore should not generate an > interrupt, it seems like there's a window when the device returns to > PCI_D0 where it could generate an interrupt before we have a chance to > update the MSI message. Definitely. That was fragile forever but puzzles me is that I can't figure out what now causes that spurious interrupt to surface out of the blue. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 16:41 ` Thomas Gleixner @ 2017-12-13 17:45 ` Linus Torvalds 2017-12-13 18:19 ` Thomas Gleixner 0 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2017-12-13 17:45 UTC (permalink / raw) To: Thomas Gleixner Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > Definitely. That was fragile forever but puzzles me is that I can't figure > out what now causes that spurious interrupt to surface out of the blue. Perhaps just timing? How hard would it be to change the ordering to just redirect irqs first? Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 17:45 ` Linus Torvalds @ 2017-12-13 18:19 ` Thomas Gleixner 2017-12-13 20:52 ` Thomas Gleixner 2017-12-13 22:39 ` Rafael J. Wysocki 0 siblings, 2 replies; 83+ messages in thread From: Thomas Gleixner @ 2017-12-13 18:19 UTC (permalink / raw) To: Linus Torvalds Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Wed, 13 Dec 2017, Linus Torvalds wrote: > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > Definitely. That was fragile forever but puzzles me is that I can't figure > > out what now causes that spurious interrupt to surface out of the blue. > > Perhaps just timing? That's what I'm trying to figure out right now, because that is the only sensible explanation left. The whole machinery of suspend is exactly the same with and without the vector changes. I instrumented all functions involved and the picture is the same. I even do not see any fundamental timing differences where one would say: That's it. What puzzles me even more is that in the range of commits I'm fiddling with there is no other change than the vector management stuff and the point where it breaks makes no sense at all. The point Maarten bisected it to works nicely here, so that might just point to a very subtle timing issue. > How hard would it be to change the ordering to just redirect irqs first? The whole interrupt redirection happens when the non boot CPUs are brought down, which is the very last step before the actual suspend happens. We could probably do that earlier, but that's something Rafael needs to answer ultimately. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 18:19 ` Thomas Gleixner @ 2017-12-13 20:52 ` Thomas Gleixner 2017-12-13 21:06 ` Thomas Gleixner 2017-12-13 22:39 ` Rafael J. Wysocki 1 sibling, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-13 20:52 UTC (permalink / raw) To: Linus Torvalds Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Wed, 13 Dec 2017, Thomas Gleixner wrote: > On Wed, 13 Dec 2017, Linus Torvalds wrote: > > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > Definitely. That was fragile forever but puzzles me is that I can't figure > > > out what now causes that spurious interrupt to surface out of the blue. > > > > Perhaps just timing? > > That's what I'm trying to figure out right now, because that is the only > sensible explanation left. The whole machinery of suspend is exactly the > same with and without the vector changes. I instrumented all functions > involved and the picture is the same. I even do not see any fundamental > timing differences where one would say: That's it. > > What puzzles me even more is that in the range of commits I'm fiddling with > there is no other change than the vector management stuff and the point > where it breaks makes no sense at all. The point Maarten bisected it to > works nicely here, so that might just point to a very subtle timing issue. After doing more debugging on this it turns out that this looks like a legacy interrupt coming in. The vector number is always 55, which is legacy IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is masked and vector 55 is completely unused. More questions than answers. Still investigating. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 20:52 ` Thomas Gleixner @ 2017-12-13 21:06 ` Thomas Gleixner 2017-12-13 22:48 ` Rafael J. Wysocki 2017-12-14 11:54 ` Thomas Gleixner 0 siblings, 2 replies; 83+ messages in thread From: Thomas Gleixner @ 2017-12-13 21:06 UTC (permalink / raw) To: Linus Torvalds Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Wed, 13 Dec 2017, Thomas Gleixner wrote: > On Wed, 13 Dec 2017, Thomas Gleixner wrote: > > On Wed, 13 Dec 2017, Linus Torvalds wrote: > > > > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > Definitely. That was fragile forever but puzzles me is that I can't figure > > > > out what now causes that spurious interrupt to surface out of the blue. > > > > > > Perhaps just timing? > > > > That's what I'm trying to figure out right now, because that is the only > > sensible explanation left. The whole machinery of suspend is exactly the > > same with and without the vector changes. I instrumented all functions > > involved and the picture is the same. I even do not see any fundamental > > timing differences where one would say: That's it. > > > > What puzzles me even more is that in the range of commits I'm fiddling with > > there is no other change than the vector management stuff and the point > > where it breaks makes no sense at all. The point Maarten bisected it to > > works nicely here, so that might just point to a very subtle timing issue. > > After doing more debugging on this it turns out that this looks like a > legacy interrupt coming in. The vector number is always 55, which is legacy > IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is > masked and vector 55 is completely unused. > > More questions than answers. Still investigating. And it does not explain Maartens report which gets a spurious vector 33 on CPU4 after the non boot cpus have been brought online again. And that's the vector which was assigned before the affinity was moved by unplugging CPU4. Hrmpf. Even more mystery to solve. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 21:06 ` Thomas Gleixner @ 2017-12-13 22:48 ` Rafael J. Wysocki 2017-12-14 11:54 ` Thomas Gleixner 1 sibling, 0 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-13 22:48 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Wednesday, December 13, 2017 10:06:40 PM CET Thomas Gleixner wrote: > On Wed, 13 Dec 2017, Thomas Gleixner wrote: > > On Wed, 13 Dec 2017, Thomas Gleixner wrote: > > > On Wed, 13 Dec 2017, Linus Torvalds wrote: > > > > > > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > > > Definitely. That was fragile forever but puzzles me is that I can't figure > > > > > out what now causes that spurious interrupt to surface out of the blue. > > > > > > > > Perhaps just timing? > > > > > > That's what I'm trying to figure out right now, because that is the only > > > sensible explanation left. The whole machinery of suspend is exactly the > > > same with and without the vector changes. I instrumented all functions > > > involved and the picture is the same. I even do not see any fundamental > > > timing differences where one would say: That's it. > > > > > > What puzzles me even more is that in the range of commits I'm fiddling with > > > there is no other change than the vector management stuff and the point > > > where it breaks makes no sense at all. The point Maarten bisected it to > > > works nicely here, so that might just point to a very subtle timing issue. > > > > After doing more debugging on this it turns out that this looks like a > > legacy interrupt coming in. The vector number is always 55, which is legacy > > IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is > > masked and vector 55 is completely unused. > > > > More questions than answers. Still investigating. > > And it does not explain Maartens report which gets a spurious vector 33 on > CPU4 after the non boot cpus have been brought online again. And that's the > vector which was assigned before the affinity was moved by unplugging CPU4. > > Hrmpf. Even more mystery to solve. Any chance to look at /proc/interrupts from a machine where that can be reproduced? I'm also curious if that can be reproduced by doing CPU offline/online without suspending? ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 21:06 ` Thomas Gleixner 2017-12-13 22:48 ` Rafael J. Wysocki @ 2017-12-14 11:54 ` Thomas Gleixner 2017-12-14 12:12 ` Rafael J. Wysocki ` (2 more replies) 1 sibling, 3 replies; 83+ messages in thread From: Thomas Gleixner @ 2017-12-14 11:54 UTC (permalink / raw) To: Linus Torvalds Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Wed, 13 Dec 2017, Thomas Gleixner wrote: > On Wed, 13 Dec 2017, Thomas Gleixner wrote: > > On Wed, 13 Dec 2017, Thomas Gleixner wrote: > > > On Wed, 13 Dec 2017, Linus Torvalds wrote: > > > > > > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > > > Definitely. That was fragile forever but puzzles me is that I can't figure > > > > > out what now causes that spurious interrupt to surface out of the blue. > > > > > > > > Perhaps just timing? > > > > > > That's what I'm trying to figure out right now, because that is the only > > > sensible explanation left. The whole machinery of suspend is exactly the > > > same with and without the vector changes. I instrumented all functions > > > involved and the picture is the same. I even do not see any fundamental > > > timing differences where one would say: That's it. > > > > > > What puzzles me even more is that in the range of commits I'm fiddling with > > > there is no other change than the vector management stuff and the point > > > where it breaks makes no sense at all. The point Maarten bisected it to > > > works nicely here, so that might just point to a very subtle timing issue. > > > > After doing more debugging on this it turns out that this looks like a > > legacy interrupt coming in. The vector number is always 55, which is legacy > > IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is > > masked and vector 55 is completely unused. > > > > More questions than answers. Still investigating. At least that one could be explained by the changes. In the previous management scheme the IOAPIC interrupts were always allocated even when the interrupt was not in use. The new scheme does not longer do that because people complained about the vector waste (16 vectors on each CPU) and it got rid of all the special casing of IRQ0-15. So the old scheme silently consumed the spurious vector. I added debug code to that effect to 4.14 and on that machine IRQ7 is triggered at the same point post resume and the core code drops it silently because the interrupt is marked masked and no action assigned. So the only difference to today is that the new code complains, while the old one does an extra mask of the already masked IOAPIC pin and silently returns. After quite some investigation I found out that its independent of the graphics thing. That's a genuine issue on that platform which seems to emit random legacy vectors which were never ever used for unknown reasons. I verified that both the IOAPIC and the PIC are masked, so they cannot send crap. Though it turned out that the silly firmware unmasks the PIC and leaves it that way when it returns from suspend. Now there is a race whether the kernel resume path manages to mask the PIC again early enough before something triggers IRQ7 or not. Adding/removing debug code makes the problem come and go. So I really don't worry about that one and rather prefer to have the spurious interrupt printed than silently consumed by chance. Now the graphics issue is a different story. That only happens on hibernation after doing the snapshot. There all non boot cpus are onlined again and after that the devices are 'thawed'. The following reenable of interrupts fails because i915 is not in PCI_D0 state. Suspend: irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a __pci_write_msi_msg: Not written <- Device not in PCI_D0 .... device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [resume] pci_pm_resume_noirq <-dpm_run_callback pci_pm_resume_noirq <-dpm_run_callback pci_pm_default_resume_early <-pci_pm_resume_noirq pci_pm_default_resume_early <-pci_pm_resume_noirq __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a <-- Set the new affinity device_pm_callback_end: i915 0000:00:02.0, err=0 Hibernate: irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a __pci_write_msi_msg: Not written <- Device not in PCI_D0 .... device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw] pci_pm_thaw_noirq <-dpm_run_callback __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a __pci_write_msi_msg: Not written <--- Device is not in PCI_D0 device_pm_callback_end: i915 0000:00:02.0, err=0 So that code path fails to set the new affinity because at the point where the MSI msg should be written the device state is != PCI_D0. Now, what's different vs. 4.14: The 4.14 code accidentaly had the irq descriptor for this vector still populated in the old CPU due to the convoluted way the vector allocation worked. I have still to investigate if one of those cases is actually leaking the descriptor, which would be a fatal bug. But the new code does a proper cleanup and does not repopulate it on the offline CPU. So that unearthes the issue. I'm handing that over to the PM folks to look at. I got lost in that maze of callbacks. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 11:54 ` Thomas Gleixner @ 2017-12-14 12:12 ` Rafael J. Wysocki 2017-12-14 12:30 ` Thomas Gleixner 2017-12-14 13:24 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Thomas Gleixner 2017-12-14 19:03 ` Linus Torvalds 2 siblings, 1 reply; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-14 12:12 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thursday, December 14, 2017 12:54:05 PM CET Thomas Gleixner wrote: > On Wed, 13 Dec 2017, Thomas Gleixner wrote: > > On Wed, 13 Dec 2017, Thomas Gleixner wrote: > > > On Wed, 13 Dec 2017, Thomas Gleixner wrote: > > > > On Wed, 13 Dec 2017, Linus Torvalds wrote: > > > > > > > > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > > > > > Definitely. That was fragile forever but puzzles me is that I can't figure > > > > > > out what now causes that spurious interrupt to surface out of the blue. > > > > > > > > > > Perhaps just timing? > > > > > > > > That's what I'm trying to figure out right now, because that is the only > > > > sensible explanation left. The whole machinery of suspend is exactly the > > > > same with and without the vector changes. I instrumented all functions > > > > involved and the picture is the same. I even do not see any fundamental > > > > timing differences where one would say: That's it. > > > > > > > > What puzzles me even more is that in the range of commits I'm fiddling with > > > > there is no other change than the vector management stuff and the point > > > > where it breaks makes no sense at all. The point Maarten bisected it to > > > > works nicely here, so that might just point to a very subtle timing issue. > > > > > > After doing more debugging on this it turns out that this looks like a > > > legacy interrupt coming in. The vector number is always 55, which is legacy > > > IRQ 7 as seen from the PIC. The corresponding IOAPIC interrupt pin is > > > masked and vector 55 is completely unused. > > > > > > More questions than answers. Still investigating. > > At least that one could be explained by the changes. In the previous > management scheme the IOAPIC interrupts were always allocated even when the > interrupt was not in use. The new scheme does not longer do that because > people complained about the vector waste (16 vectors on each CPU) and it > got rid of all the special casing of IRQ0-15. > > So the old scheme silently consumed the spurious vector. I added debug code > to that effect to 4.14 and on that machine IRQ7 is triggered at the same > point post resume and the core code drops it silently because the interrupt > is marked masked and no action assigned. > > So the only difference to today is that the new code complains, while the > old one does an extra mask of the already masked IOAPIC pin and silently > returns. > > After quite some investigation I found out that its independent of the > graphics thing. That's a genuine issue on that platform which seems to emit > random legacy vectors which were never ever used for unknown reasons. I > verified that both the IOAPIC and the PIC are masked, so they cannot send > crap. Though it turned out that the silly firmware unmasks the PIC and > leaves it that way when it returns from suspend. Now there is a race > whether the kernel resume path manages to mask the PIC again early enough > before something triggers IRQ7 or not. Adding/removing debug code makes the > problem come and go. So I really don't worry about that one and rather > prefer to have the spurious interrupt printed than silently consumed by > chance. OK > Now the graphics issue is a different story. That only happens on > hibernation after doing the snapshot. There all non boot cpus are onlined > again and after that the devices are 'thawed'. The following reenable of > interrupts fails because i915 is not in PCI_D0 state. > > Suspend: > > irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > __pci_write_msi_msg: Not written <- Device not in PCI_D0 > .... > device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [resume] > pci_pm_resume_noirq <-dpm_run_callback > pci_pm_resume_noirq <-dpm_run_callback > pci_pm_default_resume_early <-pci_pm_resume_noirq > pci_pm_default_resume_early <-pci_pm_resume_noirq > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a <-- Set the new affinity > device_pm_callback_end: i915 0000:00:02.0, err=0 So this works, because we power up the device during resume even if it had been suspended (via runtime PM) before the suspend started. > Hibernate: > > irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > __pci_write_msi_msg: Not written <- Device not in PCI_D0 > .... > device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw] > pci_pm_thaw_noirq <-dpm_run_callback > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > __pci_write_msi_msg: Not written <--- Device is not in PCI_D0 > device_pm_callback_end: i915 0000:00:02.0, err=0 And here we try to leave the device alone which is OK for devices in D0, but not for suspended ones. It looks like we need to power up them at the "thaw" time too or at least I don't see how to address that differently. > So that code path fails to set the new affinity because at the point where > the MSI msg should be written the device state is != PCI_D0. > > Now, what's different vs. 4.14: > > The 4.14 code accidentaly had the irq descriptor for this vector still > populated in the old CPU due to the convoluted way the vector allocation > worked. I have still to investigate if one of those cases is actually > leaking the descriptor, which would be a fatal bug. > > But the new code does a proper cleanup and does not repopulate it on the > offline CPU. So that unearthes the issue. I'm handing that over to the PM > folks to look at. I got lost in that maze of callbacks. OK, thanks so much for getting to the bottom of this! Rafael ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 12:12 ` Rafael J. Wysocki @ 2017-12-14 12:30 ` Thomas Gleixner 2017-12-14 15:30 ` Rafael J. Wysocki 0 siblings, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-14 12:30 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thu, 14 Dec 2017, Rafael J. Wysocki wrote: > On Thursday, December 14, 2017 12:54:05 PM CET Thomas Gleixner wrote: > > Now the graphics issue is a different story. That only happens on > > hibernation after doing the snapshot. There all non boot cpus are onlined > > again and after that the devices are 'thawed'. The following reenable of > > interrupts fails because i915 is not in PCI_D0 state. > > > > Suspend: > > > > irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 > > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > > __pci_write_msi_msg: Not written <- Device not in PCI_D0 > > .... > > device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [resume] > > pci_pm_resume_noirq <-dpm_run_callback > > pci_pm_resume_noirq <-dpm_run_callback > > pci_pm_default_resume_early <-pci_pm_resume_noirq > > pci_pm_default_resume_early <-pci_pm_resume_noirq > > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a <-- Set the new affinity > > device_pm_callback_end: i915 0000:00:02.0, err=0 > > So this works, because we power up the device during resume even if it > had been suspended (via runtime PM) before the suspend started. > > > Hibernate: > > > > irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 > > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > > __pci_write_msi_msg: Not written <- Device not in PCI_D0 > > .... > > device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw] > > pci_pm_thaw_noirq <-dpm_run_callback > > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > > __pci_write_msi_msg: Not written <--- Device is not in PCI_D0 > > device_pm_callback_end: i915 0000:00:02.0, err=0 > > And here we try to leave the device alone which is OK for devices in D0, > but not for suspended ones. > > It looks like we need to power up them at the "thaw" time too or at least > I don't see how to address that differently. The question is whether the code which brings the device out of D0 should write the message unconditionally. That would be sufficient I think. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 12:30 ` Thomas Gleixner @ 2017-12-14 15:30 ` Rafael J. Wysocki 2017-12-14 15:52 ` Thomas Gleixner 0 siblings, 1 reply; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-14 15:30 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thursday, December 14, 2017 1:30:37 PM CET Thomas Gleixner wrote: > On Thu, 14 Dec 2017, Rafael J. Wysocki wrote: > > On Thursday, December 14, 2017 12:54:05 PM CET Thomas Gleixner wrote: > > > Now the graphics issue is a different story. That only happens on > > > hibernation after doing the snapshot. There all non boot cpus are onlined > > > again and after that the devices are 'thawed'. The following reenable of > > > interrupts fails because i915 is not in PCI_D0 state. > > > > > > Suspend: > > > > > > irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 > > > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > > > __pci_write_msi_msg: Not written <- Device not in PCI_D0 > > > .... > > > device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [resume] > > > pci_pm_resume_noirq <-dpm_run_callback > > > pci_pm_resume_noirq <-dpm_run_callback > > > pci_pm_default_resume_early <-pci_pm_resume_noirq > > > pci_pm_default_resume_early <-pci_pm_resume_noirq > > > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a <-- Set the new affinity > > > device_pm_callback_end: i915 0000:00:02.0, err=0 > > > > So this works, because we power up the device during resume even if it > > had been suspended (via runtime PM) before the suspend started. > > > > > Hibernate: > > > > > > irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 > > > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > > > __pci_write_msi_msg: Not written <- Device not in PCI_D0 > > > .... > > > device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw] > > > pci_pm_thaw_noirq <-dpm_run_callback > > > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > > > __pci_write_msi_msg: Not written <--- Device is not in PCI_D0 > > > device_pm_callback_end: i915 0000:00:02.0, err=0 > > > > And here we try to leave the device alone which is OK for devices in D0, > > but not for suspended ones. > > > > It looks like we need to power up them at the "thaw" time too or at least > > I don't see how to address that differently. > > The question is whether the code which brings the device out of D0 should > write the message unconditionally. That would be sufficient I think. It doesn't have to do that. The problem here is that pci_pm_thaw_noirq() calls pci_restore_state() which in fact requires the device to be in D0, so the caller should put it into D0 instead of trying to "update" its power state. [Note that the PCI layer doesn't put devices into low-power states during the hibernation's "freeze" transition, but drivers can legitimately do that in their "freeze" callbacks which was overlooked in that code and that's what i915 does.] So IMO what we need is the change below. I'm going to test it shortly, but please give it a go too. --- drivers/pci/pci-driver.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) Index: linux-pm/drivers/pci/pci-driver.c =================================================================== --- linux-pm.orig/drivers/pci/pci-driver.c +++ linux-pm/drivers/pci/pci-driver.c @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi if (pci_has_legacy_pm_support(pci_dev)) return pci_legacy_resume_early(dev); - pci_update_current_state(pci_dev, PCI_D0); + /* + * pci_restore_state() requires the device to be in D0 (because of MSI + * restoration among other things), so force it into D0 in case the + * driver's "freeze" callbacks put it into a low-power state directly. + */ + pci_set_power_state(pci_dev, PCI_D0); pci_restore_state(pci_dev); if (drv && drv->pm && drv->pm->thaw_noirq) ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 15:30 ` Rafael J. Wysocki @ 2017-12-14 15:52 ` Thomas Gleixner 2017-12-14 15:54 ` Rafael J. Wysocki 0 siblings, 1 reply; 83+ messages in thread From: Thomas Gleixner @ 2017-12-14 15:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thu, 14 Dec 2017, Rafael J. Wysocki wrote: > The problem here is that pci_pm_thaw_noirq() calls pci_restore_state() which > in fact requires the device to be in D0, so the caller should put it into > D0 instead of trying to "update" its power state. > > [Note that the PCI layer doesn't put devices into low-power states during the > hibernation's "freeze" transition, but drivers can legitimately do that in > their "freeze" callbacks which was overlooked in that code and that's what > i915 does.] > > So IMO what we need is the change below. I'm going to test it shortly, > but please give it a go too. So now this looks more reasonable: irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a __pci_write_msi_msg: Not written ... device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw] pci_pm_thaw_noirq <-dpm_run_callback __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a device_pm_callback_end: i915 0000:00:02.0, err=0 ... resume_irqs: Resume 125 ... irq_handler_entry: irq=125 name=i915 Thanks, tglx > --- > drivers/pci/pci-driver.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > Index: linux-pm/drivers/pci/pci-driver.c > =================================================================== > --- linux-pm.orig/drivers/pci/pci-driver.c > +++ linux-pm/drivers/pci/pci-driver.c > @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi > if (pci_has_legacy_pm_support(pci_dev)) > return pci_legacy_resume_early(dev); > > - pci_update_current_state(pci_dev, PCI_D0); > + /* > + * pci_restore_state() requires the device to be in D0 (because of MSI > + * restoration among other things), so force it into D0 in case the > + * driver's "freeze" callbacks put it into a low-power state directly. > + */ > + pci_set_power_state(pci_dev, PCI_D0); > pci_restore_state(pci_dev); > > if (drv && drv->pm && drv->pm->thaw_noirq) > > ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 15:52 ` Thomas Gleixner @ 2017-12-14 15:54 ` Rafael J. Wysocki 2017-12-14 16:17 ` Maarten Lankhorst 2017-12-15 2:07 ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki 0 siblings, 2 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-14 15:54 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thursday, December 14, 2017 4:52:22 PM CET Thomas Gleixner wrote: > On Thu, 14 Dec 2017, Rafael J. Wysocki wrote: > > The problem here is that pci_pm_thaw_noirq() calls pci_restore_state() which > > in fact requires the device to be in D0, so the caller should put it into > > D0 instead of trying to "update" its power state. > > > > [Note that the PCI layer doesn't put devices into low-power states during the > > hibernation's "freeze" transition, but drivers can legitimately do that in > > their "freeze" callbacks which was overlooked in that code and that's what > > i915 does.] > > > > So IMO what we need is the change below. I'm going to test it shortly, > > but please give it a go too. > > So now this looks more reasonable: > > irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > __pci_write_msi_msg: Not written > ... > device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw] > pci_pm_thaw_noirq <-dpm_run_callback > __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a > device_pm_callback_end: i915 0000:00:02.0, err=0 > ... > resume_irqs: Resume 125 > ... > irq_handler_entry: irq=125 name=i915 Cool. Let me respin it with a changelog etc then. Thanks, Rafael ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 15:54 ` Rafael J. Wysocki @ 2017-12-14 16:17 ` Maarten Lankhorst 2017-12-15 2:07 ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki 1 sibling, 0 replies; 83+ messages in thread From: Maarten Lankhorst @ 2017-12-14 16:17 UTC (permalink / raw) To: Rafael J. Wysocki, Thomas Gleixner Cc: Linus Torvalds, Bjorn Helgaas, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm Op 14-12-17 om 16:54 schreef Rafael J. Wysocki: > On Thursday, December 14, 2017 4:52:22 PM CET Thomas Gleixner wrote: >> On Thu, 14 Dec 2017, Rafael J. Wysocki wrote: >>> The problem here is that pci_pm_thaw_noirq() calls pci_restore_state() which >>> in fact requires the device to be in D0, so the caller should put it into >>> D0 instead of trying to "update" its power state. >>> >>> [Note that the PCI layer doesn't put devices into low-power states during the >>> hibernation's "freeze" transition, but drivers can legitimately do that in >>> their "freeze" callbacks which was overlooked in that code and that's what >>> i915 does.] >>> >>> So IMO what we need is the change below. I'm going to test it shortly, >>> but please give it a go too. >> So now this looks more reasonable: >> >> irq_migrate_all_off_this_cpu: Mask 125 pci_msi_mask_irq+0x0/0x10 >> __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a >> __pci_write_msi_msg: Not written >> ... >> device_pm_callback_start: i915 0000:00:02.0, parent: pci0000:00, noirq bus [thaw] >> pci_pm_thaw_noirq <-dpm_run_callback >> __pci_write_msi_msg: 0000:00:02.0 00000000fee0100c 0000412a >> device_pm_callback_end: i915 0000:00:02.0, err=0 >> ... >> resume_irqs: Resume 125 >> ... >> irq_handler_entry: irq=125 name=i915 > Cool. > > Let me respin it with a changelog etc then. > > Thanks, > Rafael > > The machine I was using for reproducing the bug appears to be fixed with this patch, so I now sent it to intel's trybot for results. https://patchwork.freedesktop.org/series/35367/ Thanks for looking at the bug! ~Maarten ^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() 2017-12-14 15:54 ` Rafael J. Wysocki 2017-12-14 16:17 ` Maarten Lankhorst @ 2017-12-15 2:07 ` Rafael J. Wysocki 2017-12-15 14:28 ` Rafael J. Wysocki 2017-12-15 18:30 ` Bjorn Helgaas 1 sibling, 2 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-15 2:07 UTC (permalink / raw) To: Thomas Gleixner, linux-pci Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pm, Chen Yu From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> It is incorrect to call pci_restore_state() for devices in low-power states (D1-D3), as that involves the restoration of MSI setup which requires MMIO to be operational and that is only the case in D0. However, pci_pm_thaw_noirq() may do that if the driver's "freeze" callbacks put the device into a low-power state, so fix it by making it force devices into D0 via pci_set_power_state() instead of trying to "update" their power state which is pointless. Fixes: e60514bd4485 (PCI/PM: Restore the status of PCI devices across hibernation) Cc: 4.13+ <stable@vger.kernel.org> # 4.13+ Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Maarten Lankhorst <dev@mblankhorst.nl> Tested-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Maarten Lankhorst <dev@mblankhorst.nl> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- The bug is not as old as I thought, actually. Yes, we did the pci_update_current_state() in pci_pm_thaw_noirq() forever, but it started to be problematic in 4.13, when we started to call pci_restore_state() in addition to it to fix another issue. --- drivers/pci/pci-driver.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) Index: linux-pm/drivers/pci/pci-driver.c =================================================================== --- linux-pm.orig/drivers/pci/pci-driver.c +++ linux-pm/drivers/pci/pci-driver.c @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi if (pci_has_legacy_pm_support(pci_dev)) return pci_legacy_resume_early(dev); - pci_update_current_state(pci_dev, PCI_D0); + /* + * pci_restore_state() requires the device to be in D0 (because of MSI + * restoration among other things), so force it into D0 in case the + * driver's "freeze" callbacks put it into a low-power state directly. + */ + pci_set_power_state(pci_dev, PCI_D0); pci_restore_state(pci_dev); if (drv && drv->pm && drv->pm->thaw_noirq) ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() 2017-12-15 2:07 ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki @ 2017-12-15 14:28 ` Rafael J. Wysocki 2017-12-15 18:30 ` Bjorn Helgaas 1 sibling, 0 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-15 14:28 UTC (permalink / raw) To: Bjorn Helgaas Cc: Thomas Gleixner, Linux PCI, Linus Torvalds, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, Linux PM, Chen Yu On Fri, Dec 15, 2017 at 3:07 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > It is incorrect to call pci_restore_state() for devices in low-power > states (D1-D3), as that involves the restoration of MSI setup which > requires MMIO to be operational and that is only the case in D0. > > However, pci_pm_thaw_noirq() may do that if the driver's "freeze" > callbacks put the device into a low-power state, so fix it by making > it force devices into D0 via pci_set_power_state() instead of trying > to "update" their power state which is pointless. > > Fixes: e60514bd4485 (PCI/PM: Restore the status of PCI devices across hibernation) > Cc: 4.13+ <stable@vger.kernel.org> # 4.13+ > Reported-by: Thomas Gleixner <tglx@linutronix.de> > Reported-by: Maarten Lankhorst <dev@mblankhorst.nl> > Tested-by: Thomas Gleixner <tglx@linutronix.de> > Tested-by: Maarten Lankhorst <dev@mblankhorst.nl> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > The bug is not as old as I thought, actually. > > Yes, we did the pci_update_current_state() in pci_pm_thaw_noirq() > forever, but it started to be problematic in 4.13, when we started > to call pci_restore_state() in addition to it to fix another issue. Bjorn, any concerns about this one? > --- > drivers/pci/pci-driver.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > Index: linux-pm/drivers/pci/pci-driver.c > =================================================================== > --- linux-pm.orig/drivers/pci/pci-driver.c > +++ linux-pm/drivers/pci/pci-driver.c > @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi > if (pci_has_legacy_pm_support(pci_dev)) > return pci_legacy_resume_early(dev); > > - pci_update_current_state(pci_dev, PCI_D0); > + /* > + * pci_restore_state() requires the device to be in D0 (because of MSI > + * restoration among other things), so force it into D0 in case the > + * driver's "freeze" callbacks put it into a low-power state directly. > + */ > + pci_set_power_state(pci_dev, PCI_D0); > pci_restore_state(pci_dev); > > if (drv && drv->pm && drv->pm->thaw_noirq) > ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() 2017-12-15 2:07 ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki 2017-12-15 14:28 ` Rafael J. Wysocki @ 2017-12-15 18:30 ` Bjorn Helgaas 2017-12-15 23:44 ` Rafael J. Wysocki 1 sibling, 1 reply; 83+ messages in thread From: Bjorn Helgaas @ 2017-12-15 18:30 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Thomas Gleixner, linux-pci, Linus Torvalds, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pm, Chen Yu On Fri, Dec 15, 2017 at 03:07:18AM +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > It is incorrect to call pci_restore_state() for devices in low-power > states (D1-D3), as that involves the restoration of MSI setup which > requires MMIO to be operational and that is only the case in D0. > > However, pci_pm_thaw_noirq() may do that if the driver's "freeze" > callbacks put the device into a low-power state, so fix it by making > it force devices into D0 via pci_set_power_state() instead of trying > to "update" their power state which is pointless. > > Fixes: e60514bd4485 (PCI/PM: Restore the status of PCI devices across hibernation) > Cc: 4.13+ <stable@vger.kernel.org> # 4.13+ > Reported-by: Thomas Gleixner <tglx@linutronix.de> > Reported-by: Maarten Lankhorst <dev@mblankhorst.nl> > Tested-by: Thomas Gleixner <tglx@linutronix.de> > Tested-by: Maarten Lankhorst <dev@mblankhorst.nl> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Let me know if you want me to take this. I don't have anything currently queued up that touches pci-driver.c, so I'm happy if you take it yourself. > --- > > The bug is not as old as I thought, actually. > > Yes, we did the pci_update_current_state() in pci_pm_thaw_noirq() > forever, but it started to be problematic in 4.13, when we started > to call pci_restore_state() in addition to it to fix another issue. > > --- > drivers/pci/pci-driver.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > Index: linux-pm/drivers/pci/pci-driver.c > =================================================================== > --- linux-pm.orig/drivers/pci/pci-driver.c > +++ linux-pm/drivers/pci/pci-driver.c > @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi > if (pci_has_legacy_pm_support(pci_dev)) > return pci_legacy_resume_early(dev); > > - pci_update_current_state(pci_dev, PCI_D0); > + /* > + * pci_restore_state() requires the device to be in D0 (because of MSI > + * restoration among other things), so force it into D0 in case the > + * driver's "freeze" callbacks put it into a low-power state directly. > + */ > + pci_set_power_state(pci_dev, PCI_D0); > pci_restore_state(pci_dev); > > if (drv && drv->pm && drv->pm->thaw_noirq) > ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() 2017-12-15 18:30 ` Bjorn Helgaas @ 2017-12-15 23:44 ` Rafael J. Wysocki 0 siblings, 0 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-15 23:44 UTC (permalink / raw) To: Bjorn Helgaas Cc: Rafael J. Wysocki, Thomas Gleixner, Linux PCI, Linus Torvalds, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, Linux PM, Chen Yu On Fri, Dec 15, 2017 at 7:30 PM, Bjorn Helgaas <helgaas@kernel.org> wrote: > On Fri, Dec 15, 2017 at 03:07:18AM +0100, Rafael J. Wysocki wrote: >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> It is incorrect to call pci_restore_state() for devices in low-power >> states (D1-D3), as that involves the restoration of MSI setup which >> requires MMIO to be operational and that is only the case in D0. >> >> However, pci_pm_thaw_noirq() may do that if the driver's "freeze" >> callbacks put the device into a low-power state, so fix it by making >> it force devices into D0 via pci_set_power_state() instead of trying >> to "update" their power state which is pointless. >> >> Fixes: e60514bd4485 (PCI/PM: Restore the status of PCI devices across hibernation) >> Cc: 4.13+ <stable@vger.kernel.org> # 4.13+ >> Reported-by: Thomas Gleixner <tglx@linutronix.de> >> Reported-by: Maarten Lankhorst <dev@mblankhorst.nl> >> Tested-by: Thomas Gleixner <tglx@linutronix.de> >> Tested-by: Maarten Lankhorst <dev@mblankhorst.nl> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> > > Let me know if you want me to take this. I don't have anything > currently queued up that touches pci-driver.c, so I'm happy if you > take it yourself. I will take it. Depending of what Yu finds, we may need an additional fix to make the Purley system work. >> --- >> >> The bug is not as old as I thought, actually. >> >> Yes, we did the pci_update_current_state() in pci_pm_thaw_noirq() >> forever, but it started to be problematic in 4.13, when we started >> to call pci_restore_state() in addition to it to fix another issue. >> >> --- >> drivers/pci/pci-driver.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> Index: linux-pm/drivers/pci/pci-driver.c >> =================================================================== >> --- linux-pm.orig/drivers/pci/pci-driver.c >> +++ linux-pm/drivers/pci/pci-driver.c >> @@ -1027,7 +1027,12 @@ static int pci_pm_thaw_noirq(struct devi >> if (pci_has_legacy_pm_support(pci_dev)) >> return pci_legacy_resume_early(dev); >> >> - pci_update_current_state(pci_dev, PCI_D0); >> + /* >> + * pci_restore_state() requires the device to be in D0 (because of MSI >> + * restoration among other things), so force it into D0 in case the >> + * driver's "freeze" callbacks put it into a low-power state directly. >> + */ >> + pci_set_power_state(pci_dev, PCI_D0); >> pci_restore_state(pci_dev); >> >> if (drv && drv->pm && drv->pm->thaw_noirq) >> ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 11:54 ` Thomas Gleixner 2017-12-14 12:12 ` Rafael J. Wysocki @ 2017-12-14 13:24 ` Thomas Gleixner 2017-12-14 19:03 ` Linus Torvalds 2 siblings, 0 replies; 83+ messages in thread From: Thomas Gleixner @ 2017-12-14 13:24 UTC (permalink / raw) To: Linus Torvalds Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thu, 14 Dec 2017, Thomas Gleixner wrote: > Now, what's different vs. 4.14: > > The 4.14 code accidentaly had the irq descriptor for this vector still > populated in the old CPU due to the convoluted way the vector allocation > worked. I have still to investigate if one of those cases is actually > leaking the descriptor, which would be a fatal bug. It doesn't leak. It repopulates it at the same place out of sheer luck. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 11:54 ` Thomas Gleixner 2017-12-14 12:12 ` Rafael J. Wysocki 2017-12-14 13:24 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Thomas Gleixner @ 2017-12-14 19:03 ` Linus Torvalds 2017-12-14 22:36 ` Thomas Gleixner 2 siblings, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2017-12-14 19:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thu, Dec 14, 2017 at 3:54 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > So the old scheme silently consumed the spurious vector. I added debug code > to that effect to 4.14 and on that machine IRQ7 is triggered at the same > point post resume and the core code drops it silently because the interrupt > is marked masked and no action assigned. > > So the only difference to today is that the new code complains, while the > old one does an extra mask of the already masked IOAPIC pin and silently > returns. Great debugging, and it looks like Rafael has a patch that already got positive testing. I just wanted to pipe up about that "irq7", because judging from your email it seems like you think it's a real irq: > Now there is a race > whether the kernel resume path manages to mask the PIC again early enough > before something triggers IRQ7 or not. ..and that's not how the PIC works. In fact, "legacy irq 7" is the _normal_ and very traditional spurious interrupt, and it's documented. If the PIC gets an interrupt from _any_ source, but the interrupt goes away before the PIC gets an acknowledge from the CPU (and by "acknowledge", I'm not talking about the explicit software IRQ ACK, I'm talking about the hardware protocol, between the PIC and the CPU), the PIC will then report irq 7 as the interrupt - regardless of what the original was. The reason is almost always something like - CPU interrupts are disabled or masked - driver does a write to the external hardware that causes an interrupt to be raised - CPU doesn't react to the irq due to the disabled/masked nature - but the driver then does something that masks the interrupt again - interrupts are enabled/unmasked on the CPU - CPU now acks the interrupt, but the PIC no longer sees any interrupt source, so the PIC (that has to reply with *something*) replies with that documented spurious irq7. To confuse things further, irq7 is not _exclusively_ the spurious interrupt, You can definitely put real hardware and connect it to pin7 of the PIC, and get real irq7 reports. And to confuse things even *more*, this "irq7" thing is per-PIC, and the PC model obviously has the whole "nested PIC" thing where the second PIC is connected to irq2 of the first PIC. So there are *two* different "spurious interrupt" reports, one for each PIC. Anyway, to avoid this issue, drivers should strive to (a) actually take the interrupt when doing things that can cause them, and have the interrupt handler do whatever it is that causes the interrupt to go away (ie: "normal operation") (b) if you play games with clearing the source of the interrupt _without_ taking the interrupt, you should strive to basically mask the interrupt first. So to do (b) you can do something like mask_device_interrupt(dev); read_from_device_to_synchronize(dev); instead of (or perhaps _before_) disabling interrupts at a CPU level. Suspend/resume obviously does tend to play games with these kinds of things where you are no longer in "normal operation" and you do setup without having interrupts actually enabled. Or you can just decide that spurious interrupts are ok, and ignore the issue. But they *can* be very confusing, and obviously in this case that confusion then seems to have caused actual problems. Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 19:03 ` Linus Torvalds @ 2017-12-14 22:36 ` Thomas Gleixner 2017-12-14 22:47 ` Linus Torvalds 2017-12-15 0:34 ` Rafael J. Wysocki 0 siblings, 2 replies; 83+ messages in thread From: Thomas Gleixner @ 2017-12-14 22:36 UTC (permalink / raw) To: Linus Torvalds Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thu, 14 Dec 2017, Linus Torvalds wrote: > On Thu, Dec 14, 2017 at 3:54 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > I just wanted to pipe up about that "irq7", because judging from your > email it seems like you think it's a real irq: > > > Now there is a race > > whether the kernel resume path manages to mask the PIC again early enough > > before something triggers IRQ7 or not. > > ..and that's not how the PIC works. > > In fact, "legacy irq 7" is the _normal_ and very traditional spurious > interrupt, and it's documented. If the PIC gets an interrupt from > _any_ source, but the interrupt goes away before the PIC gets an > acknowledge from the CPU (and by "acknowledge", I'm not talking about > the explicit software IRQ ACK, I'm talking about the hardware > protocol, between the PIC and the CPU), the PIC will then report irq 7 > as the interrupt - regardless of what the original was. > > The reason is almost always something like > > - CPU interrupts are disabled or masked > > - driver does a write to the external hardware that causes an > interrupt to be raised Which should be a non issue because _ALL_ PIC irq lines are masked at the PIC itself. All interrupts are routed through IOAPIC. So unless the IOAPIC sports similar behaviour the PIC should not ever observe that scenario. But, because the silly firmware comes out of suspend with all PIC lines unmasked for whatever reason, the PIC can observe that IRQ being raised and the CPU not handling it. So yes, I forgot about 7 being magic, but I still think it's the firmware which causes it by unmasking the PIC irqs. Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 22:36 ` Thomas Gleixner @ 2017-12-14 22:47 ` Linus Torvalds 2017-12-15 9:05 ` Thomas Gleixner 2017-12-15 0:34 ` Rafael J. Wysocki 1 sibling, 1 reply; 83+ messages in thread From: Linus Torvalds @ 2017-12-14 22:47 UTC (permalink / raw) To: Thomas Gleixner Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thu, Dec 14, 2017 at 2:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > But, because the silly firmware comes out of suspend with all PIC lines > unmasked for whatever reason, the PIC can observe that IRQ being raised and > the CPU not handling it. So yes, I forgot about 7 being magic, but I still > think it's the firmware which causes it by unmasking the PIC irqs. Yes, that sounds quite likely. Linus ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 22:47 ` Linus Torvalds @ 2017-12-15 9:05 ` Thomas Gleixner 0 siblings, 0 replies; 83+ messages in thread From: Thomas Gleixner @ 2017-12-15 9:05 UTC (permalink / raw) To: Linus Torvalds Cc: Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Thu, 14 Dec 2017, Linus Torvalds wrote: > On Thu, Dec 14, 2017 at 2:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > But, because the silly firmware comes out of suspend with all PIC lines > > unmasked for whatever reason, the PIC can observe that IRQ being raised and > > the CPU not handling it. So yes, I forgot about 7 being magic, but I still > > think it's the firmware which causes it by unmasking the PIC irqs. > > Yes, that sounds quite likely. And just for the record I was able to figure out which interrupt comes in and goes away again. It's the only level triggered interrupt, which is the ACPI interrupt..... Thanks, tglx ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-14 22:36 ` Thomas Gleixner 2017-12-14 22:47 ` Linus Torvalds @ 2017-12-15 0:34 ` Rafael J. Wysocki 1 sibling, 0 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-15 0:34 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, Linux PCI, Linux PM On Thu, Dec 14, 2017 at 11:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Thu, 14 Dec 2017, Linus Torvalds wrote: >> On Thu, Dec 14, 2017 at 3:54 AM, Thomas Gleixner <tglx@linutronix.de> wrote: >> I just wanted to pipe up about that "irq7", because judging from your >> email it seems like you think it's a real irq: >> >> > Now there is a race >> > whether the kernel resume path manages to mask the PIC again early enough >> > before something triggers IRQ7 or not. >> >> ..and that's not how the PIC works. >> >> In fact, "legacy irq 7" is the _normal_ and very traditional spurious >> interrupt, and it's documented. If the PIC gets an interrupt from >> _any_ source, but the interrupt goes away before the PIC gets an >> acknowledge from the CPU (and by "acknowledge", I'm not talking about >> the explicit software IRQ ACK, I'm talking about the hardware >> protocol, between the PIC and the CPU), the PIC will then report irq 7 >> as the interrupt - regardless of what the original was. >> >> The reason is almost always something like >> >> - CPU interrupts are disabled or masked >> >> - driver does a write to the external hardware that causes an >> interrupt to be raised > > Which should be a non issue because _ALL_ PIC irq lines are masked at the > PIC itself. All interrupts are routed through IOAPIC. So unless the IOAPIC > sports similar behaviour the PIC should not ever observe that scenario. > > But, because the silly firmware comes out of suspend with all PIC lines > unmasked for whatever reason, the PIC can observe that IRQ being raised and > the CPU not handling it. So yes, I forgot about 7 being magic, but I still > think it's the firmware which causes it by unmasking the PIC irqs. That's my understanding too. Thanks, Rafael ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 18:19 ` Thomas Gleixner 2017-12-13 20:52 ` Thomas Gleixner @ 2017-12-13 22:39 ` Rafael J. Wysocki 2017-12-13 23:26 ` Rafael J. Wysocki 1 sibling, 1 reply; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-13 22:39 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, linux-pci, linux-pm On Wednesday, December 13, 2017 7:19:17 PM CET Thomas Gleixner wrote: > On Wed, 13 Dec 2017, Linus Torvalds wrote: > > > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > Definitely. That was fragile forever but puzzles me is that I can't figure > > > out what now causes that spurious interrupt to surface out of the blue. > > > > Perhaps just timing? > > That's what I'm trying to figure out right now, because that is the only > sensible explanation left. The whole machinery of suspend is exactly the > same with and without the vector changes. I instrumented all functions > involved and the picture is the same. I even do not see any fundamental > timing differences where one would say: That's it. > > What puzzles me even more is that in the range of commits I'm fiddling with > there is no other change than the vector management stuff and the point > where it breaks makes no sense at all. The point Maarten bisected it to > works nicely here, so that might just point to a very subtle timing issue. > > > How hard would it be to change the ordering to just redirect irqs first? > > The whole interrupt redirection happens when the non boot CPUs are brought > down, which is the very last step before the actual suspend happens. > > We could probably do that earlier, but that's something Rafael needs to > answer ultimately. Well, that's both flattering and concerning. ;-) Anyway, yes, we can do that earlier AFAICS. Action handlers are not going to run after we've called suspend_device_irqs() which happens before the final stage of PCI devices suspend (suspend_noirq) and it doesn't matter which CPU gets the interrupt from that point on (it is either wakeup or unwanted then). Thanks, Rafael ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-13 22:39 ` Rafael J. Wysocki @ 2017-12-13 23:26 ` Rafael J. Wysocki 0 siblings, 0 replies; 83+ messages in thread From: Rafael J. Wysocki @ 2017-12-13 23:26 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Bjorn Helgaas, Maarten Lankhorst, Michal Hocko, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers, Daniel Vetter, Bjorn Helgaas, Rafael J. Wysocki, Linux PCI, Linux PM On Wed, Dec 13, 2017 at 11:39 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > On Wednesday, December 13, 2017 7:19:17 PM CET Thomas Gleixner wrote: >> On Wed, 13 Dec 2017, Linus Torvalds wrote: >> >> > On Wed, Dec 13, 2017 at 8:41 AM, Thomas Gleixner <tglx@linutronix.de> wrote: >> > > >> > > Definitely. That was fragile forever but puzzles me is that I can't figure >> > > out what now causes that spurious interrupt to surface out of the blue. >> > >> > Perhaps just timing? >> >> That's what I'm trying to figure out right now, because that is the only >> sensible explanation left. The whole machinery of suspend is exactly the >> same with and without the vector changes. I instrumented all functions >> involved and the picture is the same. I even do not see any fundamental >> timing differences where one would say: That's it. >> >> What puzzles me even more is that in the range of commits I'm fiddling with >> there is no other change than the vector management stuff and the point >> where it breaks makes no sense at all. The point Maarten bisected it to >> works nicely here, so that might just point to a very subtle timing issue. >> >> > How hard would it be to change the ordering to just redirect irqs first? >> >> The whole interrupt redirection happens when the non boot CPUs are brought >> down, which is the very last step before the actual suspend happens. >> >> We could probably do that earlier, but that's something Rafael needs to >> answer ultimately. > > Well, that's both flattering and concerning. ;-) > > Anyway, yes, we can do that earlier AFAICS. Action handlers are not going to > run after we've called suspend_device_irqs() which happens before the final > stage of PCI devices suspend (suspend_noirq) and it doesn't matter which CPU > gets the interrupt from that point on (it is either wakeup or unwanted then). There is a catch that we don't and likely should not do that for suspend-to-idle, but since we have pm_suspend_target_state now, that case can be distinguished from the "full suspend" one readily. Thanks, Rafael ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-06 12:15 ` Michal Hocko 2017-12-06 12:23 ` Thomas Gleixner 2017-12-06 12:31 ` Maarten Lankhorst @ 2017-12-07 7:55 ` Michal Hocko 2017-12-10 20:30 ` Michal Hocko 2 siblings, 1 reply; 83+ messages in thread From: Michal Hocko @ 2017-12-07 7:55 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Wed 06-12-17 13:14:52, Michal Hocko wrote: > On Mon 04-12-17 14:36:20, Linus Torvalds wrote: > > On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the > > > systems I have tested, so it is probably safe to assume it to be > > > broken everywhere. > > > > Oh, it's definitely not broken everywhere, because I use it myself, > > and was traveling last week due to my mom's bday. > > > > HOWEVER. > > > > Some of the x86 work seems to have broken it for some configurations. > > In particular, do you have a big "everything enabled" kernel config - > > particularly lockdep and irqflags tracing enabled? > > > > Andy has a patch, but it hasn't made it to me yet (probably because > > the x86 people are very busy with the kaiser work): > > > > https://lkml.org/lkml/2017/11/30/546 > > > > (also note his follow-up "fix the commit message" note, but that one > > doesn't actually affect the code itself). > > merging tip/x86/urgent on top of your tree fixed this problem for me, > but I am seeing something else > [ 131.711412] ACPI: Preparing to enter system sleep state S3 > [ 131.755328] ACPI: EC: event blocked > [ 131.755328] ACPI: EC: EC stopped > [ 131.755328] PM: Saving platform NVS memory > [ 131.755344] Disabling non-boot CPUs ... > [ 131.779330] IRQ 124: no longer affine to CPU1 > [ 131.780334] smpboot: CPU 1 is now offline > [ 131.804465] smpboot: CPU 2 is now offline > [ 131.827291] IRQ 122: no longer affine to CPU3 > [ 131.827292] IRQ 123: no longer affine to CPU3 > [ 131.828293] smpboot: CPU 3 is now offline > [ 131.830991] ACPI: Low-level resume complete > [ 131.831092] ACPI: EC: EC started > [ 131.831093] PM: Restoring platform NVS memory > [ 131.831864] do_IRQ: 0.55 No irq handler for vector > [ 131.831884] Enabling non-boot CPUs ... > [ 131.831909] x86: Booting SMP configuration: > [ 131.831910] smpboot: Booting Node 0 Processor 1 APIC 0x2 > [ 131.832913] cache: parent cpu1 should not be sleeping > [ 131.833058] CPU1 is up > [ 131.833067] smpboot: Booting Node 0 Processor 2 APIC 0x1 > [ 131.833864] cache: parent cpu2 should not be sleeping > [ 131.833983] CPU2 is up > [ 131.833995] smpboot: Booting Node 0 Processor 3 APIC 0x3 > [ 131.834776] cache: parent cpu3 should not be sleeping > [ 131.834923] CPU3 is up > > "No irq handler" part looks a bit scary (maybe related to lost affinity > messages?) but the following messages look quite as well. Is this > something known? The system seems to be up and running without any > visible issues. Hmm, there is still something bad going on during resume. My laptop haven't woken up from s2ram this morning. The screen was powered on but the system hasn't come up. The last thing that made it into the kernel log on fs is this Dec 6 19:32:29 tiehlicka kernel: [21898.084685] PM: suspend entry (deep) which won't tell us much I suspect. I've tried dozen s2ram cycles and it hasn't reproduced so it smells like a timing issue. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2: Regression in resume from ACPI S3 2017-12-07 7:55 ` Michal Hocko @ 2017-12-10 20:30 ` Michal Hocko 0 siblings, 0 replies; 83+ messages in thread From: Michal Hocko @ 2017-12-10 20:30 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Andy Lutomirski, Linux Kernel Mailing List, the arch/x86 maintainers On Thu 07-12-17 08:55:08, Michal Hocko wrote: > On Wed 06-12-17 13:14:52, Michal Hocko wrote: > > On Mon 04-12-17 14:36:20, Linus Torvalds wrote: > > > On Mon, Dec 4, 2017 at 2:25 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > > > So far, resume from suspend-to-RAM (ACPI S3) is broken on all of the > > > > systems I have tested, so it is probably safe to assume it to be > > > > broken everywhere. > > > > > > Oh, it's definitely not broken everywhere, because I use it myself, > > > and was traveling last week due to my mom's bday. > > > > > > HOWEVER. > > > > > > Some of the x86 work seems to have broken it for some configurations. > > > In particular, do you have a big "everything enabled" kernel config - > > > particularly lockdep and irqflags tracing enabled? > > > > > > Andy has a patch, but it hasn't made it to me yet (probably because > > > the x86 people are very busy with the kaiser work): > > > > > > https://lkml.org/lkml/2017/11/30/546 > > > > > > (also note his follow-up "fix the commit message" note, but that one > > > doesn't actually affect the code itself). > > > > merging tip/x86/urgent on top of your tree fixed this problem for me, > > but I am seeing something else > > [ 131.711412] ACPI: Preparing to enter system sleep state S3 > > [ 131.755328] ACPI: EC: event blocked > > [ 131.755328] ACPI: EC: EC stopped > > [ 131.755328] PM: Saving platform NVS memory > > [ 131.755344] Disabling non-boot CPUs ... > > [ 131.779330] IRQ 124: no longer affine to CPU1 > > [ 131.780334] smpboot: CPU 1 is now offline > > [ 131.804465] smpboot: CPU 2 is now offline > > [ 131.827291] IRQ 122: no longer affine to CPU3 > > [ 131.827292] IRQ 123: no longer affine to CPU3 > > [ 131.828293] smpboot: CPU 3 is now offline > > [ 131.830991] ACPI: Low-level resume complete > > [ 131.831092] ACPI: EC: EC started > > [ 131.831093] PM: Restoring platform NVS memory > > [ 131.831864] do_IRQ: 0.55 No irq handler for vector > > [ 131.831884] Enabling non-boot CPUs ... > > [ 131.831909] x86: Booting SMP configuration: > > [ 131.831910] smpboot: Booting Node 0 Processor 1 APIC 0x2 > > [ 131.832913] cache: parent cpu1 should not be sleeping > > [ 131.833058] CPU1 is up > > [ 131.833067] smpboot: Booting Node 0 Processor 2 APIC 0x1 > > [ 131.833864] cache: parent cpu2 should not be sleeping > > [ 131.833983] CPU2 is up > > [ 131.833995] smpboot: Booting Node 0 Processor 3 APIC 0x3 > > [ 131.834776] cache: parent cpu3 should not be sleeping > > [ 131.834923] CPU3 is up > > > > "No irq handler" part looks a bit scary (maybe related to lost affinity > > messages?) but the following messages look quite as well. Is this > > something known? The system seems to be up and running without any > > visible issues. > > Hmm, there is still something bad going on during resume. My laptop > haven't woken up from s2ram this morning. The screen was powered on > but the system hasn't come up. It's been few days and I haven't seen this problem again. And I am doing s2ram all the time... -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: Linux 4.15-rc2 2017-12-03 16:22 Linux 4.15-rc2 Linus Torvalds 2017-12-04 22:25 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Rafael J. Wysocki @ 2018-02-21 18:36 ` Eugene Syromiatnikov 1 sibling, 0 replies; 83+ messages in thread From: Eugene Syromiatnikov @ 2018-02-21 18:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List, Al Viro On Sun, Dec 03, 2017 at 11:22:56AM -0500, Linus Torvalds wrote: > > Linus Torvalds (6): > Rename superblock flags (MS_xyz -> SB_xyz) This commit, while claims that it changes internal flags, also touches an UAPI header (include/uapi/linux/bfs_fs.h), specifically, the macro BFS_UNCLEAN. I expect that either this macro should be in a private header, or (if this check is expected to be available to the userspace) at least SB_RDONLY should be also defined there. ^ permalink raw reply [flat|nested] 83+ messages in thread
end of thread, other threads:[~2018-02-21 18:36 UTC | newest] Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-12-03 16:22 Linux 4.15-rc2 Linus Torvalds 2017-12-04 22:25 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Rafael J. Wysocki 2017-12-04 22:36 ` Linus Torvalds 2017-12-04 22:38 ` Thomas Gleixner 2017-12-04 22:41 ` Rafael J. Wysocki 2017-12-05 0:25 ` Rafael J. Wysocki 2017-12-09 10:33 ` Pavel Machek 2017-12-09 11:41 ` Pavel Machek [not found] ` <CA+55aFw8tuoJ2gcXx3K2sKFf2Y9hXX4naMVQNqGOUivnjwhjkg@mail.gmail.com> 2017-12-09 22:01 ` Pavel Machek [not found] ` <CA+55aFySAdiBZhZ0PSDjH5PuvPPcMsBRXbxCkObfm1eY7gHDbQ@mail.gmail.com> 2017-12-10 16:23 ` Pavel Machek 2017-12-10 16:37 ` Linus Torvalds 2017-12-10 18:56 ` Pavel Machek 2017-12-10 20:30 ` Linus Torvalds 2017-12-10 20:43 ` Pavel Machek 2017-12-10 21:28 ` Linus Torvalds 2017-12-10 21:35 ` Pavel Machek 2017-12-12 17:27 ` Linus Torvalds 2017-12-12 18:05 ` Andy Lutomirski 2017-12-12 18:36 ` Linus Torvalds 2017-12-12 22:10 ` Andy Lutomirski 2017-12-12 22:33 ` Linus Torvalds 2017-12-12 23:10 ` Andy Lutomirski 2017-12-13 11:16 ` Jarkko Nikula 2017-12-13 12:40 ` Ingo Molnar 2017-12-13 18:50 ` Andy Lutomirski 2017-12-10 21:38 ` [PATCH] Fix resume on x86-32 machines Pavel Machek 2017-12-10 21:58 ` Andy Lutomirski 2017-12-10 22:20 ` Pavel Machek 2017-12-11 9:25 ` Jarkko Nikula 2017-12-11 14:22 ` Rafael J. Wysocki 2017-12-11 14:43 ` Rafael J. Wysocki 2017-12-11 14:59 ` Jarkko Nikula 2017-12-11 18:31 ` Linus Torvalds 2017-12-11 18:41 ` Andy Lutomirski 2017-12-11 19:12 ` Linus Torvalds 2017-12-14 20:38 ` Pavel Machek 2017-12-14 20:47 ` Linus Torvalds 2017-12-14 21:20 ` Andy Lutomirski 2017-12-14 22:22 ` Pavel Machek 2017-12-11 15:13 ` Ingo Molnar 2017-12-11 16:26 ` Andy Lutomirski 2017-12-11 14:09 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Zhang Rui 2017-12-11 16:28 ` Andy Lutomirski 2017-12-12 8:00 ` Pavel Machek 2017-12-06 12:15 ` Michal Hocko 2017-12-06 12:23 ` Thomas Gleixner 2017-12-06 14:04 ` Rafael J. Wysocki 2017-12-06 12:31 ` Maarten Lankhorst 2017-12-06 12:46 ` Thomas Gleixner 2017-12-06 13:09 ` Maarten Lankhorst 2017-12-06 14:15 ` Thomas Gleixner 2017-12-07 13:33 ` Maarten Lankhorst 2017-12-08 10:30 ` Thomas Gleixner 2017-12-13 15:57 ` Thomas Gleixner 2017-12-13 16:23 ` Bjorn Helgaas 2017-12-13 16:41 ` Thomas Gleixner 2017-12-13 17:45 ` Linus Torvalds 2017-12-13 18:19 ` Thomas Gleixner 2017-12-13 20:52 ` Thomas Gleixner 2017-12-13 21:06 ` Thomas Gleixner 2017-12-13 22:48 ` Rafael J. Wysocki 2017-12-14 11:54 ` Thomas Gleixner 2017-12-14 12:12 ` Rafael J. Wysocki 2017-12-14 12:30 ` Thomas Gleixner 2017-12-14 15:30 ` Rafael J. Wysocki 2017-12-14 15:52 ` Thomas Gleixner 2017-12-14 15:54 ` Rafael J. Wysocki 2017-12-14 16:17 ` Maarten Lankhorst 2017-12-15 2:07 ` [PATCH] PCI / PM: Force devices to D0 in pci_pm_thaw_noirq() Rafael J. Wysocki 2017-12-15 14:28 ` Rafael J. Wysocki 2017-12-15 18:30 ` Bjorn Helgaas 2017-12-15 23:44 ` Rafael J. Wysocki 2017-12-14 13:24 ` Linux 4.15-rc2: Regression in resume from ACPI S3 Thomas Gleixner 2017-12-14 19:03 ` Linus Torvalds 2017-12-14 22:36 ` Thomas Gleixner 2017-12-14 22:47 ` Linus Torvalds 2017-12-15 9:05 ` Thomas Gleixner 2017-12-15 0:34 ` Rafael J. Wysocki 2017-12-13 22:39 ` Rafael J. Wysocki 2017-12-13 23:26 ` Rafael J. Wysocki 2017-12-07 7:55 ` Michal Hocko 2017-12-10 20:30 ` Michal Hocko 2018-02-21 18:36 ` Linux 4.15-rc2 Eugene Syromiatnikov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).