* Linux 2.6.34-rc3 @ 2010-03-30 17:50 Linus Torvalds 2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki 2010-04-02 17:59 ` Ugly rmap NULL ptr deref oopsie on hibernate (was " Borislav Petkov 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-03-30 17:50 UTC (permalink / raw) To: Linux Kernel Mailing List Ok, so -rc2 was messy, no question about it. I'm too much of a softie to hold back some peoples work, so my hard-line -rc1 didn't work out the way I wanted. But _next_ time! For sure this time. Anyway, from a messy -rc2 we now have a -rc3 that should be in much better shape. Regressions fixed, and the ShortLog is short enough to be worth posting to lkml (-rc1 never is, and -rc2 seldom is. It's not like -rc2's are generally wondeful, this time around wasn't _that_ much different). One regression fix that is worth pointing out is the EXT3_STATE_NEW handling in ext3, because the regression that one fixed was potentially quite nasty. It wouldn't cause data corruption, but it _could_ cause at least corrupt security labels. So if you have SELinux enabled (either in permissive or enforcing mode) _and_ you ran 2.6.32-rc[12] _and_ your filesystem is ext3, you should not just update, you should make sure your extended attributes are fixed. The easiest way to fix them is likely to just check the "Relabel on next boot" checkmark in the SELinux config GUI ("system-config-selinux" if you don't do that whole admin menu thing), and reboot into 2.6.34-rc3. [ And you can use something like 'restorecon -rv /' or whatever after booting into the fixed kernel. See your nearest SELinux manual for real details. ] You might even want to do the whole "touch /forcefsck" before rebooting to make sure fsck runs (I don't think it matters, but it won't hurt - the relabeling will be so slow that whatever time your fsck takes is totally irrelevant, so do them both and get it over with). Of course, I suspect most people who run experimental kernels are also the kinds of people who have turned off SELinux in annoyance long ago, or tend to be the kinds of people who long since upgraded to ext4 (which didn't have the problem), but hey, what do I know. In short - if you started seeing odd security messages after running early 2.6.34 -rc kernels, now you know what was going on. Other than that? Random fixes and updates all over. Mostly drivers and filesystems, and mostly fairly small things. If you had PCI resource conflict problems with the early -rc's due to the _CRS window thing, for example, that should hopefully be fixed. See the appended shortlog for other details. Linus --- Abraham Arce (1): KS8851: Avoid NULL pointer in set rx mode Adel Gadllah (1): iwlwifi: Silence tfds_in_queue message Adrian Hunter (1): mmc: fix incorrect interpretation of card type bits Al Viro (1): Restore LOOKUP_DIRECTORY hint handling in final lookup on open() Alexander Duyck (3): igb: only use vlan_gro_receive if vlans are registered skbuff: remove unused dma_head & dma_maps fields igb: use correct bits to identify if managability is enabled Alexandra Kossovsky (1): tcp: Fix OOB POLLIN avoidance. Amerigo Wang (1): netpoll: warn when there are spaces in parameters Ameya Palande (1): regulator: Get rid of lockdep warning Amit Kumar Salecha (4): netxen: fix bios version calculation netxen: fix warning in ioaddr for NX3031 chip netxen: added sanity check for pci map netxen: update version to 4.0.73 Amit Shah (2): virtio: console: Generate a kobject CHANGE event on adding 'name' attribute virtio: console: Check if port is valid in resize_console Andreas Bombe (1): sh64: Remove long unused mid_sched macro Andreas Herrmann (1): x86, amd: Restrict usage of c1e_idle() Andrei Emeltchenko (1): Bluetooth: Fix kernel crash on L2CAP stress tests Andrew Morton (2): timer stats: Fix del_timer_sync() and try_to_del_timer_sync() kernel/sched.c: Suppress unused var warning Andy Gospodarek (1): bonding: fix broken multicast with round-robin mode Anton Blanchard (1): ppc64 sys_ipc breakage in 2.6.34-rc2 Arnaldo Carvalho de Melo (2): perf top: Improve the autosizing of column lenghts perf top: Add missing initialization to zero Axel Lin (2): lp3971: Fix setting val for LDO2 and LDO4 lp3971: Fix BUCK_VOL_CHANGE_SHIFT logic Ben Blum (1): cgroups: net_cls as module Ben Menchaca (1): gianfar: fix undo of reserve() Benjamin Li (1): bnx2: Fix netpoll crash. Bjorn Helgaas (11): resources: add interfaces that return conflict information PCI: for address space collisions, show conflicting resource PCI: break out primary/secondary/subordinate for readability PCI: make disabled window printk style match the enabled ones PCI: print resources consistently with %pR PCI: complain about devices that seem to be broken PCI: don't say we claimed a resource if we failed x86/PCI: remove redundant warnings frv/PCI: remove redundant warnings x86/PCI: for host bridge address space collisions, show conflicting resource x86/PCI: truncate _CRS windows with _LEN > _MAX - _MIN + 1 Borislav Petkov (2): edac, mce: Filter out invalid values fs/binfmt_aout.c: fix pointer warnings Brandon L Black (1): net: Add MSG_WAITFORONE flag to recvmmsg Carolyn Wyborny (1): igb: Add support for 82576 ET2 Quad Port Server Adapter Cheng Renquan (1): ceph: some documentations fixes Chris Leech (1): ixgbe: filter FIP frames into the FCoE offload queues Chris Wilson (1): drm/i915: Avoid NULL deref in get_pages() unwind after error. Christian Borntraeger (1): [S390] system.h: Fix compile error for 1 and 2 byte cmpxchg Christian Lamparter (2): [ARM] Kirkwood: WPS button keycode mapping [ARM] Orion5x: replace KEY_WLAN with KEY_WPS_BUTTON Clemens Ladisch (4): firewire: core: fw_iso_resource_manage: fix error handling firewire: ohci: add cycle timer quirk for the TI TSB12LV22 ALSA: cmipci: work around invalid PCM pointer PCI quirk: RS780/RS880: work around missing MSI initialization Colin Ian King (1): softlockup: Stop spurious softlockup messages due to overflow Crane Cai (1): i2c-scmi: Support IBM SMBus CMI devices Daisuke Nishimura (1): memcg: disable move charge in no mmu case Dan Carpenter (11): drm/i915: fix small leak on overlay error path sunrpc: handle allocation errors from __rpc_lookup_create() pxa168fb: fix incorrect resource calculation AFS: Potential null dereference regulator: handle kcalloc() failure ceph: handle kmalloc() failure af_key: return error if pfkey_xfrm_policy2msg_prep() fails memcontrol: fix potential null deref kcore: fix test for end of list fscache: add missing unlock hwmon: (w83793) Saving negative errors in unsigned Daniel Chen (1): ALSA: ac97: Add Toshiba P500 to ac97 jack sense blacklist Daniel Mack (2): ASoC: pxa-pcm-lib: initialize DMA channel to -1 [ARM] pxa/raumfeld: fix button name Daniel T Chen (3): ALSA: hda: Fix 0 dB offset for HP laptops using CX20551 (Waikiki) ALSA: ac97: Add IBM ThinkPad R40e to Headphone/Line Jack Sense blacklist ALSA: hda: Use LPIB for ga-ma770-ud3 board Daniel Taylor (1): fs/partitions/msdos: add support for large disks Daniel Vetter (1): drm/intel: fix up set_tiling for untiled->tiled transition Darrick J. Wong (2): acpi: Support IBM SMBus CMI devices i2c-scmi: Provide module aliases for automatic loading Dave Airlie (1): slow-work: use get_ref wrapper instead of directly calling get_ref David Howells (9): nommu: fix an incorrect comment in the do_mmap_shared_file() Document Linux's circular buffering capabilities FDPIC: For-loop in elf_core_vma_data_size() is incorrect do_sync_read/write() should set kiocb.ki_nbytes to be consistent NOMMU: Revert 'nommu: get_user_pages(): pin last page on non-page-aligned start' NOMMU: Fix __get_user_pages() to pin last page on offset buffers SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG frv/chris: fix lines with a missing semicolons KEYS: Add MAINTAINERS record David Härdeman (1): kfifo: fix KFIFO_INIT in include/linux/kfifo.h David S. Miller (7): via-velocity: Fix FLOW_CNTL_TX_RX handling in set_mii_flow_control() isdn: Add netdev to lists in MAINTAINERS entry. Revert "r8169: enable 64-bit DMA by default for PCI Express devices (v2)" Revert "via82cxxx: workaround h/w bugs" tulip: Add missing parens. Revert "ide: skip probe if there are no devices on the port (v2)" sparc64: Properly truncate pt_regs framepointer in perf callback. Dean Nelson (4): PCI: fix return value from pcix_get_max_mmrbc() PCI: fix access of PCI_X_CMD by pcix get and set mmrbc functions PCI: cleanup error return for pcix get and set mmrbc functions hwmon: (coretemp) Add missing newline to dev_warn() message Derek Kelly (1): ALSA: hda - Add support of Nvidia GT220 HDMI Dmitry Torokhov (1): Regulators: max8925-regulator - clean up driver data after removal Dominik Brodowski (4): pcmcia: do not use ioports < 0x100 on x86 pcmcia: allow for four multifunction subdevices (again) power: support _noirq actions on device types and classes pcmcia: use dev_pm_ops for class pcmcia_socket_class Emil Tantilov (4): igb: do not modify tx_queue_len on link speed change igbvf: do not modify tx_queue_len on link speed change e1000e: do not modify tx_queue_len on link speed change e1000: do not modify tx_queue_len on link speed change Eric Anholt (6): drm/i915: Don't bother with the BKL for GEM ioctls. drm/i915: Enable VS timer dispatch. agp/intel: Respect the GTT size on Sandybridge for scratch page setup. agp/intel: Don't do the chipset flush on Sandybridge. drm/i915: Set up the documented clock gating on Sandybridge and Ironlake. drm/i915: Stop trying to use ACPI lid status to determine LVDS connection. Eric Dumazet (3): net: Potential null skb->dev dereference netfilter: xt_hashlimit: dl_seq_stop() fix netfilter: xt_hashlimit: IPV6 bugfix Eric Miao (3): [ARM] mmp: fix for variables in uncompress.h being discarded [ARM] pxa: remove unnecessary 'select FB_W100' from some platforms [ARM] pxa/sharpsl: add dependency of max1111 driver to sharpsl_pm Eric Sandeen (1): ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs Eric W. Biederman (1): netxen: The driver doesn't work on NX_P3_B1 so cause probe to fail. FUJITA Tomonori (1): Documentation: rename PCI/PCI-DMA-mapping.txt to DMA-API-HOWTO.txt Felix Fietkau (1): ath9k: fix BUG_ON triggered by PAE frames Francois Romieu (1): r8169: fix broken register writes Grazvydas Ignotas (1): wl1251: fix potential crash Greg Rose (7): ixgbevf: Fix VF Stats accounting after reset ixgbevf: Shorten up delay timer for watchdog task ixgbevf: Message formatting cleanups ixgbevf: Fix signed/unsigned int error ixgbe: In SR-IOV mode insert delay before bring the adapter up ixgbe: Change where clear_to_send_flag is reset to zero. ixgbe: Do not run all Diagnostic offline tests when VFs are active Greg Thelen (1): memcg: fix typo in memcg documentation Guennadi Liakhovetski (3): ASoC: SIU driver shall select FW_LOADER SH: fix SCIFA SCASCR register bit definitions SH: remove superfluous warning from the serial driver Guenter Roeck (1): ipv4: Don't drop redirected route cache entry unless PTMU actually expired Guo-Fu Tseng (3): jme: Fix VLAN memory leak jme: Protect vlgrp structure by pause RX actions. jme: Advance driver version number H Hartley Sweeten (2): [ARM] locomo: fix SPI register offset [ARM] locomo: fix unpaired spin_lock_irqsave Hans-Joachim Picht (1): [S390] fix broken proc interface for sclp_async Heiko Carstens (2): [S390] smp: fix lowcore allocation [S390] sclp: avoid 64 bit division Henrik Kretzschmar (5): genirq: Move two IRQ functions from .init.text to .text isdn: Cleanup Sections in PCMCIA driver sedlbauer isdn: Cleanup Sections in PCMCIA driver teles isdn: Cleanup Sections in PCMCIA driver avma1 isdn: Cleanup Sections in PCMCIA driver elsa Herbert Xu (1): ipv6: Remove redundant dst NULL check in ip6_dst_check Huang Weiyi (1): [ARM] pxa/raumfeld: remove duplicated #include Ian Campbell (1): x86: Do not free zero sized per cpu areas Jan Beulich (1): x86: Fix placement of FIX_OHCI1394_BASE Jan Kara (2): ext4: Fix estimate of # of blocks needed to write indirect-mapped files ext4: Don't use delayed allocation by default when used instead of ext3 Jani Nikula (1): c2port: fix device_create() return value check Jarkko Nikula (1): ALSA: pcm_lib - fix xrun functionality Jaswinder Singh Rajput (1): hwmon: (asc7621) Add X58 entry in Kconfig Jeff Dike (1): vhost: fix error path in vhost_net_set_backend Jeff Layton (1): NFS: don't try to decode GETATTR if DELEGRETURN returned error Jeff Mahoney (2): reiserfs: fix oops while creating privroot with selinux enabled reiserfs: properly honor read-only devices Jens Rottmann (1): ksz884x: fix return value of netdev_set_eeprom Jiri Kosina (1): x86: Remove excessive early_res debug output Joe Perches (3): drivers/gpu/drm/i915/intel_bios.c: fix continuation line formats MAINTAINERS: use tab not spaces for delimiter drivers/net: Fix continuation lines Joern Engel (12): Open segment file before using it Limit max_pages for insane devices Plug memory leak in writeseg_end_io Prevent schedule while atomic in __logfs_readdir Write out both superblocks on mismatch Fix logfs_get_sb_final error path Use deactivate_locked_super Prevent data corruption in logfs_rewrite_block() Simplify and fix pad_wbuf [LogFS] Clear PagePrivate when moving journal [LogFS] Move reserved segments with journal [LogFS] Erase new journal segments John Fastabend (1): ixgbe: cleanup maximum number of tx queues John Stultz (1): time: Fix accumulation bug triggered by long delay. Jon Maloy (1): TIPC: Removed inactive maintainer Jonathan Cameron (2): [ARM] pxa: fix for variables in uncompress.h being discarded [ARM] pxa: remove spi cs gpio direction to avoid clash with driver JosephChan@via.com.tw (2): pata_via: Add VIA VX900 support pata_via: fix VT6410/6415/6330 detection issue Jozsef Kadlecsik (1): netfilter: ip6table_raw: fix table priority Julia Lawall (2): sound/oss/vidc.c: change the field used with DMA_ACTIVE arch/sparc/kernel: Use set_cpus_allowed_ptr KOSAKI Motohiro (6): sched: sched_getaffinity(): Allow less than NR_CPUS length sched: Use proper type in sched_getaffinity() tmpfs: mpol=bind:0 don't cause mount error. tmpfs: handle MPOL_LOCAL mount option properly tmpfs: cleanup mpol_parse_str() doc: add the documentation for mpol=local Ken Kawasaki (1): pcnet_cs: add new id Komuro (1): pd6729: Coding Style fixes Kunal Gangakhedkar (1): ALSA: hda - Add PCI quirk for HP dv6-1110ax. Kuninori Morimoto (3): sh: mach-ecovec24: Add i2c_put_adapter on sh_eth_init sh: ms7724: Add tiny-document for sound sh: Add watch-dog register address for SH7722/SH7723/SH7724 Kyle McMartin (1): tulip: Fix null dereference in uli526x_rx_packet() Lai Jiangshan (2): rcu: Fix tracepoints & lockdep false positive rcu: Fix local_irq_disable() CONFIG_PROVE_RCU=y false positives Lee Schermerhorn (1): mempolicy: fix get_mempolicy() for relative and static nodes Lennart Schulte (1): tcp: Fix tcp_mark_head_lost() with packets == 0 Li Zefan (1): cgroups: remove duplicate include Linus Torvalds (3): Fix up prototype for sys_ipc breakage ext3: fix broken handling of EXT3_STATE_NEW Linux 2.6.34-rc3 Magnus Damm (1): serial: sh-sci: fix SH-Mobile SH breakage Mallikarjuna R Chilakala (3): ixgbe: Fix 82599 multispeed fiber link issues due to Tx laser flapping ixgbe: Fix 82599 KX4 Wake on LAN issue after an improper system shutdown ixgbe: Set IXGBE_RSC_CB(skb)->DMA field to zero after unmapping the address Marcel Holtmann (2): Bluetooth: Fix potential bad memory access with sysfs files Bluetooth: Convert debug files to actually use debugfs instead of sysfs Mark Brown (2): ASoC: Bail out of wm_hubs DC servo if calibration fails ASoC: Remove BROKEN from i.MX audio after dependencies merged Mark Fasheh (3): ocfs2: set i_mode on disk during acl operations ocfs2: Always try for maximum bits with new local alloc windows ocfs2: Clear undo bits when local alloc is freed Martin Schwidefsky (1): [S390] fix boot failures with compressed kernels Masami Hiramatsu (4): perf probe: Fix probe_point buffer overrun perf probe: Fix need_dwarf flag if lazy matching is used perf probe: Fix offset to allow signed value perf probe: Use original address instead of CU-based address Mathieu Desnoyers (1): CRED: Fix memory leak in error handling Matt Fleming (3): sh: Flush ITLB too in PTEAEX's flush_tlb_page() sh: Replace unsafe manipulation of MMUCR sh: Fix build after dynamic PMB rework Matthew Wilcox (1): PCI quirk: Disable MSI on VIA K8T890 systems Miao Xie (2): cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node cpuset: alloc nodemask_t on the heap rather than the stack Michael Chan (1): bnx2: Use proper handler during netpoll. Michael Grzeschik (1): lxfb: set the H- and V-SYNC polarity of the flatpanel output Michael Holzheu (1): [S390] zcore: CPU registers are not saved under LPAR Michael S. Tsirkin (3): vhost: fix interrupt mitigation with raw sockets vhost: fix error handling in vring ioctls exit: fix oops in sync_mm_rss Mike Frysinger (2): can: bfin_can: switch to common Blackfin can header blackfin: enable DEBUG_SECTION_MISMATCH Mitch Williams (1): igb: count Rx FIFO errors correctly Neil Horman (1): r8169: offical fix for CVE-2009-4537 (overlength frame DMAs) Nick Bowler (1): Staging: et131x: Properly disable FC in txmac. Nicolas Dichtel (1): net: ipmr/ip6mr: prevent out-of-bounds vif_table access OGAWA Hirofumi (1): fs/partition/msdos: fix unusable extended partition for > 512B sector Owain G. Ainsworth (1): drm/i915: remove an unnecessary wait_request() Pablo Neira Ayuso (3): netlink: fix unaligned access in nla_get_be64() netlink: fix NETLINK_RECV_NO_ENOBUFS in netlink_set_err() netfilter: ctnetlink: fix reliable event delivery if message building fails Patrick McHardy (3): net: ipmr/ip6mr: fix potential out-of-bounds vif_table access netfilter: xt_recent: fix regression in rules using a zero hit_count net: fix netlink address dumping in IPv4/IPv6 Paul E. McKenney (2): rcu: Make rcu_read_lock_bh_held() allow for disabled BH net: suppress lockdep-RCU false positive in FIB trie. Paul Mackerras (1): powerpc/perf_events: Fix call-graph recording, add perf_arch_fetch_caller_regs Paul Mundt (3): PCI: kill off pci_register_set_vga_state() symbol export. sh: Tidy up a couple of section mismatches. sh: Silence unintialized variable warnings in dwarf unwinder. Paulius Zaleckas (1): if_tunnel.h: add missing ams/byteorder.h include Pavel Emelyanov (2): ipv4: Cleanup struct net dereference in rt_intern_hash ipv4: Restart rt_intern_hash after emergency rebuild (v2) Peter Ujfalusi (2): ASoC: tlv320dac33: Fix DSP modes ASoC: tlv320dac33: Internal clocking changes Prarit Bhargava (1): hwmon: (coretemp) Fix cpu model output Priit Laes (1): drm/i915: Rename FBC_C3_IDLE to FBC_CTL_C3_IDLE to match other registers Rafael J. Wysocki (1): x86 / perf: Fix suspend to RAM on HP nx6325 Randy Dunlap (2): scripts/kernel-doc: handle struct member __aligned scripts/kernel-doc: fix fatal error on function prototype Ravikiran G Thirumalai (1): tmpfs: fix oops on mounts with mpol=default Richard Röjfors (1): drivers/gpio/max730x.c: add license macro Rob Landley (1): sparc: Fix use of uid16_t and gid16_t in asm/stat.h Robert Love (2): ixgbe: Don't allow user buffer count to exceed 256 ixgbe: Priority tag FIP frames Robin Holt (1): mm/ksm.c is doing an unneeded _notify in write_protect_page. Russell King (3): ARM: Fix IXP23xx build error in mach/memory.h ARM: Update mach-types Documentation/volatile-considered-harmful.txt: correct cpu_relax() documentation Ryusuke Konishi (3): nilfs2: fix duplicate call to nilfs_segctor_cancel_freev nilfs2: fix hang-up of cleaner after log writer returned with error nilfs2: fix imperfect completion wait in nilfs_wait_on_logs Sachin Prabhu (1): Skip check for mandatory locks when unlocking Sage Weil (26): ceph: implemented caps should always be superset of issued caps ceph: add missing locking to protect i_snap_realm_item during split ceph: fix inode removal from snap realm when racing with migration ceph: fix authenticator timeout ceph: fix authenticator buffer size calculation ceph: release old ticket_blob buffer ceph: clean up service ticket decoding ceph: fix null pointer deref of r_osd in debug output ceph: drop unnecessary WARN_ON in caps migration ceph: fix session locking in handle_caps, ceph_check_caps ceph: clean up handle_cap_grant, handle_caps wrt session mutex ceph: only release unused caps with mds requests ceph: fix mds sync() race with completing requests ceph: fix pg pool decoding from incremental osdmap update ceph: prevent dup stale messages to console for restarting mds ceph: fix connection fault con_work reentrancy problem ceph: rename r_sent_stamp r_stamp ceph: avoid reopening osd connections when address hasn't changed ceph: fix snap rebuild condition ceph: make write_begin wait propagate ERESTARTSYS ceph: propagate mds session allocation failures to caller ceph: fix session check on mds reply ceph: fix possible double-free of mds request reference ceph: avoid loaded term 'OSD' in documention ceph: fix use after free on mds __unregister_request ceph: update discussion list address in MAINTAINERS Srinivas Eeda (1): ocfs2: Fix a race in o2dlm lockres mastery Stanislaw Gruszka (1): posix-cpu-timers: Reset expire cache when no timer is running Stefan Haberland (1): [S390] dasd: check tsb validity Stefan Richter (2): firewire: core: fix Model_ID in modalias firewire: core: align driver match with modalias Stefan Weinhuber (1): [S390] dasd: fix alignment of transport mode recovery TCW Steve Glendinning (1): smsc95xx: Fix tx checksum offload for small packets Steven J. Magnani (1): NET_DMA: free skbs periodically Steven Rostedt (1): ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align Suresh Siddha (1): x86: Handle legacy PIC interrupts on all the cpu's Takashi Iwai (3): ALSA: hda - Sort codec entry list of Nvidia HDMI ALSA: hda - Fix access-after-free in patch_realtek.c ALSA: hda - Don't set invalid connection index in Realtek initialiaiton Tao Ma (4): ocfs2: Change bg_chain check for ocfs2_validate_gd_parent. ocfs2: Update i_blocks in reflink operations. ocfs2: Fix the update of name_offset when removing xattrs ocfs2: Init meta_ac properly in ocfs2_create_empty_xattr_block. Tejun Heo (1): libata-sff: fix spurious IRQ handling Tetsuo Handa (2): rxrpc: Check allocation failure. rxrpc: Check allocation failure. Theodore Ts'o (1): ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3 Thomas Gleixner (3): genirq: Prevent oneshot irq thread race clockevents: Sanitize min_delta_ns adjustment and prevent overflows genirq: Protect access to irq_desc->action in can_request_irq() Thomas Weber (1): OMAP: DSS2: VRAM: Fix early_param for vram Tim Yamin (1): PCI quirk: only apply CX700 PCI bus parking quirk if external VT6212L is present Timo Teräs (2): ipv4: check rt_genid in dst_check ip_gre: include route header_len in max_headroom calculation Tomi Valkeinen (2): OMAP: DSS2: initialize dss clk sources properly OMAP: DSS2: panel-generic: re-implement mode changing Tristan Ye (2): Ocfs2: Journaling i_flags and i_orphaned_slot when adding inode to orphan dir. Ocfs2: Handle deletion of reflinked oprhan inodes correctly. Trond Myklebust (4): NFS: Prevent another deadlock in nfs_release_page() SUNRPC: Fix a potential memory leak in auth_gss SUNRPC: Fix a use after free bug with the NFSv4.1 backchannel SUNRPC: Fix the return value of rpc_run_bc_task() Uwe Kleine-König (1): rtc/mc13783: fix use after free bug Vasu Dev (3): ixgbe: fix for real_num_tx_queues update issue vlan: adds vlan_dev_select_queue vlan: updates vlan real_num_tx_queues Wolfram Sang (2): regulator: fix dangling pointers get_maintainer: repair STDIN usage YOSHIFUJI Hideaki / 吉藤英明 (1): ipv6: Don't drop cache route entry unless timer actually expired. Yegor Yefremov (1): KS8695: update ksp->next_rx_desc_read at the end of rx loop Yinghai Lu (2): x86: Make smp_locks end with page alignment x86: Make sure free_init_pages() frees pages on page boundary Zhenyu Wang (1): drm/i915: Fix check with IS_GEN6 stephen hemminger (1): TCP: check min TTL on received ICMP packets wzt wzt (1): benet: Fix compile warnnings in drivers/net/benet/be_ethtool.c ^ permalink raw reply [flat|nested] 231+ messages in thread
* [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-03-30 17:50 Linux 2.6.34-rc3 Linus Torvalds @ 2010-03-30 21:16 ` Rafael J. Wysocki 2010-03-31 20:34 ` [stable] " Greg KH 2010-04-01 1:13 ` Rafael J. Wysocki 2010-04-02 17:59 ` Ugly rmap NULL ptr deref oopsie on hibernate (was " Borislav Petkov 1 sibling, 2 replies; 231+ messages in thread From: Rafael J. Wysocki @ 2010-03-30 21:16 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Dave Airlie, dri-devel, Jesse Barnes, Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH On Tuesday 30 March 2010, Linus Torvalds wrote: ... > Other than that? Random fixes and updates all over. Mostly drivers and > filesystems, and mostly fairly small things. If you had PCI resource > conflict problems with the early -rc's due to the _CRS window thing, for > example, that should hopefully be fixed. See the appended shortlog for > other details. ... > Clemens Ladisch (4): > firewire: core: fw_iso_resource_manage: fix error handling > firewire: ohci: add cycle timer quirk for the TI TSB12LV22 > ALSA: cmipci: work around invalid PCM pointer > PCI quirk: RS780/RS880: work around missing MSI initialization This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box which happens to have a RS780. The symptom is that every operation involving the GPU is _very_ slow, so the window manager eventually disables compositing. Reverting this commit makes things work flawlessly again. So, please revert. BTW, I don't think it's a -stable material. Thanks, Rafael ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [stable] [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki @ 2010-03-31 20:34 ` Greg KH 2010-04-01 1:13 ` Rafael J. Wysocki 1 sibling, 0 replies; 231+ messages in thread From: Greg KH @ 2010-03-31 20:34 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linus Torvalds, Linux PCI, Greg KH, Clemens Ladisch, Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel, Dave Airlie, stable On Tue, Mar 30, 2010 at 11:16:45PM +0200, Rafael J. Wysocki wrote: > On Tuesday 30 March 2010, Linus Torvalds wrote: > ... > > Other than that? Random fixes and updates all over. Mostly drivers and > > filesystems, and mostly fairly small things. If you had PCI resource > > conflict problems with the early -rc's due to the _CRS window thing, for > > example, that should hopefully be fixed. See the appended shortlog for > > other details. > > ... > > > Clemens Ladisch (4): > > firewire: core: fw_iso_resource_manage: fix error handling > > firewire: ohci: add cycle timer quirk for the TI TSB12LV22 > > ALSA: cmipci: work around invalid PCM pointer > > PCI quirk: RS780/RS880: work around missing MSI initialization > > This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box > which happens to have a RS780. > > The symptom is that every operation involving the GPU is _very_ slow, so the > window manager eventually disables compositing. Reverting this commit makes > things work flawlessly again. > > So, please revert. > > BTW, I don't think it's a -stable material. Ok, I'll go drop it. thanks, greg k-h ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki 2010-03-31 20:34 ` [stable] " Greg KH @ 2010-04-01 1:13 ` Rafael J. Wysocki 2010-04-01 2:19 ` Alex Deucher 2010-04-01 16:29 ` Linus Torvalds 1 sibling, 2 replies; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-01 1:13 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Dave Airlie, dri-devel, Jesse Barnes, Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH On Tuesday 30 March 2010, Rafael J. Wysocki wrote: > On Tuesday 30 March 2010, Linus Torvalds wrote: > ... > > Other than that? Random fixes and updates all over. Mostly drivers and > > filesystems, and mostly fairly small things. If you had PCI resource > > conflict problems with the early -rc's due to the _CRS window thing, for > > example, that should hopefully be fixed. See the appended shortlog for > > other details. > > ... > > > Clemens Ladisch (4): > > firewire: core: fw_iso_resource_manage: fix error handling > > firewire: ohci: add cycle timer quirk for the TI TSB12LV22 > > ALSA: cmipci: work around invalid PCM pointer > > PCI quirk: RS780/RS880: work around missing MSI initialization > > This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box > which happens to have a RS780. > > The symptom is that every operation involving the GPU is _very_ slow, so the > window manager eventually disables compositing. Reverting this commit makes > things work flawlessly again. > > So, please revert. > > BTW, I don't think it's a -stable material. OK, I've verified that partial revert (below) is sufficient. Rafael --- From: Rafael J. Wysocki <rjw@sisk.pl> Subject: DRM / radeon: Really do not try to enable MSI on RS780 and RS880 Commit a5ee4eb75413c145334c30e43f1af9875dad6fd7 (PCI quirk: RS780/RS880: work around missing MSI initialization) removed a quirk to disable MSI on RS780 and RS880, which still is necessary on my Acer Ferrari One, because pci_enable_msi() attempts to enable the MSI and apparently succeeds despite the PCI quirk added by that commit. Add the removed radeon quirk again. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> --- drivers/gpu/drm/radeon/radeon_irq_kms.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c =================================================================== --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c @@ -116,7 +116,13 @@ int radeon_irq_kms_init(struct radeon_de } /* enable msi */ rdev->msi_enabled = 0; - if (rdev->family >= CHIP_RV380) { + /* MSIs don't seem to work on my rs780; + * not sure about rs880 or other rs780s. + * Needs more investigation. + */ + if ((rdev->family >= CHIP_RV380) && + (rdev->family != CHIP_RS780) && + (rdev->family != CHIP_RS880)) { int ret = pci_enable_msi(rdev->pdev); if (!ret) { rdev->msi_enabled = 1; ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 1:13 ` Rafael J. Wysocki @ 2010-04-01 2:19 ` Alex Deucher 2010-04-01 6:36 ` Clemens Ladisch 2010-04-01 16:29 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Alex Deucher @ 2010-04-01 2:19 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linus Torvalds, Linux PCI, Greg KH, Clemens Ladisch, Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel, stable, Dave Airlie [-- Attachment #1: Type: text/plain, Size: 2991 bytes --] On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Tuesday 30 March 2010, Rafael J. Wysocki wrote: >> On Tuesday 30 March 2010, Linus Torvalds wrote: >> ... >> > Other than that? Random fixes and updates all over. Mostly drivers and >> > filesystems, and mostly fairly small things. If you had PCI resource >> > conflict problems with the early -rc's due to the _CRS window thing, for >> > example, that should hopefully be fixed. See the appended shortlog for >> > other details. >> >> ... >> >> > Clemens Ladisch (4): >> > firewire: core: fw_iso_resource_manage: fix error handling >> > firewire: ohci: add cycle timer quirk for the TI TSB12LV22 >> > ALSA: cmipci: work around invalid PCM pointer >> > PCI quirk: RS780/RS880: work around missing MSI initialization >> >> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box >> which happens to have a RS780. >> >> The symptom is that every operation involving the GPU is _very_ slow, so the >> window manager eventually disables compositing. Reverting this commit makes >> things work flawlessly again. >> >> So, please revert. >> >> BTW, I don't think it's a -stable material. > > OK, I've verified that partial revert (below) is sufficient. > > Rafael > > --- > From: Rafael J. Wysocki <rjw@sisk.pl> > Subject: DRM / radeon: Really do not try to enable MSI on RS780 and RS880 > > Commit a5ee4eb75413c145334c30e43f1af9875dad6fd7 > (PCI quirk: RS780/RS880: work around missing MSI initialization) > removed a quirk to disable MSI on RS780 and RS880, which still is > necessary on my Acer Ferrari One, because pci_enable_msi() attempts > to enable the MSI and apparently succeeds despite the PCI quirk > added by that commit. Add the removed radeon quirk again. > > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> > --- > drivers/gpu/drm/radeon/radeon_irq_kms.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c > =================================================================== > --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c > +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c > @@ -116,7 +116,13 @@ int radeon_irq_kms_init(struct radeon_de > } > /* enable msi */ > rdev->msi_enabled = 0; > - if (rdev->family >= CHIP_RV380) { > + /* MSIs don't seem to work on my rs780; > + * not sure about rs880 or other rs780s. > + * Needs more investigation. > + */ > + if ((rdev->family >= CHIP_RV380) && > + (rdev->family != CHIP_RS780) && > + (rdev->family != CHIP_RS880)) { > int ret = pci_enable_msi(rdev->pdev); > if (!ret) { > rdev->msi_enabled = 1; I also have the attached patch queued in via Dave's tree to disable MSI on all IGP chips for the time being. Alex [-- Attachment #2: 0001-drm-radeon-kms-disable-MSI-on-IGP-chips.patch --] [-- Type: application/mbox, Size: 1248 bytes --] ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 2:19 ` Alex Deucher @ 2010-04-01 6:36 ` Clemens Ladisch 2010-04-01 15:01 ` Alex Deucher 0 siblings, 1 reply; 231+ messages in thread From: Clemens Ladisch @ 2010-04-01 6:36 UTC (permalink / raw) To: Alex Deucher Cc: Rafael J. Wysocki, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie Alex Deucher wrote: > On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: >>> > PCI quirk: RS780/RS880: work around missing MSI initialization >>> >>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box >>> which happens to have a RS780. So it's better to disable MSI unconditionally. Rafael, can you check if MSI works for the HDMI audio device? (I'd guess it doesn't.) > I also have the attached patch queued in via Dave's tree to disable > MSI on all IGP chips for the time being. This disables MSI only for the graphics device. I'd prefer to have the quirk on its bridge so that MSI gets disabled for the HDMI audio device too, to avoid having to duplicate this quirk in the snd-hda-intel driver. ========== PCI quirk: RS780/RS880: disable MSI completely The missing initialization of the nb_cntl.strap_msi_enable does not seem to be the only problem that prevents MSI, so that quirk is not sufficient to enable MSI on all machines. To be safe, unconditionally disable MSI for the internal graphics and HDMI audio on these chipsets. Signed-off-by: Clemens Ladisch <clemens@ladisch.de> --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -2123,6 +2123,8 @@ static void __devinit quirk_disable_msi(struct pci_dev *dev) } } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi); /* Go through the list of Hypertransport capabilities and @@ -2495,39 +2497,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4374, DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375, quirk_msi_intx_disable_bug); -/* - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit. - */ -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge) -{ - u32 nb_cntl; - - if (!int_gfx_bridge->subordinate) - return; - - pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x60, 0); - pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x64, &nb_cntl); - - if (!(nb_cntl & BIT(10))) { - dev_warn(&int_gfx_bridge->dev, - FW_WARN "RS780: MSI for internal graphics disabled\n"); - int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI; - } -} - -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602 - -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); -/* wrong vendor ID on M4A785TD motherboard: */ -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); - #endif /* CONFIG_PCI_MSI */ #ifdef CONFIG_PCI_IOV ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 6:36 ` Clemens Ladisch @ 2010-04-01 15:01 ` Alex Deucher 2010-04-01 20:28 ` Rafael J. Wysocki 0 siblings, 1 reply; 231+ messages in thread From: Alex Deucher @ 2010-04-01 15:01 UTC (permalink / raw) To: Clemens Ladisch Cc: Rafael J. Wysocki, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: > Alex Deucher wrote: >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization >>>> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box >>>> which happens to have a RS780. > > So it's better to disable MSI unconditionally. > > Rafael, can you check if MSI works for the HDMI audio device? > (I'd guess it doesn't.) > >> I also have the attached patch queued in via Dave's tree to disable >> MSI on all IGP chips for the time being. > > This disables MSI only for the graphics device. I'd prefer to have > the quirk on its bridge so that MSI gets disabled for the HDMI audio > device too, to avoid having to duplicate this quirk in the snd-hda-intel > driver. > > ========== > > PCI quirk: RS780/RS880: disable MSI completely > > The missing initialization of the nb_cntl.strap_msi_enable does not seem > to be the only problem that prevents MSI, so that quirk is not > sufficient to enable MSI on all machines. To be safe, unconditionally > disable MSI for the internal graphics and HDMI audio on these chipsets. > > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Works fine here. Tested-by: Alex Deucher <alexdeucher@gmail.com> > > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -2123,6 +2123,8 @@ static void __devinit quirk_disable_msi(struct pci_dev *dev) > } > } > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi); > > /* Go through the list of Hypertransport capabilities and > @@ -2495,39 +2497,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4374, > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375, > quirk_msi_intx_disable_bug); > > -/* > - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio > - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit. > - */ > -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge) > -{ > - u32 nb_cntl; > - > - if (!int_gfx_bridge->subordinate) > - return; > - > - pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), > - 0x60, 0); > - pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), > - 0x64, &nb_cntl); > - > - if (!(nb_cntl & BIT(10))) { > - dev_warn(&int_gfx_bridge->dev, > - FW_WARN "RS780: MSI for internal graphics disabled\n"); > - int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI; > - } > -} > - > -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602 > - > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, > - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, > - rs780_int_gfx_disable_msi); > -/* wrong vendor ID on M4A785TD motherboard: */ > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, > - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, > - rs780_int_gfx_disable_msi); > - > #endif /* CONFIG_PCI_MSI */ > > #ifdef CONFIG_PCI_IOV > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 15:01 ` Alex Deucher @ 2010-04-01 20:28 ` Rafael J. Wysocki 2010-04-01 20:39 ` Alex Deucher 0 siblings, 1 reply; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-01 20:28 UTC (permalink / raw) To: Alex Deucher Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thursday 01 April 2010, Alex Deucher wrote: > On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: > > Alex Deucher wrote: > >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: > >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization > >>>> > >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box > >>>> which happens to have a RS780. > > > > So it's better to disable MSI unconditionally. > > > > Rafael, can you check if MSI works for the HDMI audio device? > > (I'd guess it doesn't.) > > > >> I also have the attached patch queued in via Dave's tree to disable > >> MSI on all IGP chips for the time being. > > > > This disables MSI only for the graphics device. I'd prefer to have > > the quirk on its bridge so that MSI gets disabled for the HDMI audio > > device too, to avoid having to duplicate this quirk in the snd-hda-intel > > driver. > > > > ========== > > > > PCI quirk: RS780/RS880: disable MSI completely > > > > The missing initialization of the nb_cntl.strap_msi_enable does not seem > > to be the only problem that prevents MSI, so that quirk is not > > sufficient to enable MSI on all machines. To be safe, unconditionally > > disable MSI for the internal graphics and HDMI audio on these chipsets. > > > > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> > > Works fine here. > > Tested-by: Alex Deucher <alexdeucher@gmail.com> Unfortunately it doesn't work for me without the if ((rdev->family >= CHIP_RV380) && (!(rdev->flags & RADEON_IS_IGP))) radeon quirk. Thanks, Rafael ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 20:28 ` Rafael J. Wysocki @ 2010-04-01 20:39 ` Alex Deucher 2010-04-01 20:48 ` Rafael J. Wysocki 0 siblings, 1 reply; 231+ messages in thread From: Alex Deucher @ 2010-04-01 20:39 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Thursday 01 April 2010, Alex Deucher wrote: >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: >> > Alex Deucher wrote: >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: >> >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization >> >>>> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box >> >>>> which happens to have a RS780. >> > >> > So it's better to disable MSI unconditionally. >> > >> > Rafael, can you check if MSI works for the HDMI audio device? >> > (I'd guess it doesn't.) >> > >> >> I also have the attached patch queued in via Dave's tree to disable >> >> MSI on all IGP chips for the time being. >> > >> > This disables MSI only for the graphics device. I'd prefer to have >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel >> > driver. >> > >> > ========== >> > >> > PCI quirk: RS780/RS880: disable MSI completely >> > >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem >> > to be the only problem that prevents MSI, so that quirk is not >> > sufficient to enable MSI on all machines. To be safe, unconditionally >> > disable MSI for the internal graphics and HDMI audio on these chipsets. >> > >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> >> >> Works fine here. >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com> > > Unfortunately it doesn't work for me without the > > if ((rdev->family >= CHIP_RV380) && > (!(rdev->flags & RADEON_IS_IGP))) > > radeon quirk. what are your pci ids? Alex > > Thanks, > Rafael > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 20:39 ` Alex Deucher @ 2010-04-01 20:48 ` Rafael J. Wysocki 2010-04-01 21:00 ` Alex Deucher 2010-04-01 21:01 ` Alex Deucher 0 siblings, 2 replies; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-01 20:48 UTC (permalink / raw) To: Alex Deucher Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thursday 01 April 2010, Alex Deucher wrote: > On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > On Thursday 01 April 2010, Alex Deucher wrote: > >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: > >> > Alex Deucher wrote: > >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: > >> >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization > >> >>>> > >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box > >> >>>> which happens to have a RS780. > >> > > >> > So it's better to disable MSI unconditionally. > >> > > >> > Rafael, can you check if MSI works for the HDMI audio device? > >> > (I'd guess it doesn't.) > >> > > >> >> I also have the attached patch queued in via Dave's tree to disable > >> >> MSI on all IGP chips for the time being. > >> > > >> > This disables MSI only for the graphics device. I'd prefer to have > >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio > >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel > >> > driver. > >> > > >> > ========== > >> > > >> > PCI quirk: RS780/RS880: disable MSI completely > >> > > >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem > >> > to be the only problem that prevents MSI, so that quirk is not > >> > sufficient to enable MSI on all machines. To be safe, unconditionally > >> > disable MSI for the internal graphics and HDMI audio on these chipsets. > >> > > >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> > >> > >> Works fine here. > >> > >> Tested-by: Alex Deucher <alexdeucher@gmail.com> > > > > Unfortunately it doesn't work for me without the > > > > if ((rdev->family >= CHIP_RV380) && > > (!(rdev->flags & RADEON_IS_IGP))) > > > > radeon quirk. > > what are your pci ids? 1022:960b I guess 1022 is AMD. OK, I'll try to add that. Rafael ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 20:48 ` Rafael J. Wysocki @ 2010-04-01 21:00 ` Alex Deucher 2010-04-01 21:01 ` Alex Deucher 1 sibling, 0 replies; 231+ messages in thread From: Alex Deucher @ 2010-04-01 21:00 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Thursday 01 April 2010, Alex Deucher wrote: >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> > On Thursday 01 April 2010, Alex Deucher wrote: >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: >> >> > Alex Deucher wrote: >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: >> >> >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization >> >> >>>> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box >> >> >>>> which happens to have a RS780. >> >> > >> >> > So it's better to disable MSI unconditionally. >> >> > >> >> > Rafael, can you check if MSI works for the HDMI audio device? >> >> > (I'd guess it doesn't.) >> >> > >> >> >> I also have the attached patch queued in via Dave's tree to disable >> >> >> MSI on all IGP chips for the time being. >> >> > >> >> > This disables MSI only for the graphics device. I'd prefer to have >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel >> >> > driver. >> >> > >> >> > ========== >> >> > >> >> > PCI quirk: RS780/RS880: disable MSI completely >> >> > >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem >> >> > to be the only problem that prevents MSI, so that quirk is not >> >> > sufficient to enable MSI on all machines. To be safe, unconditionally >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets. >> >> > >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> >> >> >> >> Works fine here. >> >> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com> >> > >> > Unfortunately it doesn't work for me without the >> > >> > if ((rdev->family >= CHIP_RV380) && >> > (!(rdev->flags & RADEON_IS_IGP))) >> > >> > radeon quirk. >> >> what are your pci ids? > > 1022:960b > > I guess 1022 is AMD. > > OK, I'll try to add that. 0x960b won't affect the internal gfx. That bridge is for the pcie x16 gfx slot. 0x9600 Host bridge 0x9602 Internal GFX PCI-PCI bridge ID 0x9603 External GFX - port 0 0x960B External GFX - port 1 0x9604 PCI-PCI bridge - Port 0 0x9605 PCI-PCI bridge - Port 1 0x9606 PCI-PCI bridge - Port 2 0x9607 PCI-PCI bridge - Port 3 0x9608 PCI-PCI bridge - Port 4 0x9609 PCI-PCI bridge - Port 5 0x960A PCI-PCI bridge (SB) 0x960F HD Audio controller 0x791A HDMI Audio codec Alex > > Rafael > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 20:48 ` Rafael J. Wysocki 2010-04-01 21:00 ` Alex Deucher @ 2010-04-01 21:01 ` Alex Deucher 2010-04-01 21:08 ` Rafael J. Wysocki 1 sibling, 1 reply; 231+ messages in thread From: Alex Deucher @ 2010-04-01 21:01 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Thursday 01 April 2010, Alex Deucher wrote: >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> > On Thursday 01 April 2010, Alex Deucher wrote: >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: >> >> > Alex Deucher wrote: >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: >> >> >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization >> >> >>>> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box >> >> >>>> which happens to have a RS780. >> >> > >> >> > So it's better to disable MSI unconditionally. >> >> > >> >> > Rafael, can you check if MSI works for the HDMI audio device? >> >> > (I'd guess it doesn't.) >> >> > >> >> >> I also have the attached patch queued in via Dave's tree to disable >> >> >> MSI on all IGP chips for the time being. >> >> > >> >> > This disables MSI only for the graphics device. I'd prefer to have >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel >> >> > driver. >> >> > >> >> > ========== >> >> > >> >> > PCI quirk: RS780/RS880: disable MSI completely >> >> > >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem >> >> > to be the only problem that prevents MSI, so that quirk is not >> >> > sufficient to enable MSI on all machines. To be safe, unconditionally >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets. >> >> > >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> >> >> >> >> Works fine here. >> >> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com> >> > >> > Unfortunately it doesn't work for me without the >> > >> > if ((rdev->family >= CHIP_RV380) && >> > (!(rdev->flags & RADEON_IS_IGP))) >> > >> > radeon quirk. >> >> what are your pci ids? > > 1022:960b > > I guess 1022 is AMD. > > OK, I'll try to add that. It's possible your oem has the wrong vendor id for the 0x9602 bridge. Alex > > Rafael > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 21:01 ` Alex Deucher @ 2010-04-01 21:08 ` Rafael J. Wysocki 2010-04-01 21:13 ` Alex Deucher 0 siblings, 1 reply; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-01 21:08 UTC (permalink / raw) To: Alex Deucher Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thursday 01 April 2010, Alex Deucher wrote: > On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > On Thursday 01 April 2010, Alex Deucher wrote: > >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> > On Thursday 01 April 2010, Alex Deucher wrote: > >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: > >> >> > Alex Deucher wrote: > >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: > >> >> >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization > >> >> >>>> > >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box > >> >> >>>> which happens to have a RS780. > >> >> > > >> >> > So it's better to disable MSI unconditionally. > >> >> > > >> >> > Rafael, can you check if MSI works for the HDMI audio device? > >> >> > (I'd guess it doesn't.) > >> >> > > >> >> >> I also have the attached patch queued in via Dave's tree to disable > >> >> >> MSI on all IGP chips for the time being. > >> >> > > >> >> > This disables MSI only for the graphics device. I'd prefer to have > >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio > >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel > >> >> > driver. > >> >> > > >> >> > ========== > >> >> > > >> >> > PCI quirk: RS780/RS880: disable MSI completely > >> >> > > >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem > >> >> > to be the only problem that prevents MSI, so that quirk is not > >> >> > sufficient to enable MSI on all machines. To be safe, unconditionally > >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets. > >> >> > > >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> > >> >> > >> >> Works fine here. > >> >> > >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com> > >> > > >> > Unfortunately it doesn't work for me without the > >> > > >> > if ((rdev->family >= CHIP_RV380) && > >> > (!(rdev->flags & RADEON_IS_IGP))) > >> > > >> > radeon quirk. > >> > >> what are your pci ids? > > > > 1022:960b > > > > I guess 1022 is AMD. > > > > OK, I'll try to add that. > > It's possible your oem has the wrong vendor id for the 0x9602 bridge. Yes, the patch below works. Thanks, Rafael --- drivers/gpu/drm/radeon/radeon_irq_kms.c | 3 -- drivers/pci/quirks.c | 36 ++------------------------------ 2 files changed, 4 insertions(+), 35 deletions(-) Index: linux-2.6/drivers/pci/quirks.c =================================================================== --- linux-2.6.orig/drivers/pci/quirks.c +++ linux-2.6/drivers/pci/quirks.c @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi( } } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi); /* Go through the list of Hypertransport capabilities and @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375, quirk_msi_intx_disable_bug); -/* - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit. - */ -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge) -{ - u32 nb_cntl; - - if (!int_gfx_bridge->subordinate) - return; - - pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x60, 0); - pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x64, &nb_cntl); - - if (!(nb_cntl & BIT(10))) { - dev_warn(&int_gfx_bridge->dev, - FW_WARN "RS780: MSI for internal graphics disabled\n"); - int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI; - } -} - -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602 - -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); -/* wrong vendor ID on M4A785TD motherboard: */ -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); - #endif /* CONFIG_PCI_MSI */ #ifdef CONFIG_PCI_IOV Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c =================================================================== --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c @@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de /* MSIs don't seem to work reliably on all IGP * chips. Disable MSI on them for now. */ - if ((rdev->family >= CHIP_RV380) && - (!(rdev->flags & RADEON_IS_IGP))) { + if (rdev->family >= CHIP_RV380) { int ret = pci_enable_msi(rdev->pdev); if (!ret) { rdev->msi_enabled = 1; ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 21:08 ` Rafael J. Wysocki @ 2010-04-01 21:13 ` Alex Deucher 2010-04-01 21:46 ` Rafael J. Wysocki 0 siblings, 1 reply; 231+ messages in thread From: Alex Deucher @ 2010-04-01 21:13 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Thursday 01 April 2010, Alex Deucher wrote: >> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> > On Thursday 01 April 2010, Alex Deucher wrote: >> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> >> > On Thursday 01 April 2010, Alex Deucher wrote: >> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: >> >> >> > Alex Deucher wrote: >> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: >> >> >> >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization >> >> >> >>>> >> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box >> >> >> >>>> which happens to have a RS780. >> >> >> > >> >> >> > So it's better to disable MSI unconditionally. >> >> >> > >> >> >> > Rafael, can you check if MSI works for the HDMI audio device? >> >> >> > (I'd guess it doesn't.) >> >> >> > >> >> >> >> I also have the attached patch queued in via Dave's tree to disable >> >> >> >> MSI on all IGP chips for the time being. >> >> >> > >> >> >> > This disables MSI only for the graphics device. I'd prefer to have >> >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio >> >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel >> >> >> > driver. >> >> >> > >> >> >> > ========== >> >> >> > >> >> >> > PCI quirk: RS780/RS880: disable MSI completely >> >> >> > >> >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem >> >> >> > to be the only problem that prevents MSI, so that quirk is not >> >> >> > sufficient to enable MSI on all machines. To be safe, unconditionally >> >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets. >> >> >> > >> >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> >> >> >> >> >> >> Works fine here. >> >> >> >> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com> >> >> > >> >> > Unfortunately it doesn't work for me without the >> >> > >> >> > if ((rdev->family >= CHIP_RV380) && >> >> > (!(rdev->flags & RADEON_IS_IGP))) >> >> > >> >> > radeon quirk. >> >> >> >> what are your pci ids? >> > >> > 1022:960b >> > >> > I guess 1022 is AMD. >> > >> > OK, I'll try to add that. >> >> It's possible your oem has the wrong vendor id for the 0x9602 bridge. > > Yes, the patch below works. > > Thanks, > Rafael > > > --- > drivers/gpu/drm/radeon/radeon_irq_kms.c | 3 -- > drivers/pci/quirks.c | 36 ++------------------------------ > 2 files changed, 4 insertions(+), 35 deletions(-) > > Index: linux-2.6/drivers/pci/quirks.c > =================================================================== > --- linux-2.6.orig/drivers/pci/quirks.c > +++ linux-2.6/drivers/pci/quirks.c > @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi( > } > } > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi); > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi); > > /* Go through the list of Hypertransport capabilities and > @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375, > quirk_msi_intx_disable_bug); > > -/* > - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio > - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit. > - */ > -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge) > -{ > - u32 nb_cntl; > - > - if (!int_gfx_bridge->subordinate) > - return; > - > - pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), > - 0x60, 0); > - pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), > - 0x64, &nb_cntl); > - > - if (!(nb_cntl & BIT(10))) { > - dev_warn(&int_gfx_bridge->dev, > - FW_WARN "RS780: MSI for internal graphics disabled\n"); > - int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI; > - } > -} > - > -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602 > - > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, > - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, > - rs780_int_gfx_disable_msi); > -/* wrong vendor ID on M4A785TD motherboard: */ > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, > - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, > - rs780_int_gfx_disable_msi); > - > #endif /* CONFIG_PCI_MSI */ > > #ifdef CONFIG_PCI_IOV > Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c > =================================================================== > --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c > +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c > @@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de > /* MSIs don't seem to work reliably on all IGP > * chips. Disable MSI on them for now. > */ > - if ((rdev->family >= CHIP_RV380) && > - (!(rdev->flags & RADEON_IS_IGP))) { > + if (rdev->family >= CHIP_RV380) { > int ret = pci_enable_msi(rdev->pdev); > if (!ret) { > rdev->msi_enabled = 1; > Let's skip this second chunk for now as there are other non-RS780 IGP chips that could be problematic, so I'd rather just leave MSIs disabled for now. Alex ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 21:13 ` Alex Deucher @ 2010-04-01 21:46 ` Rafael J. Wysocki 2010-04-01 22:07 ` Alex Deucher 0 siblings, 1 reply; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-01 21:46 UTC (permalink / raw) To: Alex Deucher Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thursday 01 April 2010, Alex Deucher wrote: > On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > On Thursday 01 April 2010, Alex Deucher wrote: > >> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> > On Thursday 01 April 2010, Alex Deucher wrote: > >> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> >> > On Thursday 01 April 2010, Alex Deucher wrote: > >> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: > >> >> >> > Alex Deucher wrote: > >> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: > >> >> >> >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization > >> >> >> >>>> > >> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box > >> >> >> >>>> which happens to have a RS780. > >> >> >> > > >> >> >> > So it's better to disable MSI unconditionally. > >> >> >> > > >> >> >> > Rafael, can you check if MSI works for the HDMI audio device? > >> >> >> > (I'd guess it doesn't.) > >> >> >> > > >> >> >> >> I also have the attached patch queued in via Dave's tree to disable > >> >> >> >> MSI on all IGP chips for the time being. > >> >> >> > > >> >> >> > This disables MSI only for the graphics device. I'd prefer to have > >> >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio > >> >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel > >> >> >> > driver. > >> >> >> > > >> >> >> > ========== > >> >> >> > > >> >> >> > PCI quirk: RS780/RS880: disable MSI completely > >> >> >> > > >> >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem > >> >> >> > to be the only problem that prevents MSI, so that quirk is not > >> >> >> > sufficient to enable MSI on all machines. To be safe, unconditionally > >> >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets. > >> >> >> > > >> >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> > >> >> >> > >> >> >> Works fine here. > >> >> >> > >> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com> > >> >> > > >> >> > Unfortunately it doesn't work for me without the > >> >> > > >> >> > if ((rdev->family >= CHIP_RV380) && > >> >> > (!(rdev->flags & RADEON_IS_IGP))) > >> >> > > >> >> > radeon quirk. > >> >> > >> >> what are your pci ids? > >> > > >> > 1022:960b > >> > > >> > I guess 1022 is AMD. > >> > > >> > OK, I'll try to add that. > >> > >> It's possible your oem has the wrong vendor id for the 0x9602 bridge. > > > > Yes, the patch below works. > > > > Thanks, > > Rafael > > > > > > --- > > drivers/gpu/drm/radeon/radeon_irq_kms.c | 3 -- > > drivers/pci/quirks.c | 36 ++------------------------------ > > 2 files changed, 4 insertions(+), 35 deletions(-) > > > > Index: linux-2.6/drivers/pci/quirks.c > > =================================================================== > > --- linux-2.6.orig/drivers/pci/quirks.c > > +++ linux-2.6/drivers/pci/quirks.c > > @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi( > > } > > } > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi); > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi); > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi); > > > > /* Go through the list of Hypertransport capabilities and > > @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375, > > quirk_msi_intx_disable_bug); > > > > -/* > > - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio > > - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit. > > - */ > > -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge) > > -{ > > - u32 nb_cntl; > > - > > - if (!int_gfx_bridge->subordinate) > > - return; > > - > > - pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), > > - 0x60, 0); > > - pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), > > - 0x64, &nb_cntl); > > - > > - if (!(nb_cntl & BIT(10))) { > > - dev_warn(&int_gfx_bridge->dev, > > - FW_WARN "RS780: MSI for internal graphics disabled\n"); > > - int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI; > > - } > > -} > > - > > -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602 > > - > > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, > > - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, > > - rs780_int_gfx_disable_msi); > > -/* wrong vendor ID on M4A785TD motherboard: */ > > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, > > - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, > > - rs780_int_gfx_disable_msi); > > - > > #endif /* CONFIG_PCI_MSI */ > > > > #ifdef CONFIG_PCI_IOV > > Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c > > =================================================================== > > --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c > > +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c > > @@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de > > /* MSIs don't seem to work reliably on all IGP > > * chips. Disable MSI on them for now. > > */ > > - if ((rdev->family >= CHIP_RV380) && > > - (!(rdev->flags & RADEON_IS_IGP))) { > > + if (rdev->family >= CHIP_RV380) { > > int ret = pci_enable_msi(rdev->pdev); > > if (!ret) { > > rdev->msi_enabled = 1; > > > > Let's skip this second chunk for now as there are other non-RS780 IGP > chips that could be problematic, so I'd rather just leave MSIs > disabled for now. Works for me. So do you want me to resubmit? Rafael ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 21:46 ` Rafael J. Wysocki @ 2010-04-01 22:07 ` Alex Deucher 2010-04-01 23:20 ` Rafael J. Wysocki 0 siblings, 1 reply; 231+ messages in thread From: Alex Deucher @ 2010-04-01 22:07 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Thu, Apr 1, 2010 at 5:46 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Thursday 01 April 2010, Alex Deucher wrote: >> On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> > On Thursday 01 April 2010, Alex Deucher wrote: >> >> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> >> > On Thursday 01 April 2010, Alex Deucher wrote: >> >> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> >> >> > On Thursday 01 April 2010, Alex Deucher wrote: >> >> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: >> >> >> >> > Alex Deucher wrote: >> >> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> >> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: >> >> >> >> >>>> > PCI quirk: RS780/RS880: work around missing MSI initialization >> >> >> >> >>>> >> >> >> >> >>>> This one (commit a5ee4eb7541) broke OpenGL acceleration on my new test box >> >> >> >> >>>> which happens to have a RS780. >> >> >> >> > >> >> >> >> > So it's better to disable MSI unconditionally. >> >> >> >> > >> >> >> >> > Rafael, can you check if MSI works for the HDMI audio device? >> >> >> >> > (I'd guess it doesn't.) >> >> >> >> > >> >> >> >> >> I also have the attached patch queued in via Dave's tree to disable >> >> >> >> >> MSI on all IGP chips for the time being. >> >> >> >> > >> >> >> >> > This disables MSI only for the graphics device. I'd prefer to have >> >> >> >> > the quirk on its bridge so that MSI gets disabled for the HDMI audio >> >> >> >> > device too, to avoid having to duplicate this quirk in the snd-hda-intel >> >> >> >> > driver. >> >> >> >> > >> >> >> >> > ========== >> >> >> >> > >> >> >> >> > PCI quirk: RS780/RS880: disable MSI completely >> >> >> >> > >> >> >> >> > The missing initialization of the nb_cntl.strap_msi_enable does not seem >> >> >> >> > to be the only problem that prevents MSI, so that quirk is not >> >> >> >> > sufficient to enable MSI on all machines. To be safe, unconditionally >> >> >> >> > disable MSI for the internal graphics and HDMI audio on these chipsets. >> >> >> >> > >> >> >> >> > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> >> >> >> >> >> >> >> >> Works fine here. >> >> >> >> >> >> >> >> Tested-by: Alex Deucher <alexdeucher@gmail.com> >> >> >> > >> >> >> > Unfortunately it doesn't work for me without the >> >> >> > >> >> >> > if ((rdev->family >= CHIP_RV380) && >> >> >> > (!(rdev->flags & RADEON_IS_IGP))) >> >> >> > >> >> >> > radeon quirk. >> >> >> >> >> >> what are your pci ids? >> >> > >> >> > 1022:960b >> >> > >> >> > I guess 1022 is AMD. >> >> > >> >> > OK, I'll try to add that. >> >> >> >> It's possible your oem has the wrong vendor id for the 0x9602 bridge. >> > >> > Yes, the patch below works. >> > >> > Thanks, >> > Rafael >> > >> > >> > --- >> > drivers/gpu/drm/radeon/radeon_irq_kms.c | 3 -- >> > drivers/pci/quirks.c | 36 ++------------------------------ >> > 2 files changed, 4 insertions(+), 35 deletions(-) >> > >> > Index: linux-2.6/drivers/pci/quirks.c >> > =================================================================== >> > --- linux-2.6.orig/drivers/pci/quirks.c >> > +++ linux-2.6/drivers/pci/quirks.c >> > @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi( >> > } >> > } >> > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi); >> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); >> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); >> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi); >> > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi); >> > >> > /* Go through the list of Hypertransport capabilities and >> > @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT >> > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375, >> > quirk_msi_intx_disable_bug); >> > >> > -/* >> > - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio >> > - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit. >> > - */ >> > -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge) >> > -{ >> > - u32 nb_cntl; >> > - >> > - if (!int_gfx_bridge->subordinate) >> > - return; >> > - >> > - pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), >> > - 0x60, 0); >> > - pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), >> > - 0x64, &nb_cntl); >> > - >> > - if (!(nb_cntl & BIT(10))) { >> > - dev_warn(&int_gfx_bridge->dev, >> > - FW_WARN "RS780: MSI for internal graphics disabled\n"); >> > - int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI; >> > - } >> > -} >> > - >> > -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602 >> > - >> > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, >> > - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, >> > - rs780_int_gfx_disable_msi); >> > -/* wrong vendor ID on M4A785TD motherboard: */ >> > -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, >> > - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, >> > - rs780_int_gfx_disable_msi); >> > - >> > #endif /* CONFIG_PCI_MSI */ >> > >> > #ifdef CONFIG_PCI_IOV >> > Index: linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c >> > =================================================================== >> > --- linux-2.6.orig/drivers/gpu/drm/radeon/radeon_irq_kms.c >> > +++ linux-2.6/drivers/gpu/drm/radeon/radeon_irq_kms.c >> > @@ -117,8 +117,7 @@ int radeon_irq_kms_init(struct radeon_de >> > /* MSIs don't seem to work reliably on all IGP >> > * chips. Disable MSI on them for now. >> > */ >> > - if ((rdev->family >= CHIP_RV380) && >> > - (!(rdev->flags & RADEON_IS_IGP))) { >> > + if (rdev->family >= CHIP_RV380) { >> > int ret = pci_enable_msi(rdev->pdev); >> > if (!ret) { >> > rdev->msi_enabled = 1; >> > >> >> Let's skip this second chunk for now as there are other non-RS780 IGP >> chips that could be problematic, so I'd rather just leave MSIs >> disabled for now. > > Works for me. > > So do you want me to resubmit? > Please. Thanks, Alex > Rafael > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 22:07 ` Alex Deucher @ 2010-04-01 23:20 ` Rafael J. Wysocki 2010-04-02 0:23 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-01 23:20 UTC (permalink / raw) To: Alex Deucher Cc: Clemens Ladisch, Linus Torvalds, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Friday 02 April 2010, Alex Deucher wrote: > On Thu, Apr 1, 2010 at 5:46 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > On Thursday 01 April 2010, Alex Deucher wrote: > >> On Thu, Apr 1, 2010 at 5:08 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> > On Thursday 01 April 2010, Alex Deucher wrote: > >> >> On Thu, Apr 1, 2010 at 4:48 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> >> > On Thursday 01 April 2010, Alex Deucher wrote: > >> >> >> On Thu, Apr 1, 2010 at 4:28 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> >> >> > On Thursday 01 April 2010, Alex Deucher wrote: > >> >> >> >> On Thu, Apr 1, 2010 at 2:36 AM, Clemens Ladisch <clemens@ladisch.de> wrote: > >> >> >> >> > Alex Deucher wrote: > >> >> >> >> >> On Wed, Mar 31, 2010 at 9:13 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> >> >> >> >>> On Tuesday 30 March 2010, Rafael J. Wysocki wrote: ... > > So do you want me to resubmit? > > > > Please. Appended, with sign-offs and changelog. Thanks, Rafael --- Subject: PCI quirk: RS780/RS880: disable MSI completely The missing initialization of the nb_cntl.strap_msi_enable does not seem to be the only problem that prevents MSI, so that quirk is not sufficient to enable MSI on all machines. To be safe, disable MSI unconditionally for the internal graphics and HDMI audio on these chipsets. [rjw: Added the PCI_VENDOR_ID_AI quirk.] Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> --- drivers/pci/quirks.c | 36 +++--------------------------------- 1 file changed, 3 insertions(+), 33 deletions(-) Index: linux-2.6/drivers/pci/quirks.c =================================================================== --- linux-2.6.orig/drivers/pci/quirks.c +++ linux-2.6/drivers/pci/quirks.c @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi( } } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi); /* Go through the list of Hypertransport capabilities and @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375, quirk_msi_intx_disable_bug); -/* - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit. - */ -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge) -{ - u32 nb_cntl; - - if (!int_gfx_bridge->subordinate) - return; - - pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x60, 0); - pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x64, &nb_cntl); - - if (!(nb_cntl & BIT(10))) { - dev_warn(&int_gfx_bridge->dev, - FW_WARN "RS780: MSI for internal graphics disabled\n"); - int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI; - } -} - -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602 - -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); -/* wrong vendor ID on M4A785TD motherboard: */ -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); - #endif /* CONFIG_PCI_MSI */ #ifdef CONFIG_PCI_IOV ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 23:20 ` Rafael J. Wysocki @ 2010-04-02 0:23 ` Linus Torvalds 2010-04-02 16:46 ` Rafael J. Wysocki 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-02 0:23 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Alex Deucher, Clemens Ladisch, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Fri, 2 Apr 2010, Rafael J. Wysocki wrote: > > Appended, with sign-offs and changelog. > > --- > Subject: PCI quirk: RS780/RS880: disable MSI completely Hmm. Isn't this missing a From: Clemens Ladisch <clemens@ladisch.de> too? Or was the original patch yours? Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-02 0:23 ` Linus Torvalds @ 2010-04-02 16:46 ` Rafael J. Wysocki 2010-04-03 18:08 ` Clemens Ladisch 0 siblings, 1 reply; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-02 16:46 UTC (permalink / raw) To: Linus Torvalds Cc: Alex Deucher, Clemens Ladisch, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Friday 02 April 2010, Linus Torvalds wrote: > > On Fri, 2 Apr 2010, Rafael J. Wysocki wrote: > > > > Appended, with sign-offs and changelog. > > > > --- > > Subject: PCI quirk: RS780/RS880: disable MSI completely > > Hmm. Isn't this missing a > > From: Clemens Ladisch <clemens@ladisch.de> > > too? Ouch, yes it is, sorry. This one should be complete. --- From: Clemens Ladisch <clemens@ladisch.de> Subject: PCI quirk: RS780/RS880: disable MSI completely The missing initialization of the nb_cntl.strap_msi_enable does not seem to be the only problem that prevents MSI, so that quirk is not sufficient to enable MSI on all machines. To be safe, disable MSI unconditionally for the internal graphics and HDMI audio on these chipsets. [rjw: Added the PCI_VENDOR_ID_AI quirk.] Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> --- drivers/pci/quirks.c | 36 +++--------------------------------- 1 file changed, 3 insertions(+), 33 deletions(-) Index: linux-2.6/drivers/pci/quirks.c =================================================================== --- linux-2.6.orig/drivers/pci/quirks.c +++ linux-2.6/drivers/pci/quirks.c @@ -2123,6 +2123,9 @@ static void __devinit quirk_disable_msi( } } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_8131_BRIDGE, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_VIA, 0xa238, quirk_disable_msi); /* Go through the list of Hypertransport capabilities and @@ -2495,39 +2498,6 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x4375, quirk_msi_intx_disable_bug); -/* - * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio - * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit. - */ -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge) -{ - u32 nb_cntl; - - if (!int_gfx_bridge->subordinate) - return; - - pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x60, 0); - pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x64, &nb_cntl); - - if (!(nb_cntl & BIT(10))) { - dev_warn(&int_gfx_bridge->dev, - FW_WARN "RS780: MSI for internal graphics disabled\n"); - int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI; - } -} - -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602 - -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); -/* wrong vendor ID on M4A785TD motherboard: */ -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); - #endif /* CONFIG_PCI_MSI */ #ifdef CONFIG_PCI_IOV ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-02 16:46 ` Rafael J. Wysocki @ 2010-04-03 18:08 ` Clemens Ladisch 2010-04-03 19:33 ` Rafael J. Wysocki 0 siblings, 1 reply; 231+ messages in thread From: Clemens Ladisch @ 2010-04-03 18:08 UTC (permalink / raw) To: Rafael J. Wysocki, Linus Torvalds Cc: Alex Deucher, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie Rafael J. Wysocki wrote: > From: Clemens Ladisch <clemens@ladisch.de> > Subject: PCI quirk: RS780/RS880: disable MSI completely > > The missing initialization of the nb_cntl.strap_msi_enable does not > seem to be the only problem that prevents MSI, so that quirk is not > sufficient to enable MSI on all machines. To be safe, disable MSI > unconditionally for the internal graphics and HDMI audio on these > chipsets. > > [rjw: Added the PCI_VENDOR_ID_AI quirk.] > ... > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi); I fear I have to NACK this. The fact that two OEMs have changed the vendor ID makes it likely that this is a bug in AMD's template BIOS code, and that we will see the same problem on other systems using other vendor IDs. So we should not use the vendor ID of device 0x9602 to declare the quirk, but use some other device with an ID that is known to be correct. We already access the configuration space of the host bridge, so we should use that. Furthermore, the quirk in my first patch was never run at all on the ALi system, so it is probable that the nb_cntl.strap_msi_enable detection would actually work. Rafael, please test this patch; if it doesn't work on your system, we can still remove the check for the strap_msi_enable bit. ========== Subject: PCI quirk: RS780/RS880: work around wrong vendor IDs of RS780 bridge On many RS780 systems, the vendor ID of the PCI/PCI bridge for the internal graphics is set to that of the mainboard vendor, so the quirk would not match and failed to notice the disabled MSI. Since we do not know in advance all possible vendor IDs, we have to declare the quirk on another device with an ID that is known to be correct, and use that as a stepping stone to find the PCI/PCI bridge, if present. Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Cc: <stable@kernel.org> --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -2483,34 +2483,38 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AT * MSI does not work with the AMD RS780/RS880 internal graphics and HDMI audio * devices unless the BIOS has initialized the nb_cntl.strap_msi_enable bit. */ -static void __init rs780_int_gfx_disable_msi(struct pci_dev *int_gfx_bridge) +static void __init rs780_int_gfx_disable_msi(struct pci_dev *host_bridge) { + struct pci_dev *int_gfx_bridge; u32 nb_cntl; - if (!int_gfx_bridge->subordinate) + /* + * Many OEMs change the vendor ID of the internal graphics PCI/PCI + * bridge, so we use the possible vendor/device IDs of the host bridge + * for the declared quirk, and search for the PCI/PCI bridge by slot + * number. + */ + int_gfx_bridge = pci_get_slot(host_bridge->bus, PCI_DEVFN(1, 0)); + if (!int_gfx_bridge) return; + if (int_gfx_bridge->device != 0x9602 || !int_gfx_bridge->subordinate) + goto out; - pci_bus_write_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x60, 0); - pci_bus_read_config_dword(int_gfx_bridge->bus, PCI_DEVFN(0, 0), - 0x64, &nb_cntl); + pci_write_config_dword(host_bridge, 0x60, 0); + pci_read_config_dword(host_bridge, 0x64, &nb_cntl); if (!(nb_cntl & BIT(10))) { dev_warn(&int_gfx_bridge->dev, FW_WARN "RS780: MSI for internal graphics disabled\n"); int_gfx_bridge->subordinate->bus_flags |= PCI_BUS_FLAGS_NO_MSI; } -} -#define PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX 0x9602 +out: + pci_dev_put(int_gfx_bridge); +} -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); -/* wrong vendor ID on M4A785TD motherboard: */ -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, - PCI_DEVICE_ID_AMD_RS780_P2P_INT_GFX, - rs780_int_gfx_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9600, rs780_int_gfx_disable_msi); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9601, rs780_int_gfx_disable_msi); #endif /* CONFIG_PCI_MSI */ ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-03 18:08 ` Clemens Ladisch @ 2010-04-03 19:33 ` Rafael J. Wysocki 0 siblings, 0 replies; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-03 19:33 UTC (permalink / raw) To: Clemens Ladisch Cc: Linus Torvalds, Alex Deucher, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable, Dave Airlie On Saturday 03 April 2010, Clemens Ladisch wrote: > Rafael J. Wysocki wrote: > > From: Clemens Ladisch <clemens@ladisch.de> > > Subject: PCI quirk: RS780/RS880: disable MSI completely > > > > The missing initialization of the nb_cntl.strap_msi_enable does not > > seem to be the only problem that prevents MSI, so that quirk is not > > sufficient to enable MSI on all machines. To be safe, disable MSI > > unconditionally for the internal graphics and HDMI audio on these > > chipsets. > > > > [rjw: Added the PCI_VENDOR_ID_AI quirk.] > > ... > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, 0x9602, quirk_disable_msi); > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASUSTEK, 0x9602, quirk_disable_msi); > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AI, 0x9602, quirk_disable_msi); > > I fear I have to NACK this. I'm afraid it's too late, the patch has been merged. > The fact that two OEMs have changed the vendor > ID makes it likely that this is a bug in AMD's template BIOS code, and that > we will see the same problem on other systems using other vendor IDs. > > So we should not use the vendor ID of device 0x9602 to declare the quirk, but > use some other device with an ID that is known to be correct. We already > access the configuration space of the host bridge, so we should use that. > > Furthermore, the quirk in my first patch was never run at all on the ALi > system, so it is probable that the nb_cntl.strap_msi_enable detection > would actually work. Rafael, please test this patch; if it doesn't work > on your system, we can still remove the check for the strap_msi_enable bit. > > ========== > > Subject: PCI quirk: RS780/RS880: work around wrong vendor IDs of RS780 bridge > > On many RS780 systems, the vendor ID of the PCI/PCI bridge for the > internal graphics is set to that of the mainboard vendor, so the quirk > would not match and failed to notice the disabled MSI. > > Since we do not know in advance all possible vendor IDs, we have to > declare the quirk on another device with an ID that is known to be > correct, and use that as a stepping stone to find the PCI/PCI bridge, > if present. > > Signed-off-by: Clemens Ladisch <clemens@ladisch.de> > Cc: <stable@kernel.org> Yes, this works (after reverting commit 5193d7a7f500cfbbfc0de221e808208199723521 and removing the (rdev->flags & RADEON_IS_IGP) test from radeon_irq_kms_init()). Thanks, Rafael ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 1:13 ` Rafael J. Wysocki 2010-04-01 2:19 ` Alex Deucher @ 2010-04-01 16:29 ` Linus Torvalds 2010-04-01 17:07 ` Alex Deucher ` (2 more replies) 1 sibling, 3 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-01 16:29 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux Kernel Mailing List, Dave Airlie, dri-devel, Jesse Barnes, Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH On Thu, 1 Apr 2010, Rafael J. Wysocki wrote: > > OK, I've verified that partial revert (below) is sufficient. Hmm. Through the DRM merge I just did, this area actually conflicted, and the resolved version is now if ((rdev->family >= CHIP_RV380) && (!(rdev->flags & RADEON_IS_IGP))) { which presumably also fixes your issue? [ Side note: somebody in the DRM tree seems to be way too used to LISP, and thinks that adding parenthesis always improves the code ;-] However, I do suspect that we should probably revert the quirk regardless as being useless (ie it probably was related to those IGP chips that apparently don't do MSI anyway). So the patch that reverts the quirk by Clemens (to replace it with disabling MSI entirely when the AMD NB doesn't accept them) seems to be a good idea regardless, since it's apparently not just about gfx. Jesse? Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 16:29 ` Linus Torvalds @ 2010-04-01 17:07 ` Alex Deucher 2010-04-01 17:24 ` Linus Torvalds 2010-04-01 19:46 ` Rafael J. Wysocki 2010-04-01 22:48 ` Jesse Barnes 2 siblings, 1 reply; 231+ messages in thread From: Alex Deucher @ 2010-04-01 17:07 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch, Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel, stable On Thu, Apr 1, 2010 at 12:29 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, 1 Apr 2010, Rafael J. Wysocki wrote: >> >> OK, I've verified that partial revert (below) is sufficient. > > Hmm. Through the DRM merge I just did, this area actually conflicted, and > the resolved version is now > > if ((rdev->family >= CHIP_RV380) && > (!(rdev->flags & RADEON_IS_IGP))) { > > which presumably also fixes your issue? > > [ Side note: somebody in the DRM tree seems to be way too used to LISP, > and thinks that adding parenthesis always improves the code ;-] > heh, that's me. habit I guess, just to be sure. > However, I do suspect that we should probably revert the quirk regardless > as being useless (ie it probably was related to those IGP chips that > apparently don't do MSI anyway). > > So the patch that reverts the quirk by Clemens (to replace it with > disabling MSI entirely when the AMD NB doesn't accept them) seems to be a > good idea regardless, since it's apparently not just about gfx. Jesse? Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the right approach I think. Note that it's only devices hung off the int gfx pci to pci bridge that have broken MSI (gfx and audio). MSI works fine on the PCIE slots. I have a similar patch for rs400 chips on bug 15626: https://bugzilla.kernel.org/show_bug.cgi?id=15626 Alex > > Linus > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > -- > _______________________________________________ > Dri-devel mailing list > Dri-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dri-devel > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 17:07 ` Alex Deucher @ 2010-04-01 17:24 ` Linus Torvalds 2010-04-01 17:50 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 Clemens Ladisch 2010-04-01 17:53 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Alex Deucher 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-01 17:24 UTC (permalink / raw) To: Alex Deucher Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch, Linux Kernel Mailing List, Jesse Barnes, Alex Deucher, dri-devel, stable On Thu, 1 Apr 2010, Alex Deucher wrote: > > Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the > right approach I think. Note that it's only devices hung off the int > gfx pci to pci bridge that have broken MSI (gfx and audio). MSI works > fine on the PCIE slots. I have a similar patch for rs400 chips on bug > 15626: > https://bugzilla.kernel.org/show_bug.cgi?id=15626 Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that pci_no_msi() quirk disable MSI for PCIE too? I think it will trigger for PCIE drivers too. Put another way: it sounds like the quirk now disables MSI for all devices. Maybe there would some more targeted mode? Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 2010-04-01 17:24 ` Linus Torvalds @ 2010-04-01 17:50 ` Clemens Ladisch 2010-04-01 17:53 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Alex Deucher 1 sibling, 0 replies; 231+ messages in thread From: Clemens Ladisch @ 2010-04-01 17:50 UTC (permalink / raw) To: Linus Torvalds Cc: Alex Deucher, Rafael J. Wysocki, Linux PCI, Greg KH, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable Linus Torvalds wrote: > On Thu, 1 Apr 2010, Alex Deucher wrote: > > Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the > > right approach I think. Note that it's only devices hung off the int > > gfx pci to pci bridge that have broken MSI (gfx and audio). MSI works > > fine on the PCIE slots. I have a similar patch for rs400 chips on bug > > 15626: > > https://bugzilla.kernel.org/show_bug.cgi?id=15626 > > Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that > pci_no_msi() quirk disable MSI for PCIE too? A quirk that used pci_no_msi() would disable all MSI for all devices. However, these patches (and that in bug 15626) use PCI_BUS_FLAGS_NO_MSI so that only the internal GPU devices are affected. That "completely" in my patch title should better read "unconditionally". Regards, Clemens ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 17:24 ` Linus Torvalds 2010-04-01 17:50 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 Clemens Ladisch @ 2010-04-01 17:53 ` Alex Deucher 2010-04-01 20:17 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Alex Deucher @ 2010-04-01 17:53 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable On Thu, Apr 1, 2010 at 1:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, 1 Apr 2010, Alex Deucher wrote: >> >> Clemems' "PCI quirk: RS780/RS880: disable MSI completely" patch is the >> right approach I think. Note that it's only devices hung off the int >> gfx pci to pci bridge that have broken MSI (gfx and audio). MSI works >> fine on the PCIE slots. I have a similar patch for rs400 chips on bug >> 15626: >> https://bugzilla.kernel.org/show_bug.cgi?id=15626 > > Hmm. Does 'pci_msi_enable' only cover regular PCI devices? Or will that > pci_no_msi() quirk disable MSI for PCIE too? I think it will trigger for > PCIE drivers too. > > Put another way: it sounds like the quirk now disables MSI for all > devices. Maybe there would some more targeted mode? > What I meant to say was MSI works fine on bridges other than the bridge the internal gfx lives on. quirk_disable_msi() just disables MSI on the devices on that particular bridge as far as I understand it, but I'm by no means an expert on the PCI code. E.g., on my RS780 board, MSIs are only problematic on the integrated gfx chip. MSIs work fine on PCI/PCIE add-on cards and the integrated Ethernet. Alex > Linus > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 17:53 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Alex Deucher @ 2010-04-01 20:17 ` Linus Torvalds 2010-04-01 20:23 ` Alex Deucher 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-01 20:17 UTC (permalink / raw) To: Alex Deucher Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable On Thu, 1 Apr 2010, Alex Deucher wrote: > > What I meant to say was MSI works fine on bridges other than the > bridge the internal gfx lives on. quirk_disable_msi() just disables > MSI on the devices on that particular bridge as far as I understand > it, but I'm by no means an expert on the PCI code. Yes, it disabled MSI only on devices under that bridge. But if it's the northbridge, that would be everything, no? But I don't know what devices those PCI_VENDOR_ID_AMD, 0x9602, PCI_VENDOR_ID_ASUSTEK, 0x9602, things are. If they are just a PCIE->PCI bridge rather than the root bridge, then everything looks fine to me. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 20:17 ` Linus Torvalds @ 2010-04-01 20:23 ` Alex Deucher 0 siblings, 0 replies; 231+ messages in thread From: Alex Deucher @ 2010-04-01 20:23 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Linux PCI, Greg KH, Clemens Ladisch, Linux Kernel Mailing List, Jesse Barnes, dri-devel, stable On Thu, Apr 1, 2010 at 4:17 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, 1 Apr 2010, Alex Deucher wrote: >> >> What I meant to say was MSI works fine on bridges other than the >> bridge the internal gfx lives on. quirk_disable_msi() just disables >> MSI on the devices on that particular bridge as far as I understand >> it, but I'm by no means an expert on the PCI code. > > Yes, it disabled MSI only on devices under that bridge. But if it's the > northbridge, that would be everything, no? > > But I don't know what devices those > > PCI_VENDOR_ID_AMD, 0x9602, > PCI_VENDOR_ID_ASUSTEK, 0x9602, > > things are. If they are just a PCIE->PCI bridge rather than the root > bridge, then everything looks fine to me. > Yup, those are just the pci to pci bridges used for the internal gfx. Really there's only one, 0x9602, but some asus oem boards have the vendor id wrong. > Linus > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 16:29 ` Linus Torvalds 2010-04-01 17:07 ` Alex Deucher @ 2010-04-01 19:46 ` Rafael J. Wysocki 2010-04-01 22:48 ` Jesse Barnes 2 siblings, 0 replies; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-01 19:46 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel Mailing List, Dave Airlie, dri-devel, Jesse Barnes, Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH On Thursday 01 April 2010, Linus Torvalds wrote: > > On Thu, 1 Apr 2010, Rafael J. Wysocki wrote: > > > > OK, I've verified that partial revert (below) is sufficient. > > Hmm. Through the DRM merge I just did, this area actually conflicted, and > the resolved version is now > > if ((rdev->family >= CHIP_RV380) && > (!(rdev->flags & RADEON_IS_IGP))) { > > which presumably also fixes your issue? Yes, it does. Rafael ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 16:29 ` Linus Torvalds 2010-04-01 17:07 ` Alex Deucher 2010-04-01 19:46 ` Rafael J. Wysocki @ 2010-04-01 22:48 ` Jesse Barnes 2010-04-01 23:23 ` Rafael J. Wysocki 2 siblings, 1 reply; 231+ messages in thread From: Jesse Barnes @ 2010-04-01 22:48 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Dave Airlie, dri-devel, Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH On Thu, 1 Apr 2010 09:29:23 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, 1 Apr 2010, Rafael J. Wysocki wrote: > > > > OK, I've verified that partial revert (below) is sufficient. > > Hmm. Through the DRM merge I just did, this area actually conflicted, and > the resolved version is now > > if ((rdev->family >= CHIP_RV380) && > (!(rdev->flags & RADEON_IS_IGP))) { > > which presumably also fixes your issue? > > [ Side note: somebody in the DRM tree seems to be way too used to LISP, > and thinks that adding parenthesis always improves the code ;-] > > However, I do suspect that we should probably revert the quirk regardless > as being useless (ie it probably was related to those IGP chips that > apparently don't do MSI anyway). > > So the patch that reverts the quirk by Clemens (to replace it with > disabling MSI entirely when the AMD NB doesn't accept them) seems to be a > good idea regardless, since it's apparently not just about gfx. Jesse? Yeah, that sounds fine. I can include it in my next pull req or you can just pick it up directly. -- Jesse Barnes, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) 2010-04-01 22:48 ` Jesse Barnes @ 2010-04-01 23:23 ` Rafael J. Wysocki 0 siblings, 0 replies; 231+ messages in thread From: Rafael J. Wysocki @ 2010-04-01 23:23 UTC (permalink / raw) To: Jesse Barnes Cc: Linus Torvalds, Linux Kernel Mailing List, Dave Airlie, dri-devel, Linux PCI, Clemens Ladisch, Alex Deucher, stable, Greg KH On Friday 02 April 2010, Jesse Barnes wrote: > On Thu, 1 Apr 2010 09:29:23 -0700 (PDT) > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > > On Thu, 1 Apr 2010, Rafael J. Wysocki wrote: > > > > > > OK, I've verified that partial revert (below) is sufficient. > > > > Hmm. Through the DRM merge I just did, this area actually conflicted, and > > the resolved version is now > > > > if ((rdev->family >= CHIP_RV380) && > > (!(rdev->flags & RADEON_IS_IGP))) { > > > > which presumably also fixes your issue? > > > > [ Side note: somebody in the DRM tree seems to be way too used to LISP, > > and thinks that adding parenthesis always improves the code ;-] > > > > However, I do suspect that we should probably revert the quirk regardless > > as being useless (ie it probably was related to those IGP chips that > > apparently don't do MSI anyway). > > > > So the patch that reverts the quirk by Clemens (to replace it with > > disabling MSI entirely when the AMD NB doesn't accept them) seems to be a > > good idea regardless, since it's apparently not just about gfx. Jesse? > > Yeah, that sounds fine. I can include it in my next pull req or you > can just pick it up directly. Not exactly that one, please, it's missing a quirk for the affected system. I've just sent a corrected version, here: https://patchwork.kernel.org/patch/90275/ Rafael ^ permalink raw reply [flat|nested] 231+ messages in thread
* Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-03-30 17:50 Linux 2.6.34-rc3 Linus Torvalds 2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki @ 2010-04-02 17:59 ` Borislav Petkov 2010-04-02 18:09 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-02 17:59 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton; +Cc: Linux Kernel Mailing List Hi, I've got the following oopsie two times now when hibernating - this means, I don't get it everytime I hibernate but only sometimes, say once in a blue moon. And yeah, I couldn't catch it over serial console so I had to make ugly pictures. By the way, the numbers in the filenames increment as I scroll down the whole oops (yep, it hadn't completely frozen and I still could do Shift->PgUp or Shift->PgDn on the console): http://www.kernel.org/pub/linux/kernel/people/bp/ So, here's what I could decipher from the oopsie, someone else who's more knowledgeable in mm, rmap and anon_vma's list traversal should be able to tell what goes wrong there. EIP is at page_referenced+0xee which is <disasm> 10c4: 41 01 c4 add %eax,%r12d 10c7: 83 7d cc 00 cmpl $0x0,-0x34(%rbp) 10cb: 74 19 je 10e6 <page_referenced+0xff> 10cd: 4d 8b 6d 20 mov 0x20(%r13),%r13 10d1: 49 83 ed 20 sub $0x20,%r13 10d5: 49 8b 45 20 mov 0x20(%r13),%rax <-------------- 10d9: 0f 18 08 prefetcht0 (%rax) 10dc: 49 8d 45 20 lea 0x20(%r13),%rax 10e0: 48 39 45 80 cmp %rax,-0x80(%rbp) </disasm> Corresponding asm: <asm> .loc 1 496 0 movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.451 .LVL295: subq $32, %r13 #, avc .LVL296: .L184: .LBE1278: movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <---------------- prefetcht0 (%rax) # <variable>.same_anon_vma.next leaq 32(%r13), %rax #, tmp97 cmpq %rax, -128(%rbp) # tmp97, %sfp jne .L187 #, .L186: .loc 1 514 0 movq %r14, %rdi # anon_vma, call page_unlock_anon_vma # </asm> and the NULL pointer in question is being written into %r13 and then 32 is subtracted from it (I'm guessing container_of()). This is consistent with the register snapshot - %r13 contains 0xffffffffffffffe0 which is -32 and with the code dump in the oops, in CIMG1640.JPG code points to opcode 49 8b 45 20. Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>. <source> mapcount = page_mapcount(page); list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { struct vm_area_struct *vma = avc->vma; unsigned long address = vma_address(page, vma); if (address == -EFAULT) continue; </source> which tells us that same_anon_vma.next is NULL. Hmm... -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-02 17:59 ` Ugly rmap NULL ptr deref oopsie on hibernate (was " Borislav Petkov @ 2010-04-02 18:09 ` Linus Torvalds 2010-04-02 15:24 ` Andrew Morton 2010-04-06 8:53 ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-02 18:09 UTC (permalink / raw) To: Borislav Petkov, Rik van Riel Cc: Andrew Morton, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins I think this is likely due to the new scalable anon_vma linking by Rik. Nothing else I can imagine should have introduced anything like it. Rik: the picures have the information, but you need to look at several to see both the oops and the backtrace. Here's a condensed version: shrink_all_memory -> do_try_to_free_pages -> shrink_zone -> shrink_inactive_list -> shrink_page_list -> page_referenced where page_referenced() oopses due page_referenced_anon() as per Borislav's description below. Added all the usual suspects to the Cc list. Left the full report appended so that the new people don't have to search for it on lkml. Linus On Fri, 2 Apr 2010, Borislav Petkov wrote: > > I've got the following oopsie two times now when hibernating - this > means, I don't get it everytime I hibernate but only sometimes, say once > in a blue moon. > > And yeah, I couldn't catch it over serial console so I had to make ugly > pictures. By the way, the numbers in the filenames increment as I scroll > down the whole oops (yep, it hadn't completely frozen and I still could > do Shift->PgUp or Shift->PgDn on the console): > > http://www.kernel.org/pub/linux/kernel/people/bp/ > > So, here's what I could decipher from the oopsie, someone else who's > more knowledgeable in mm, rmap and anon_vma's list traversal should be > able to tell what goes wrong there. > > EIP is at page_referenced+0xee > > which is > > <disasm> > 10c4: 41 01 c4 add %eax,%r12d > 10c7: 83 7d cc 00 cmpl $0x0,-0x34(%rbp) > 10cb: 74 19 je 10e6 <page_referenced+0xff> > 10cd: 4d 8b 6d 20 mov 0x20(%r13),%r13 > 10d1: 49 83 ed 20 sub $0x20,%r13 > > 10d5: 49 8b 45 20 mov 0x20(%r13),%rax <-------------- > > 10d9: 0f 18 08 prefetcht0 (%rax) > 10dc: 49 8d 45 20 lea 0x20(%r13),%rax > 10e0: 48 39 45 80 cmp %rax,-0x80(%rbp) > </disasm> > > > Corresponding asm: > > <asm> > .loc 1 496 0 > movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.451 > .LVL295: > subq $32, %r13 #, avc > .LVL296: > .L184: > .LBE1278: > movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <---------------- > prefetcht0 (%rax) # <variable>.same_anon_vma.next > leaq 32(%r13), %rax #, tmp97 > cmpq %rax, -128(%rbp) # tmp97, %sfp > jne .L187 #, > .L186: > .loc 1 514 0 > movq %r14, %rdi # anon_vma, > call page_unlock_anon_vma # > </asm> > > > and the NULL pointer in question is being written into %r13 and then 32 > is subtracted from it (I'm guessing container_of()). This is consistent > with the register snapshot - %r13 contains 0xffffffffffffffe0 which is > -32 and with the code dump in the oops, in CIMG1640.JPG code points to > opcode 49 8b 45 20. > > Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>. > > <source> > > mapcount = page_mapcount(page); > list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { > struct vm_area_struct *vma = avc->vma; > unsigned long address = vma_address(page, vma); > if (address == -EFAULT) > continue; > > </source> > > which tells us that same_anon_vma.next is NULL. Hmm... > > -- > Regards/Gruss, > Boris. > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-02 18:09 ` Linus Torvalds @ 2010-04-02 15:24 ` Andrew Morton 2010-04-02 18:37 ` Linus Torvalds 2010-04-06 8:53 ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro 1 sibling, 1 reply; 231+ messages in thread From: Andrew Morton @ 2010-04-02 15:24 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Rik van Riel, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I think this is likely due to the new scalable anon_vma linking by Rik. Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680 > Nothing else I can imagine should have introduced anything like it. > > Rik: the picures have the information, but you need to look at several to > see both the oops and the backtrace. Here's a condensed version: > > shrink_all_memory -> > do_try_to_free_pages -> > shrink_zone -> > shrink_inactive_list -> > shrink_page_list -> > page_referenced > > where page_referenced() oopses due page_referenced_anon() as per > Borislav's description below. > > Added all the usual suspects to the Cc list. Left the full report appended > so that the new people don't have to search for it on lkml. > > Linus > > On Fri, 2 Apr 2010, Borislav Petkov wrote: > > > > I've got the following oopsie two times now when hibernating - this > > means, I don't get it everytime I hibernate but only sometimes, say once > > in a blue moon. > > > > And yeah, I couldn't catch it over serial console so I had to make ugly > > pictures. By the way, the numbers in the filenames increment as I scroll > > down the whole oops (yep, it hadn't completely frozen and I still could > > do Shift->PgUp or Shift->PgDn on the console): > > > > http://www.kernel.org/pub/linux/kernel/people/bp/ > > > > So, here's what I could decipher from the oopsie, someone else who's > > more knowledgeable in mm, rmap and anon_vma's list traversal should be > > able to tell what goes wrong there. > > > > EIP is at page_referenced+0xee > > > > which is > > > > <disasm> > > 10c4: 41 01 c4 add %eax,%r12d > > 10c7: 83 7d cc 00 cmpl $0x0,-0x34(%rbp) > > 10cb: 74 19 je 10e6 <page_referenced+0xff> > > 10cd: 4d 8b 6d 20 mov 0x20(%r13),%r13 > > 10d1: 49 83 ed 20 sub $0x20,%r13 > > > > 10d5: 49 8b 45 20 mov 0x20(%r13),%rax <-------------- > > > > 10d9: 0f 18 08 prefetcht0 (%rax) > > 10dc: 49 8d 45 20 lea 0x20(%r13),%rax > > 10e0: 48 39 45 80 cmp %rax,-0x80(%rbp) > > </disasm> > > > > > > Corresponding asm: > > > > <asm> > > .loc 1 496 0 > > movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.451 > > .LVL295: > > subq $32, %r13 #, avc > > .LVL296: > > .L184: > > .LBE1278: > > movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <---------------- > > prefetcht0 (%rax) # <variable>.same_anon_vma.next > > leaq 32(%r13), %rax #, tmp97 > > cmpq %rax, -128(%rbp) # tmp97, %sfp > > jne .L187 #, > > .L186: > > .loc 1 514 0 > > movq %r14, %rdi # anon_vma, > > call page_unlock_anon_vma # > > </asm> > > > > > > and the NULL pointer in question is being written into %r13 and then 32 > > is subtracted from it (I'm guessing container_of()). This is consistent > > with the register snapshot - %r13 contains 0xffffffffffffffe0 which is > > -32 and with the code dump in the oops, in CIMG1640.JPG code points to > > opcode 49 8b 45 20. > > > > Which is the following piece of code in <mm/rmap.c:page_referenced_anon()>. > > > > <source> > > > > mapcount = page_mapcount(page); > > list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { > > struct vm_area_struct *vma = avc->vma; > > unsigned long address = vma_address(page, vma); > > if (address == -EFAULT) > > continue; > > > > </source> > > > > which tells us that same_anon_vma.next is NULL. Hmm... > > > > -- > > Regards/Gruss, > > Boris. > > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-02 15:24 ` Andrew Morton @ 2010-04-02 18:37 ` Linus Torvalds 2010-04-02 22:01 ` Rik van Riel 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-02 18:37 UTC (permalink / raw) To: Andrew Morton Cc: Borislav Petkov, Rik van Riel, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, 2 Apr 2010, Andrew Morton wrote: > On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > I think this is likely due to the new scalable anon_vma linking by Rik. > > Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680 Yup, looks like the same thing, except that bugzilla entry was due to swapping rather than hibernation and memory shrinking. But same end result, just different reasons for why we were trying to shrink the page lists. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-02 18:37 ` Linus Torvalds @ 2010-04-02 22:01 ` Rik van Riel 2010-04-03 0:19 ` Linus Torvalds 2010-04-04 16:12 ` Minchan Kim 0 siblings, 2 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-02 22:01 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/02/2010 02:37 PM, Linus Torvalds wrote: > On Fri, 2 Apr 2010, Andrew Morton wrote: >> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds<torvalds@linux-foundation.org> wrote: >> >>> >>> I think this is likely due to the new scalable anon_vma linking by Rik. >> >> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680 > > Yup, looks like the same thing, except that bugzilla entry was due to > swapping rather than hibernation and memory shrinking. But same end > result, just different reasons for why we were trying to shrink the page > lists. Interesting that it is a null pointer dereference, given that we do not zero out the anon_vma_chain structs before freeing them. Page_referenced_anon() takes the anon_vma->lock before walking the list. The three places where we modify the anon_vma_chain->same_anon_vma list, we also hold the lock. No doubt something in mm/ is doing something silly, but I have not found anything yet :( If I had to guess, I'd say maybe we got one of the mprotect & vma_adjust cases wrong. Maybe a page stayed around in the LRU (and in a process?) after its anon_vma already got freed? There has to be a reason why a very heavy AIM7 workload and some other stress tests did not trigger it, but a few people are able to trigger it on their systems... ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-02 22:01 ` Rik van Riel @ 2010-04-03 0:19 ` Linus Torvalds 2010-04-04 16:12 ` Minchan Kim 1 sibling, 0 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-03 0:19 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, 2 Apr 2010, Rik van Riel wrote: > > Interesting that it is a null pointer dereference, given > that we do not zero out the anon_vma_chain structs before > freeing them. > > Page_referenced_anon() takes the anon_vma->lock before > walking the list. The three places where we modify the > anon_vma_chain->same_anon_vma list, we also hold the > lock. So let's look at the individual anon_vma_chain entries instead. What is the protection of the 'vma->anon_vma_chain' list? In anon_vma_prepare(), the code implies that it is the page_table_lock, but what about anon_vma_clone()? If I'm reading it correctly, it is some odd mix of "mmap_sem held for writing" or "mmap_sem held for reading _and_ page_table_lock". And then we have the exit case that apparently has no locking at all, but that should hopefully be single-threaded. That thing is subtle. A few more comments about the locking would be good, so that people like me wouldn't have to try to guess the rules from reading the source. > There has to be a reason why a very heavy AIM7 workload > and some other stress tests did not trigger it, but a few > people are able to trigger it on their systems... I don't think AIM7 is at all a very interesting workload, and not likely to stress anything at all. Did your AIM7 test actually cause heavy swapping? I doubt it. Page swapout is where a lot of the magic happens, since that happens without mmap_sem held etc. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-02 22:01 ` Rik van Riel 2010-04-03 0:19 ` Linus Torvalds @ 2010-04-04 16:12 ` Minchan Kim 2010-04-04 17:24 ` Rik van Riel 2010-04-04 23:09 ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel 1 sibling, 2 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-04 16:12 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson Hi, Rik. On Fri, 2010-04-02 at 18:01 -0400, Rik van Riel wrote: > On 04/02/2010 02:37 PM, Linus Torvalds wrote: > > On Fri, 2 Apr 2010, Andrew Morton wrote: > >> On Fri, 2 Apr 2010 11:09:14 -0700 (PDT) Linus Torvalds<torvalds@linux-foundation.org> wrote: > >> > >>> > >>> I think this is likely due to the new scalable anon_vma linking by Rik. > >> > >> Similar to https://bugzilla.kernel.org/show_bug.cgi?id=15680 > > > > Yup, looks like the same thing, except that bugzilla entry was due to > > swapping rather than hibernation and memory shrinking. But same end > > result, just different reasons for why we were trying to shrink the page > > lists. > > Interesting that it is a null pointer dereference, given > that we do not zero out the anon_vma_chain structs before > freeing them. > > Page_referenced_anon() takes the anon_vma->lock before > walking the list. The three places where we modify the > anon_vma_chain->same_anon_vma list, we also hold the > lock. > > No doubt something in mm/ is doing something silly, but > I have not found anything yet :( > > If I had to guess, I'd say maybe we got one of the > mprotect & vma_adjust cases wrong. Maybe a page stayed > around in the LRU (and in a process?) after its anon_vma > already got freed? While I review the code again due to this BUG, I found some strange thing. In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which has vma which is destroyed. I couldn't find any clean routine to remove this garbage. I am missing something? But I think it isn't related to this bug because oops point is not vma_address but anon_vma_chain.next. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-04 16:12 ` Minchan Kim @ 2010-04-04 17:24 ` Rik van Riel 2010-04-04 23:09 ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel 1 sibling, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-04 17:24 UTC (permalink / raw) To: Minchan Kim Cc: Linus Torvalds, Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/04/2010 12:12 PM, Minchan Kim wrote: > While I review the code again due to this BUG, I found some strange > thing. > > In anon_vma_fork, if anon_vma_clone is successful but anon_vma_alloc is > failed, what happens? Parent VMA's anon_vmas have anon_vma_chain which > has vma which is destroyed. > I couldn't find any clean routine to remove this garbage. > I am missing something? Good catch. The parent VMA's anon_vmas will get delinked eventually, but we need to get rid of the newly allocated child anon_vmas. You found a hopefully rare memory leak... We need a call to unlink_anon_vmas(vma) at the error label to do that. > But I think it isn't related to this bug because oops point is not > vma_address but anon_vma_chain.next. Agreed, it's probably not it. ^ permalink raw reply [flat|nested] 231+ messages in thread
* [PATCH] rmap: fix anon_vma_fork() memory leak 2010-04-04 16:12 ` Minchan Kim 2010-04-04 17:24 ` Rik van Riel @ 2010-04-04 23:09 ` Rik van Riel 2010-04-04 23:56 ` Minchan Kim 2010-04-05 15:37 ` Linus Torvalds 1 sibling, 2 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-04 23:09 UTC (permalink / raw) To: Minchan Kim Cc: Linus Torvalds, Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson Fix a memory leak in anon_vma_fork(), where we fail to tear down the anon_vmas attached to the new VMA in case setting up the new anon_vma fails. Reported-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Rik van Riel <riel@redhat.com> diff --git a/mm/rmap.c b/mm/rmap.c index fcd593c..fb7ce99 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) out_error_free_anon_vma: anon_vma_free(anon_vma); + unlink_anon_vmas(vma); out_error: return -ENOMEM; } ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: fix anon_vma_fork() memory leak 2010-04-04 23:09 ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel @ 2010-04-04 23:56 ` Minchan Kim 2010-04-05 15:37 ` Linus Torvalds 1 sibling, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-04 23:56 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, Apr 5, 2010 at 8:09 AM, Rik van Riel <riel@redhat.com> wrote: > Fix a memory leak in anon_vma_fork(), where we fail to tear down the > anon_vmas attached to the new VMA in case setting up the new anon_vma > fails. > > Reported-by: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: fix anon_vma_fork() memory leak 2010-04-04 23:09 ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel 2010-04-04 23:56 ` Minchan Kim @ 2010-04-05 15:37 ` Linus Torvalds 2010-04-05 15:48 ` Minchan Kim ` (2 more replies) 1 sibling, 3 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-05 15:37 UTC (permalink / raw) To: Rik van Riel Cc: Minchan Kim, Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sun, 4 Apr 2010, Rik van Riel wrote: > > Fix a memory leak in anon_vma_fork(), where we fail to tear down the > anon_vmas attached to the new VMA in case setting up the new anon_vma > fails. > > Reported-by: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Rik van Riel <riel@redhat.com> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com> > --- > > diff --git a/mm/rmap.c b/mm/rmap.c > index fcd593c..fb7ce99 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > > out_error_free_anon_vma: > anon_vma_free(anon_vma); > + unlink_anon_vmas(vma); > out_error: > return -ENOMEM; > } This looks _very_ wrong to me. Shouldn't the unlink_anon_vmas() be in the "out_error" case? IOW, we should do it even if the "anon_vma_alloc()" failed, nbot just if the "anon_vma_chain_alloc()" failed? No? What am I missing? Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: fix anon_vma_fork() memory leak 2010-04-05 15:37 ` Linus Torvalds @ 2010-04-05 15:48 ` Minchan Kim 2010-04-05 16:04 ` Rik van Riel 2010-04-05 16:13 ` [PATCH -v2] " Rik van Riel 2 siblings, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-05 15:48 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, Apr 6, 2010 at 12:37 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Sun, 4 Apr 2010, Rik van Riel wrote: >> >> Fix a memory leak in anon_vma_fork(), where we fail to tear down the >> anon_vmas attached to the new VMA in case setting up the new anon_vma >> fails. >> >> Reported-by: Minchan Kim <minchan.kim@gmail.com> >> Signed-off-by: Rik van Riel <riel@redhat.com> >> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> >> --- >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index fcd593c..fb7ce99 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -231,6 +231,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) >> >> out_error_free_anon_vma: >> anon_vma_free(anon_vma); >> + unlink_anon_vmas(vma); >> out_error: >> return -ENOMEM; >> } > > This looks _very_ wrong to me. > > Shouldn't the unlink_anon_vmas() be in the "out_error" case? IOW, we > should do it even if the "anon_vma_alloc()" failed, nbot just if the > "anon_vma_chain_alloc()" failed? > > No? > > What am I missing? Indeed. You're right. I should have been reviewed more carefully. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: fix anon_vma_fork() memory leak 2010-04-05 15:37 ` Linus Torvalds 2010-04-05 15:48 ` Minchan Kim @ 2010-04-05 16:04 ` Rik van Riel 2010-04-05 16:13 ` [PATCH -v2] " Rik van Riel 2 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-05 16:04 UTC (permalink / raw) To: Linus Torvalds Cc: Minchan Kim, Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/05/2010 11:37 AM, Linus Torvalds wrote: > This looks _very_ wrong to me. > > Shouldn't the unlink_anon_vmas() be in the "out_error" case? Indeed it should. I've had my mind somewhere else this weekend :/ New patch in the next mail. ^ permalink raw reply [flat|nested] 231+ messages in thread
* [PATCH -v2] rmap: fix anon_vma_fork() memory leak 2010-04-05 15:37 ` Linus Torvalds 2010-04-05 15:48 ` Minchan Kim 2010-04-05 16:04 ` Rik van Riel @ 2010-04-05 16:13 ` Rik van Riel 2 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-05 16:13 UTC (permalink / raw) To: Linus Torvalds Cc: Minchan Kim, Andrew Morton, Borislav Petkov, Linux Kernel Mailing List, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson Fix a memory leak in anon_vma_fork(), where we fail to tear down the anon_vmas attached to the new VMA in case setting up the new anon_vma fails. This bug also has the potential to leave behind anon_vma_chain structs with pointers to invalid memory. Reported-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Rik van Riel <riel@redhat.com> diff --git a/mm/rmap.c b/mm/rmap.c index fcd593c..eaa7a09 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -232,6 +232,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) out_error_free_anon_vma: anon_vma_free(anon_vma); out_error: + unlink_anon_vmas(vma); return -ENOMEM; } ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-02 18:09 ` Linus Torvalds 2010-04-02 15:24 ` Andrew Morton @ 2010-04-06 8:53 ` KOSAKI Motohiro 2010-04-06 10:09 ` KOSAKI Motohiro 2010-04-06 14:38 ` Rik van Riel 1 sibling, 2 replies; 231+ messages in thread From: KOSAKI Motohiro @ 2010-04-06 8:53 UTC (permalink / raw) To: Linus Torvalds Cc: kosaki.motohiro, Borislav Petkov, Rik van Riel, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins > > I think this is likely due to the new scalable anon_vma linking by Rik. > Nothing else I can imagine should have introduced anything like it. > > Rik: the picures have the information, but you need to look at several to > see both the oops and the backtrace. Here's a condensed version: > > shrink_all_memory -> > do_try_to_free_pages -> > shrink_zone -> > shrink_inactive_list -> > shrink_page_list -> > page_referenced > > where page_referenced() oopses due page_referenced_anon() as per > Borislav's description below. > > Added all the usual suspects to the Cc list. Left the full report appended > so that the new people don't have to search for it on lkml. Today, I've reviewed this patch carefully. but I haven't found any bug. 1) anon_vma->list is alwasys protected anon_vma->lock. 2) If anyone forget to take lock, list_add() and/or list_del() never assign to NULL. then, NULL mean either three possibility. a) we see uninitialized data b) we see after freed data c) we see memory corruption by another bug but (a) can't happen because static inline void __list_add() { next->prev = new; new->next = next; new->prev = prev; prev->next = new; (*) } If uninitialized var is linked to avc list, new->next was already !NULL. (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma freeing until next rcu period. It mean rcu_read_lock()+page_mapped() can see kfree()ed page. but it is safe. noone corrupt it. now I doubt (c) ;-) Also, I've runned stress workload with shrink_all_memory() today. but I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy. I'm frequently fail to reproduce) I'll continue to work. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 8:53 ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro @ 2010-04-06 10:09 ` KOSAKI Motohiro 2010-04-06 14:34 ` Rik van Riel 2010-04-06 14:38 ` Rik van Riel 1 sibling, 1 reply; 231+ messages in thread From: KOSAKI Motohiro @ 2010-04-06 10:09 UTC (permalink / raw) To: KOSAKI Motohiro Cc: kosaki.motohiro, Linus Torvalds, Borislav Petkov, Rik van Riel, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins > (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma > freeing until next rcu period. It mean rcu_read_lock()+page_mapped() > can see kfree()ed page. but it is safe. noone corrupt it. by the way: I haven't understand why rik's per process anon_vma concept works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems not guranteed vma->anon_vma and page->anon_vma are the same. but I guess bug reporter doesn't use ksm, it's minor feature. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 10:09 ` KOSAKI Motohiro @ 2010-04-06 14:34 ` Rik van Riel 0 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-06 14:34 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Linus Torvalds, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins On 04/06/2010 06:09 AM, KOSAKI Motohiro wrote: >> (b) is also impossible. SLAB_DESTROY_BY_RCU delay the page for anon_vma >> freeing until next rcu period. It mean rcu_read_lock()+page_mapped() >> can see kfree()ed page. but it is safe. noone corrupt it. > > by the way: I haven't understand why rik's per process anon_vma concept > works correctly with ksm. ksm increase anon_vma->ksm_refcount. but it seems > not guranteed vma->anon_vma and page->anon_vma are the same. KSM removes the page from its original anon_vma. If the page gets reinstantiated (copy on write), it will be created in the vma->anon_vma. Am I overlooking something? ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 8:53 ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro 2010-04-06 10:09 ` KOSAKI Motohiro @ 2010-04-06 14:38 ` Rik van Riel 2010-04-06 15:34 ` Minchan Kim 2010-04-06 17:05 ` Borislav Petkov 1 sibling, 2 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-06 14:38 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Linus Torvalds, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins On 04/06/2010 04:53 AM, KOSAKI Motohiro wrote: > Today, I've reviewed this patch carefully. but I haven't found any bug. > Also, I've runned stress workload with shrink_all_memory() today. but > I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy. > I'm frequently fail to reproduce) > > I'll continue to work. My status with this bug is the same - I have gone through the code from all angles, but have not found any other bugs yet (except for that leak - which could leave invalid pointers behind). This makes me wonder if perhaps the bug is a side effect of something Borislav (and the other reproducers) have in their kernel configuration, which we do not have. Another (unlikely) thing is that the fix for the leak makes the bug go away. Yes, very unlikely. Borislav, could you please send us your .config ? Also, if you have the time, could you try out the patch (-v2) I mailed in a little up this thread that fixes the memory leak in anon_vma_fork? I suspect it should not change anything, but it could be useful to rule out anyway. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 14:38 ` Rik van Riel @ 2010-04-06 15:34 ` Minchan Kim 2010-04-06 15:40 ` Rik van Riel 2010-04-06 15:55 ` Linus Torvalds 2010-04-06 17:05 ` Borislav Petkov 1 sibling, 2 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-06 15:34 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, 2010-04-06 at 10:38 -0400, Rik van Riel wrote: > On 04/06/2010 04:53 AM, KOSAKI Motohiro wrote: > > > Today, I've reviewed this patch carefully. but I haven't found any bug. > > > Also, I've runned stress workload with shrink_all_memory() today. but > > I couldn't reproduce the issue. hmm.. (perhaps I'm no lucky guy. > > I'm frequently fail to reproduce) > > > > I'll continue to work. > > My status with this bug is the same - I have gone through > the code from all angles, but have not found any other bugs > yet (except for that leak - which could leave invalid > pointers behind). Let's see the unlink_anon_vmas. 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma) 2. anon_vma_unlink 3. spin_lock(anon_vma->lock) <-- HERE LOCK. 4. list_del(anon_vma_chain->same_anon_vma); What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another anon_vma object between 2 and 3? I mean how to make sure 3) does lock valid anon_vma? I hope it is culprit. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 15:34 ` Minchan Kim @ 2010-04-06 15:40 ` Rik van Riel 2010-04-06 15:58 ` Minchan Kim 2010-04-06 15:55 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-06 15:40 UTC (permalink / raw) To: Minchan Kim Cc: KOSAKI Motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On 04/06/2010 11:34 AM, Minchan Kim wrote: > Let's see the unlink_anon_vmas. > > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma) > 2. anon_vma_unlink > 3. spin_lock(anon_vma->lock)<-- HERE LOCK. > 4. list_del(anon_vma_chain->same_anon_vma); > > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another > anon_vma object between 2 and 3? > I mean how to make sure 3) does lock valid anon_vma? > > I hope it is culprit. How can the anon_vma get destroyed and reused, when this anon_vma_chain still has a reference to it (and the anon_vma has not been freed yet)? What combination of circumstances is necessary for your bug hypothetical to happen? ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 15:40 ` Rik van Riel @ 2010-04-06 15:58 ` Minchan Kim 0 siblings, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-06 15:58 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, 2010-04-06 at 11:40 -0400, Rik van Riel wrote: > On 04/06/2010 11:34 AM, Minchan Kim wrote: > > > Let's see the unlink_anon_vmas. > > > > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma) > > 2. anon_vma_unlink > > 3. spin_lock(anon_vma->lock)<-- HERE LOCK. > > 4. list_del(anon_vma_chain->same_anon_vma); > > > > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another > > anon_vma object between 2 and 3? > > I mean how to make sure 3) does lock valid anon_vma? > > > > I hope it is culprit. > > How can the anon_vma get destroyed and reused, when this > anon_vma_chain still has a reference to it (and the Doesn't anon_vma_chain have a ref counter on anon_vma? > anon_vma has not been freed yet)? AFAIK, anon_vma can be reused without free by SLAB_XXX_RCU. So we always use it carefully by page_lock_anon_vma or manual check with RCU and page_mapped. What am I missing? > > What combination of circumstances is necessary for > your bug hypothetical to happen? CPU A CPU B unlink_anon_vmas list_for_each_entry free_pgtable anon_vma_unlink <crazy stall> spin_lock(anon_vma); list_del(same_anon_vma) spin_unlock(anon_vma) anon_vma_unlink anon_vma_free reuse for another anon_vma spin_lock(another anon_vma) list_del(another anon_vma) If my assumption is wrong, please correct me. Thanks, Rik. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 15:34 ` Minchan Kim 2010-04-06 15:40 ` Rik van Riel @ 2010-04-06 15:55 ` Linus Torvalds 2010-04-06 16:23 ` Minchan Kim 2010-04-07 8:37 ` Peter Zijlstra 1 sibling, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 15:55 UTC (permalink / raw) To: Minchan Kim Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Wed, 7 Apr 2010, Minchan Kim wrote: > > Let's see the unlink_anon_vmas. > > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma) > 2. anon_vma_unlink > 3. spin_lock(anon_vma->lock) <-- HERE LOCK. > 4. list_del(anon_vma_chain->same_anon_vma); > > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another > anon_vma object between 2 and 3? > I mean how to make sure 3) does lock valid anon_vma? > > I hope it is culprit. I don't think so. That isn't the racy case. We're working with a anon_vma_chain, so the anonvma is all there. The racy case is when we look up an anonvma by the page, and the page gets unmapped at the same time because somebody else is travelling over the LRU list of the page itself, isn't it? I do wonder if "page_lock_anon_vma()" should check the whole "page_mapped()" case _after_ taking the anon_vma lock. Because if the race happens, we're following a anon_vma list that has nothing to do with that page (it's stilla _valid_ list, since we locked the anon_vma, but will it be ok?) IOW, what is it that really keeps the anon_vma list reliable _and_ relevant wrt the page? We know we may get a stale anon_vma, are we ok if that anon_vma list doesn't actually have anything to do with the page any more? I think the first check in "page_address_in_vma()" protects us, but whatever. However, that made me look at the PAGE_MIGRATION case. That seems to be just broken. It's doing that page_anon_vma() + spin_lock without holding any RCU locks, so there is no guarantee that anon_vma there is at all valid. Is that function always called with rcu_read_lock()? Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 15:55 ` Linus Torvalds @ 2010-04-06 16:23 ` Minchan Kim 2010-04-06 16:28 ` Linus Torvalds 2010-04-06 16:32 ` Linus Torvalds 2010-04-07 8:37 ` Peter Zijlstra 1 sibling, 2 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-06 16:23 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins Hi, Linus. On Tue, 2010-04-06 at 08:55 -0700, Linus Torvalds wrote: > > On Wed, 7 Apr 2010, Minchan Kim wrote: > > > > Let's see the unlink_anon_vmas. > > > > 1. list_for_each_entry_safe(avc,next, vma->anon_vma_chain, same_vma) > > 2. anon_vma_unlink > > 3. spin_lock(anon_vma->lock) <-- HERE LOCK. > > 4. list_del(anon_vma_chain->same_anon_vma); > > > > What if anon_vma is destroyed and reuse by SLAB_XXX_RCU for another > > anon_vma object between 2 and 3? > > I mean how to make sure 3) does lock valid anon_vma? > > > > I hope it is culprit. > > I don't think so. That isn't the racy case. We're working with a > anon_vma_chain, so the anonvma is all there. > But the anon_vma is using for another anon_vma. Nonetheless, anon_vma_unlink does list_del(anon_vma's same_anon_vma). I doubt it. > The racy case is when we look up an anonvma by the page, and the page gets > unmapped at the same time because somebody else is travelling over the LRU > list of the page itself, isn't it? Yes. but I thought page might travel with anon_vmas which have same_anon_vma deleted by race. > > I do wonder if "page_lock_anon_vma()" should check the whole > "page_mapped()" case _after_ taking the anon_vma lock. Because if the race > happens, we're following a anon_vma list that has nothing to do with that > page (it's stilla _valid_ list, since we locked the anon_vma, but will it > be ok?) So we always use it with (vma_address and page_check_address) to make sure validation of anon_vma. But I think it's not good design. I want to hold lock ahead checking of page_mapped but maybe performance issue? I am not sure. > > IOW, what is it that really keeps the anon_vma list reliable _and_ > relevant wrt the page? We know we may get a stale anon_vma, are we ok if > that anon_vma list doesn't actually have anything to do with the page any > more? > I think the first check in "page_address_in_vma()" protects us, but > whatever. > > However, that made me look at the PAGE_MIGRATION case. That seems to be > just broken. It's doing that page_anon_vma() + spin_lock without holding > any RCU locks, so there is no guarantee that anon_vma there is at all > valid. FYI, recently there is a patch about migration case. http://lkml.org/lkml/2010/4/2/145 > > Is that function always called with rcu_read_lock()? > > Linus -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 16:23 ` Minchan Kim @ 2010-04-06 16:28 ` Linus Torvalds 2010-04-06 16:45 ` Minchan Kim 2010-04-06 16:32 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 16:28 UTC (permalink / raw) To: Minchan Kim Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Wed, 7 Apr 2010, Minchan Kim wrote: > > > > However, that made me look at the PAGE_MIGRATION case. That seems to be > > just broken. It's doing that page_anon_vma() + spin_lock without holding > > any RCU locks, so there is no guarantee that anon_vma there is at all > > valid. > > FYI, recently there is a patch about migration case. > http://lkml.org/lkml/2010/4/2/145 No, I'm talking about rmap_walk_anon(): anon_vma = page_anon_vma(page); if (!anon_vma) return ret; spin_lock(&anon_vma->lock); which seems to be simply buggy. The anon_vma may not exist any more, because an RCU event might have really freed the page between looking it up and locking it. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 16:28 ` Linus Torvalds @ 2010-04-06 16:45 ` Minchan Kim 2010-04-06 16:53 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Minchan Kim @ 2010-04-06 16:45 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, 2010-04-06 at 09:28 -0700, Linus Torvalds wrote: > > On Wed, 7 Apr 2010, Minchan Kim wrote: > > > > > > However, that made me look at the PAGE_MIGRATION case. That seems to be > > > just broken. It's doing that page_anon_vma() + spin_lock without holding > > > any RCU locks, so there is no guarantee that anon_vma there is at all > > > valid. > > > > FYI, recently there is a patch about migration case. > > http://lkml.org/lkml/2010/4/2/145 > > No, I'm talking about rmap_walk_anon(): > > anon_vma = page_anon_vma(page); > if (!anon_vma) > return ret; > spin_lock(&anon_vma->lock); > > which seems to be simply buggy. The anon_vma may not exist any more, > because an RCU event might have really freed the page between looking it > up and locking it. > > Linus unmap_and_move remove_migration_ptes rmap_walk rmap_walk_anon We always has rcu_read_lock about anon page in unmap_and_move. So I think it's not buggy. What am I missing? -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 16:45 ` Minchan Kim @ 2010-04-06 16:53 ` Linus Torvalds 2010-04-06 17:04 ` Rik van Riel 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 16:53 UTC (permalink / raw) To: Minchan Kim Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Wed, 7 Apr 2010, Minchan Kim wrote: > > unmap_and_move > remove_migration_ptes > rmap_walk > rmap_walk_anon > > We always has rcu_read_lock about anon page in unmap_and_move. > So I think it's not buggy. What am I missing? Ok, in that case it's fine. However, it does bring back my comment about all those anonvma changes: the locking is totally undocumented. Why isn't there a thing _saying_ that it's ok because of this? Why is there no comment about the locking of that 'same_vma' / 'vma->anon_vma_chain' except for the totally nonsensical one about page_table_lock (which doesn't protect _any_ of the other cases)? Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 16:53 ` Linus Torvalds @ 2010-04-06 17:04 ` Rik van Riel 2010-04-06 18:28 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-06 17:04 UTC (permalink / raw) To: Linus Torvalds Cc: Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On 04/06/2010 12:53 PM, Linus Torvalds wrote: > On Wed, 7 Apr 2010, Minchan Kim wrote: >> >> unmap_and_move >> remove_migration_ptes >> rmap_walk >> rmap_walk_anon >> >> We always has rcu_read_lock about anon page in unmap_and_move. >> So I think it's not buggy. What am I missing? > > Ok, in that case it's fine. > > However, it does bring back my comment about all those anonvma changes: > the locking is totally undocumented. > > Why isn't there a thing _saying_ that it's ok because of this? > > Why is there no comment about the locking of that 'same_vma' / > 'vma->anon_vma_chain' except for the totally nonsensical one about > page_table_lock (which doesn't protect _any_ of the other cases)? Which other cases? When do we ever walk the "same_vma" list not from the context of the process owning the vma? This bug in page_referenced is walking the "same_anon_vma" list, which is locked with the anon_vma->lock. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 17:04 ` Rik van Riel @ 2010-04-06 18:28 ` Linus Torvalds 2010-04-06 19:03 ` Andrew Morton 2010-04-07 8:36 ` Peter Zijlstra 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 18:28 UTC (permalink / raw) To: Rik van Riel Cc: Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, 6 Apr 2010, Rik van Riel wrote: > > Which other cases? When do we ever walk the "same_vma" list > not from the context of the process owning the vma? That's the point. What does 'owning the vma' mean? That's exactly what I'm asking to be documented. Quite frankly, the thing is a mess. There is _no_ comment on why it's ok to modify the list or walk the list, except for the one totally misleading one, since the page_table_lock has at most a _secondary_ meaning in the whole ownership (ie it is used only when we do _not_ own the vma chain exclusively). So your very comment shows the whole confusion. No, we do not "own the vma" in all cases. Sometimes we just have a read-lock on it. > This bug in page_referenced is walking the "same_anon_vma" list, > which is locked with the anon_vma->lock. Umm. Wake the hell up, Rik! It's walking a _corrupt_ same_anon_vma list. In other words, we _know_ that the 'anon_vma_chain' entry is crap. We know that exactly because it contains "impossible" values with regard to the list. And what's the easiest way to get such a corrupt list, considering that the locking looks correct for that particular list? That's right: by having something like anon_vma_clone() do something bad when it walks the same avc entries using the 'same_vma' list and creates copies of it. You can't just say "but but but same_anon_vma list is always locked properly". Because it doesn't matter if that list is locked properly if walking _another_ list doesn't work right. I really don't understand why you keep on harping on thatr same_anon_vma list. The fact that that was the corrupt list IN ABSOLUTELY NO WAY implies that that is the list that caused the corruption. For example, let's say that the 'anon_vma_chain' list is corrupted. Never mind how. So what could happen is that you'd have vma->anon_vma pointing to one thing, and one or more entries on the 'vma->anon_vma_chain' list pointing to _another_ anon_vma. What happens then? I have no idea. Maybe nothing bad. But the point is, if one avc list is corrupted and you may end up referencing those avc's in unexpected cases, how can you trust the other list that is in the same data structure? For example, maybe some list corruption causes us to do that "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that already had "same_anon_vma" on one list. No, I really don't see how that could happen, but my argument is that a corrupt list can do odd things. The same entry might end up pointing to itself, so that you end up freeing it twice or something. Just as an example of the kind of code that makes me worry: void unlink_anon_vmas(struct vm_area_struct *vma) { struct anon_vma_chain *avc, *next; /* Unlink each anon_vma chained to the VMA. */ list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { anon_vma_unlink(avc); list_del(&avc->same_vma); anon_vma_chain_free(avc); } } Now, think about what happens for the *last* entry in that avc chain. It will call that "anon_vma_unlink()" thing, which will delete perhaps the last entry in the "same_anon_vma" one, and then it does if (empty) anon_vma_free(anon_vma); *before* unlink_anon_vma's has actually does that list_del(&avc->same_vma); and what we essentially have is a stale anon_vma_chain entry that still exists on that same_vma list, and points to an anon_vma that already got deleted. Does it matter? I really can't see that it does. But that's the kind of thing that makes me nervous. It makes me _especially_ nervous when the whole locking for that anon_vma_chain thing isn't entirely obvious. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 18:28 ` Linus Torvalds @ 2010-04-06 19:03 ` Andrew Morton 2010-04-06 19:10 ` Steinar H. Gunderson ` (2 more replies) 2010-04-07 8:36 ` Peter Zijlstra 1 sibling, 3 replies; 231+ messages in thread From: Andrew Morton @ 2010-04-06 19:03 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 6 Apr 2010 11:28:52 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > For example, maybe some list corruption causes us to do that > "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that > "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that > already had "same_anon_vma" on one list. The lib/list_debug.c stuff might detect such things. I wonder if either Borislav or Steinar had CONFIG_DEBUG_LIST enabled? ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 19:03 ` Andrew Morton @ 2010-04-06 19:10 ` Steinar H. Gunderson 2010-04-06 19:10 ` Linus Torvalds 2010-04-06 19:42 ` Borislav Petkov 2 siblings, 0 replies; 231+ messages in thread From: Steinar H. Gunderson @ 2010-04-06 19:10 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, Apr 06, 2010 at 12:03:15PM -0700, Andrew Morton wrote: >> For example, maybe some list corruption causes us to do that >> "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that >> "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that >> already had "same_anon_vma" on one list. > The lib/list_debug.c stuff might detect such things. I wonder if > either Borislav or Steinar had CONFIG_DEBUG_LIST enabled? Not set on my kernel. /* Steinar */ -- Homepage: http://www.sesse.net/ ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 19:03 ` Andrew Morton 2010-04-06 19:10 ` Steinar H. Gunderson @ 2010-04-06 19:10 ` Linus Torvalds 2010-04-06 19:35 ` Linus Torvalds 2010-04-06 19:42 ` Borislav Petkov 2 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 19:10 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 6 Apr 2010, Andrew Morton wrote: > On Tue, 6 Apr 2010 11:28:52 -0700 (PDT) > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > For example, maybe some list corruption causes us to do that > > "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that > > "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that > > already had "same_anon_vma" on one list. > > The lib/list_debug.c stuff might detect such things. I wonder if > either Borislav or Steinar had CONFIG_DEBUG_LIST enabled? Well, even without CONFIG_LIST_DEBUG we'd catch _some_ things, and conversely, even with LIST_DEBUG on we don't catch everything. For example, doing list_del() twice on the same entry will die with a really nice pattern due to poisoning even without LIST_DEBUG. But list_add() twice on the same entry will sadly silently succeed both with and without list debugging (the list debugging will check the target list head, but there is no way to check the "new->next/prev" entries). Anyway, I've not actually found anything wrong in the same_vma locking. And I'm not at all convinced there is any list corruption there. My point was really only that (a) the locking rules seem very unclear and certainly not documented and (b) corruption of one list could easily be the cause of corruption of another list of the same structure. but I don't actually see anything wrong anywhere. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 19:10 ` Linus Torvalds @ 2010-04-06 19:35 ` Linus Torvalds 0 siblings, 0 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 19:35 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 6 Apr 2010, Linus Torvalds wrote: > > Anyway, I've not actually found anything wrong in the same_vma locking. > And I'm not at all convinced there is any list corruption there. My point > was really only that > (a) the locking rules seem very unclear and certainly not documented and > (b) corruption of one list could easily be the cause of corruption of > another list of the same structure. > but I don't actually see anything wrong anywhere. I _have_ found what looks like a few clues, though. In particular, the disassembly in Steinar Gunderson's case looks much more like the disassembly I get, and if I read that correctly, it's actually the _first_ iteration of the for_each_entry() loop that crashes. Why do I think so? In Steinar's oops, we have "RAX: ffff880169111fc8", which is clearly a kernel pointer. However, the code from Steinar's oops decodes to: 0: 3b 56 10 cmp 0x10(%rsi),%edx 3: 73 1e jae 0x23 5: 48 83 fa f2 cmp $0xfffffffffffffff2,%rdx 9: 74 18 je 0x23 b: 4d 89 f8 mov %r15,%r8 e: 48 8d 4d cc lea -0x34(%rbp),%rcx 12: 4c 89 e7 mov %r12,%rdi 15: e8 44 f2 ff ff callq 0xfffffffffffff25e 1a: 41 01 c5 add %eax,%r13d 1d: 83 7d cc 00 cmpl $0x0,-0x34(%rbp) 21: 74 19 je 0x3c 23: 48 8b 43 20 mov 0x20(%rbx),%rax 27: 48 8d 58 e0 lea -0x20(%rax),%rbx 2b:* 48 8b 43 20 mov 0x20(%rbx),%rax <-- trapping instruction 2f: 0f 18 08 prefetcht0 (%rax) 32: 48 8d 43 20 lea 0x20(%rbx),%rax 36: 48 39 45 88 cmp %rax,-0x78(%rbp) 3a: 75 a7 jne 0xffffffffffffffe3 3c: 41 fe 06 incb (%r14) 3f: e9 .byte 0xe9 which matches my code pretty well, and the point is, _if_ it went through the loop, then %rbx should be %rax+20. And it's not. IOW, the code you see above before the trapping instruction is the end of the loop: it's the referenced += page_referenced_one(page, vma, address, &mapcount, vm_flags); if (!mapcount) break; } part (the "callq" and "add %eax" is that "referenced +=", and %r13d is "referenced"). What you cannot see from the code decode is the loop setup and _entry_, which looks like this for me: movl 12(%rbx), %eax # <variable>.D.11299._mapcount.counter, D.33294 xorl %r12d, %r12d # referenced incl %eax # tmp89 movl %eax, -52(%rbp) # tmp89, mapcount leaq 48(%r14), %rax #, movq 48(%r14), %r13 # <variable>.head.next, <variable>.head.next movq %rax, -128(%rbp) #, %sfp subq $32, %r13 #, avc jmp .L167 # where that "L167" is actually the oopsing instruction (ie the "while" loop has been turned around, and we jump to the end of the loop that does the loop end test). In other words, what is NULL here is not an anon_vma_chain entry, but actually the initial "anon_vma->head.next" pointer. The whole _head_ of the list has never been initialized, in other words. So we can entirely ignore the 'anon_vma_chain' issues. We need to look at the initializations of the 'anon_vma's themselves. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 19:03 ` Andrew Morton 2010-04-06 19:10 ` Steinar H. Gunderson 2010-04-06 19:10 ` Linus Torvalds @ 2010-04-06 19:42 ` Borislav Petkov 2010-04-06 20:02 ` Linus Torvalds 2 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-06 19:42 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Andrew Morton <akpm@linux-foundation.org> Date: Tue, Apr 06, 2010 at 12:03:15PM -0700 > On Tue, 6 Apr 2010 11:28:52 -0700 (PDT) > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > For example, maybe some list corruption causes us to do that > > "anon_vma_chain_link()" _twice_ on the same avc entry. So we do that > > "list_add_tail(&avc->same_anon_vma, &anon_vma->head);" on an entry that > > already had "same_anon_vma" on one list. > > The lib/list_debug.c stuff might detect such things. I wonder if > either Borislav or Steinar had CONFIG_DEBUG_LIST enabled? No, it is off in my .config. I'll turn it on and retest to see whether it screams something. In the meantime, I've been testing current git (v2.6.34-rc3-288-gab195c5), and especially Rik's mem leak fix which Linus already committed (4946d54cb55e86a156216fcfeed5568514b0830f) and tried to retrigger the bug by hibernating the machine several times. Now, this machine has 8G of memory so I thought maybe if starting several assorted guests on it would put some pressure on anon_vma lists but no, the machine habernated happily by creating almost a 600Mb hibernation image and having all three guests loaded. Then, I said, well, let's have another last test run and started firefox which went into reloading the last session. And I remember that firefox still hadn't finished loading all pages when I hibernated and boom, it oopsed. So, it definitely is some anon_vma lists concurrency issue ... The good thing is, I was able to catch the oops in its sheer magnificence over netconsole this time: [ 2995.478125] PM: Preallocating image memory... [ 2995.713692] BUG: unable to handle kernel NULL pointer dereference at (null) [ 2995.714001] IP: [<ffffffff810c194d>] page_referenced+0xee/0x1dc [ 2995.714001] PGD 22d1b8067 PUD 22dd85067 PMD 0 [ 2995.714001] Oops: 0000 [#1] PREEMPT SMP [ 2995.714001] last sysfs file: /sys/power/state [ 2995.714001] CPU 0 [ 2995.714001] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core [ 2995.714001] [ 2995.714001] Pid: 7440, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name [ 2995.714001] RIP: 0010:[<ffffffff810c194d>] [<ffffffff810c194d>] page_referenced+0xee/0x1dc [ 2995.714001] RSP: 0018:ffff88022fa038b8 EFLAGS: 00010283 [ 2995.714001] RAX: ffff88022d747098 RBX: ffffea00078efb70 RCX: 0000000000000000 [ 2995.714001] RDX: ffff88022fa03cf8 RSI: ffff88022d747070 RDI: ffff88022fb32520 [ 2995.714001] RBP: ffff88022fa03938 R08: 0000000000000002 R09: 0000000000000000 [ 2995.714001] R10: ffff88022fa038a8 R11: ffff88022d295d10 R12: 0000000000000000 [ 2995.714001] R13: ffffffffffffffe0 R14: ffff88022d747058 R15: ffff88022fa03a00 [ 2995.714001] FS: 00007f4da8b966f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000 [ 2995.714001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2995.714001] CR2: 0000000000000000 CR3: 000000022d11e000 CR4: 00000000000006f0 [ 2995.714001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2995.714001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 2995.714001] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520) [ 2995.714001] Stack: [ 2995.714001] ffff88022d747098 00000000813fd2ac ffffffff8165ee28 0000000000000416 [ 2995.714001] <0> ffff88022fa038f8 ffffffff810c6d40 ffffea00078fae60 ffffea00078fae60 [ 2995.714001] <0> ffff88022fa03938 00000002810abd98 ffffea00078ec530 ffffea00078efb98 [ 2995.714001] Call Trace: [ 2995.714001] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c [ 2995.714001] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1 [ 2995.714001] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58 [ 2995.714001] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623 [ 2995.714001] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4 [ 2995.714001] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1 [ 2995.714001] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79 [ 2995.714001] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4 [ 2995.714001] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c [ 2995.714001] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f [ 2995.714001] [<ffffffff8103de67>] ? irq_exit+0x93/0x95 [ 2995.714001] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4 [ 2995.714001] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217 [ 2995.714001] [<ffffffff81077503>] ? count_data_pages+0x65/0x79 [ 2995.714001] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb [ 2995.714001] [<ffffffff813f95b5>] ? printk+0x41/0x44 [ 2995.714001] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1 [ 2995.714001] [<ffffffff8107632c>] hibernate+0xce/0x172 [ 2995.714001] [<ffffffff81075099>] state_store+0x5c/0xd3 [ 2995.714001] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19 [ 2995.714001] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144 [ 2995.714001] [<ffffffff810d66ff>] vfs_write+0xb2/0x153 [ 2995.714001] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 2995.714001] [<ffffffff810d6863>] sys_write+0x4a/0x71 [ 2995.714001] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b [ 2995.714001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 4d f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 [ 2995.714001] RIP [<ffffffff810c194d>] page_referenced+0xee/0x1dc [ 2995.714001] RSP <ffff88022fa038b8> [ 2995.714001] CR2: 0000000000000000 [ 2995.729717] ---[ end trace 92c25d74e4800968 ]--- [ 2995.729862] note: hib.sh[7440] exited with preempt_count 2 [ 2995.730022] BUG: scheduling while atomic: hib.sh/7440/0x10000003 [ 2995.730170] INFO: lockdep is turned off. [ 2995.730319] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core [ 2995.731749] Pid: 7440, comm: hib.sh Tainted: G D 2.6.34-rc3-00288-gab195c5 #1 [ 2995.732003] Call Trace: [ 2995.732158] [<ffffffff810636bf>] ? __debug_show_held_locks+0x1b/0x24 [ 2995.732305] [<ffffffff8102d499>] __schedule_bug+0x72/0x77 [ 2995.732454] [<ffffffff813f9a0a>] schedule+0xd9/0x730 [ 2995.732603] [<ffffffff81030301>] __cond_resched+0x18/0x24 [ 2995.732751] [<ffffffff813fa12e>] _cond_resched+0x2c/0x37 [ 2995.732900] [<ffffffff810b8a21>] unmap_vmas+0x6ce/0x893 [ 2995.733053] [<ffffffff810bd0f5>] exit_mmap+0xd7/0x182 [ 2995.733206] [<ffffffff81035b58>] mmput+0x43/0xea [ 2995.733356] [<ffffffff81039e99>] exit_mm+0x110/0x11d [ 2995.733505] [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2 [ 2995.733653] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155 [ 2995.733802] [<ffffffff810060db>] ? oops_end+0x47/0x93 [ 2995.733950] [<ffffffff81006122>] oops_end+0x8e/0x93 [ 2995.734102] [<ffffffff8101ed99>] no_context+0x1fc/0x20b [ 2995.734255] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af [ 2995.734407] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d [ 2995.734556] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15 [ 2995.734705] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d [ 2995.734854] [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130 [ 2995.735008] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6 [ 2995.735161] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 2995.735313] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 2995.735463] [<ffffffff813fd8bf>] page_fault+0x1f/0x30 [ 2995.735612] [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc [ 2995.735761] [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc [ 2995.735910] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c [ 2995.736062] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1 [ 2995.736216] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58 [ 2995.736368] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623 [ 2995.736518] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4 [ 2995.736666] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1 [ 2995.736816] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79 [ 2995.736965] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4 [ 2995.737117] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c [ 2995.737270] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f [ 2995.737422] [<ffffffff8103de67>] ? irq_exit+0x93/0x95 [ 2995.737570] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4 [ 2995.737719] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217 [ 2995.737868] [<ffffffff81077503>] ? count_data_pages+0x65/0x79 [ 2995.738020] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb [ 2995.738175] [<ffffffff813f95b5>] ? printk+0x41/0x44 [ 2995.738326] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1 [ 2995.738475] [<ffffffff8107632c>] hibernate+0xce/0x172 [ 2995.738623] [<ffffffff81075099>] state_store+0x5c/0xd3 [ 2995.738772] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19 [ 2995.738920] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144 [ 2995.739073] [<ffffffff810d66ff>] vfs_write+0xb2/0x153 [ 2995.739226] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 2995.739378] [<ffffffff810d6863>] sys_write+0x4a/0x71 [ 2995.739526] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b [ 2995.739940] BUG: unable to handle kernel paging request at 00007faf064ff1f0 [ 2995.740220] IP: [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a [ 2995.740441] PGD 0 [ 2995.740646] Oops: 0000 [#2] PREEMPT SMP [ 2995.740685] last sysfs file: /sys/power/state [ 2995.740685] CPU 1 [ 2995.740685] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core [ 2995.740685] [ 2995.740685] Pid: 7440, comm: hib.sh Tainted: G D 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name [ 2995.740685] RIP: 0010:[<ffffffff8119c0d0>] [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a [ 2995.740685] RSP: 0018:ffff88022fa03438 EFLAGS: 00010292 [ 2995.740685] RAX: ffff88022fb32520 RBX: 00007faf064ff1f0 RCX: 0000000000000000 [ 2995.740685] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007faf064ff1f0 [ 2995.740685] RBP: ffff88022fa03438 R08: 0000000000000002 R09: 0000000000000000 [ 2995.740685] R10: dead000000100100 R11: ffffffff810d26f5 R12: 00007faf064ff208 [ 2995.740685] R13: fffffffffffffff0 R14: ffff88022d747068 R15: 00007f4da81fa000 [ 2995.740685] FS: 00007f4da8b966f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000 [ 2995.740685] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2995.740685] CR2: 00007faf064ff1f0 CR3: 0000000001646000 CR4: 00000000000006e0 [ 2995.740685] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2995.740685] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 2995.740685] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520) [ 2995.740685] Stack: [ 2995.740685] ffff88022fa03468 ffffffff813fc6c3 ffffffff810c1ae3 ffff8801cbfde880 [ 2995.740685] <0> ffff88022fb32510 00007faf064ff1f0 ffff88022fa034a8 ffffffff810c1ae3 [ 2995.740685] <0> ffff88022fa034a8 ffff88022d747000 0000000000000000 0000000000000000 [ 2995.740685] Call Trace: [ 2995.740685] [<ffffffff813fc6c3>] _raw_spin_lock+0x48/0x73 [ 2995.740685] [<ffffffff810c1ae3>] ? unlink_anon_vmas+0x40/0xe1 [ 2995.740685] [<ffffffff810c1ae3>] unlink_anon_vmas+0x40/0xe1 [ 2995.740685] [<ffffffff810bb562>] free_pgtables+0x68/0xce [ 2995.740685] [<ffffffff810bd11e>] exit_mmap+0x100/0x182 [ 2995.740685] [<ffffffff81035b58>] mmput+0x43/0xea [ 2995.740685] [<ffffffff81039e99>] exit_mm+0x110/0x11d [ 2995.740685] [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2 [ 2995.740685] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155 [ 2995.740685] [<ffffffff810060db>] ? oops_end+0x47/0x93 [ 2995.740685] [<ffffffff81006122>] oops_end+0x8e/0x93 [ 2995.740685] [<ffffffff8101ed99>] no_context+0x1fc/0x20b [ 2995.740685] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af [ 2995.740685] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d [ 2995.740685] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15 [ 2995.740685] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d [ 2995.740685] [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130 [ 2995.740685] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6 [ 2995.740685] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 2995.740685] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 2995.740685] [<ffffffff813fd8bf>] page_fault+0x1f/0x30 [ 2995.740685] [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc [ 2995.740685] [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc [ 2995.740685] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c [ 2995.740685] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1 [ 2995.740685] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58 [ 2995.740685] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623 [ 2995.740685] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4 [ 2995.740685] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1 [ 2995.740685] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79 [ 2995.740685] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4 [ 2995.740685] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c [ 2995.740685] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f [ 2995.740685] [<ffffffff8103de67>] ? irq_exit+0x93/0x95 [ 2995.740685] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4 [ 2995.740685] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217 [ 2995.740685] [<ffffffff81077503>] ? count_data_pages+0x65/0x79 [ 2995.740685] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb [ 2995.740685] [<ffffffff813f95b5>] ? printk+0x41/0x44 [ 2995.740685] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1 [ 2995.740685] [<ffffffff8107632c>] hibernate+0xce/0x172 [ 2995.740685] [<ffffffff81075099>] state_store+0x5c/0xd3 [ 2995.740685] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19 [ 2995.740685] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144 [ 2995.740685] [<ffffffff810d66ff>] vfs_write+0xb2/0x153 [ 2995.740685] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 2995.740685] [<ffffffff810d6863>] sys_write+0x4a/0x71 [ 2995.740685] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b [ 2995.740685] Code: c7 c7 90 16 67 81 e8 79 f1 25 00 48 c7 c7 90 16 67 81 e8 e1 e7 25 00 48 c7 c7 30 18 67 81 e8 d5 e7 25 00 c9 c3 90 90 55 48 89 e5 <0f> b7 07 38 e0 8d 90 00 01 00 00 75 05 f0 66 0f b1 17 0f 94 c2 [ 2995.740685] RIP [<ffffffff8119c0d0>] do_raw_spin_trylock+0x4/0x3a [ 2995.740685] RSP <ffff88022fa03438> [ 2995.740685] CR2: 00007faf064ff1f0 [ 2995.762521] ---[ end trace 92c25d74e4800969 ]--- [ 2995.762686] Fixing recursive fault but reboot is needed! [ 2995.762855] BUG: scheduling while atomic: hib.sh/7440/0x00000005 [ 2995.763026] INFO: lockdep is turned off. [ 2995.763203] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core [ 2995.764799] Pid: 7440, comm: hib.sh Tainted: G D 2.6.34-rc3-00288-gab195c5 #1 [ 2995.765080] Call Trace: [ 2995.765256] [<ffffffff810636bf>] ? __debug_show_held_locks+0x1b/0x24 [ 2995.765429] [<ffffffff8102d499>] __schedule_bug+0x72/0x77 [ 2995.765600] [<ffffffff813f9a0a>] schedule+0xd9/0x730 [ 2995.765771] [<ffffffff8103b7f7>] do_exit+0xcf/0x6a2 [ 2995.765941] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155 [ 2995.766115] [<ffffffff810060db>] ? oops_end+0x47/0x93 [ 2995.766295] [<ffffffff81006122>] oops_end+0x8e/0x93 [ 2995.766462] [<ffffffff8101ed99>] no_context+0x1fc/0x20b [ 2995.766632] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 2995.766806] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af [ 2995.766977] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d [ 2995.767161] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15 [ 2995.767330] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d [ 2995.767501] [<ffffffff810a9eca>] ? release_pages+0x1ee/0x200 [ 2995.767673] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6 [ 2995.767842] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 2995.768017] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 2995.768202] [<ffffffff810d26f5>] ? kmem_cache_free+0x56/0x129 [ 2995.768373] [<ffffffff813fd8bf>] page_fault+0x1f/0x30 [ 2995.768544] [<ffffffff810d26f5>] ? kmem_cache_free+0x56/0x129 [ 2995.768716] [<ffffffff8119c0d0>] ? do_raw_spin_trylock+0x4/0x3a [ 2995.768888] [<ffffffff813fc6c3>] _raw_spin_lock+0x48/0x73 [ 2995.769064] [<ffffffff810c1ae3>] ? unlink_anon_vmas+0x40/0xe1 [ 2995.769246] [<ffffffff810c1ae3>] unlink_anon_vmas+0x40/0xe1 [ 2995.769415] [<ffffffff810bb562>] free_pgtables+0x68/0xce [ 2995.769586] [<ffffffff810bd11e>] exit_mmap+0x100/0x182 [ 2995.769756] [<ffffffff81035b58>] mmput+0x43/0xea [ 2995.769925] [<ffffffff81039e99>] exit_mm+0x110/0x11d [ 2995.770099] [<ffffffff8103b8ed>] do_exit+0x1c5/0x6a2 [ 2995.770279] [<ffffffff81038f84>] ? kmsg_dump+0x13b/0x155 [ 2995.770447] [<ffffffff810060db>] ? oops_end+0x47/0x93 [ 2995.770616] [<ffffffff81006122>] oops_end+0x8e/0x93 [ 2995.770785] [<ffffffff8101ed99>] no_context+0x1fc/0x20b [ 2995.770955] [<ffffffff8101ef34>] __bad_area_nosemaphore+0x18c/0x1af [ 2995.771141] [<ffffffff8101f16f>] ? do_page_fault+0xa8/0x32d [ 2995.771311] [<ffffffff8101ef6a>] bad_area_nosemaphore+0x13/0x15 [ 2995.771482] [<ffffffff8101f23a>] do_page_fault+0x173/0x32d [ 2995.771653] [<ffffffff810802f9>] ? __call_rcu+0x11d/0x130 [ 2995.771824] [<ffffffff813fdaa3>] ? error_sti+0x5/0x6 [ 2995.771994] [<ffffffff81063167>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 2995.772179] [<ffffffff813fc48e>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 2995.772352] [<ffffffff813fd8bf>] page_fault+0x1f/0x30 [ 2995.772524] [<ffffffff810c194d>] ? page_referenced+0xee/0x1dc [ 2995.772696] [<ffffffff810c18df>] ? page_referenced+0x80/0x1dc [ 2995.772867] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c [ 2995.773043] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1 [ 2995.773226] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58 [ 2995.773398] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623 [ 2995.773572] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4 [ 2995.773742] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1 [ 2995.773916] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79 [ 2995.774093] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4 [ 2995.774274] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c [ 2995.774444] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f [ 2995.774617] [<ffffffff8103de67>] ? irq_exit+0x93/0x95 [ 2995.774786] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4 [ 2995.774958] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217 [ 2995.775144] [<ffffffff81077503>] ? count_data_pages+0x65/0x79 [ 2995.775314] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb [ 2995.775487] [<ffffffff813f95b5>] ? printk+0x41/0x44 [ 2995.775657] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1 [ 2995.775828] [<ffffffff8107632c>] hibernate+0xce/0x172 [ 2995.775998] [<ffffffff81075099>] state_store+0x5c/0xd3 [ 2995.776182] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19 [ 2995.776350] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144 [ 2995.776521] [<ffffffff810d66ff>] vfs_write+0xb2/0x153 [ 2995.776690] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 2995.776863] [<ffffffff810d6863>] sys_write+0x4a/0x71 [ 2995.777038] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 19:42 ` Borislav Petkov @ 2010-04-06 20:02 ` Linus Torvalds 2010-04-06 20:46 ` Steinar H. Gunderson ` (2 more replies) 0 siblings, 3 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 20:02 UTC (permalink / raw) To: Borislav Petkov Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 6 Apr 2010, Borislav Petkov wrote: > > [ 2995.478125] PM: Preallocating image memory... > [ 2995.713692] BUG: unable to handle kernel NULL pointer dereference at (null) > [ 2995.714001] IP: [<ffffffff810c194d>] page_referenced+0xee/0x1dc > [ 2995.714001] PGD 22d1b8067 PUD 22dd85067 PMD 0 > [ 2995.714001] Oops: 0000 [#1] PREEMPT SMP > [ 2995.714001] last sysfs file: /sys/power/state > [ 2995.714001] CPU 0 > [ 2995.714001] Modules linked in: tun powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr 8250_pnp 8250 k10temp edac_core serial_core > [ 2995.714001] > [ 2995.714001] Pid: 7440, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5 #1 M3A78 PRO/System Product Name > [ 2995.714001] RIP: 0010:[<ffffffff810c194d>] [<ffffffff810c194d>] page_referenced+0xee/0x1dc > [ 2995.714001] RSP: 0018:ffff88022fa038b8 EFLAGS: 00010283 > [ 2995.714001] RAX: ffff88022d747098 RBX: ffffea00078efb70 RCX: 0000000000000000 > [ 2995.714001] RDX: ffff88022fa03cf8 RSI: ffff88022d747070 RDI: ffff88022fb32520 > [ 2995.714001] RBP: ffff88022fa03938 R08: 0000000000000002 R09: 0000000000000000 > [ 2995.714001] R10: ffff88022fa038a8 R11: ffff88022d295d10 R12: 0000000000000000 > [ 2995.714001] R13: ffffffffffffffe0 R14: ffff88022d747058 R15: ffff88022fa03a00 > [ 2995.714001] FS: 00007f4da8b966f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000 > [ 2995.714001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 2995.714001] CR2: 0000000000000000 CR3: 000000022d11e000 CR4: 00000000000006f0 > [ 2995.714001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 2995.714001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 2995.714001] Process hib.sh (pid: 7440, threadinfo ffff88022fa02000, task ffff88022fb32520) > [ 2995.714001] Stack: > [ 2995.714001] ffff88022d747098 00000000813fd2ac ffffffff8165ee28 0000000000000416 > [ 2995.714001] <0> ffff88022fa038f8 ffffffff810c6d40 ffffea00078fae60 ffffea00078fae60 > [ 2995.714001] <0> ffff88022fa03938 00000002810abd98 ffffea00078ec530 ffffea00078efb98 > [ 2995.714001] Call Trace: > [ 2995.714001] [<ffffffff810c6d40>] ? swapcache_free+0x37/0x3c > [ 2995.714001] [<ffffffff810ac31d>] shrink_page_list+0x171/0x4b1 > [ 2995.714001] [<ffffffff813fd1e6>] ? _raw_spin_unlock_irq+0x30/0x58 > [ 2995.714001] [<ffffffff810ac9b9>] shrink_inactive_list+0x35c/0x623 > [ 2995.714001] [<ffffffff810acd94>] ? shrink_zone+0x114/0x3d4 > [ 2995.714001] [<ffffffff81064f29>] ? print_lock_contention_bug+0x1b/0xe1 > [ 2995.714001] [<ffffffff813fc790>] ? _raw_spin_lock_irq+0x19/0x79 > [ 2995.714001] [<ffffffff810acf8a>] shrink_zone+0x30a/0x3d4 > [ 2995.714001] [<ffffffff810ad19e>] ? shrink_slab+0x14a/0x15c > [ 2995.714001] [<ffffffff810adb65>] do_try_to_free_pages+0x176/0x27f > [ 2995.714001] [<ffffffff8103de67>] ? irq_exit+0x93/0x95 > [ 2995.714001] [<ffffffff810add03>] shrink_all_memory+0x95/0xc4 > [ 2995.714001] [<ffffffff810ab0f0>] ? isolate_pages_global+0x0/0x217 > [ 2995.714001] [<ffffffff81077503>] ? count_data_pages+0x65/0x79 > [ 2995.714001] [<ffffffff8107776a>] hibernate_preallocate_memory+0x1aa/0x2cb > [ 2995.714001] [<ffffffff813f95b5>] ? printk+0x41/0x44 > [ 2995.714001] [<ffffffff810760b3>] hibernation_snapshot+0x36/0x1e1 > [ 2995.714001] [<ffffffff8107632c>] hibernate+0xce/0x172 > [ 2995.714001] [<ffffffff81075099>] state_store+0x5c/0xd3 > [ 2995.714001] [<ffffffff8118728f>] kobj_attr_store+0x17/0x19 > [ 2995.714001] [<ffffffff81127b69>] sysfs_write_file+0x108/0x144 > [ 2995.714001] [<ffffffff810d66ff>] vfs_write+0xb2/0x153 > [ 2995.714001] [<ffffffff810641a9>] ? trace_hardirqs_on_caller+0x1f/0x14b > [ 2995.714001] [<ffffffff810d6863>] sys_write+0x4a/0x71 > [ 2995.714001] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b > [ 2995.714001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 4d f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 > [ 2995.714001] RIP [<ffffffff810c194d>] page_referenced+0xee/0x1dc > [ 2995.714001] RSP <ffff88022fa038b8> > [ 2995.714001] CR2: 0000000000000000 > [ 2995.729717] ---[ end trace 92c25d74e4800968 ]--- So again, I can show that the code has never actually been through the loop. The above code decodes to: 0: 3b 56 10 cmp 0x10(%rsi),%edx 3: 73 1e jae 0x23 5: 48 83 fa f2 cmp $0xfffffffffffffff2,%rdx 9: 74 18 je 0x23 b: 48 8d 4d cc lea -0x34(%rbp),%rcx f: 4d 89 f8 mov %r15,%r8 12: 48 89 df mov %rbx,%rdi 15: e8 4d f2 ff ff callq 0xfffffffffffff267 1a: 41 01 c4 add %eax,%r12d 1d: 83 7d cc 00 cmpl $0x0,-0x34(%rbp) 21: 74 19 je 0x3c 23: 4d 8b 6d 20 mov 0x20(%r13),%r13 27: 49 83 ed 20 sub $0x20,%r13 2b:* 49 8b 45 20 mov 0x20(%r13),%rax <-- trapping instruction 2f: 0f 18 08 prefetcht0 (%rax) 32: 49 8d 45 20 lea 0x20(%r13),%rax 36: 48 39 45 80 cmp %rax,-0x80(%rbp) 3a: 75 aa jne 0xffffffffffffffe6 3c: 4c 89 f7 mov %r14,%rdi 3f: e8 .byte 0xe8 and in your case, if we had gone through the loop, then %rax would still contain the return value from page_referenced_one(). But %rax is a kernel pointer, and %r12d is 0. So again, it's actually anon_vma.head.next that is NULL, not any of the entries on the list itself. Now, I can see several cases for this: - the obvious one: anon_vma just wasn't correctly initialized, and is missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we don't have a whole lot of coverage of constructors), or somebody allocated an anon_vma without using the anon_vma_cachep. - Related to the above: perhaps the RCU freeing isn't working, or slub/slab/slob ends up reusing the allocations for something else than anonvma's, so together with the race _and_ an unlucky re-use, you get some odd crud. I haven't looked at the kernel config files: do they perhaps share the same (odd?) SLUB/SLAB/SLOB config? - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud with the low bit set. That sounds unlikely, but who knows. The ksm code sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM" Did people have KSM enabled? .. and probably other things I haven't even thought about. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 20:02 ` Linus Torvalds @ 2010-04-06 20:46 ` Steinar H. Gunderson 2010-04-06 20:56 ` Linus Torvalds 2010-04-06 20:51 ` Borislav Petkov 2010-04-07 8:41 ` Peter Zijlstra 2 siblings, 1 reply; 231+ messages in thread From: Steinar H. Gunderson @ 2010-04-06 20:46 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, Apr 06, 2010 at 01:02:35PM -0700, Linus Torvalds wrote: > I haven't looked at the kernel config files: do they perhaps share the > same (odd?) SLUB/SLAB/SLOB config? http://storage.sesse.net/config-crashing-2.6.34-rc2 > Did people have KSM enabled? No KSM for me. /* Steinar */ -- Homepage: http://www.sesse.net/ ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 20:46 ` Steinar H. Gunderson @ 2010-04-06 20:56 ` Linus Torvalds 2010-04-06 21:05 ` Steinar H. Gunderson 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 20:56 UTC (permalink / raw) To: Steinar H. Gunderson Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, 6 Apr 2010, Steinar H. Gunderson wrote: > On Tue, Apr 06, 2010 at 01:02:35PM -0700, Linus Torvalds wrote: > > I haven't looked at the kernel config files: do they perhaps share the > > same (odd?) SLUB/SLAB/SLOB config? > > http://storage.sesse.net/config-crashing-2.6.34-rc2 Ok, CONFIG_SLUB, which is the common case. Not likely to be buggy. > > Did people have KSM enabled? > > No KSM for me. Ok, not anything odd there either, and you're not using any odd RCU setup either. Nothing odd at all strikes me about your config, in fact. Lots and lots of modules, but I guess it comes from some distro default config.. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 20:56 ` Linus Torvalds @ 2010-04-06 21:05 ` Steinar H. Gunderson 0 siblings, 0 replies; 231+ messages in thread From: Steinar H. Gunderson @ 2010-04-06 21:05 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, Apr 06, 2010 at 01:56:19PM -0700, Linus Torvalds wrote: >>> Did people have KSM enabled? >> No KSM for me. > Ok, not anything odd there either, and you're not using any odd RCU setup > either. Nothing odd at all strikes me about your config, in fact. Lots and > lots of modules, but I guess it comes from some distro default config.. I think it was originally some distro config, yes, but that «config fork» was at 2.6.16 or something... /* Steinar */ -- Homepage: http://www.sesse.net/ ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 20:02 ` Linus Torvalds 2010-04-06 20:46 ` Steinar H. Gunderson @ 2010-04-06 20:51 ` Borislav Petkov 2010-04-06 21:27 ` Linus Torvalds 2010-04-07 8:41 ` Peter Zijlstra 2 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-06 20:51 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, Apr 06, 2010 at 01:02:35PM -0700 > So again, I can show that the code has never actually been through the > loop. The above code decodes to: > > 0: 3b 56 10 cmp 0x10(%rsi),%edx > 3: 73 1e jae 0x23 > 5: 48 83 fa f2 cmp $0xfffffffffffffff2,%rdx > 9: 74 18 je 0x23 > b: 48 8d 4d cc lea -0x34(%rbp),%rcx > f: 4d 89 f8 mov %r15,%r8 > 12: 48 89 df mov %rbx,%rdi > 15: e8 4d f2 ff ff callq 0xfffffffffffff267 > 1a: 41 01 c4 add %eax,%r12d > 1d: 83 7d cc 00 cmpl $0x0,-0x34(%rbp) > 21: 74 19 je 0x3c > 23: 4d 8b 6d 20 mov 0x20(%r13),%r13 > 27: 49 83 ed 20 sub $0x20,%r13 > 2b:* 49 8b 45 20 mov 0x20(%r13),%rax <-- trapping instruction > 2f: 0f 18 08 prefetcht0 (%rax) > 32: 49 8d 45 20 lea 0x20(%r13),%rax > 36: 48 39 45 80 cmp %rax,-0x80(%rbp) > 3a: 75 aa jne 0xffffffffffffffe6 > 3c: 4c 89 f7 mov %r14,%rdi > 3f: e8 .byte 0xe8 > > and in your case, if we had gone through the loop, then %rax would still > contain the return value from page_referenced_one(). > > But %rax is a kernel pointer, and %r12d is 0. > > So again, it's actually anon_vma.head.next that is NULL, not any of the > entries on the list itself. > > Now, I can see several cases for this: > > - the obvious one: anon_vma just wasn't correctly initialized, and is > missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we > don't have a whole lot of coverage of constructors), or somebody > allocated an anon_vma without using the anon_vma_cachep. I've added code to verify this and am suspend/resuming now... Wait a minute, Linus, you're good! :) : [ 873.083074] PM: Preallocating image memory... [ 873.254359] NULL anon_vma->head.next, page 2182681 This is the page_to_pfn number. Now, how do we track back to the place which is missing anon_vma->head init? Can we use the struct page *page arg to page_referenced_anon() somehow? [ 873.254654] Pid: 3642, comm: hib.sh Not tainted 2.6.34-rc3-00288-gab195c5-dirty #3 [ 873.254904] Call Trace: [ 873.255063] [<ffffffff810c0c28>] page_referenced+0xd3/0x219 [ 873.255212] [<ffffffff810c5fb0>] ? swapcache_free+0x37/0x3c [ 873.255364] [<ffffffff810ab782>] shrink_page_list+0x14a/0x477 [ 873.255512] [<ffffffff810aa6e0>] ? isolate_pages_global+0xc4/0x1f0 [ 873.255662] [<ffffffff813f8a76>] ? _raw_spin_unlock_irq+0x30/0x58 [ 873.255811] [<ffffffff810abe06>] shrink_inactive_list+0x357/0x5e5 [ 873.255960] [<ffffffff810ab626>] ? shrink_active_list+0x232/0x244 [ 873.256112] [<ffffffff810ac39e>] shrink_zone+0x30a/0x3d4 [ 873.256264] [<ffffffff810acf79>] do_try_to_free_pages+0x176/0x27f [ 873.256416] [<ffffffff810ad117>] shrink_all_memory+0x95/0xc4 [ 873.256564] [<ffffffff810aa61c>] ? isolate_pages_global+0x0/0x1f0 [ 873.256713] [<ffffffff81076e4c>] ? count_data_pages+0x65/0x79 [ 873.256862] [<ffffffff810770b3>] hibernate_preallocate_memory+0x1aa/0x2cb [ 873.257036] [<ffffffff813f4f75>] ? printk+0x41/0x44 [ 873.257186] [<ffffffff81075a53>] hibernation_snapshot+0x36/0x1e1 [ 873.257337] [<ffffffff81075ccc>] hibernate+0xce/0x172 [ 873.257485] [<ffffffff81074a39>] state_store+0x5c/0xd3 [ 873.257634] [<ffffffff81184eff>] kobj_attr_store+0x17/0x19 [ 873.257783] [<ffffffff81125d43>] sysfs_write_file+0x108/0x144 [ 873.257932] [<ffffffff810d560f>] vfs_write+0xb2/0x153 [ 873.258084] [<ffffffff81063bd9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 873.258237] [<ffffffff810d5773>] sys_write+0x4a/0x71 [ 873.258388] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b > - Related to the above: perhaps the RCU freeing isn't working, or > slub/slab/slob ends up reusing the allocations for something else than > anonvma's, so together with the race _and_ an unlucky re-use, you get > some odd crud. > > I haven't looked at the kernel config files: do they perhaps share the > same (odd?) SLUB/SLAB/SLOB config? what is an odd SL[AOU]B config? > - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud > with the low bit set. That sounds unlikely, but who knows. The ksm code > sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM" > > Did people have KSM enabled? Nope, KSM is off here. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 20:51 ` Borislav Petkov @ 2010-04-06 21:27 ` Linus Torvalds 2010-04-06 22:59 ` Borislav Petkov 2010-04-06 23:22 ` Rik van Riel 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 21:27 UTC (permalink / raw) To: Borislav Petkov Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 6 Apr 2010, Borislav Petkov wrote: > > So again, it's actually anon_vma.head.next that is NULL, not any of the > > entries on the list itself. > > > > Now, I can see several cases for this: > > > > - the obvious one: anon_vma just wasn't correctly initialized, and is > > missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we > > don't have a whole lot of coverage of constructors), or somebody > > allocated an anon_vma without using the anon_vma_cachep. > > I've added code to verify this and am suspend/resuming now... Wait a > minute, Linus, you're good! :) : > > [ 873.083074] PM: Preallocating image memory... > [ 873.254359] NULL anon_vma->head.next, page 2182681 Yeah, I was pretty sure of that thing. I still don't see _how_ it happens, though. That 'struct anon_vma' is very simple, and contains literally just the lock and that list_head. Now, 'head.next' is kind of magical, because it contains that magic low-bit "have I been locked" thing (see "vm_lock_anon_vma()" in mm/mmap.c). But I'm not seeing anything else touching it. And if you allocate a anon_vma the proper way, the SLUB constructor should have made sure that the head is initialized. And no normal list operation ever sets any list pointer to zero, although a "list_del()" on the first list entry could do it if that first list entry had a NULL next pointer. > Now, how do we track back to the place which is missing anon_vma->head > init? Can we use the struct page *page arg to page_referenced_anon() > somehow? You might enable SLUB debugging (both SLUB_DEBUG _and_ SLUB_DEBUG_ON), and then make the "object_err()" function in mm/slub.c be non-static. You could call it when you see the problem, perhaps. Or you could just add tests to both alloc_anon_vma() and free_anon_vma() to check that 'list_empty(&anon_vma->head)' is true. I dunno. > > I haven't looked at the kernel config files: do they perhaps share the > > same (odd?) SLUB/SLAB/SLOB config? > > what is an odd SL[AOU]B config? Probably anything but the default SLUB these days. But Steinar already said he had SLUB, so it's unlikely to be something odd. > > - anon_vma isn't actually an anonvma at all. 'page->mapping' was crud > > with the low bit set. That sounds unlikely, but who knows. The ksm code > > sets mapping to "stable_node + PAGE_MAPPING_ANON | PAGE_MAPPING_KSM" > > > > Did people have KSM enabled? > > Nope, KSM is off here. Yeah, wasn't for Steinar either. So it doesn't look like it's any odd corner case that depends on some odd configuration. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 21:27 ` Linus Torvalds @ 2010-04-06 22:59 ` Borislav Petkov 2010-04-06 23:27 ` Linus Torvalds 2010-04-06 23:37 ` Linus Torvalds 2010-04-06 23:22 ` Rik van Riel 1 sibling, 2 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-06 22:59 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, Apr 06, 2010 at 02:27:37PM -0700 > On Tue, 6 Apr 2010, Borislav Petkov wrote: > > > So again, it's actually anon_vma.head.next that is NULL, not any of the > > > entries on the list itself. > > > > > > Now, I can see several cases for this: > > > > > > - the obvious one: anon_vma just wasn't correctly initialized, and is > > > missing a INIT_LIST_HEAD(&anon_vma->head). That's either a slab bug (we > > > don't have a whole lot of coverage of constructors), or somebody > > > allocated an anon_vma without using the anon_vma_cachep. > > > > I've added code to verify this and am suspend/resuming now... Wait a > > minute, Linus, you're good! :) : > > > > [ 873.083074] PM: Preallocating image memory... > > [ 873.254359] NULL anon_vma->head.next, page 2182681 > > Yeah, I was pretty sure of that thing. > > I still don't see _how_ it happens, though. That 'struct anon_vma' is very > simple, and contains literally just the lock and that list_head. > > Now, 'head.next' is kind of magical, because it contains that magic > low-bit "have I been locked" thing (see "vm_lock_anon_vma()" in > mm/mmap.c). But I'm not seeing anything else touching it. > > And if you allocate a anon_vma the proper way, the SLUB constructor should > have made sure that the head is initialized. And no normal list operation > ever sets any list pointer to zero, although a "list_del()" on the first > list entry could do it if that first list entry had a NULL next pointer. > > > Now, how do we track back to the place which is missing anon_vma->head > > init? Can we use the struct page *page arg to page_referenced_anon() > > somehow? > > You might enable SLUB debugging (both SLUB_DEBUG _and_ SLUB_DEBUG_ON), and > then make the "object_err()" function in mm/slub.c be non-static. You > could call it when you see the problem, perhaps. > > Or you could just add tests to both alloc_anon_vma() and free_anon_vma() > to check that 'list_empty(&anon_vma->head)' is true. I dunno. Ok, I tried doing all you suggested and here's what came out. Please, take this with a grain of salt because I'm almost falling asleep - even the coffee is not working anymore so it could be just as well that I've made a mistake somewhere (the new OOPS is a #GP, by the way), just watch: Source changes locally: -- diff --git a/include/linux/slab.h b/include/linux/slab.h index 4884462..0c11dfb 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -108,6 +108,8 @@ unsigned int kmem_cache_size(struct kmem_cache *); const char *kmem_cache_name(struct kmem_cache *); int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr); +void object_err(struct kmem_cache *s, struct page *page, u8 *object, char *reason); + /* * Please use this macro to create slab caches. Simply specify the * name of the structure and maybe some flags that are listed above. diff --git a/mm/rmap.c b/mm/rmap.c index eaa7a09..7b35b3f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -66,11 +66,24 @@ static struct kmem_cache *anon_vma_chain_cachep; static inline struct anon_vma *anon_vma_alloc(void) { - return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); + struct anon_vma *ret; + ret = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); + + if (!ret->head.next) { + printk("%s NULL anon_vma->head.next\n", __func__); + dump_stack(); + } + + return ret; } void anon_vma_free(struct anon_vma *anon_vma) { + if (!anon_vma->head.next) { + printk("%s NULL anon_vma->head.next\n", __func__); + dump_stack(); + } + kmem_cache_free(anon_vma_cachep, anon_vma); } @@ -494,6 +507,18 @@ static int page_referenced_anon(struct page *page, return referenced; mapcount = page_mapcount(page); + + if (!anon_vma->head.next) { + printk(KERN_ERR "NULL anon_vma->head.next, page %lu\n", + page_to_pfn(page)); + + object_err(anon_vma_cachep, page, (u8 *)anon_vma, "NULL next"); + + dump_stack(); + + return referenced; + } + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { struct vm_area_struct *vma = avc->vma; unsigned long address = vma_address(page, vma); diff --git a/mm/slub.c b/mm/slub.c index b364844..bcf5416 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -477,7 +477,7 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p) dump_stack(); } -static void object_err(struct kmem_cache *s, struct page *page, +void object_err(struct kmem_cache *s, struct page *page, u8 *object, char *reason) { slab_bug(s, "%s", reason); --- do the same exercise of starting several guests and then shutting them down, and hibernating at the same time. After having shutdown the guests, start firefox and let it load a big html page and hibernate while doing so, boom! [ 269.104940] Freezing user space processes ... (elapsed 0.03 seconds) done. [ 269.141953] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. [ 269.155115] PM: Preallocating image memory... [ 269.423811] general protection fault: 0000 [#1] PREEMPT SMP [ 269.424003] last sysfs file: /sys/power/state [ 269.424003] CPU 0 [ 269.424003] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_co nservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr edac_core k10temp 8250_pnp 8250 serial_ core [ 269.424003] [ 269.424003] Pid: 2617, comm: hib.sh Tainted: G W 2.6.34-rc3-00288-gab195c5-dirty #4 M3A78 PRO/System Product Name [ 269.424003] RIP: 0010:[<ffffffff810c0cb4>] [<ffffffff810c0cb4>] page_referenced+0x147/0x232 [ 269.424003] RSP: 0018:ffff88022a1218b8 EFLAGS: 00010246 [ 269.424003] RAX: ffff8802126fa468 RBX: ffffea000700b210 RCX: 0000000000000000 [ 269.424003] RDX: ffff8802126fa429 RSI: ffff8802126fa440 RDI: ffff88022dc3cb80 [ 269.424003] RBP: ffff88022a121938 R08: 0000000000000002 R09: 0000000000000000 [ 269.424003] R10: 0000000000000246 R11: ffff88021a030478 R12: 0000000000000000 [ 269.424003] R13: 002e2e2e002e2e0e R14: ffff8802126fa428 R15: ffff88022a121a00 [ 269.424003] FS: 00007fe2799796f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000 [ 269.424003] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 269.424003] CR2: 00007fffdefb3880 CR3: 00000002171c0000 CR4: 00000000000006f0 [ 269.424003] DR0: 0000000000000090 DR1: 00000000000000a4 DR2: 00000000000000ff [ 269.424003] DR3: 000000000000000f DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 269.424003] Process hib.sh (pid: 2617, threadinfo ffff88022a120000, task ffff88022dc3cb80) [ 269.424003] Stack: [ 269.424003] ffff8802126fa468 00000000813f8cfc ffffffff8165ae28 00000000000042e7 [ 269.424003] <0> ffff88022a1218f8 ffffffff810c6051 ffffea0006f968c8 ffffea0006f968c8 [ 269.424003] <0> ffff88022a121938 00000002810ab275 0000000006f96890 ffffea000700b238 [ 269.424003] Call Trace: [ 269.424003] [<ffffffff810c6051>] ? swapcache_free+0x37/0x3c [ 269.424003] [<ffffffff810ab79a>] shrink_page_list+0x14a/0x477 [ 269.424003] [<ffffffff813f8c36>] ? _raw_spin_unlock_irq+0x30/0x58 [ 269.424003] [<ffffffff810abe1e>] shrink_inactive_list+0x357/0x5e5 [ 269.424003] [<ffffffff810ac3b6>] shrink_zone+0x30a/0x3d4 [ 269.424003] [<ffffffff810acf91>] do_try_to_free_pages+0x176/0x27f [ 269.424003] [<ffffffff810ad12f>] shrink_all_memory+0x95/0xc4 [ 269.424003] [<ffffffff810aa634>] ? isolate_pages_global+0x0/0x1f0 [ 269.424003] [<ffffffff81076e64>] ? count_data_pages+0x65/0x79 [ 269.424003] [<ffffffff810770cb>] hibernate_preallocate_memory+0x1aa/0x2cb [ 269.424003] [<ffffffff813f5135>] ? printk+0x41/0x44 [ 269.424003] [<ffffffff81075a6b>] hibernation_snapshot+0x36/0x1e1 [ 269.424003] [<ffffffff81075ce4>] hibernate+0xce/0x172 [ 269.424003] [<ffffffff81074a51>] state_store+0x5c/0xd3 [ 269.424003] [<ffffffff81185097>] kobj_attr_store+0x17/0x19 [ 269.424003] [<ffffffff81125edb>] sysfs_write_file+0x108/0x144 [ 269.424003] [<ffffffff810d57a7>] vfs_write+0xb2/0x153 [ 269.424003] [<ffffffff81063bf1>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 269.424003] [<ffffffff810d590b>] sys_write+0x4a/0x71 [ 269.424003] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b [ 269.424003] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 1e f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 [ 269.424003] RIP [<ffffffff810c0cb4>] page_referenced+0x147/0x232 [ 269.424003] RSP <ffff88022a1218b8> [ 269.438405] ---[ end trace ad5b4172ee94398e ]--- [ 269.438553] note: hib.sh[2617] exited with preempt_count 2 [ 269.438709] BUG: scheduling while atomic: hib.sh/2617/0x10000003 [ 269.438858] INFO: lockdep is turned off. [ 269.439075] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_co nservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd pcspkr edac_core k10temp 8250_pnp 8250 serial_core [ 269.440875] Pid: 2617, comm: hib.sh Tainted: G D W 2.6.34-rc3-00288-gab195c5-dirty #4 [ 269.441137] Call Trace: [ 269.441288] [<ffffffff81063107>] ? __debug_show_held_locks+0x1b/0x24 [ 269.441440] [<ffffffff8102d3c0>] __schedule_bug+0x72/0x77 [ 269.441590] [<ffffffff813f553e>] schedule+0xd9/0x730 [ 269.441741] [<ffffffff8103022c>] __cond_resched+0x18/0x24 [ 269.441891] [<ffffffff813f5c62>] _cond_resched+0x2c/0x37 [ 269.442045] [<ffffffff810b7d7d>] unmap_vmas+0x6ce/0x893 [ 269.442205] [<ffffffff810bc42f>] exit_mmap+0xd7/0x182 [ 269.442352] [<ffffffff81035951>] mmput+0x48/0xb9 [ 269.442502] [<ffffffff81039c21>] exit_mm+0x110/0x11d [ 269.442652] [<ffffffff8103b663>] do_exit+0x1c5/0x691 [ 269.442802] [<ffffffff81038d0d>] ? kmsg_dump+0x13b/0x155 [ 269.442953] [<ffffffff810060db>] ? oops_end+0x47/0x93 [ 269.443107] [<ffffffff81006122>] oops_end+0x8e/0x93 [ 269.443262] [<ffffffff81006313>] die+0x5a/0x63 [ 269.443414] [<ffffffff81003eaf>] do_general_protection+0x134/0x13c [ 269.443566] [<ffffffff813f90f0>] ? irq_return+0x0/0x2 [ 269.443716] [<ffffffff813f92cf>] general_protection+0x1f/0x30 [ 269.443867] [<ffffffff810c0cb4>] ? page_referenced+0x147/0x232 [ 269.444021] [<ffffffff810c0bf0>] ? page_referenced+0x83/0x232 [ 269.444176] [<ffffffff810c6051>] ? swapcache_free+0x37/0x3c [ 269.444328] [<ffffffff810ab79a>] shrink_page_list+0x14a/0x477 [ 269.444479] [<ffffffff813f8c36>] ? _raw_spin_unlock_irq+0x30/0x58 [ 269.444630] [<ffffffff810abe1e>] shrink_inactive_list+0x357/0x5e5 [ 269.444782] [<ffffffff810ac3b6>] shrink_zone+0x30a/0x3d4 [ 269.444933] [<ffffffff810acf91>] do_try_to_free_pages+0x176/0x27f [ 269.445087] [<ffffffff810ad12f>] shrink_all_memory+0x95/0xc4 [ 269.445243] [<ffffffff810aa634>] ? isolate_pages_global+0x0/0x1f0 [ 269.445396] [<ffffffff81076e64>] ? count_data_pages+0x65/0x79 [ 269.445547] [<ffffffff810770cb>] hibernate_preallocate_memory+0x1aa/0x2cb [ 269.445698] [<ffffffff813f5135>] ? printk+0x41/0x44 [ 269.445848] [<ffffffff81075a6b>] hibernation_snapshot+0x36/0x1e1 [ 269.445999] [<ffffffff81075ce4>] hibernate+0xce/0x172 [ 269.446160] [<ffffffff81074a51>] state_store+0x5c/0xd3 [ 269.446307] [<ffffffff81185097>] kobj_attr_store+0x17/0x19 [ 269.446457] [<ffffffff81125edb>] sysfs_write_file+0x108/0x144 [ 269.446607] [<ffffffff810d57a7>] vfs_write+0xb2/0x153 [ 269.446757] [<ffffffff81063bf1>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 269.446908] [<ffffffff810d590b>] sys_write+0x4a/0x71 [ 269.447063] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b This time we have [ 269.424003] RIP: 0010:[<ffffffff810c0cb4>] [<ffffffff810c0cb4>] page_referenced+0x147/0x232 which is offset 0x1104. which is 10eb: 48 89 df mov %rbx,%rdi 10ee: e8 00 00 00 00 callq 10f3 <page_referenced+0x136> 10f3: 41 01 c4 add %eax,%r12d 10f6: 83 7d cc 00 cmpl $0x0,-0x34(%rbp) 10fa: 74 19 je 1115 <page_referenced+0x158> 10fc: 4d 8b 6d 20 mov 0x20(%r13),%r13 1100: 49 83 ed 20 sub $0x20,%r13 1104: 49 8b 45 20 mov 0x20(%r13),%rax <------------------------- 1108: 0f 18 08 prefetcht0 (%rax) 110b: 49 8d 45 20 lea 0x20(%r13),%rax 110f: 48 39 45 80 cmp %rax,-0x80(%rbp) 1113: 75 aa jne 10bf <page_referenced+0x102> 1115: 4c 89 f7 mov %r14,%rdi and asm is .loc 1 522 0 movq 32(%r13), %r13 # <variable>.same_anon_vma.next, __mptr.454 .LVL295: subq $32, %r13 #, avc .LVL296: .L186: .LBE1224: movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next <-------------- prefetcht0 (%rax) # <variable>.same_anon_vma.next leaq 32(%r13), %rax #, tmp104 cmpq %rax, -128(%rbp) # tmp104, %sfp jne .L189 #, .L188: .loc 1 540 0 movq %r14, %rdi # anon_vma, call page_unlock_anon_vma # and %r13 contains some funny stuff, could be some mangled SLUB debug poison or something: R13: 002e2e2e002e2e0e. Maybe this is the reason for the #GP. But yes, even if the oopsing instruction is movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next this is not same_anon_vma.next because we've come to the above instruction through the ".L186:" label, before which we have %r13 already loaded with anon_vma->head.next. To be continued... -- Regards/Gruss, Boris. ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 22:59 ` Borislav Petkov @ 2010-04-06 23:27 ` Linus Torvalds 2010-04-06 23:54 ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel ` (2 more replies) 2010-04-06 23:37 ` Linus Torvalds 1 sibling, 3 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 23:27 UTC (permalink / raw) To: Borislav Petkov Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Wed, 7 Apr 2010, Borislav Petkov wrote: > > Ok, I tried doing all you suggested and here's what came out. Please, > take this with a grain of salt because I'm almost falling asleep - even > the coffee is not working anymore so it could be just as well that I've > made a mistake somewhere (the new OOPS is a #GP, by the way), just > watch: Hey ho, yeah. The reason it's a #GP fault is that it's not a NULL pointer dereference any more, but a wild pointer that is not in the legal region of pointers on x86-64. That is also why your debugging code didn't catch it: the pointer isn't NULL, so you got the #GP fault on the same old instruction: 2b:* 49 8b 45 20 mov 0x20(%r13),%rax <-- trapping instruction for all the same old reasons. But now %r13 has a non-zero value: 0x002e2e2e002e2e0e, which I do _not_ recognize as any of the normal poison values. > and %r13 contains some funny stuff, could be some mangled SLUB debug > poison or something: R13: 002e2e2e002e2e0e. Maybe this is the reason for > the #GP. Correct. You don't get a page fault if the pointer was totally bogus > But yes, even if the oopsing instruction is > > movq 32(%r13), %rax # <variable>.same_anon_vma.next, <variable>.same_anon_vma.next > > this is not same_anon_vma.next because we've come to the above > instruction through the ".L186:" label, before which we have %r13 > already loaded with anon_vma->head.next. No, you're mis-reading the asm. It's again the first iteration, and the code above it is again the end of the loop. And %rax is once more a kernel pointer, not the return value of 'page_referenced_one()'. So it once more is 'anon_vma->head.next' that is crap, but now it's not NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20 subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e). What does '0x2e' mean? It's ASCII '.', but that doesn't really mean anything either. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-06 23:27 ` Linus Torvalds @ 2010-04-06 23:54 ` Rik van Riel 2010-04-07 7:00 ` KOSAKI Motohiro 2010-04-07 7:29 ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) Borislav Petkov 2010-04-07 14:05 ` Paulo Marques 2 siblings, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-06 23:54 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Andrew Morton, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes When a new VMA has a mergeable anon_vma with a neighboring VMA, make sure all of the neighbor's old anon_vma structs are also linked in. This is necessary because at some point the VMAs could get merged, and we want to ensure no anon_vma structs get freed prematurely, while the system still has anonymous pages that belong to those structs. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Rik van Riel <riel@redhat.com> --- include/linux/mm.h | 2 +- mm/mmap.c | 6 +++--- mm/rmap.c | 20 +++++++++++++------- 3 files changed, 17 insertions(+), 11 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e70f21b..90ac50e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *, struct vm_area_struct *prev, unsigned long addr, unsigned long end, unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, struct mempolicy *); -extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); +extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *); extern int split_vma(struct mm_struct *, struct vm_area_struct *, unsigned long addr, int new_below); extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..bf0600c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, * anon_vmas being allocated, preventing vma merge in subsequent * mprotect. */ -struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) +struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma) { struct vm_area_struct *near; unsigned long vm_flags; @@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) can_vma_merge_before(near, vm_flags, NULL, vma->vm_file, vma->vm_pgoff + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT))) - return near->anon_vma; + return near; try_prev: /* * It is potentially slow to have to call find_vma_prev here. @@ -875,7 +875,7 @@ try_prev: mpol_equal(vma_policy(near), vma_policy(vma)) && can_vma_merge_after(near, vm_flags, NULL, vma->vm_file, vma->vm_pgoff)) - return near->anon_vma; + return near; none: /* * There's no absolute need to look only at touching neighbours: diff --git a/mm/rmap.c b/mm/rmap.c index eaa7a09..60616db 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -119,20 +119,26 @@ int anon_vma_prepare(struct vm_area_struct *vma) might_sleep(); if (unlikely(!anon_vma)) { struct mm_struct *mm = vma->vm_mm; + struct vm_area_struct *merge_vma; struct anon_vma *allocated; + merge_vma = find_mergeable_anon_vma(vma); + if (merge_vma) { + if (anon_vma_clone(vma, merge_vma)) + goto out_enomem; + return 0; + } + avc = anon_vma_chain_alloc(); if (!avc) goto out_enomem; - anon_vma = find_mergeable_anon_vma(vma); allocated = NULL; - if (!anon_vma) { - anon_vma = anon_vma_alloc(); - if (unlikely(!anon_vma)) - goto out_enomem_free_avc; - allocated = anon_vma; - } + anon_vma = anon_vma_alloc(); + if (unlikely(!anon_vma)) + goto out_enomem_free_avc; + allocated = anon_vma; + spin_lock(&anon_vma->lock); /* page_table_lock to protect against threads */ ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-06 23:54 ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel @ 2010-04-07 7:00 ` KOSAKI Motohiro 2010-04-07 14:48 ` Rik van Riel 2010-04-07 14:54 ` [PATCH -v2] " Rik van Riel 0 siblings, 2 replies; 231+ messages in thread From: KOSAKI Motohiro @ 2010-04-07 7:00 UTC (permalink / raw) To: Rik van Riel Cc: kosaki.motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes > When a new VMA has a mergeable anon_vma with a neighboring VMA, > make sure all of the neighbor's old anon_vma structs are also > linked in. > > This is necessary because at some point the VMAs could get merged, > and we want to ensure no anon_vma structs get freed prematurely, > while the system still has anonymous pages that belong to those > structs. Ahhhh, I'm shame myself. sure, neighbor vma might have lots avc ;-) few comments are blow. > > Reported-by: Borislav Petkov <bp@alien8.de> > Signed-off-by: Rik van Riel <riel@redhat.com> > > --- > include/linux/mm.h | 2 +- > mm/mmap.c | 6 +++--- > mm/rmap.c | 20 +++++++++++++------- > 3 files changed, 17 insertions(+), 11 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e70f21b..90ac50e 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *, > struct vm_area_struct *prev, unsigned long addr, unsigned long end, > unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, > struct mempolicy *); > -extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); > +extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *); > extern int split_vma(struct mm_struct *, > struct vm_area_struct *, unsigned long addr, int new_below); > extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); > diff --git a/mm/mmap.c b/mm/mmap.c > index 75557c6..bf0600c 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, > * anon_vmas being allocated, preventing vma merge in subsequent > * mprotect. > */ > -struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) > +struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma) > { > struct vm_area_struct *near; > unsigned long vm_flags; > @@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) > can_vma_merge_before(near, vm_flags, > NULL, vma->vm_file, vma->vm_pgoff + > ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT))) > - return near->anon_vma; > + return near; > try_prev: > /* > * It is potentially slow to have to call find_vma_prev here. > @@ -875,7 +875,7 @@ try_prev: > mpol_equal(vma_policy(near), vma_policy(vma)) && > can_vma_merge_after(near, vm_flags, > NULL, vma->vm_file, vma->vm_pgoff)) > - return near->anon_vma; > + return near; > none: > /* > * There's no absolute need to look only at touching neighbours: > diff --git a/mm/rmap.c b/mm/rmap.c > index eaa7a09..60616db 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -119,20 +119,26 @@ int anon_vma_prepare(struct vm_area_struct *vma) > might_sleep(); > if (unlikely(!anon_vma)) { > struct mm_struct *mm = vma->vm_mm; > + struct vm_area_struct *merge_vma; > struct anon_vma *allocated; > > + merge_vma = find_mergeable_anon_vma(vma); > + if (merge_vma) { > + if (anon_vma_clone(vma, merge_vma)) > + goto out_enomem; > + return 0; > + } > + Hmm.. probably I'm moron. I'm also confusing this locking rule as same as linus said. after this patch, new locking order are down_read(mmap_sem) anon_vma_clone(vma, merge_vma) list_add(&avc->same_vma, &vma->anon_vma_chain); spin_lock(&anon_vma->lock); list_add_tail(&avc->same_anon_vma, &anon_vma->head); spin_unlock(&anon_vma->lock); spin_lock(&anon_vma->lock); spin_lock(&mm->page_table_lock); So, Why mmap_sem read lock can protect vma->anon_vma_chain? An another threads seems to be able to change avc list concurrentlly and freely. plus, Why don't we need "vma->anon_vma = merge_vma->anon_vma" assignment? if vma->anon_vma keep NULL, I think anon_vma_prepare() call anon_vma_clone() multiple times. > avc = anon_vma_chain_alloc(); > if (!avc) > goto out_enomem; > > - anon_vma = find_mergeable_anon_vma(vma); > allocated = NULL; > - if (!anon_vma) { > - anon_vma = anon_vma_alloc(); > - if (unlikely(!anon_vma)) > - goto out_enomem_free_avc; > - allocated = anon_vma; > - } > + anon_vma = anon_vma_alloc(); > + if (unlikely(!anon_vma)) > + goto out_enomem_free_avc; > + allocated = anon_vma; > + > spin_lock(&anon_vma->lock); > > /* page_table_lock to protect against threads */ ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 7:00 ` KOSAKI Motohiro @ 2010-04-07 14:48 ` Rik van Riel 2010-04-07 14:54 ` [PATCH -v2] " Rik van Riel 1 sibling, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-07 14:48 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Linus Torvalds, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On 04/07/2010 03:00 AM, KOSAKI Motohiro wrote: > Hmm.. probably I'm moron. Someone might be, but it's not you :) > I'm also confusing this locking rule as same as linus said. > > after this patch, new locking order are > So, Why mmap_sem read lock can protect vma->anon_vma_chain? > An another threads seems to be able to change avc list concurrentlly and freely. You are right, the code needs to take the pagetable_lock around the call to anon_vma_clone, so other threads get locked out. This means the locking order has now been inverted, with the pagetable_lock on the outside and the anon_vma locks on the inside. I have checked all the other call sites to the anon_vma code. The direct callers of anon_vma_clone and anon_vma_fork already hold the mmap_sem for write. The callers of anon_vma_prepare hold the mmap_sem for read - so excluding other callers of anon_vma_prepare with the page_table_lock is enough. mm_take_all_locks has the mmap_sem for write. There seem to be no other traversals of the same_vma list, so changing the locking order to have the page_table_lock on the outside of the anon_vma locks works. > plus, Why don't we need "vma->anon_vma = merge_vma->anon_vma" assignment? > if vma->anon_vma keep NULL, I think anon_vma_prepare() call anon_vma_clone() > multiple times. Added in the new version. See the next email. ^ permalink raw reply [flat|nested] 231+ messages in thread
* [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 7:00 ` KOSAKI Motohiro 2010-04-07 14:48 ` Rik van Riel @ 2010-04-07 14:54 ` Rik van Riel 2010-04-07 15:30 ` Linus Torvalds 2010-04-07 15:55 ` Minchan Kim 1 sibling, 2 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-07 14:54 UTC (permalink / raw) To: KOSAKI Motohiro Cc: kosaki.motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes When a new VMA has a mergeable anon_vma with a neighboring VMA, make sure all of the neighbor's old anon_vma structs are also linked in. This is necessary because at some point the VMAs could get merged, and we want to ensure no anon_vma structs get freed prematurely, while the system still has anonymous pages that belong to those structs. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Rik van Riel <riel@redhat.com> --- v2: - fix the locking issues spotted by Kosaki Motohiro - set vma->anon_vma correctly include/linux/mm.h | 2 +- mm/mmap.c | 6 +++--- mm/rmap.c | 27 ++++++++++++++++++--------- 3 files changed, 22 insertions(+), 13 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e70f21b..90ac50e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1228,7 +1228,7 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *, struct vm_area_struct *prev, unsigned long addr, unsigned long end, unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, struct mempolicy *); -extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); +extern struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *); extern int split_vma(struct mm_struct *, struct vm_area_struct *, unsigned long addr, int new_below); extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..bf0600c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -832,7 +832,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, * anon_vmas being allocated, preventing vma merge in subsequent * mprotect. */ -struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) +struct vm_area_struct *find_mergeable_anon_vma(struct vm_area_struct *vma) { struct vm_area_struct *near; unsigned long vm_flags; @@ -855,7 +855,7 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) can_vma_merge_before(near, vm_flags, NULL, vma->vm_file, vma->vm_pgoff + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT))) - return near->anon_vma; + return near; try_prev: /* * It is potentially slow to have to call find_vma_prev here. @@ -875,7 +875,7 @@ try_prev: mpol_equal(vma_policy(near), vma_policy(vma)) && can_vma_merge_after(near, vm_flags, NULL, vma->vm_file, vma->vm_pgoff)) - return near->anon_vma; + return near; none: /* * There's no absolute need to look only at touching neighbours: diff --git a/mm/rmap.c b/mm/rmap.c index eaa7a09..abe7aa5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -119,24 +119,33 @@ int anon_vma_prepare(struct vm_area_struct *vma) might_sleep(); if (unlikely(!anon_vma)) { struct mm_struct *mm = vma->vm_mm; + struct vm_area_struct *merge_vma; struct anon_vma *allocated; + merge_vma = find_mergeable_anon_vma(vma); + if (merge_vma) { + int ret; + spin_lock(&mm->page_table_lock); + ret = anon_vma_clone(vma, merge_vma); + if (!ret) + vma->anon_vma = merge_vma->anon_vma; + spin_unlock(&mm->page_table_lock); + return ret; + } + avc = anon_vma_chain_alloc(); if (!avc) goto out_enomem; - anon_vma = find_mergeable_anon_vma(vma); allocated = NULL; - if (!anon_vma) { - anon_vma = anon_vma_alloc(); - if (unlikely(!anon_vma)) - goto out_enomem_free_avc; - allocated = anon_vma; - } - spin_lock(&anon_vma->lock); + anon_vma = anon_vma_alloc(); + if (unlikely(!anon_vma)) + goto out_enomem_free_avc; + allocated = anon_vma; /* page_table_lock to protect against threads */ spin_lock(&mm->page_table_lock); + spin_lock(&anon_vma->lock); if (likely(!vma->anon_vma)) { vma->anon_vma = anon_vma; avc->anon_vma = anon_vma; @@ -145,9 +154,9 @@ int anon_vma_prepare(struct vm_area_struct *vma) list_add(&avc->same_anon_vma, &anon_vma->head); allocated = NULL; } + spin_unlock(&anon_vma->lock); spin_unlock(&mm->page_table_lock); - spin_unlock(&anon_vma->lock); if (unlikely(allocated)) { anon_vma_free(allocated); anon_vma_chain_free(avc); ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 14:54 ` [PATCH -v2] " Rik van Riel @ 2010-04-07 15:30 ` Linus Torvalds 2010-04-07 15:52 ` Rik van Riel 2010-04-07 15:55 ` Minchan Kim 1 sibling, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-07 15:30 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Wed, 7 Apr 2010, Rik van Riel wrote: > > - fix the locking issues spotted by Kosaki Motohiro No, they're broken. And Rik, please explain the locking rather than make even more of these kinds of random ad-hoc locking rules. I've said this now _three_ times, but let me repeat once more: - the locking rules for that anon_vma_chain are very unclear. I _think_ you mean for them to be "mmap_sem held for writing, _or_ mmap_sem held for reading and page_table_lock held", but nowhere is that actually documented. Why is it so hard for you to just admit that? Especially after you yourself got it wrong. > + merge_vma = find_mergeable_anon_vma(vma); > + if (merge_vma) { > + int ret; > + spin_lock(&mm->page_table_lock); > + ret = anon_vma_clone(vma, merge_vma); > + if (!ret) > + vma->anon_vma = merge_vma->anon_vma; > + spin_unlock(&mm->page_table_lock); > + return ret; > + } Rik, the above is obviously total crap. anon_vma_clone() needs to allocate memory, and it does so with GFP_KERNEL. You can't do that with a spinlock held. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 15:30 ` Linus Torvalds @ 2010-04-07 15:52 ` Rik van Riel 2010-04-07 16:56 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-07 15:52 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On 04/07/2010 11:30 AM, Linus Torvalds wrote: > I've said this now _three_ times, but let me repeat once more: > > - the locking rules for that anon_vma_chain are very unclear. I _think_ > you mean for them to be "mmap_sem held for writing, _or_ mmap_sem held > for reading and page_table_lock held", but nowhere is that actually > documented. > Why is it so hard for you to just admit that? Especially after you > yourself got it wrong. You are right, the idea was to continue use the locking that the anon_vma code was already using, without introducing any new locking with the anon_vma patches. However, it has become clear that this is no longer possible, due to the need to hold a secondary lock across anon_vma_clone, when we come from a code path that holds the mmap_sem for read. >> + merge_vma = find_mergeable_anon_vma(vma); >> + if (merge_vma) { >> + int ret; >> + spin_lock(&mm->page_table_lock); >> + ret = anon_vma_clone(vma, merge_vma); >> + if (!ret) >> + vma->anon_vma = merge_vma->anon_vma; >> + spin_unlock(&mm->page_table_lock); >> + return ret; >> + } > > Rik, the above is obviously total crap. > > anon_vma_clone() needs to allocate memory, and it does so with GFP_KERNEL. > You can't do that with a spinlock held. Looks like we'll either have to introduce a per-mm semaphore for the same_vma anon_vma chains, or move the complexity of solving this bug to anon_vma_merge, where we can ensure that the resulting VMA has the sum of the anon_vmas of each VMA. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 15:52 ` Rik van Riel @ 2010-04-07 16:56 ` Linus Torvalds 2010-04-07 21:19 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-07 16:56 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Wed, 7 Apr 2010, Rik van Riel wrote: > > You are right, the idea was to continue use the locking that > the anon_vma code was already using, without introducing any > new locking with the anon_vma patches. > > However, it has become clear that this is no longer possible, > due to the need to hold a secondary lock across anon_vma_clone, > when we come from a code path that holds the mmap_sem for read. I do wonder if we could possibly simplify this a _lot_ by just requiring that the anon_vma gets allocated at vma creation time (ie mmap), rather than doing it on-demand when we actually do the page fault. That would make all of this crap happen under mmap_sem held for writing, and it would simplify the faulting code (which is the much more critical code) a lot. And it would make all your locking problems go away. Now all anon_vma code really _would_ run with mmap_sem held exclusively, without any races. When I tried to do a "fill in multiple page table entries in one go" patch, that annoying anon_vma issue was a problem as well. Allocating the anon_vma up-front would have simplified that code too. I can't imagine that we ever really have mappings without an anon_vma in practice _anyway_, so why delay the allocation until page fault time? Maybe I'm missing something subtle. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 16:56 ` Linus Torvalds @ 2010-04-07 21:19 ` Linus Torvalds 2010-04-07 21:52 ` Rik van Riel 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-07 21:19 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Wed, 7 Apr 2010, Linus Torvalds wrote: > > I do wonder if we could possibly simplify this a _lot_ by just requiring > that the anon_vma gets allocated at vma creation time (ie mmap), rather > than doing it on-demand when we actually do the page fault. > > That would make all of this crap happen under mmap_sem held for writing, > and it would simplify the faulting code (which is the much more critical > code) a lot. Here is a patch that boots for me (but has had _zero_ serious testing: caveat emptor etc etc). It basically moves "anon_vma_prepare()" to be called in vma_link and in __insert_vm_struct() - which I _think_ should cover all normal vma creation events. I did a "WARN_ONCE(!vma->anon_vma)" just to check, I haven't triggered one yet. Now, this clearly will create anon_vma's that may never get used at all, ie for things like shared mappings etc that never have anonymous memory associated with them. But that structure is pretty small, so I don't find it in myself to care too deeply. And with this, all the anon_vma games shuld all happen with mmap_sem held for writing, which should hopefully simplify things a lot. Rik, can you use this to make a new version of your fixing patch? Comments? Linus --- mm/memory.c | 10 +--------- mm/mmap.c | 17 ++++------------- 2 files changed, 5 insertions(+), 22 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 833952d..0abefd8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2223,9 +2223,6 @@ reuse: gotten: pte_unmap_unlock(page_table, ptl); - if (unlikely(anon_vma_prepare(vma))) - goto oom; - if (is_zero_pfn(pte_pfn(orig_pte))) { new_page = alloc_zeroed_user_highpage_movable(vma, address); if (!new_page) @@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, /* Allocate our own private page. */ pte_unmap(page_table); - if (unlikely(anon_vma_prepare(vma))) - goto oom; page = alloc_zeroed_user_highpage_movable(vma, address); if (!page) goto oom; @@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & FAULT_FLAG_WRITE) { if (!(vma->vm_flags & VM_SHARED)) { anon = 1; - if (unlikely(anon_vma_prepare(vma))) { - ret = VM_FAULT_OOM; - goto out; - } page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!page) { @@ -3115,6 +3106,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmd; pte_t *pte; + WARN_ONCE(!vma->anon_vma, "No anonvma"); __set_current_state(TASK_RUNNING); count_vm_event(PGFAULT); diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..c14284b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, mm->map_count++; validate_mm(mm); + + anon_vma_prepare(vma); } /* @@ -479,6 +481,8 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) BUG_ON(__vma && __vma->vm_start < vma->vm_end); __vma_link(mm, vma, prev, rb_link, rb_parent); mm->map_count++; + + anon_vma_prepare(vma); } static inline void @@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) if (!(vma->vm_flags & VM_GROWSUP)) return -EFAULT; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; anon_vma_lock(vma); /* @@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma, { int error; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; - address &= PAGE_MASK; error = security_file_mmap(NULL, 0, 0, 0, address, 1); if (error) ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 21:19 ` Linus Torvalds @ 2010-04-07 21:52 ` Rik van Riel 2010-04-07 22:09 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-07 21:52 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On 04/07/2010 05:19 PM, Linus Torvalds wrote: > Comments? I remember there being an "unfixable" spot with this approach when I originally wrote the new anon_vma linking code. However, I can't for the life of me find that spot. I am starting to believe I made it fixable as a side effect of one of the changes I made :) One of the issues with your patch is that anon_vma_prepare can fail and this patch ignores its return value. Having anon_vma-prepare fail after an mremap or mprotect might result in messing up the VMAs of a process, or having to undo the VMA changes that were made. In fact, this may be the problem I was running into - not wanting to add even more complex error paths to the vma shuffling code. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 21:52 ` Rik van Riel @ 2010-04-07 22:09 ` Linus Torvalds 2010-04-07 22:15 ` Linus Torvalds 2010-04-07 23:37 ` Linus Torvalds 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-07 22:09 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Wed, 7 Apr 2010, Rik van Riel wrote: > > One of the issues with your patch is that anon_vma_prepare > can fail and this patch ignores its return value. Yes. The failure point is too late to do anything really interesting with, and the old code also just causes a SIGBUS. My intention was to change the WARN_ONCE(!vma->anon_vma); into returning that SIGBUS - which is not wonderful, but is no different from old failures. In the long run, it would be nicer to actually return an error from the mmap() that fails, but that's more complicated, and as mentioned, it's not what the old code used to do either (since the failure point was always at the page fault stage). > Having anon_vma-prepare fail after an mremap or mprotect > might result in messing up the VMAs of a process, or having > to undo the VMA changes that were made. We really aren't any worse off than we have always been. If anon_vma_prepare() fails, the vma list will be valid, but no new pages can be added to that vma. That used to be true before too. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 22:09 ` Linus Torvalds @ 2010-04-07 22:15 ` Linus Torvalds 2010-04-08 0:38 ` Rik van Riel 2010-04-07 23:37 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-07 22:15 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Wed, 7 Apr 2010, Linus Torvalds wrote: > > In the long run, it would be nicer to actually return an error from the > mmap() that fails, but that's more complicated, and as mentioned, it's not > what the old code used to do either (since the failure point was always at > the page fault stage). Put another way: I'm not proud of it, but the new code isn't any worse than what we used to have, and I think the new code is _fixable_. The easiest way to do that would likely be to pre-allocate the anon_vma struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That way anon_vma_prepare() itself can never fail, and all we need to do is a simple allocation earlier in the call-chain. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 22:15 ` Linus Torvalds @ 2010-04-08 0:38 ` Rik van Riel 0 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-08 0:38 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On 04/07/2010 06:15 PM, Linus Torvalds wrote: > On Wed, 7 Apr 2010, Linus Torvalds wrote: >> >> In the long run, it would be nicer to actually return an error from the >> mmap() that fails, but that's more complicated, and as mentioned, it's not >> what the old code used to do either (since the failure point was always at >> the page fault stage). > > Put another way: I'm not proud of it, but the new code isn't any worse > than what we used to have, and I think the new code is _fixable_. Agreed, it is no worse than what we had before. As to fixable, I supect both situations are fixable. The new code by getting the error paths right, the old code by completely bailing out of the page fault and retrying it (the pageout code should trigger an OOM kill at some point, if we are really out of memory). > The easiest way to do that would likely be to pre-allocate the anon_vma > struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That > way anon_vma_prepare() itself can never fail, and all we need to do is a > simple allocation earlier in the call-chain. That may not work, because we may want to merge the anon_vma with the anon_vma in an adjacant VMA ... and that adjacant VMA could be chained onto multiple anon_vmas. That means allocating a single anon_vma_chain may not be enough. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 22:09 ` Linus Torvalds 2010-04-07 22:15 ` Linus Torvalds @ 2010-04-07 23:37 ` Linus Torvalds 2010-04-08 2:03 ` KOSAKI Motohiro 1 sibling, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-07 23:37 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Wed, 7 Apr 2010, Linus Torvalds wrote: > > Yes. The failure point is too late to do anything really interesting with, > and the old code also just causes a SIGBUS. My intention was to change the > > WARN_ONCE(!vma->anon_vma); > > into returning that SIGBUS - which is not wonderful, but is no different > from old failures. Not SIGBUS, but VM_FAULT_OOM, of course. IOW, something like this should be no worse than what we have now, and has the much nicer locking semantics. Having done some more digging, I can point to a downside: we do end up having about twice as many anon_vma entries. It seems about half of the vma's never need an anon_vma entry, probably because they end up being read-only file mappings, and thus never trigger the anonvma case. That said: - I don't really think you can fix the locking problem you have in a saner way - the anon_vma entry is much smaller than the vm_area_struct, so we're still using much less memory for them than for vma's. - We _could_ avoid allocating anonvma entries for shared mappings or for mappings that are read-only. That might force us to allocate some of them at mprotect time, and/or when doing a forced COW event with ptrace, but we have the mmap_sem for writing for the one case, and we could decide to get it for the other. So it's not a _fundamental_ problem if we decide we want to recover most of the memory lost by doing unconditional allocations. There are alternative models. For example, the VM layer _could_ decide to just release the mmap_sem, and re-do it and take it for writing if the vma doesn't have an anon_vma. I dunno. I like how this patch makes things so much less subtle, though. For example: with this in place, we could further simplify anon_vma_prepare(), since it would now never have the re-entrancy issue and wouldn't need to worry about taking that page_table_lock and re-testing vma->anon_vma for races. Linus --- mm/memory.c | 12 +++--------- mm/mmap.c | 17 ++++------------- 2 files changed, 7 insertions(+), 22 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 833952d..b5efe76 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2223,9 +2223,6 @@ reuse: gotten: pte_unmap_unlock(page_table, ptl); - if (unlikely(anon_vma_prepare(vma))) - goto oom; - if (is_zero_pfn(pte_pfn(orig_pte))) { new_page = alloc_zeroed_user_highpage_movable(vma, address); if (!new_page) @@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, /* Allocate our own private page. */ pte_unmap(page_table); - if (unlikely(anon_vma_prepare(vma))) - goto oom; page = alloc_zeroed_user_highpage_movable(vma, address); if (!page) goto oom; @@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & FAULT_FLAG_WRITE) { if (!(vma->vm_flags & VM_SHARED)) { anon = 1; - if (unlikely(anon_vma_prepare(vma))) { - ret = VM_FAULT_OOM; - goto out; - } page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!page) { @@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmd; pte_t *pte; + if (!vma->anon_vma) + return VM_FAULT_OOM; + __set_current_state(TASK_RUNNING); count_vm_event(PGFAULT); diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..c14284b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, mm->map_count++; validate_mm(mm); + + anon_vma_prepare(vma); } /* @@ -479,6 +481,8 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) BUG_ON(__vma && __vma->vm_start < vma->vm_end); __vma_link(mm, vma, prev, rb_link, rb_parent); mm->map_count++; + + anon_vma_prepare(vma); } static inline void @@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) if (!(vma->vm_flags & VM_GROWSUP)) return -EFAULT; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; anon_vma_lock(vma); /* @@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma, { int error; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; - address &= PAGE_MASK; error = security_file_mmap(NULL, 0, 0, 0, address, 1); if (error) ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 23:37 ` Linus Torvalds @ 2010-04-08 2:03 ` KOSAKI Motohiro 2010-04-08 2:33 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: KOSAKI Motohiro @ 2010-04-08 2:03 UTC (permalink / raw) To: Linus Torvalds Cc: kosaki.motohiro, Rik van Riel, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes Hi Wow, your patch is very cool. I'm surprising such 20 lines patch makes lots simplify. > On Wed, 7 Apr 2010, Linus Torvalds wrote: > > > > Yes. The failure point is too late to do anything really interesting with, > > and the old code also just causes a SIGBUS. My intention was to change the > > > > WARN_ONCE(!vma->anon_vma); > > > > into returning that SIGBUS - which is not wonderful, but is no different > > from old failures. > > Not SIGBUS, but VM_FAULT_OOM, of course. Now pagefault don't insert anon_vma anymore, right? if so, SIGBUS is better. Now SIGBUS and VM_FAULT_OOM make different result. SIGBUS -> kill current task VM_FAULT_OOM -> invoke oom-killer (see pagefault_out_of_memory()) If current task can't recover proper anon_vma. we should just kill current instead random highest badness process. otherwise !anon_vma process continue to randomly invoke oom-killer. Perhaps, I'm missing something. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 2:03 ` KOSAKI Motohiro @ 2010-04-08 2:33 ` Linus Torvalds 2010-04-08 5:47 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-08 2:33 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Thu, 8 Apr 2010, KOSAKI Motohiro wrote: > > Now pagefault don't insert anon_vma anymore, right? if so, SIGBUS is better. > Now SIGBUS and VM_FAULT_OOM make different result. > > SIGBUS -> kill current task > VM_FAULT_OOM -> invoke oom-killer (see pagefault_out_of_memory()) Yeah, maybe VM_FAULT_SIGBUS works ok instead of VM_FAULT_OOM. But the cause of it is the system having been oom when themappign was created, so I think either is fine. > If current task can't recover proper anon_vma. we should just kill current > instead random highest badness process. otherwise !anon_vma process continue > to randomly invoke oom-killer. Yes, that is a good point. Anyway, I think it might be interesting to test my anon_vma_prepare() locking change patch together with Rik's _first_ version of his "fix anon_vma_prepare" thing (the one without the spinlock). They should apply independently of each other, and maybe it all even works together. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 2:33 ` Linus Torvalds @ 2010-04-08 5:47 ` Borislav Petkov 2010-04-08 14:11 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-08 5:47 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes From: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed, Apr 07, 2010 at 07:33:01PM -0700 > Anyway, I think it might be interesting to test my anon_vma_prepare() > locking change patch together with Rik's _first_ version of his "fix > anon_vma_prepare" thing (the one without the spinlock). They should apply > independently of each other, and maybe it all even works together. There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file mappings while we might sleep in anon_vma_prepare(): [ 9.386929] BUG: sleeping function called from invalid context at mm/rmap.c:119 [ 9.387188] in_atomic(): 1, irqs_disabled(): 0, pid: 1068, name: modprobe [ 9.387343] 3 locks held by modprobe/1068: [ 9.387524] #0: (&p->cred_guard_mutex){+.+.+.}, at: [<ffffffff810d97fc>] prepare_bprm_creds+0x29/0x5a [ 9.387959] #1: (&mm->mmap_sem){++++++}, at: [<ffffffff81110ee2>] elf_map+0x70/0x190 [ 9.388416] #2: (&(&inode->i_data.i_mmap_lock)->rlock){+.+...}, at: [<ffffffff810bcbdf>] vma_adjust+0x190 /0x3ca [ 9.388848] Pid: 1068, comm: modprobe Not tainted 2.6.34-rc3-00290-ge4b2849 #6 [ 9.389102] Call Trace: [ 9.389256] [<ffffffff810630f6>] ? __debug_show_held_locks+0x22/0x24 [ 9.389418] [<ffffffff8102c288>] __might_sleep+0x117/0x11b [ 9.389570] [<ffffffff810c0f2e>] anon_vma_prepare+0x30/0x132 [ 9.389722] [<ffffffff810bcd95>] vma_adjust+0x346/0x3ca [ 9.389874] [<ffffffff810bcf68>] __split_vma+0x14f/0x1b9 [ 9.390027] [<ffffffff810bd143>] do_munmap+0x171/0x315 [ 9.390181] [<ffffffff81110ee2>] ? elf_map+0x70/0x190 [ 9.390335] [<ffffffff81110f9d>] elf_map+0x12b/0x190 [ 9.390493] [<ffffffff81111b35>] load_elf_binary+0xb33/0x170e [ 9.390645] [<ffffffff8102d529>] ? sub_preempt_count+0xa3/0xb6 [ 9.390800] [<ffffffff810d945a>] search_binary_handler+0x166/0x30e [ 9.390952] [<ffffffff810d92ab>] ? copy_strings+0x1d4/0x1e5 [ 9.391111] [<ffffffff81111002>] ? load_elf_binary+0x0/0x170e [ 9.391265] [<ffffffff810dadff>] do_execve+0x1fc/0x2f5 [ 9.391424] [<ffffffff8100a379>] sys_execve+0x43/0x61 [ 9.391576] [<ffffffff810025fa>] stub_execve+0x6a/0xc0 -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 5:47 ` Borislav Petkov @ 2010-04-08 14:11 ` Linus Torvalds 2010-04-08 18:25 ` Rik van Riel 2010-04-08 21:00 ` Borislav Petkov 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-08 14:11 UTC (permalink / raw) To: Borislav Petkov Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Thu, 8 Apr 2010, Borislav Petkov wrote: > > There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file > mappings while we might sleep in anon_vma_prepare(): Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in __insert_vm_struct. It should be simple enough to just move it into the caller, just after it releases that lock. There's only one user of that __insert_vm_struct() anyway. You can do it yourself, or you can replace my previous patch with this.. [ The patch below also makes it warn once and return SIGBUS for the case where there is no anon_vma. I decided I still want to hear about it if there might be some path that tries to insert a vma on its own ] Linus --- mm/memory.c | 12 +++--------- mm/mmap.c | 17 ++++------------- 2 files changed, 7 insertions(+), 22 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 833952d..08d4423 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2223,9 +2223,6 @@ reuse: gotten: pte_unmap_unlock(page_table, ptl); - if (unlikely(anon_vma_prepare(vma))) - goto oom; - if (is_zero_pfn(pte_pfn(orig_pte))) { new_page = alloc_zeroed_user_highpage_movable(vma, address); if (!new_page) @@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, /* Allocate our own private page. */ pte_unmap(page_table); - if (unlikely(anon_vma_prepare(vma))) - goto oom; page = alloc_zeroed_user_highpage_movable(vma, address); if (!page) goto oom; @@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & FAULT_FLAG_WRITE) { if (!(vma->vm_flags & VM_SHARED)) { anon = 1; - if (unlikely(anon_vma_prepare(vma))) { - ret = VM_FAULT_OOM; - goto out; - } page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!page) { @@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmd; pte_t *pte; + if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma")) + return VM_FAULT_SIGBUS; + __set_current_state(TASK_RUNNING); count_vm_event(PGFAULT); diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..82392c2 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, mm->map_count++; validate_mm(mm); + + anon_vma_prepare(vma); } /* @@ -628,6 +630,8 @@ again: remove_next = 1 + (end > next->vm_end); if (mapping) spin_unlock(&mapping->i_mmap_lock); + anon_vma_prepare(vma); + if (remove_next) { if (file) { fput(file); @@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) if (!(vma->vm_flags & VM_GROWSUP)) return -EFAULT; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; anon_vma_lock(vma); /* @@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma, { int error; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; - address &= PAGE_MASK; error = security_file_mmap(NULL, 0, 0, 0, address, 1); if (error) ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 14:11 ` Linus Torvalds @ 2010-04-08 18:25 ` Rik van Riel 2010-04-08 18:32 ` Linus Torvalds 2010-04-08 21:00 ` Borislav Petkov 1 sibling, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-08 18:25 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On 04/08/2010 10:11 AM, Linus Torvalds wrote: > > > On Thu, 8 Apr 2010, Borislav Petkov wrote: >> >> There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file >> mappings while we might sleep in anon_vma_prepare(): > > Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in > __insert_vm_struct. > > It should be simple enough to just move it into the caller, just after it > releases that lock. There's only one user of that __insert_vm_struct() > anyway. You can do it yourself, or you can replace my previous patch with > this.. > > [ The patch below also makes it warn once and return SIGBUS for the case > where there is no anon_vma. I decided I still want to hear about it if > there might be some path that tries to insert a vma on its own ] Reviewed-by: Rik van Riel <riel@redhat.com> I haven't seen any places that insert VMAs by itself. Several strange places that allocate them, but they all appear to use the standard functions to insert them. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 18:25 ` Rik van Riel @ 2010-04-08 18:32 ` Linus Torvalds 2010-04-08 20:31 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-08 18:32 UTC (permalink / raw) To: Rik van Riel Cc: Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Thu, 8 Apr 2010, Rik van Riel wrote: > > Reviewed-by: Rik van Riel <riel@redhat.com> Yeah, I think I'll commit it as-is, assuming we get confirmation that it (along with your patch) actually ends up fixing the original problem. I had actually had lockdep etc on with that patch, but for some reason I'd overlooked the SPINLOCK_SLEEP debugging, so I hadn't seen the stupid issue that Borislav pointed out. I wonder if LOCKDEP or spinlock debugging hould just select it. Small detail, but I should have caught that obvious bug myself. > I haven't seen any places that insert VMAs by itself. > Several strange places that allocate them, but they > all appear to use the standard functions to insert them. Yeah, it's complicated enough to add a vma with all the rbtree etc stuff that I hope nobody actually cooks their own. But I too grepped for vma allocations, and there were more of them than I expected, so... Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 18:32 ` Linus Torvalds @ 2010-04-08 20:31 ` Borislav Petkov 0 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-08 20:31 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, Apr 08, 2010 at 11:32:06AM -0700 Here we go, another night of testing starts... got more caffeine this time :) > > I haven't seen any places that insert VMAs by itself. > > Several strange places that allocate them, but they > > all appear to use the standard functions to insert them. > > Yeah, it's complicated enough to add a vma with all the rbtree etc stuff > that I hope nobody actually cooks their own. But I too grepped for vma > allocations, and there were more of them than I expected, so... ... and of course, I just hit that WARN_ONCE on the first suspend (it did suspend ok though): [ 88.078958] ------------[ cut here ]------------ [ 88.079007] WARNING: at mm/memory.c:3110 handle_mm_fault+0x56/0x67c() [ 88.079032] Hardware name: System Product Name [ 88.079056] Mapping with no anon_vma [ 88.079082] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod k10temp 8250_pnp 8250 serial_core edac_core ohci_hcd pcspkr [ 88.079637] Pid: 1965, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9 #7 [ 88.079676] Call Trace: [ 88.079713] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94 [ 88.079744] [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43 [ 88.079774] [<ffffffff810b857d>] handle_mm_fault+0x56/0x67c [ 88.079805] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d [ 88.079838] [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27 [ 88.079866] [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109 [ 88.079898] [<ffffffff813f93e3>] ? error_sti+0x5/0x6 [ 88.079929] [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 88.079960] [<ffffffff813f91ff>] page_fault+0x1f/0x30 [ 88.079988] ---[ end trace 154dd7f6249e1cc3 ]--- and then sysfs triggered that lockdep circular locking warning - I thought it was fixed already :( [ 256.831204] ======================================================= [ 256.831210] [ INFO: possible circular locking dependency detected ] [ 256.831216] 2.6.34-rc3-00290-g2156db9 #7 [ 256.831221] ------------------------------------------------------- [ 256.831226] hib.sh/2464 is trying to acquire lock: [ 256.831231] (s_active#80){++++.+}, at: [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f [ 256.831250] [ 256.831252] but task is already holding lock: [ 256.831256] (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}, at: [<ffffffff8131bb52>] lock_policy_rwsem_write+0x4f/0x80 [ 256.831271] [ 256.831273] which lock already depends on the new lock. [ 256.831275] [ 256.831278] [ 256.831280] the existing dependency chain (in reverse order) is: [ 256.831284] [ 256.831286] -> #1 (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}: [ 256.831294] [<ffffffff8106790a>] __lock_acquire+0x1306/0x169f [ 256.831305] [<ffffffff81067d95>] lock_acquire+0xf2/0x118 [ 256.831314] [<ffffffff813f727a>] down_read+0x4c/0x91 [ 256.831323] [<ffffffff8131c9f3>] lock_policy_rwsem_read+0x4f/0x80 [ 256.831332] [<ffffffff8131ca5c>] show+0x38/0x71 [ 256.831341] [<ffffffff81125ef0>] sysfs_read_file+0xb9/0x13e [ 256.831348] [<ffffffff810d5901>] vfs_read+0xaf/0x150 [ 256.831357] [<ffffffff810d5a65>] sys_read+0x4a/0x71 [ 256.831364] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b [ 256.831375] [ 256.831376] -> #0 (s_active#80){++++.+}: [ 256.831385] [<ffffffff810675c1>] __lock_acquire+0xfbd/0x169f [ 256.831385] [<ffffffff81067d95>] lock_acquire+0xf2/0x118 [ 256.831385] [<ffffffff81126a79>] sysfs_deactivate+0x91/0xe6 [ 256.831385] [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f [ 256.831385] [<ffffffff81127504>] sysfs_remove_dir+0x7a/0x8d [ 256.831385] [<ffffffff8118522e>] kobject_del+0x16/0x37 [ 256.831385] [<ffffffff8118528d>] kobject_release+0x3e/0x66 [ 256.831385] [<ffffffff811860d9>] kref_put+0x43/0x4d [ 256.831385] [<ffffffff811851a9>] kobject_put+0x47/0x4b [ 256.831385] [<ffffffff8131ba68>] __cpufreq_remove_dev+0x1e5/0x241 [ 256.831385] [<ffffffff813f4e33>] cpufreq_cpu_callback+0x67/0x7f [ 256.831385] [<ffffffff8105846b>] notifier_call_chain+0x37/0x63 [ 256.831385] [<ffffffff81058505>] __raw_notifier_call_chain+0xe/0x10 [ 256.831385] [<ffffffff813e6091>] _cpu_down+0x98/0x2a6 [ 256.831385] [<ffffffff810396b1>] disable_nonboot_cpus+0x74/0x10d [ 256.831385] [<ffffffff81075ac9>] hibernation_snapshot+0xac/0x1e1 [ 256.831385] [<ffffffff81075ccc>] hibernate+0xce/0x172 [ 256.831385] [<ffffffff81074a39>] state_store+0x5c/0xd3 [ 256.831385] [<ffffffff81184fb7>] kobj_attr_store+0x17/0x19 [ 256.831385] [<ffffffff81125dfb>] sysfs_write_file+0x108/0x144 [ 256.831385] [<ffffffff810d56c7>] vfs_write+0xb2/0x153 [ 256.831385] [<ffffffff810d582b>] sys_write+0x4a/0x71 [ 256.831385] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b [ 256.831385] [ 256.831385] other info that might help us debug this: [ 256.831385] [ 256.831385] 6 locks held by hib.sh/2464: [ 256.831385] #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff81125d2f>] sysfs_write_file+0x3c/0x144 [ 256.831385] #1: (s_active#49){.+.+.+}, at: [<ffffffff81125dda>] sysfs_write_file+0xe7/0x144 [ 256.831385] #2: (pm_mutex){+.+.+.}, at: [<ffffffff81075c1a>] hibernate+0x1c/0x172 [ 256.831385] #3: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810395d1>] cpu_maps_update_begin+0x17/0x19 [ 256.831385] #4: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81039616>] cpu_hotplug_begin+0x2c/0x53 [ 256.831385] #5: (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}, at: [<ffffffff8131bb52>] lock_policy_rwsem_write+0x4f/0x80 [ 256.831385] [ 256.831385] stack backtrace: [ 256.831385] Pid: 2464, comm: hib.sh Tainted: G W 2.6.34-rc3-00290-g2156db9 #7 [ 256.831385] Call Trace: [ 256.831385] [<ffffffff810643c3>] print_circular_bug+0xae/0xbd [ 256.831385] [<ffffffff810675c1>] __lock_acquire+0xfbd/0x169f [ 256.831385] [<ffffffff81127412>] ? sysfs_addrm_finish+0x36/0x5f [ 256.831385] [<ffffffff81067d95>] lock_acquire+0xf2/0x118 [ 256.831385] [<ffffffff81127412>] ? sysfs_addrm_finish+0x36/0x5f [ 256.831385] [<ffffffff81126a79>] sysfs_deactivate+0x91/0xe6 [ 256.831385] [<ffffffff81127412>] ? sysfs_addrm_finish+0x36/0x5f [ 256.831385] [<ffffffff81063d12>] ? trace_hardirqs_on+0xd/0xf [ 256.831385] [<ffffffff81126f3d>] ? release_sysfs_dirent+0x89/0xa9 [ 256.831385] [<ffffffff81127412>] sysfs_addrm_finish+0x36/0x5f [ 256.831385] [<ffffffff81127504>] sysfs_remove_dir+0x7a/0x8d [ 256.831385] [<ffffffff8118522e>] kobject_del+0x16/0x37 [ 256.831385] [<ffffffff8118528d>] kobject_release+0x3e/0x66 [ 256.831385] [<ffffffff8118524f>] ? kobject_release+0x0/0x66 [ 256.831385] [<ffffffff811860d9>] kref_put+0x43/0x4d [ 256.831385] [<ffffffff811851a9>] kobject_put+0x47/0x4b [ 256.831385] [<ffffffff8131ba68>] __cpufreq_remove_dev+0x1e5/0x241 [ 256.831385] [<ffffffff813f4e33>] cpufreq_cpu_callback+0x67/0x7f [ 256.831385] [<ffffffff8105846b>] notifier_call_chain+0x37/0x63 [ 256.831385] [<ffffffff81058505>] __raw_notifier_call_chain+0xe/0x10 [ 256.831385] [<ffffffff813e6091>] _cpu_down+0x98/0x2a6 [ 256.831385] [<ffffffff810396b1>] disable_nonboot_cpus+0x74/0x10d [ 256.831385] [<ffffffff81075ac9>] hibernation_snapshot+0xac/0x1e1 [ 256.831385] [<ffffffff81075ccc>] hibernate+0xce/0x172 [ 256.831385] [<ffffffff81074a39>] state_store+0x5c/0xd3 [ 256.831385] [<ffffffff81184fb7>] kobj_attr_store+0x17/0x19 [ 256.831385] [<ffffffff81125dfb>] sysfs_write_file+0x108/0x144 [ 256.831385] [<ffffffff810d56c7>] vfs_write+0xb2/0x153 [ 256.831385] [<ffffffff81063cda>] ? trace_hardirqs_on_caller+0x120/0x14b [ 256.831385] [<ffffffff810d582b>] sys_write+0x4a/0x71 [ 256.831385] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 14:11 ` Linus Torvalds 2010-04-08 18:25 ` Rik van Riel @ 2010-04-08 21:00 ` Borislav Petkov 2010-04-08 23:16 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-08 21:00 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, Apr 08, 2010 at 07:11:11AM -0700 > [ The patch below also makes it warn once and return SIGBUS for the case > where there is no anon_vma. I decided I still want to hear about it if > there might be some path that tries to insert a vma on its own ] And this happens quite often - I changed the WARN_ONCE to WARN and can't start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up upon boot too: [ 55.814570] ------------[ cut here ]------------ [ 55.814623] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a() [ 55.814648] Hardware name: System Product Name [ 55.814671] Mapping with no anon_vma [ 55.814693] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr [ 55.815249] Pid: 1936, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9-dirty #8 [ 55.815290] Call Trace: [ 55.815327] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94 [ 55.815362] [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43 [ 55.815391] [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a [ 55.815420] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d [ 55.815452] [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27 [ 55.815483] [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109 [ 55.815518] [<ffffffff813f93e3>] ? error_sti+0x5/0x6 [ 55.815553] [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 55.815585] [<ffffffff813f91ff>] page_fault+0x1f/0x30 [ 55.815613] ---[ end trace fa59f67cbfeeca44 ]--- [ 60.801651] ------------[ cut here ]------------ [ 60.801672] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a() [ 60.801681] Hardware name: System Product Name [ 60.801689] Mapping with no anon_vma [ 60.801702] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr [ 60.802156] Pid: 2008, comm: iceowl-bin Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #8 [ 60.802169] Call Trace: [ 60.802181] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94 [ 60.802191] [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43 [ 60.802203] [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a [ 60.802213] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d [ 60.802225] [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27 [ 60.802235] [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109 [ 60.802268] [<ffffffff813f93e3>] ? error_sti+0x5/0x6 [ 60.802279] [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 60.802290] [<ffffffff813f91ff>] page_fault+0x1f/0x30 [ 60.802305] ---[ end trace fa59f67cbfeeca45 ]--- [ 92.123350] ------------[ cut here ]------------ [ 92.123402] WARNING: at kernel/sched.c:3555 add_preempt_count+0x9c/0xcb() [ 92.123428] Hardware name: System Product Name [ 92.123451] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr [ 92.123902] Pid: 2111, comm: kvm Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #8 [ 92.123940] Call Trace: [ 92.123973] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94 [ 92.124002] [<ffffffff81037ed4>] warn_slowpath_null+0x14/0x16 [ 92.124031] [<ffffffff8102d5d8>] add_preempt_count+0x9c/0xcb [ 92.124061] [<ffffffff813f7ee9>] _raw_spin_lock_nest_lock+0x21/0x7a [ 92.124090] [<ffffffff810bc079>] ? mm_take_all_locks+0xf9/0x150 [ 92.124118] [<ffffffff810bc079>] mm_take_all_locks+0xf9/0x150 [ 92.124146] [<ffffffff810cc48d>] ? do_mmu_notifier_register+0xd3/0x19d [ 92.124174] [<ffffffff810cc495>] do_mmu_notifier_register+0xdb/0x19d [ 92.124202] [<ffffffff810cc57c>] mmu_notifier_register+0x13/0x15 [ 92.124256] [<ffffffffa00c67e3>] kvm_dev_ioctl+0x2c8/0x495 [kvm] [ 92.124318] [<ffffffff810e24ff>] vfs_ioctl+0x32/0xa6 [ 92.124357] [<ffffffff810e2a91>] do_vfs_ioctl+0x495/0x4db [ 92.124390] [<ffffffff813f93e3>] ? error_sti+0x5/0x6 [ 92.124425] [<ffffffff813f8fad>] ? retint_swapgs+0xe/0x13 [ 92.124458] [<ffffffff810e2b1e>] sys_ioctl+0x47/0x6a [ 92.124498] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b [ 92.124527] ---[ end trace fa59f67cbfeeca46 ]--- [ 92.213834] ------------[ cut here ]------------ [ 92.213888] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x66a() [ 92.213913] Hardware name: System Product Name [ 92.213937] Mapping with no anon_vma [ 92.213959] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core ohci_hcd serial_core k10temp pcspkr [ 92.214529] Pid: 2111, comm: kvm Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #8 [ 92.214571] Call Trace: [ 92.214612] [<ffffffff81037ea8>] warn_slowpath_common+0x7c/0x94 [ 92.214647] [<ffffffff81037f17>] warn_slowpath_fmt+0x41/0x43 [ 92.214683] [<ffffffff810b856a>] handle_mm_fault+0x43/0x66a [ 92.214718] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d [ 92.214751] [<ffffffff810be3ab>] ? do_mmap_pgoff+0x290/0x2f3 [ 92.214787] [<ffffffff813f93e3>] ? error_sti+0x5/0x6 [ 92.214821] [<ffffffff81062b97>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 92.214857] [<ffffffff813f7dd2>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 92.214896] [<ffffffff813f91ff>] page_fault+0x1f/0x30 [ 92.214928] ---[ end trace fa59f67cbfeeca47 ]--- -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 21:00 ` Borislav Petkov @ 2010-04-08 23:16 ` Linus Torvalds 2010-04-08 23:47 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-08 23:16 UTC (permalink / raw) To: Borislav Petkov Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Thu, 8 Apr 2010, Borislav Petkov wrote: > > And this happens quite often - I changed the WARN_ONCE to WARN and can't > start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up > upon boot too: Hmm. I tried console-kit-daemon, which I had installed, but didn't get anything like that. Probably some setup difference. I also went through every user of 'vm_area_cachep', and saw nothing suspicious at least for the mmu case (I didn't check the nommu.c code). I must have missed something. One thing you could do is to add some more debugging info when that "no anon_vma" warning happens. In particular, if you still have the SLUB debugging on, you could try to do that page = virt_to_head_page(vma); object_err(vm_area_cachep, page, (void *)vma, "NULL anon_vma"); and it should give you _which_ routine did the kmem_cache_alloc() for the vma that doesn't have an anon_vma. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 23:16 ` Linus Torvalds @ 2010-04-08 23:47 ` Borislav Petkov 2010-04-09 0:50 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-08 23:47 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, Apr 08, 2010 at 04:16:23PM -0700 > > And this happens quite often - I changed the WARN_ONCE to WARN and can't > > start kvm, iceowl (mozilla calendar) and the console-kit-daemon craps up > > upon boot too: > > Hmm. I tried console-kit-daemon, which I had installed, but didn't get > anything like that. Probably some setup difference. > > I also went through every user of 'vm_area_cachep', and saw nothing > suspicious at least for the mmu case (I didn't check the nommu.c code). I > must have missed something. > > One thing you could do is to add some more debugging info when that "no > anon_vma" warning happens. In particular, if you still have the SLUB > debugging on, you could try to do that > > page = virt_to_head_page(vma); > object_err(vm_area_cachep, page, (void *)vma, "NULL anon_vma"); > > and it should give you _which_ routine did the kmem_cache_alloc() for the > vma that doesn't have an anon_vma. Yep, looks good: its mmap_region()... [ 88.237326] ------------[ cut here ]------------ [ 88.237377] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x6ab() [ 88.237403] Hardware name: System Product Name [ 88.237428] Mapping with no anon_vma [ 88.237451] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp [ 88.237938] Pid: 1978, comm: console-kit-dae Not tainted 2.6.34-rc3-00290-g2156db9-dirty #9 [ 88.237980] Call Trace: [ 88.239269] [<ffffffff81037ec0>] warn_slowpath_common+0x7c/0x94 [ 88.239320] [<ffffffff81037f2f>] warn_slowpath_fmt+0x41/0x43 [ 88.239378] [<ffffffff810b8582>] handle_mm_fault+0x43/0x6ab [ 88.239440] [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d [ 88.239471] [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27 [ 88.239517] [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109 [ 88.239548] [<ffffffff813f9463>] ? error_sti+0x5/0x6 [ 88.239597] [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 88.239626] [<ffffffff813f927f>] page_fault+0x1f/0x30 [ 88.239674] ---[ end trace 42d53170a0d3ccef ]--- [ 88.239699] ============================================================================= [ 88.239750] BUG vm_area_struct: NULL anon_vma [ 88.239790] ----------------------------------------------------------------------------- [ 88.239794] [ 88.239805] INFO: Allocated in mmap_region+0x23d/0x500 age=2 cpu=0 pid=1978 [ 88.239815] INFO: Slab 0xffffea0007a0f0e8 objects=17 used=1 fp=0xffff88022dfbb0f0 flags=0x80000000000000c2 [ 88.239823] INFO: Object 0xffff88022dfbb000 @offset=0 fp=0xffff88022dfbb0f0 [ 88.239827] [ 88.239832] Object 0xffff88022dfbb000: 00 32 53 2b 02 88 ff ff 00 20 ab 29 d1 7f 00 00 .2S+..ÿÿ..«)Ñ... [ 88.239861] Object 0xffff88022dfbb010: 00 30 ac 29 d1 7f 00 00 e0 81 2b 2c 02 88 ff ff .0¬)Ñ...à.+,..ÿÿ [ 88.239886] Object 0xffff88022dfbb020: 25 00 00 00 00 00 00 80 73 00 10 00 00 00 00 00 %.......s....... [ 88.239910] Object 0xffff88022dfbb030: 10 82 2b 2c 02 88 ff ff 00 00 00 00 00 00 00 00 ..+,..ÿÿ........ [ 88.239966] Object 0xffff88022dfbb040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 88.240016] Object 0xffff88022dfbb050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 88.240077] Object 0xffff88022dfbb060: 00 00 00 00 00 00 00 00 10 a0 1c 2c 02 88 ff ff ...........,..ÿÿ [ 88.240160] Object 0xffff88022dfbb070: 10 a0 1c 2c 02 88 ff ff 00 00 00 00 00 00 00 00 ...,..ÿÿ........ [ 88.240225] Object 0xffff88022dfbb080: 00 00 00 00 00 00 00 00 b2 9a 12 fd 07 00 00 00 ........²..ý.... [ 88.240294] Object 0xffff88022dfbb090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 88.240352] Object 0xffff88022dfbb0a0: 00 00 00 00 00 00 00 00 ........ [ 88.240442] Redzone 0xffff88022dfbb0a8: cc cc cc cc cc cc cc cc ÌÌÌÌÌÌÌÌ [ 88.240509] Padding 0xffff88022dfbb0e8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ [ 88.240567] Pid: 1978, comm: console-kit-dae Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #9 [ 88.240578] Call Trace: [ 88.240593] [<ffffffff810cd802>] print_trailer+0x139/0x142 [ 88.240607] [<ffffffff810cd845>] object_err+0x3a/0x42 [ 88.240617] [<ffffffff810b85e2>] handle_mm_fault+0xa3/0x6ab [ 88.240641] [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d [ 88.240652] [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27 [ 88.240663] [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109 [ 88.240685] [<ffffffff813f9463>] ? error_sti+0x5/0x6 [ 88.240695] [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 88.240707] [<ffffffff813f927f>] page_fault+0x1f/0x30 [ 93.841666] ------------[ cut here ]------------ [ 93.841716] WARNING: at mm/memory.c:3110 handle_mm_fault+0x43/0x6ab() [ 93.841741] Hardware name: System Product Name [ 93.841766] Mapping with no anon_vma [ 93.841793] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp [ 93.842339] Pid: 2050, comm: iceowl-bin Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #9 [ 93.842383] Call Trace: [ 93.842424] [<ffffffff81037ec0>] warn_slowpath_common+0x7c/0x94 [ 93.842457] [<ffffffff81037f2f>] warn_slowpath_fmt+0x41/0x43 [ 93.842492] [<ffffffff810b8582>] handle_mm_fault+0x43/0x6ab [ 93.842527] [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d [ 93.842561] [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27 [ 93.842593] [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109 [ 93.842627] [<ffffffff813f9463>] ? error_sti+0x5/0x6 [ 93.842660] [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 93.842694] [<ffffffff813f927f>] page_fault+0x1f/0x30 [ 93.842724] ---[ end trace 42d53170a0d3ccf0 ]--- [ 93.842750] ============================================================================= [ 93.842794] BUG vm_area_struct: NULL anon_vma [ 93.842822] ----------------------------------------------------------------------------- [ 93.842827] [ 93.842889] INFO: Allocated in mmap_region+0x23d/0x500 age=1 cpu=2 pid=2050 [ 93.842918] INFO: Slab 0xffffea00079b84b8 objects=17 used=7 fp=0xffff88022c6f1690 flags=0x80000000000000c2 [ 93.842961] INFO: Object 0xffff88022c6f15a0 @offset=1440 fp=0xffff88022c6f1690 [ 93.842965] [ 93.843005] Bytes b4 0xffff88022c6f1590: 48 d9 fc ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a HÙüÿ....ZZZZZZZZ [ 93.843466] Object 0xffff88022c6f15a0: 00 78 b4 2e 02 88 ff ff 00 80 ce 49 5f 7f 00 00 .x´...ÿÿ..ÎI_... [ 93.843877] Object 0xffff88022c6f15b0: 00 90 4e 4a 5f 7f 00 00 c0 13 6f 2c 02 88 ff ff ..NJ_...À.o,..ÿÿ [ 93.844391] Object 0xffff88022c6f15c0: 25 00 00 00 00 00 00 80 73 00 10 00 00 00 00 00 %.......s....... [ 93.844794] Object 0xffff88022c6f15d0: e0 94 4a 2c 02 88 ff ff 00 00 00 00 00 00 00 00 à.J,..ÿÿ........ [ 93.845198] Object 0xffff88022c6f15e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 93.845665] Object 0xffff88022c6f15f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 93.846076] Object 0xffff88022c6f1600: 00 00 00 00 00 00 00 00 30 2d ec 2a 02 88 ff ff ........0-ì*..ÿÿ [ 93.846518] Object 0xffff88022c6f1610: 30 2d ec 2a 02 88 ff ff 00 00 00 00 00 00 00 00 0-ì*..ÿÿ........ [ 93.846931] Object 0xffff88022c6f1620: 00 00 00 00 00 00 00 00 e8 9c f4 f5 07 00 00 00 ........è.ôõ.... [ 93.847372] Object 0xffff88022c6f1630: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 93.847787] Object 0xffff88022c6f1640: 00 00 00 00 00 00 00 00 ........ [ 93.848194] Redzone 0xffff88022c6f1648: cc cc cc cc cc cc cc cc ÌÌÌÌÌÌÌÌ [ 93.848635] Padding 0xffff88022c6f1688: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ [ 93.849036] Pid: 2050, comm: iceowl-bin Tainted: G W 2.6.34-rc3-00290-g2156db9-dirty #9 [ 93.849078] Call Trace: [ 93.849111] [<ffffffff810cd802>] print_trailer+0x139/0x142 [ 93.849142] [<ffffffff810cd845>] object_err+0x3a/0x42 [ 93.849174] [<ffffffff810b85e2>] handle_mm_fault+0xa3/0x6ab [ 93.849204] [<ffffffff8101f3b2>] do_page_fault+0x30b/0x32d [ 93.849237] [<ffffffff810615e6>] ? put_lock_stats+0xe/0x27 [ 93.849301] [<ffffffff81062a6d>] ? lock_release_holdtime+0x104/0x109 [ 93.849337] [<ffffffff813f9463>] ? error_sti+0x5/0x6 [ 93.849370] [<ffffffff813f7e52>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 93.849418] [<ffffffff813f927f>] page_fault+0x1f/0x30 -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-08 23:47 ` Borislav Petkov @ 2010-04-09 0:50 ` Linus Torvalds 2010-04-09 1:30 ` Borislav Petkov 2010-04-09 1:45 ` KOSAKI Motohiro 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-09 0:50 UTC (permalink / raw) To: Borislav Petkov Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Fri, 9 Apr 2010, Borislav Petkov wrote: > > Yep, looks good: its mmap_region()... Can you double-check your current diffs - maybe something got corrupted. mmap_region installs the vma with vma_link(), and the last thing vma_link() does with my patch is that "anon_vma_prepare()". Maybe with all the patches flying around, you had a reject or something, and you lost that one anon_vma_prepare()? Or maybe I screwed up somewhere and sent you the wrong patch. Here it is again, just in case. [ I have a horrible cold, and can hardly think straight. So who knows, maybe I'm missing something. But if you have lost one of the 'anon_vma_prepare()' call sites, that would certainly explain why you get NULL anon_vma's ] Linus --- mm/memory.c | 12 +++--------- mm/mmap.c | 17 ++++------------- 2 files changed, 7 insertions(+), 22 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 833952d..08d4423 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2223,9 +2223,6 @@ reuse: gotten: pte_unmap_unlock(page_table, ptl); - if (unlikely(anon_vma_prepare(vma))) - goto oom; - if (is_zero_pfn(pte_pfn(orig_pte))) { new_page = alloc_zeroed_user_highpage_movable(vma, address); if (!new_page) @@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, /* Allocate our own private page. */ pte_unmap(page_table); - if (unlikely(anon_vma_prepare(vma))) - goto oom; page = alloc_zeroed_user_highpage_movable(vma, address); if (!page) goto oom; @@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & FAULT_FLAG_WRITE) { if (!(vma->vm_flags & VM_SHARED)) { anon = 1; - if (unlikely(anon_vma_prepare(vma))) { - ret = VM_FAULT_OOM; - goto out; - } page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!page) { @@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmd; pte_t *pte; + if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma")) + return VM_FAULT_SIGBUS; + __set_current_state(TASK_RUNNING); count_vm_event(PGFAULT); diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..82392c2 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, mm->map_count++; validate_mm(mm); + + anon_vma_prepare(vma); } /* @@ -628,6 +630,8 @@ again: remove_next = 1 + (end > next->vm_end); if (mapping) spin_unlock(&mapping->i_mmap_lock); + anon_vma_prepare(vma); + if (remove_next) { if (file) { fput(file); @@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) if (!(vma->vm_flags & VM_GROWSUP)) return -EFAULT; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; anon_vma_lock(vma); /* @@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma, { int error; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; - address &= PAGE_MASK; error = security_file_mmap(NULL, 0, 0, 0, address, 1); if (error) ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 0:50 ` Linus Torvalds @ 2010-04-09 1:30 ` Borislav Petkov 2010-04-09 9:21 ` Borislav Petkov 2010-04-09 1:45 ` KOSAKI Motohiro 1 sibling, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-09 1:30 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, Apr 08, 2010 at 05:50:21PM -0700 > > Yep, looks good: its mmap_region()... > > Can you double-check your current diffs - maybe something got corrupted. > > mmap_region installs the vma with vma_link(), and the last thing > vma_link() does with my patch is that "anon_vma_prepare()". Right, it looks like it. I'll add some more debugging calls there tomorrow - it might give us more clues in case someone hasn't caught it until then. > Maybe with all the patches flying around, you had a reject or something, > and you lost that one anon_vma_prepare()? > > Or maybe I screwed up somewhere and sent you the wrong patch. Here it is > again, just in case. Doesn't look like it - here's the diff between yours and what I have applied here (yep, only minor fuzz but no code differences) Also, I've added my version at the end: --- a.diff 2010-04-09 03:03:35.000000000 +0200 +++ b.diff 2010-04-09 03:03:52.000000000 +0200 @@ -1,8 +1,8 @@ diff --git a/mm/memory.c b/mm/memory.c -index 1d2ea39..bd7ea7f 100644 +index 833952d..08d4423 100644 --- a/mm/memory.c +++ b/mm/memory.c -@@ -2224,9 +2224,6 @@ reuse: +@@ -2223,9 +2223,6 @@ reuse: gotten: pte_unmap_unlock(page_table, ptl); @@ -12,7 +12,7 @@ index 1d2ea39..bd7ea7f 100644 if (is_zero_pfn(pte_pfn(orig_pte))) { new_page = alloc_zeroed_user_highpage_movable(vma, address); if (!new_page) -@@ -2767,8 +2764,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +@@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, /* Allocate our own private page. */ pte_unmap(page_table); @@ -21,7 +21,7 @@ index 1d2ea39..bd7ea7f 100644 page = alloc_zeroed_user_highpage_movable(vma, address); if (!page) goto oom; -@@ -2864,10 +2859,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, +@@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & FAULT_FLAG_WRITE) { if (!(vma->vm_flags & VM_SHARED)) { anon = 1; @@ -32,7 +32,7 @@ index 1d2ea39..bd7ea7f 100644 page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!page) { -@@ -3116,6 +3107,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, +@@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmd; pte_t *pte; @@ -43,7 +43,7 @@ index 1d2ea39..bd7ea7f 100644 count_vm_event(PGFAULT); diff --git a/mm/mmap.c b/mm/mmap.c -index bf0600c..4592a93 100644 +index 75557c6..82392c2 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, > [ I have a horrible cold, and can hardly think straight. So who knows, > maybe I'm missing something. But if you have lost one of the > 'anon_vma_prepare()' call sites, that would certainly explain why you > get NULL anon_vma's ] Oh, sorry to hear that. Ok, let's stop for today - it is 3am here and even if some would say, "well, this is just getting interesting" :), I think it would be best to "sleep on it." :) Thanks. -- commit 2156db98fd84d07e3b86564f429fcc8c6b7d61df Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu Apr 8 22:09:53 2010 +0200 rmap: preallocate anon VMAs On Thu, 8 Apr 2010, Borislav Petkov wrote: > > There are still issues: vma_adjust() grabs mapping->i_mmap_lock for file > mappings while we might sleep in anon_vma_prepare(): Ahh. Good catch. So I can't actually do that anon_vma_prepare() thing in __insert_vm_struct. It should be simple enough to just move it into the caller, just after it releases that lock. There's only one user of that __insert_vm_struct() anyway. You can do it yourself, or you can replace my previous patch with this.. [ The patch below also makes it warn once and return SIGBUS for the case where there is no anon_vma. I decided I still want to hear about it if there might be some path that tries to insert a vma on its own ] Linus diff --git a/mm/memory.c b/mm/memory.c index 1d2ea39..bd7ea7f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2224,9 +2224,6 @@ reuse: gotten: pte_unmap_unlock(page_table, ptl); - if (unlikely(anon_vma_prepare(vma))) - goto oom; - if (is_zero_pfn(pte_pfn(orig_pte))) { new_page = alloc_zeroed_user_highpage_movable(vma, address); if (!new_page) @@ -2767,8 +2764,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, /* Allocate our own private page. */ pte_unmap(page_table); - if (unlikely(anon_vma_prepare(vma))) - goto oom; page = alloc_zeroed_user_highpage_movable(vma, address); if (!page) goto oom; @@ -2864,10 +2859,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & FAULT_FLAG_WRITE) { if (!(vma->vm_flags & VM_SHARED)) { anon = 1; - if (unlikely(anon_vma_prepare(vma))) { - ret = VM_FAULT_OOM; - goto out; - } page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!page) { @@ -3116,6 +3107,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmd; pte_t *pte; + if (WARN_ONCE(!vma->anon_vma, "Mapping with no anon_vma")) + return VM_FAULT_SIGBUS; + __set_current_state(TASK_RUNNING); count_vm_event(PGFAULT); diff --git a/mm/mmap.c b/mm/mmap.c index bf0600c..4592a93 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, mm->map_count++; validate_mm(mm); + + anon_vma_prepare(vma); } /* @@ -628,6 +630,8 @@ again: remove_next = 1 + (end > next->vm_end); if (mapping) spin_unlock(&mapping->i_mmap_lock); + anon_vma_prepare(vma); + if (remove_next) { if (file) { fput(file); @@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) if (!(vma->vm_flags & VM_GROWSUP)) return -EFAULT; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; anon_vma_lock(vma); /* @@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma, { int error; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; - address &= PAGE_MASK; error = security_file_mmap(NULL, 0, 0, 0, address, 1); if (error) -- Regards/Gruss, Boris. ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 1:30 ` Borislav Petkov @ 2010-04-09 9:21 ` Borislav Petkov 2010-04-09 16:35 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-09 9:21 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes From: Borislav Petkov <bp@alien8.de> Date: Fri, Apr 09, 2010 at 03:30:12AM +0200 > > Maybe with all the patches flying around, you had a reject or something, > > and you lost that one anon_vma_prepare()? > > > > Or maybe I screwed up somewhere and sent you the wrong patch. Here it is > > again, just in case. > > Doesn't look like it - here's the diff between yours and what I have > applied here (yep, only minor fuzz but no code differences) Also, I've > added my version at the end: So I went and reapplied the three patches (3rd is the object_err export for SLUB debugging) on a new branch of today's git - same results, the same processes crap up in the WARN(!vma->anon_vma) check so it should be something else we're missing. More code staring later... -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 9:21 ` Borislav Petkov @ 2010-04-09 16:35 ` Linus Torvalds 2010-04-09 17:40 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-09 16:35 UTC (permalink / raw) To: Borislav Petkov Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Fri, 9 Apr 2010, Borislav Petkov wrote: > > So I went and reapplied the three patches (3rd is the object_err export > for SLUB debugging) on a new branch of today's git - same results, the > same processes crap up in the WARN(!vma->anon_vma) check so it should be > something else we're missing. > > More code staring later... Can you try with _just_ my patch? Or add a vma->anon_vma = merge_vma->anon_vma; to Rik's "merge_vma" case in anon_vma_prepare(). Because I'm starign at Rik's patch, and one thing strikes me: it does that "anon_vma_clone()" in anon_vma_prepare(), and maybe I'm blind, but I don't see where that actually sets vma->anon_vma. As far as I can tell, anon_vma_clone() was designed purely for the fork() case, which has done *new = *vma; which will set new->anon_vma to the same vma. But Rik's patch never does that for the anon_vma_prepare() case. And maybe we should do it in anon_vma_clone() itself, just to make it impossible to mistakenly leave it out, the way I think Rik's patch did. Anyway, I'm still groggy from allt he flu medication, so take everything I say with a grain of salt. In fact, the more I look at this, the less I think I like Rik's patch in the first place. I think the real bug that Rik tried to fix is that apparently anon_vma_merge() doesn't necessarily merge everything right. >From Rik's bug-explanation, step 5: >> 5) vma_adjust calls anon_vma_merge, causing the anon_vma >> chain of one of the VMAs to get nuked - with bad luck, >> this is the original one, leaving just the new anon_vma >> attached to the VMA and I think that _this_ is the real bug to begin with. The real fix should be in vma_adjust/anon_vma_merge, not in how we set up the anon_vma in the first place. I do _not_ think we should require that we always merged things at mmap() time, because we may _never_ be able to merge perfectly (ie start out with to disjoing mmaps, and fill in the middle). Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 16:35 ` Linus Torvalds @ 2010-04-09 17:40 ` Borislav Petkov 2010-04-09 17:50 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-09 17:40 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes From: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri, Apr 09, 2010 at 09:35:15AM -0700 > Can you try with _just_ my patch? Yep, yours along with the SLUB debugging piece just survived one hibernation cycle without a problem. Also, no SIGBUS-killed processes, all seems fine. Will continue stressing it though... Let me know what you want me to do next. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 17:40 ` Borislav Petkov @ 2010-04-09 17:50 ` Linus Torvalds 2010-04-09 19:14 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-09 17:50 UTC (permalink / raw) To: Borislav Petkov Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Fri, 9 Apr 2010, Borislav Petkov wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Fri, Apr 09, 2010 at 09:35:15AM -0700 > > > Can you try with _just_ my patch? > > Yep, yours along with the SLUB debugging piece just survived one > hibernation cycle without a problem. Also, no SIGBUS-killed processes, > all seems fine. Will continue stressing it though... > > Let me know what you want me to do next. Continue stress-testing it. I don't think my patch on its own should fix the original problem, but at least we now know why you got those NULL anon_vma's. So what I _think_ will happen is that you'll be able to re-create the problem that started this all. But I'd like to verify that, just because I'm anal and I'd like these things to be tested independently. So assuming that the original problem happens again, if you can then apply Rik's patch, but add a dst->anon_vma = src->anon_vma; to just before the success case (the "return 0") in anon_vma_clone(), that would be good. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 17:50 ` Linus Torvalds @ 2010-04-09 19:14 ` Borislav Petkov 2010-04-09 19:32 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-09 19:14 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes From: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri, Apr 09, 2010 at 10:50:23AM -0700 > Continue stress-testing it. I don't think my patch on its own should fix > the original problem, but at least we now know why you got those NULL > anon_vma's. > > So what I _think_ will happen is that you'll be able to re-create the > problem that started this all. But I'd like to verify that, just because > I'm anal and I'd like these things to be tested independently. Heh, that was easy. Third hibernate cycle is a charm^Wboom :) > So assuming that the original problem happens again, if you can then apply > Rik's patch, but add a > > dst->anon_vma = src->anon_vma; > > to just before the success case (the "return 0") in anon_vma_clone(), > that would be good. It looks like this way we mangle the anon_vma chains somehow. From what I can see and if I'm not mistaken, we save the anon_vmas alright but end up in what seems like an endless list_for_each_entry() loop having grabbed anon_vma->lock in page_lock_anon_vma() and we can't seem to yield it through page_unlock_anon_vma() at the end of page_referenced_anon() so it has to be that code in between iterating over each list entry... I could be completely wrong though... [ 373.683545] PM: Syncing filesystems ... done. [ 373.950289] Freezing user space processes ... (elapsed 0.04 seconds) done. [ 373.998878] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. [ 374.011121] PM: Preallocating image memory... [ 439.161126] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617] [ 439.161315] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd [ 439.162302] irq event stamp: 0 [ 439.162302] hardirqs last enabled at (0): [<(null)>] (null) [ 439.162302] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc [ 439.163297] softirqs last enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc [ 439.163297] softirqs last disabled at (0): [<(null)>] (null) [ 439.163297] CPU 1 [ 439.163297] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd [ 439.165297] [ 439.165297] Pid: 3617, comm: hib.sh Tainted: G W 2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name [ 439.165297] RIP: 0010:[<ffffffff8118b731>] [<ffffffff8118b731>] delay_tsc+0x0/0xca [ 439.165297] RSP: 0018:ffff8801f68b77f0 EFLAGS: 00000202 [ 439.166300] RAX: 0000000000000000 RBX: ffff8801f68b77f8 RCX: 000000000000f100 [ 439.166300] RDX: 0000000000000001 RSI: ffff8801f68b7848 RDI: 0000000000000001 [ 439.166300] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000 [ 439.166300] R10: ffff88022c9a3ac8 R11: ffffffff00000012 R12: 000000000000f100 [ 439.166300] R13: 00000000cc444700 R14: 0000000000000001 R15: 0000000000000000 [ 439.166300] FS: 00007f8d00e676f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000 [ 439.167296] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 439.167296] CR2: 00007fff93e8a9c0 CR3: 00000001f5397000 CR4: 00000000000006e0 [ 439.167296] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003 [ 439.167296] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 439.167296] Process hib.sh (pid: 3617, threadinfo ffff8801f68b6000, task ffff88022a2e8000) [ 439.167296] Stack: [ 439.168297] ffffffff8118b72f ffff8801f68b7848 ffffffff8119a1ca ffff880214972868 [ 439.168297] <0> 0000000000000001 ffff880100000000 ffff880214972850 ffff880214972868 [ 439.168297] <0> ffff8801f68b7cf8 ffff8801f68b7b78 ffff8801f68b7a00 ffff8801f68b7878 [ 439.169298] Call Trace: [ 439.169298] [<ffffffff8118b72f>] ? __delay+0xf/0x11 [ 439.169298] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c [ 439.169298] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73 [ 439.170299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 439.170299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 439.170299] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac [ 439.170299] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc [ 439.170299] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4 [ 439.170299] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477 [ 439.170299] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58 [ 439.171296] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5 [ 439.171296] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244 [ 439.171296] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6 [ 439.171296] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f [ 439.171296] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4 [ 439.171296] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0 [ 439.171296] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79 [ 439.172298] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb [ 439.172298] [<ffffffff813f5285>] ? printk+0x41/0x44 [ 439.172298] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1 [ 439.172298] [<ffffffff81075ce0>] ? hibernate+0xce/0x172 [ 439.172298] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3 [ 439.172298] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19 [ 439.173296] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144 [ 439.173296] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153 [ 439.173296] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 439.173296] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71 [ 439.173296] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b [ 439.173296] Code: ff c8 c9 c3 55 48 89 e5 0f 1f 44 00 00 48 c7 05 12 35 4e 00 31 b7 18 81 c9 c3 55 48 89 e5 0f 1f 44 00 00 ff 15 01 35 4e 00 c9 c3 <55> 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 [ 439.176296] Call Trace: [ 439.177297] [<ffffffff8118b72f>] ? __delay+0xf/0x11 [ 439.177297] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c [ 439.177297] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73 [ 439.177297] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 439.177297] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 439.177297] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac [ 439.177297] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc [ 439.178295] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4 [ 439.178295] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477 [ 439.178295] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58 [ 439.178295] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5 [ 439.178295] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244 [ 439.178295] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6 [ 439.178295] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f [ 439.179299] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4 [ 439.179299] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0 [ 439.179299] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79 [ 439.179299] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb [ 439.179299] [<ffffffff813f5285>] ? printk+0x41/0x44 [ 439.179299] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1 [ 439.180296] [<ffffffff81075ce0>] ? hibernate+0xce/0x172 [ 439.180296] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3 [ 439.180296] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19 [ 439.180296] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144 [ 439.180296] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153 [ 439.180296] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 439.180296] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71 [ 439.181297] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b [ 504.659125] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617] [ 504.659126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd [ 504.660297] irq event stamp: 0 [ 504.660297] hardirqs last enabled at (0): [<(null)>] (null) [ 504.660297] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc [ 504.661298] softirqs last enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc [ 504.661298] softirqs last disabled at (0): [<(null)>] (null) [ 504.661298] CPU 1 [ 504.661298] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd [ 504.663297] [ 504.663297] Pid: 3617, comm: hib.sh Tainted: G W 2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name [ 504.663297] RIP: 0010:[<ffffffff8118b775>] [<ffffffff8118b775>] delay_tsc+0x44/0xca [ 504.663297] RSP: 0018:ffff8801f68b77b8 EFLAGS: 00000206 [ 504.663297] RAX: 00000000a4911fed RBX: ffff8801f68b77e8 RCX: 000000000000f100 [ 504.664326] RDX: 00000000000000f1 RSI: ffff8801f68b7848 RDI: 0000000000000001 [ 504.664326] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000 [ 504.664326] R10: ffff88022c9a3ac8 R11: ffffffff00000012 R12: 0000000000000010 [ 504.664326] R13: ffff88000a200000 R14: ffff8801f68b6000 R15: ffff8801f68b7fd8 [ 504.664326] FS: 00007f8d00e676f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000 [ 504.664326] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 504.665296] CR2: 00007fff93e8a9c0 CR3: 00000001f5397000 CR4: 00000000000006e0 [ 504.665296] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003 [ 504.665296] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 504.665296] Process hib.sh (pid: 3617, threadinfo ffff8801f68b6000, task ffff88022a2e8000) [ 504.665296] Stack: [ 504.665296] 0000000000000001 ffff880214972850 ffff88022a2e8000 00000000b3450160 [ 504.666297] <0> ffff88022a2e83a8 000000005486e668 ffff8801f68b77f8 ffffffff8118b72f [ 504.666297] <0> ffff8801f68b7848 ffffffff8119a1ca ffff880214972868 0000000000000001 [ 504.667298] Call Trace: [ 504.667298] [<ffffffff8118b72f>] ? __delay+0xf/0x11 [ 504.667298] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c [ 504.667298] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73 [ 504.667298] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 504.668288] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 504.668298] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac [ 504.668298] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc [ 504.668298] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4 [ 504.668298] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477 [ 504.668298] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58 [ 504.668298] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5 [ 504.669296] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244 [ 504.669296] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6 [ 504.669296] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f [ 504.669296] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4 [ 504.669296] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0 [ 504.669296] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79 [ 504.669296] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb [ 504.670302] [<ffffffff813f5285>] ? printk+0x41/0x44 [ 504.670302] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1 [ 504.670302] [<ffffffff81075ce0>] ? hibernate+0xce/0x172 [ 504.670302] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3 [ 504.670302] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19 [ 504.670302] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144 [ 504.670302] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153 [ 504.671297] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 504.671297] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71 [ 504.673315] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b [ 504.674350] Code: bf 01 00 00 00 e8 f8 1d ea ff e8 9f f4 00 00 41 89 c5 0f ae f0 66 66 90 0f 31 89 c3 65 4c 8b 34 25 48 b5 00 00 0f ae f0 66 66 90 <0f> 31 41 89 c7 4c 89 f8 48 29 d8 4c 39 e0 73 49 bf 01 00 00 00 [ 504.677299] Call Trace: [ 504.677299] [<ffffffff8118b72f>] ? __delay+0xf/0x11 [ 504.677299] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c [ 504.677299] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73 [ 504.677299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 504.678287] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 504.678296] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac [ 504.678296] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc [ 504.678296] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4 [ 504.678296] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477 [ 504.678296] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58 [ 504.678296] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5 [ 504.679297] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244 [ 504.679297] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6 [ 504.679297] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f [ 504.679297] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4 [ 504.679297] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0 [ 504.679297] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79 [ 504.679297] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb [ 504.680303] [<ffffffff813f5285>] ? printk+0x41/0x44 [ 504.680303] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1 [ 504.680303] [<ffffffff81075ce0>] ? hibernate+0xce/0x172 [ 504.680303] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3 [ 504.680303] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19 [ 504.680303] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144 [ 504.680303] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153 [ 504.681297] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 504.681297] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71 [ 504.681297] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b [ 570.157125] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:3617] [ 570.157126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd [ 570.158283] irq event stamp: 0 [ 570.158283] hardirqs last enabled at (0): [<(null)>] (null) [ 570.158283] hardirqs last disabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc [ 570.159297] softirqs last enabled at (0): [<ffffffff8103655c>] copy_process+0x3c1/0x10cc [ 570.159297] softirqs last disabled at (0): [<(null)>] (null) [ 570.159297] CPU 1 [ 570.159297] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd [ 570.161297] [ 570.161297] Pid: 3617, comm: hib.sh Tainted: G W 2.6.34-rc3-00413-g1028f7c-dirty #12 M3A78 PRO/System Product Name [ 570.161297] RIP: 0010:[<ffffffff8118b777>] [<ffffffff8118b777>] delay_tsc+0x46/0xca [ 570.161297] RSP: 0018:ffff8801f68b77b8 EFLAGS: 00000206 [ 570.161297] RAX: 000000007cdde43c RBX: ffff8801f68b77e8 RCX: 000000000000f100 [ 570.162296] RDX: 000000000000011f RSI: ffff8801f68b7848 RDI: 0000000000000001 [ 570.162296] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000 [ 570.162296] R10: ffff88022c9a3ac8 R11: ffffffff00000012 R12: 0000000000000010 [ 570.162296] R13: ffff88000a200000 R14: ffff8801f68b6000 R15: ffff8801f68b7fd8 [ 570.162296] FS: 00007f8d00e676f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000 [ 570.162296] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 570.163296] CR2: 00007fff93e8a9c0 CR3: 00000001f5397000 CR4: 00000000000006e0 [ 570.163296] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003 [ 570.163296] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 570.163296] Process hib.sh (pid: 3617, threadinfo ffff8801f68b6000, task ffff88022a2e8000) [ 570.163296] Stack: [ 570.163296] 0000000000000001 ffff880214972850 ffff88022a2e8000 00000000b3450160 [ 570.164335] <0> ffff88022a2e83a8 000000007f0025c7 ffff8801f68b77f8 ffffffff8118b72f [ 570.164335] <0> ffff8801f68b7848 ffffffff8119a1ca ffff880214972868 0000000000000001 [ 570.165299] Call Trace: [ 570.165299] [<ffffffff8118b72f>] ? __delay+0xf/0x11 [ 570.165299] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c [ 570.165299] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73 [ 570.165299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 570.165299] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 570.166297] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac [ 570.166297] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc [ 570.166297] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4 [ 570.166297] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477 [ 570.166297] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58 [ 570.166297] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5 [ 570.167296] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244 [ 570.167296] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6 [ 570.167296] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f [ 570.167296] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4 [ 570.167296] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0 [ 570.167296] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79 [ 570.167296] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb [ 570.168286] [<ffffffff813f5285>] ? printk+0x41/0x44 [ 570.168286] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1 [ 570.168286] [<ffffffff81075ce0>] ? hibernate+0xce/0x172 [ 570.168286] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3 [ 570.168286] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19 [ 570.168286] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144 [ 570.168286] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153 [ 570.169297] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 570.169297] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71 [ 570.169297] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b [ 570.169297] Code: 00 00 00 e8 f8 1d ea ff e8 9f f4 00 00 41 89 c5 0f ae f0 66 66 90 0f 31 89 c3 65 4c 8b 34 25 48 b5 00 00 0f ae f0 66 66 90 0f 31 <41> 89 c7 4c 89 f8 48 29 d8 4c 39 e0 73 49 bf 01 00 00 00 e8 07 [ 570.172299] Call Trace: [ 570.172299] [<ffffffff8118b72f>] ? __delay+0xf/0x11 [ 570.172299] [<ffffffff8119a1ca>] ? do_raw_spin_lock+0xd2/0x13c [ 570.173297] [<ffffffff813f827b>] ? _raw_spin_lock+0x60/0x73 [ 570.173297] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 570.173297] [<ffffffff810c0a37>] ? page_lock_anon_vma+0x63/0xac [ 570.173297] [<ffffffff810c09d4>] ? page_lock_anon_vma+0x0/0xac [ 570.173297] [<ffffffff810c0c1d>] ? page_referenced+0x80/0x1dc [ 570.173297] [<ffffffff810c0b22>] ? try_to_unmap_anon+0xa2/0xb4 [ 570.174329] [<ffffffff810ab7a6>] ? shrink_page_list+0x14a/0x477 [ 570.174329] [<ffffffff813f8d86>] ? _raw_spin_unlock_irq+0x30/0x58 [ 570.174329] [<ffffffff810abe2a>] ? shrink_inactive_list+0x357/0x5e5 [ 570.174329] [<ffffffff810ab64a>] ? shrink_active_list+0x232/0x244 [ 570.174329] [<ffffffff810ac3c4>] ? shrink_zone+0x30c/0x3d6 [ 570.174329] [<ffffffff810acf9f>] ? do_try_to_free_pages+0x176/0x27f [ 570.174329] [<ffffffff810ad13d>] ? shrink_all_memory+0x95/0xc4 [ 570.175297] [<ffffffff810aa640>] ? isolate_pages_global+0x0/0x1f0 [ 570.175297] [<ffffffff81076e60>] ? count_data_pages+0x65/0x79 [ 570.175297] [<ffffffff810770c7>] ? hibernate_preallocate_memory+0x1aa/0x2cb [ 570.175297] [<ffffffff813f5285>] ? printk+0x41/0x44 [ 570.175297] [<ffffffff81075a67>] ? hibernation_snapshot+0x36/0x1e1 [ 570.175297] [<ffffffff81075ce0>] ? hibernate+0xce/0x172 [ 570.175297] [<ffffffff81074a4d>] ? state_store+0x5c/0xd3 [ 570.176298] [<ffffffff81184f8f>] ? kobj_attr_store+0x17/0x19 [ 570.176298] [<ffffffff81125dd7>] ? sysfs_write_file+0x108/0x144 [ 570.176298] [<ffffffff810d575f>] ? vfs_write+0xb2/0x153 [ 570.176298] [<ffffffff81063bed>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 570.176298] [<ffffffff810d58c3>] ? sys_write+0x4a/0x71 [ 570.176298] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 19:14 ` Borislav Petkov @ 2010-04-09 19:32 ` Linus Torvalds 2010-04-09 20:03 ` Rik van Riel 2010-04-09 20:43 ` Johannes Weiner 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-09 19:32 UTC (permalink / raw) To: Borislav Petkov Cc: KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On Fri, 9 Apr 2010, Borislav Petkov wrote: > > > > So what I _think_ will happen is that you'll be able to re-create the > > problem that started this all. But I'd like to verify that, just because > > I'm anal and I'd like these things to be tested independently. > > Heh, that was easy. Third hibernate cycle is a charm^Wboom :) Ok, good to know that I'm still tracking ok on the issue. > > So assuming that the original problem happens again, if you can then apply > > Rik's patch, but add a > > > > dst->anon_vma = src->anon_vma; > > > > to just before the success case (the "return 0") in anon_vma_clone(), > > that would be good. > > It looks like this way we mangle the anon_vma chains somehow. From > what I can see and if I'm not mistaken, we save the anon_vmas alright > but end up in what seems like an endless list_for_each_entry() > loop having grabbed anon_vma->lock in page_lock_anon_vma() and we > can't seem to yield it through page_unlock_anon_vma() at the end of > page_referenced_anon() so it has to be that code in between iterating > over each list entry... Ok. So scratch Rik's patch. It doesn't work even with the anon_vma set up. Rik? I think it's back to you. I'm not going to bother committing the change to the anon_vma locking unless you actually need the locking guarantees for anon_vma_prepare(). And I've got the feeling that the proper fix is in the vma_adjust() handling if your original idea was right. Anybody? We're at the point where I've already delayed -rc4 several days because it's pointless cutting it without fixing this. One option is to just say "f*ck it, we'll revert it all and try again later". But it feels so close.. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 19:32 ` Linus Torvalds @ 2010-04-09 20:03 ` Rik van Riel 2010-04-09 20:43 ` Johannes Weiner 1 sibling, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-09 20:03 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes On 04/09/2010 03:32 PM, Linus Torvalds wrote: > Rik? I think it's back to you. I'm not going to bother committing the > change to the anon_vma locking unless you actually need the locking > guarantees for anon_vma_prepare(). > And I've got the feeling that the proper fix is in the vma_adjust() > handling if your original idea was right. We can fix it on the other side, by changing anon_vma_merge to actually link all the anon_vma structs into the VMA. An added benefit is that we are already holding the required lock (mmap_sem) exclusively in that code path. I'll cook up a patch and I'll mail it out after a little testing. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 19:32 ` Linus Torvalds 2010-04-09 20:03 ` Rik van Riel @ 2010-04-09 20:43 ` Johannes Weiner 2010-04-09 20:57 ` Rik van Riel ` (2 more replies) 1 sibling, 3 replies; 231+ messages in thread From: Johannes Weiner @ 2010-04-09 20:43 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, Apr 09, 2010 at 12:32:30PM -0700, Linus Torvalds wrote: > > > On Fri, 9 Apr 2010, Borislav Petkov wrote: > > > > > > So what I _think_ will happen is that you'll be able to re-create the > > > problem that started this all. But I'd like to verify that, just because > > > I'm anal and I'd like these things to be tested independently. > > > > Heh, that was easy. Third hibernate cycle is a charm^Wboom :) > > Ok, good to know that I'm still tracking ok on the issue. > > > > So assuming that the original problem happens again, if you can then apply > > > Rik's patch, but add a > > > > > > dst->anon_vma = src->anon_vma; > > > > > > to just before the success case (the "return 0") in anon_vma_clone(), > > > that would be good. > > > > It looks like this way we mangle the anon_vma chains somehow. From > > what I can see and if I'm not mistaken, we save the anon_vmas alright > > but end up in what seems like an endless list_for_each_entry() > > loop having grabbed anon_vma->lock in page_lock_anon_vma() and we > > can't seem to yield it through page_unlock_anon_vma() at the end of > > page_referenced_anon() so it has to be that code in between iterating > > over each list entry... > > Ok. So scratch Rik's patch. It doesn't work even with the anon_vma set up. > > Rik? I think it's back to you. I'm not going to bother committing the > change to the anon_vma locking unless you actually need the locking > guarantees for anon_vma_prepare(). > > And I've got the feeling that the proper fix is in the vma_adjust() > handling if your original idea was right. > > Anybody? Okay, I think I got it working. I first thought we would need an m^n loop to properly merge the anon_vma_chains, but we can actually be cleverer than that: --- Subject: mm: properly merge anon_vma_chains when merging vmas Merging can happen when two VMAs were split from one root VMA or a mergeable VMA was instantiated and reused a nearby VMA's anon_vma. In both cases, none of the VMAs can grow any more anon_vmas and forked VMAs can no longer get merged due to differing primary anon_vmas for their private COW-broken pages. In the split case, the anon_vma_chains are equal and we can just drop the one of the VMA that is going away. In the other case, the VMA that was instantiated later has only one anon_vma on its chain: the primary anon_vma of its merge partner (due to anon_vma_prepare()). If the VMA that came later is going away, its anon_vma_chain is a subset of the one that is staying, so it can be dropped like in the split case. Only if the VMA that came first is going away, its potential parent anon_vmas need to be migrated to the VMA that is staying. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- It compiles and boots but I have not really excercised this code. Boris, could you give it a spin? Thanks! diff --git a/include/linux/rmap.h b/include/linux/rmap.h index d25bd22..ecef882 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -114,13 +114,7 @@ int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *); int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *); void __anon_vma_link(struct vm_area_struct *); void anon_vma_free(struct anon_vma *); - -static inline void anon_vma_merge(struct vm_area_struct *vma, - struct vm_area_struct *next) -{ - VM_BUG_ON(vma->anon_vma != next->anon_vma); - unlink_anon_vmas(next); -} +void anon_vma_merge(struct vm_area_struct *, struct vm_area_struct *); /* * rmap interfaces called when adding or removing pte of page diff --git a/mm/rmap.c b/mm/rmap.c index eaa7a09..498a46e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -268,6 +268,58 @@ void unlink_anon_vmas(struct vm_area_struct *vma) } } +void anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next) +{ + VM_BUG_ON(vma->anon_vma != next->anon_vma); + /* + * 1. case: vma and next are split parts of one root vma. + * Their anon_vma_chain is equal and we can drop that of next. + * + * 2. case: one vma was instantiated as mergeable with the + * other one and inherited the other one's primary anon_vma as + * the singleton in its chain. + * + * If next came after vma, vma's chain is already an unstrict + * superset of next's and we can treat it like case 1. + * + * If vma has the singleton chain, we have to copy next's + * unique anon_vmas over. + */ + if (!list_is_singular(&vma->anon_vma_chain)) { + unlink_anon_vmas(next); + return; + } + while (!list_empty(&next->anon_vma_chain)) { + struct anon_vma_chain *avc; + + avc = list_first_entry(&next->anon_vma_chain, + struct anon_vma_chain, same_vma); + if (avc->anon_vma == vma->anon_vma) { + /* + * The shared one that vma inherited in + * anon_vma_prepare. Don't copy it, we + * already have it. + */ + spin_lock(&avc->anon_vma->lock); + list_del(&avc->same_anon_vma); + spin_unlock(&avc->anon_vma->lock); + + list_del(&avc->same_vma); + anon_vma_chain_free(avc); + } else { + /* + * One of the parent anon_vmas, move it over. + * Make sure nobody walks the vma list while + * the entries are in flux. + */ + spin_lock(&avc->anon_vma->lock); + avc->vma = vma; + list_move_tail(&avc->same_vma, &vma->anon_vma_chain); + spin_unlock(&avc->anon_vma->lock); + } + } +} + static void anon_vma_ctor(void *data) { struct anon_vma *anon_vma = data; ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 20:43 ` Johannes Weiner @ 2010-04-09 20:57 ` Rik van Riel 2010-04-09 21:33 ` Borislav Petkov 2010-04-09 23:22 ` Linus Torvalds 2 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-09 20:57 UTC (permalink / raw) To: Johannes Weiner Cc: Linus Torvalds, Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/09/2010 04:43 PM, Johannes Weiner wrote: > Okay, I think I got it working. I first thought we would need an > m^n loop to properly merge the anon_vma_chains, but we can actually > be cleverer than that: I've looked it over 5 times, can't find anything wrong with it. Your approach looks like it should work just fine. Certainly easier than the things Linus and I tried :) > Signed-off-by: Johannes Weiner<hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 20:43 ` Johannes Weiner 2010-04-09 20:57 ` Rik van Riel @ 2010-04-09 21:33 ` Borislav Petkov 2010-04-09 23:22 ` Linus Torvalds 2 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-09 21:33 UTC (permalink / raw) To: Johannes Weiner Cc: Linus Torvalds, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Johannes Weiner <hannes@cmpxchg.org> Date: Fri, Apr 09, 2010 at 10:43:28PM +0200 Hi Hannes :) , > --- > Subject: mm: properly merge anon_vma_chains when merging vmas > > Merging can happen when two VMAs were split from one root VMA or > a mergeable VMA was instantiated and reused a nearby VMA's anon_vma. > > In both cases, none of the VMAs can grow any more anon_vmas and forked > VMAs can no longer get merged due to differing primary anon_vmas for > their private COW-broken pages. > > In the split case, the anon_vma_chains are equal and we can just drop > the one of the VMA that is going away. > > In the other case, the VMA that was instantiated later has only one > anon_vma on its chain: the primary anon_vma of its merge partner (due > to anon_vma_prepare()). > > If the VMA that came later is going away, its anon_vma_chain is a > subset of the one that is staying, so it can be dropped like in the > split case. > > Only if the VMA that came first is going away, its potential parent > anon_vmas need to be migrated to the VMA that is staying. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > > It compiles and boots but I have not really excercised this code. > Boris, could you give it a spin? Thanks! ok, I got this ontop of mainline (no other patches from this thread) but unfortunately it breaks at the same spot while under heavy page reclaiming when trying to hibernate while booting 3 guests. [ 322.171120] PM: Preallocating image memory... [ 322.477374] BUG: unable to handle kernel NULL pointer dereference at (null) [ 322.477376] IP: [<ffffffff810c0c87>] page_referenced+0xee/0x1dc [ 322.477376] PGD 2014e8067 PUD 221b4e067 PMD 0 [ 322.477376] Oops: 0000 [#1] PREEMPT SMP [ 322.477376] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq [ 322.477376] CPU 3 [ 322.477376] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core k10temp ohci_hcd edac_core [ 322.477376] [ 322.477376] Pid: 2750, comm: hib.sh Tainted: G W 2.6.34-rc3-00411-ga7247b6 #13 M3A78 PRO/System Product Name [ 322.477376] RIP: 0010:[<ffffffff810c0c87>] [<ffffffff810c0c87>] page_referenced+0xee/0x1dc [ 322.477376] RSP: 0018:ffff88020936d8b8 EFLAGS: 00010283 [ 322.477376] RAX: ffff88022de91af0 RBX: ffffea0006dcb488 RCX: 0000000000000000 [ 322.477376] RDX: ffff88020936dcf8 RSI: ffff88022de91ac8 RDI: ffff88022ced0000 [ 322.477376] RBP: ffff88020936d938 R08: 0000000000000002 R09: 0000000000000000 [ 322.477376] R10: 0000000000000246 R11: 0000000000000003 R12: 0000000000000000 [ 322.477376] R13: ffffffffffffffe0 R14: ffff88022de91ab0 R15: ffff88020936da00 [ 322.477376] FS: 00007f286493e6f0(0000) GS:ffff88000a600000(0000) knlGS:0000000000000000 [ 322.477376] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 322.477376] CR2: 0000000000000000 CR3: 00000001f8354000 CR4: 00000000000006e0 [ 322.477376] DR0: 0000000000000090 DR1: 00000000000000a4 DR2: 00000000000000ff [ 322.477376] DR3: 000000000000000f DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 322.477376] Process hib.sh (pid: 2750, threadinfo ffff88020936c000, task ffff88022ced0000) [ 322.477376] Stack: [ 322.477376] ffff88022de91af0 00000000813f8eec ffffffff8165ce28 000000000000002e [ 322.477376] <0> ffff88020936d8f8 ffffffff810c60bc ffffea0006dcb450 ffffea0006dcb450 [ 322.477376] <0> ffff88020936d938 00000002810ab29d 0000000006f316b0 ffffea0006dcb4b0 [ 322.477376] Call Trace: [ 322.477376] [<ffffffff810c60bc>] ? swapcache_free+0x37/0x3c [ 322.477376] [<ffffffff810ab7c2>] shrink_page_list+0x14a/0x477 [ 322.477376] [<ffffffff810abe46>] shrink_inactive_list+0x357/0x5e5 [ 322.477376] [<ffffffff810ab666>] ? shrink_active_list+0x232/0x244 [ 322.477376] [<ffffffff810ac3e0>] shrink_zone+0x30c/0x3d6 [ 322.477376] [<ffffffff810acfbb>] do_try_to_free_pages+0x176/0x27f [ 322.477376] [<ffffffff810ad159>] shrink_all_memory+0x95/0xc4 [ 322.477376] [<ffffffff810aa65c>] ? isolate_pages_global+0x0/0x1f0 [ 322.477376] [<ffffffff81076e7c>] ? count_data_pages+0x65/0x79 [ 322.477376] [<ffffffff810770e3>] hibernate_preallocate_memory+0x1aa/0x2cb [ 322.477376] [<ffffffff813f5325>] ? printk+0x41/0x44 [ 322.477376] [<ffffffff81075a83>] hibernation_snapshot+0x36/0x1e1 [ 322.477376] [<ffffffff81075cfc>] hibernate+0xce/0x172 [ 322.477376] [<ffffffff81074a69>] state_store+0x5c/0xd3 [ 322.477376] [<ffffffff81185043>] kobj_attr_store+0x17/0x19 [ 322.477376] [<ffffffff81125e87>] sysfs_write_file+0x108/0x144 [ 322.477376] [<ffffffff810d580f>] vfs_write+0xb2/0x153 [ 322.477376] [<ffffffff81063c09>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 322.477376] [<ffffffff810d5973>] sys_write+0x4a/0x71 [ 322.477376] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b [ 322.477376] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 77 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 [ 322.477376] RIP [<ffffffff810c0c87>] page_referenced+0xee/0x1dc [ 322.477376] RSP <ffff88020936d8b8> [ 322.477376] CR2: 0000000000000000 [ 322.491359] ---[ end trace 520a5274d8859b71 ]--- [ 322.491509] note: hib.sh[2750] exited with preempt_count 2 [ 322.491663] BUG: scheduling while atomic: hib.sh/2750/0x10000003 [ 322.491810] INFO: lockdep is turned off. [ 322.491956] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core k10temp ohci_hcd edac_core [ 322.493364] Pid: 2750, comm: hib.sh Tainted: G D W 2.6.34-rc3-00411-ga7247b6 #13 [ 322.493622] Call Trace: [ 322.493768] [<ffffffff8106311f>] ? __debug_show_held_locks+0x1b/0x24 [ 322.493919] [<ffffffff8102d3d0>] __schedule_bug+0x72/0x77 [ 322.494070] [<ffffffff813f572e>] schedule+0xd9/0x730 [ 322.494223] [<ffffffff8103023c>] __cond_resched+0x18/0x24 [ 322.494378] [<ffffffff813f5e52>] _cond_resched+0x2c/0x37 [ 322.494527] [<ffffffff810b7da5>] unmap_vmas+0x6ce/0x893 [ 322.494678] [<ffffffff813f8e86>] ? _raw_spin_unlock_irqrestore+0x38/0x69 [ 322.494829] [<ffffffff810bc457>] exit_mmap+0xd7/0x182 [ 322.494978] [<ffffffff81035969>] mmput+0x48/0xb9 [ 322.495131] [<ffffffff81039c39>] exit_mm+0x110/0x11d [ 322.495280] [<ffffffff8103b67b>] do_exit+0x1c5/0x691 [ 322.495521] [<ffffffff81038d25>] ? kmsg_dump+0x13b/0x155 [ 322.495668] [<ffffffff810060db>] ? oops_end+0x47/0x93 [ 322.495816] [<ffffffff81006122>] oops_end+0x8e/0x93 [ 322.495964] [<ffffffff8101ed95>] no_context+0x1fc/0x20b [ 322.496118] [<ffffffff8101ef30>] __bad_area_nosemaphore+0x18c/0x1af [ 322.496267] [<ffffffff8101f16b>] ? do_page_fault+0xa8/0x32d [ 322.496484] [<ffffffff8101ef66>] bad_area_nosemaphore+0x13/0x15 [ 322.496630] [<ffffffff8101f236>] do_page_fault+0x173/0x32d [ 322.496780] [<ffffffff813f96e3>] ? error_sti+0x5/0x6 [ 322.496928] [<ffffffff81062bc7>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 322.497082] [<ffffffff813f80d2>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 322.497232] [<ffffffff813f94ff>] page_fault+0x1f/0x30 [ 322.497392] [<ffffffff810c0c87>] ? page_referenced+0xee/0x1dc [ 322.497541] [<ffffffff810c0c19>] ? page_referenced+0x80/0x1dc [ 322.497690] [<ffffffff810c60bc>] ? swapcache_free+0x37/0x3c [ 322.497839] [<ffffffff810ab7c2>] shrink_page_list+0x14a/0x477 [ 322.497989] [<ffffffff810abe46>] shrink_inactive_list+0x357/0x5e5 [ 322.498141] [<ffffffff810ab666>] ? shrink_active_list+0x232/0x244 [ 322.498291] [<ffffffff810ac3e0>] shrink_zone+0x30c/0x3d6 [ 322.498444] [<ffffffff810acfbb>] do_try_to_free_pages+0x176/0x27f [ 322.498594] [<ffffffff810ad159>] shrink_all_memory+0x95/0xc4 [ 322.498743] [<ffffffff810aa65c>] ? isolate_pages_global+0x0/0x1f0 [ 322.498892] [<ffffffff81076e7c>] ? count_data_pages+0x65/0x79 [ 322.499046] [<ffffffff810770e3>] hibernate_preallocate_memory+0x1aa/0x2cb [ 322.499195] [<ffffffff813f5325>] ? printk+0x41/0x44 [ 322.499344] [<ffffffff81075a83>] hibernation_snapshot+0x36/0x1e1 [ 322.499498] [<ffffffff81075cfc>] hibernate+0xce/0x172 [ 322.499647] [<ffffffff81074a69>] state_store+0x5c/0xd3 [ 322.499795] [<ffffffff81185043>] kobj_attr_store+0x17/0x19 [ 322.499944] [<ffffffff81125e87>] sysfs_write_file+0x108/0x144 [ 322.500097] [<ffffffff810d580f>] vfs_write+0xb2/0x153 [ 322.500246] [<ffffffff81063c09>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 322.500399] [<ffffffff810d5973>] sys_write+0x4a/0x71 [ 322.500547] [<ffffffff810021db>] system_call_fastpath+0x16/0x1b -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 20:43 ` Johannes Weiner 2010-04-09 20:57 ` Rik van Riel 2010-04-09 21:33 ` Borislav Petkov @ 2010-04-09 23:22 ` Linus Torvalds 2010-04-09 23:45 ` Rik van Riel ` (2 more replies) 2 siblings, 3 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-09 23:22 UTC (permalink / raw) To: Johannes Weiner Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, 9 Apr 2010, Johannes Weiner wrote: > + /* > + * 1. case: vma and next are split parts of one root vma. > + * Their anon_vma_chain is equal and we can drop that of next. > + * > + * 2. case: one vma was instantiated as mergeable with the > + * other one and inherited the other one's primary anon_vma as > + * the singleton in its chain. > + * > + * If next came after vma, vma's chain is already an unstrict > + * superset of next's and we can treat it like case 1. > + * > + * If vma has the singleton chain, we have to copy next's > + * unique anon_vmas over. > + */ This comment makes my head hurt. In fact, the whole anon_vma thing hurts my head. Can we have some better high-level documentation on what happens for all the cases. - split (mprotect, or munmap in the middle): anon_vma_clone: the two vma's will have the same anon_vma, and the anon_vma chains will be equivalent. - merge (mprotect that creates a mergeable state): anon_vma_merge: we're supposed to have a anon_vma_chain that is a superset of the two chains of the merged entries. - fork: anon_vma_fork: each new vma will have a _new_ anon_vma as it's primary one, and will link to the old primary trough the anon_vma_chain. It's doing this with a anon_vma_clone() followed by adding an entra entry to the new anon_vma, and setting vma->anon_vma to the new one. - create/mmap: anon_vma_prepare: find a mergeable anon_vma and use that as a singleton, because the other entries on the anon_vma chain won't matter, since they cannot be associated with any pages associated with the newly created vma.. Correct? Quite frankly, just looking at that, I can't see how we get to your rules. At least not trivially. Especially with multiple merges, I don't see how "singleton" is such a special case. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 23:22 ` Linus Torvalds @ 2010-04-09 23:45 ` Rik van Riel 2010-04-10 0:03 ` Linus Torvalds 2010-04-09 23:54 ` Johannes Weiner 2010-04-09 23:56 ` Linus Torvalds 2 siblings, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-09 23:45 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/09/2010 07:22 PM, Linus Torvalds wrote: > > > On Fri, 9 Apr 2010, Johannes Weiner wrote: >> + /* >> + * 1. case: vma and next are split parts of one root vma. >> + * Their anon_vma_chain is equal and we can drop that of next. >> + * >> + * 2. case: one vma was instantiated as mergeable with the >> + * other one and inherited the other one's primary anon_vma as >> + * the singleton in its chain. >> + * >> + * If next came after vma, vma's chain is already an unstrict >> + * superset of next's and we can treat it like case 1. >> + * >> + * If vma has the singleton chain, we have to copy next's >> + * unique anon_vmas over. >> + */ > > This comment makes my head hurt. In fact, the whole anon_vma thing hurts > my head. > > Can we have some better high-level documentation on what happens for all > the cases. > > - split (mprotect, or munmap in the middle): > > anon_vma_clone: the two vma's will have the same anon_vma, and the > anon_vma chains will be equivalent. > > - merge (mprotect that creates a mergeable state): > > anon_vma_merge: we're supposed to have a anon_vma_chain that is > a superset of the two chains of the merged entries. > > - fork: > > anon_vma_fork: each new vma will have a _new_ anon_vma as it's > primary one, and will link to the old primary trough the > anon_vma_chain. It's doing this with a anon_vma_clone() followed > by adding an entra entry to the new anon_vma, and setting > vma->anon_vma to the new one. > > - create/mmap: > > anon_vma_prepare: find a mergeable anon_vma and use that as a > singleton, because the other entries on the anon_vma chain won't > matter, since they cannot be associated with any pages associated > with the newly created vma.. > > Correct? This is indeed correct. > Quite frankly, just looking at that, I can't see how we get to your rules. > At least not trivially. Especially with multiple merges, I don't see > how "singleton" is such a special case. The trick is in the fact that anon_vma_merge is only called when vma->anon_vma == vma1->anon_vma. If the top anon_vmas are different, then anon_vma_merge will not be called. This means that VMAs which have recently passed through fork will not be passed to anon_vma_merge, because their top anon_vmas are different. That leaves just the split & create cases, which will be passed to anon_vma_merge when they are merged. In case of split, they will have identical anon_vma chains. In case of create + merge, one of the two VMAs will have the whole anon_vma chain, while the other one has just the top anon_vma. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 23:45 ` Rik van Riel @ 2010-04-10 0:03 ` Linus Torvalds 2010-04-10 0:11 ` Rik van Riel 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 0:03 UTC (permalink / raw) To: Rik van Riel Cc: Johannes Weiner, Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, 9 Apr 2010, Rik van Riel wrote: > > The trick is in the fact that anon_vma_merge is only called > when vma->anon_vma == vma1->anon_vma. Sure sure. I still think it's _way_ too complex. See my previous email where I suggested one single simple additional rule that I think makes things _much_ simpler. > If the top anon_vmas are different, then anon_vma_merge will > not be called. Right. The case of different anon_vma's is the trivial one. I don't worry about that. > That leaves just the split & create cases, which will be > passed to anon_vma_merge when they are merged. > > In case of split, they will have identical anon_vma chains. And yes, split is fundamentally simple. Split guarantees that the chains look identical. But: > In case of create + merge, one of the two VMAs will have > the whole anon_vma chain, while the other one has just > the top anon_vma. THIS is where I think you simplified a lot and said "and magic happens". The thing is, in the case of create, we create a different chain. That simple fact just makes merging fundamentally complicated. And we now have two different chains, and both of those can split, so those differences can "spread out". And you need to guarantee that "merge" really works. It didn't work in your original code, and quite frankly, I do _not_ think it's entirely obvious that it works in Johannes' code either. Don't get me wrong: _maybe_ Johannes' code works fine. I just don't think it's obvious at all. And if it doesn't work fine, now you're just spreading the differences even further. This is why I suggest that we limit the "re-use an existing vma for a new case" to the singleton case, which means that now you _never_ have differences at all. There's no spreading on splitting. Merging is trivial. Now, admittedly, I'm really hopped up on cough medication, so the feeling of this solving all the problems in the universe may not be entirely accurate. But it feels so _right_. I hope if feels right when I'm off my meds too. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 0:03 ` Linus Torvalds @ 2010-04-10 0:11 ` Rik van Riel 0 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-10 0:11 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/09/2010 08:03 PM, Linus Torvalds wrote: > This is why I suggest that we limit the "re-use an existing vma for a new > case" to the singleton case, which means that now you _never_ have > differences at all. There's no spreading on splitting. Merging is trivial. That looks like it should work. > Now, admittedly, I'm really hopped up on cough medication, so the feeling > of this solving all the problems in the universe may not be entirely > accurate. But it feels so _right_. > > I hope if feels right when I'm off my meds too. I am not on any cough meds, and your patch looks right. OTOH, maybe I should be on some kind of cold meds, because I haven't been feeling right all week... ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 23:22 ` Linus Torvalds 2010-04-09 23:45 ` Rik van Riel @ 2010-04-09 23:54 ` Johannes Weiner 2010-04-09 23:56 ` Linus Torvalds 2 siblings, 0 replies; 231+ messages in thread From: Johannes Weiner @ 2010-04-09 23:54 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, Apr 09, 2010 at 04:22:19PM -0700, Linus Torvalds wrote: > > > On Fri, 9 Apr 2010, Johannes Weiner wrote: > > + /* > > + * 1. case: vma and next are split parts of one root vma. > > + * Their anon_vma_chain is equal and we can drop that of next. > > + * > > + * 2. case: one vma was instantiated as mergeable with the > > + * other one and inherited the other one's primary anon_vma as > > + * the singleton in its chain. > > + * > > + * If next came after vma, vma's chain is already an unstrict > > + * superset of next's and we can treat it like case 1. > > + * > > + * If vma has the singleton chain, we have to copy next's > > + * unique anon_vmas over. > > + */ > > This comment makes my head hurt. In fact, the whole anon_vma thing hurts > my head. I can relate ;) > Can we have some better high-level documentation on what happens for all > the cases. > > - split (mprotect, or munmap in the middle): > > anon_vma_clone: the two vma's will have the same anon_vma, and the > anon_vma chains will be equivalent. > > - merge (mprotect that creates a mergeable state): > > anon_vma_merge: we're supposed to have a anon_vma_chain that is > a superset of the two chains of the merged entries. > > - fork: > > anon_vma_fork: each new vma will have a _new_ anon_vma as it's > primary one, and will link to the old primary trough the > anon_vma_chain. It's doing this with a anon_vma_clone() followed > by adding an entra entry to the new anon_vma, and setting > vma->anon_vma to the new one. > > - create/mmap: > > anon_vma_prepare: find a mergeable anon_vma and use that as a > singleton, because the other entries on the anon_vma chain won't > matter, since they cannot be associated with any pages associated > with the newly created vma.. > > Correct? > > Quite frankly, just looking at that, I can't see how we get to your rules. > At least not trivially. Especially with multiple merges, I don't see > how "singleton" is such a special case. The key is that merging is only possible if the primary anon_vmas are equivalent. This only happens if we split a vma in two and clone the old vma's anon_vma_chain into the new vma. So the chains are equivalent. Or anon_vma_prepare() finds a mergeable anon_vma, in which case this will be the singleton on the vma's chain. If a split vma is merged, the old anon_vma_chains are equivalent, we drop one completely and the one that stays has not changed. If a mergeable vma (singleton anon_vma) is merged into another one, this singleton is the primary anon_vma of the swallowing vma, thus already linked and the swallowing vma's anon_vma_chain stays unchanged. If it's the other way round and the singleton vma swallows the other one, every anon_vma of the vanishing vma is moved over (except your singleton anon_vma, you already have that). The result should look exactly like the chain we swallowed. So in all this merging, no unique and new combination of anon_vma_chains should have been created! Thus you can merge as much as you want, either you swallow singletons and don't change yourself or you are the singleton and after the merger have an equivalent anon_vma_chain to the vma you swallowed. Again: no new anon_vmas should enter the game for mergeable vmas and no _new_ anon_vma_chains should be created while merging. Thus it is always true that you either merge with a singleton or the chains are equivalent. At least those are my assumptions. Maybe they are crap, but I don't see how right now. And according to Boris' test, somewhere we still drop anon_vmas where we let pages in the field pointing at them. Hannes ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 23:22 ` Linus Torvalds 2010-04-09 23:45 ` Rik van Riel 2010-04-09 23:54 ` Johannes Weiner @ 2010-04-09 23:56 ` Linus Torvalds 2010-04-10 0:19 ` Rik van Riel 2010-04-10 0:31 ` Johannes Weiner 2 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-09 23:56 UTC (permalink / raw) To: Johannes Weiner Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, 9 Apr 2010, Linus Torvalds wrote: > > Can we have some better high-level documentation on what happens for all > the cases. > > - split (mprotect, or munmap in the middle): > > anon_vma_clone: the two vma's will have the same anon_vma, and the > anon_vma chains will be equivalent. > > - merge (mprotect that creates a mergeable state): > > anon_vma_merge: we're supposed to have a anon_vma_chain that is > a superset of the two chains of the merged entries. > > - fork: > > anon_vma_fork: each new vma will have a _new_ anon_vma as it's > primary one, and will link to the old primary trough the > anon_vma_chain. It's doing this with a anon_vma_clone() followed > by adding an entra entry to the new anon_vma, and setting > vma->anon_vma to the new one. > > - create/mmap: > > anon_vma_prepare: find a mergeable anon_vma and use that as a > singleton, because the other entries on the anon_vma chain won't > matter, since they cannot be associated with any pages associated > with the newly created vma.. > > Correct? Ok, so I don't know if the above is correct, but if it is, let's ignore the "merge" case as being complex, and look at the other cases. With fork, the main anon_vma becomes different, so let's ignore that. That always means that the resulting list is not comparable or compatible, and we'll never mix them up. If we make one very _simple_ rule for the create/mmap case, namely that we only re-use another _singleton_ anon_vma, then split and create case will look exactly the same. And in particular, we get a very simple and powerful rule: if the anon_vma matches, then the _list_ will also always match. And that, in turn, would make 'merge' trivial too: you really can always drop the side that goes away. There's never any question about how to merge the lists, or which to pick, because every single operation that leaves the anon_vma the same will guarantee that the list will be identical too. So now the simple rule is that if the anon_vma is the same, then the list of associated anon_vma's will always be the same - across all of merge, split and create. Isn't that a _much_ simpler model to think about? So _instead_ of all the patches that have floated about, I would suggest this simple change to "find_mergeable_anon_vma()" instead.. Oh, and maybe it's the meds talking again. I'm feeling better than yesterday, but am still a bit lightheaded. Linus --- mm/mmap.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..462a8ca 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -850,7 +850,8 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - if (near->anon_vma && vma->vm_end == near->vm_start && + if (near->anon_vma && list_is_singular(&near->anon_vma_chain) && + vma->vm_end == near->vm_start && mpol_equal(vma_policy(vma), vma_policy(near)) && can_vma_merge_before(near, vm_flags, NULL, vma->vm_file, vma->vm_pgoff + @@ -871,7 +872,8 @@ try_prev: vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - if (near->anon_vma && near->vm_end == vma->vm_start && + if (near->anon_vma && list_is_singular(&near->anon_vma_chain) && + near->vm_end == vma->vm_start && mpol_equal(vma_policy(near), vma_policy(vma)) && can_vma_merge_after(near, vm_flags, NULL, vma->vm_file, vma->vm_pgoff)) ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 23:56 ` Linus Torvalds @ 2010-04-10 0:19 ` Rik van Riel 2010-04-10 0:31 ` Johannes Weiner 1 sibling, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-10 0:19 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Borislav Petkov, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/09/2010 07:56 PM, Linus Torvalds wrote: > So _instead_ of all the patches that have floated about, I would suggest > this simple change to "find_mergeable_anon_vma()" instead.. Boris, this is your chance to really ruin our week :) If the bug persists with Linus's patch, we've been fixing the wrong bug all week long, and you are experiencing something else... I'm getting really curious now. > --- > mm/mmap.c | 6 ++++-- > 1 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 75557c6..462a8ca 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -850,7 +850,8 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) > vm_flags = vma->vm_flags& ~(VM_READ|VM_WRITE|VM_EXEC); > vm_flags |= near->vm_flags& (VM_READ|VM_WRITE|VM_EXEC); > > - if (near->anon_vma&& vma->vm_end == near->vm_start&& > + if (near->anon_vma&& list_is_singular(&near->anon_vma_chain)&& > + vma->vm_end == near->vm_start&& > mpol_equal(vma_policy(vma), vma_policy(near))&& > can_vma_merge_before(near, vm_flags, > NULL, vma->vm_file, vma->vm_pgoff + > @@ -871,7 +872,8 @@ try_prev: > vm_flags = vma->vm_flags& ~(VM_READ|VM_WRITE|VM_EXEC); > vm_flags |= near->vm_flags& (VM_READ|VM_WRITE|VM_EXEC); > > - if (near->anon_vma&& near->vm_end == vma->vm_start&& > + if (near->anon_vma&& list_is_singular(&near->anon_vma_chain)&& > + near->vm_end == vma->vm_start&& > mpol_equal(vma_policy(near), vma_policy(vma))&& > can_vma_merge_after(near, vm_flags, > NULL, vma->vm_file, vma->vm_pgoff)) ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 23:56 ` Linus Torvalds 2010-04-10 0:19 ` Rik van Riel @ 2010-04-10 0:31 ` Johannes Weiner 2010-04-10 0:32 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Johannes Weiner @ 2010-04-10 0:31 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, Apr 09, 2010 at 04:56:13PM -0700, Linus Torvalds wrote: > So _instead_ of all the patches that have floated about, I would suggest > this simple change to "find_mergeable_anon_vma()" instead.. That leaves the chance that my code was correct and we leave a conceptual error around somewhere that can materialize again. But I am at a point where simplification never sounded more blissful, so yeah, I like it :) Let's hope it fixes Boris's issue. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 0:31 ` Johannes Weiner @ 2010-04-10 0:32 ` Linus Torvalds 2010-04-10 7:27 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 0:32 UTC (permalink / raw) To: Johannes Weiner Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Johannes Weiner wrote: > > That leaves the chance that my code was correct and we leave a conceptual > error around somewhere that can materialize again. Absolutely. I really don't know whether your merge routine works or not. I'd just rather not have to even _try_ to understand it. I have a fairly simple rule for most of the code I see: if I have a hard time understanding why it should work, I don't really want to rely on it. > But I am at a point where simplification never sounded more blissful, so > yeah, I like it :) Exactly. This is the "let's limit things a bit to keep them much simpler. > Let's hope it fixes Boris's issue. I'm going to just guess that it won't, and that Boris' issue was actually due to something else entirely, and we've all been staring at totally the wrong code. But we can hope. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 0:32 ` Linus Torvalds @ 2010-04-10 7:27 ` Borislav Petkov 2010-04-10 11:26 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-10 7:27 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Fri, Apr 09, 2010 at 05:32:36PM -0700 > Exactly. This is the "let's limit things a bit to keep them much simpler. You gotta love that rule :) > > Let's hope it fixes Boris's issue. > > I'm going to just guess that it won't, and that Boris' issue was actually > due to something else entirely, and we've all been staring at totally the > wrong code. > > But we can hope. Now why would you go and jinx it like that... :) Hibernation runs back-to-back: 1. light system load after boot... ok 2. 3 kvm guests, 3Gb mem free of 8Gb total acc. to /proc/meminfo... ok [ this was the fireproof way to trigger the bug, btw] 3. kvm guests down, firefox loading a 4Mb html page... ok 4. start ubuntu guest, firefox keeps loading the 4Mb html page after previous resume... ok 5. ubuntu guest booting done, firefox done, play video... ok 6. video broken after resume due to: [AO_ALSA] Pcm in suspend mode, trying to resume. 212% 2% 1.7% 1 0 [AO_ALSA] alsa-lib: pcm_hw.c:709:(snd_pcm_hw_resume) SNDRV_PCM_IOCTL_RESUME failed: Function not implemented i.e., unrelated... still ok 7. ubuntu guest downloading a 100Mb file causing allocation of a bunch of anon memory in the host... ok 8. all guests off, firefox off, back to light load... ok No oopsies or problems in dmesg except the old lockdep sysfs warning. I will keep running that kernel in the next couple of days and keep you informed in case this is the fix we're gonna use. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 7:27 ` Borislav Petkov @ 2010-04-10 11:26 ` Borislav Petkov 2010-04-10 14:45 ` Rik van Riel 2010-04-10 15:24 ` Linus Torvalds 0 siblings, 2 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-10 11:26 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Borislav Petkov <bp@alien8.de> Date: Sat, Apr 10, 2010 at 09:27:14AM +0200 > Now why would you go and jinx it like that... :) > > Hibernation runs back-to-back: > > 1. light system load after boot... ok > 2. 3 kvm guests, 3Gb mem free of 8Gb total acc. to /proc/meminfo... ok [ this was the fireproof way to trigger the bug, btw] > 3. kvm guests down, firefox loading a 4Mb html page... ok > 4. start ubuntu guest, firefox keeps loading the 4Mb html page after previous resume... ok > 5. ubuntu guest booting done, firefox done, play video... ok > 6. video broken after resume due to: > > [AO_ALSA] Pcm in suspend mode, trying to resume. 212% 2% 1.7% 1 0 > [AO_ALSA] alsa-lib: pcm_hw.c:709:(snd_pcm_hw_resume) SNDRV_PCM_IOCTL_RESUME failed: Function not implemented > > i.e., unrelated... still ok > > 7. ubuntu guest downloading a 100Mb file causing allocation of a bunch of anon memory in the host... ok > 8. all guests off, firefox off, back to light load... ok > > No oopsies or problems in dmesg except the old lockdep sysfs warning. > > I will keep running that kernel in the next couple of days and keep you > informed in case this is the fix we're gonna use. Yep, you jinxed it :) This time we got stuck on the anon_vma->lock (yep, we've seen that oopsie before). So, it might be that we _really_ are staring at the wrong code... Back to square one. [18969.797126] BUG: soft lockup - CPU#1 stuck for 61s! [hib.sh:5605] [18969.797126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd pcspkr serial_core k10temp edac_core [18969.798029] irq event stamp: 0 [18969.798029] hardirqs last enabled at (0): [<(null)>] (null) [18969.798029] hardirqs last disabled at (0): [<ffffffff8103657c>] copy_process+0x3c1/0x10cc [18969.798029] softirqs last enabled at (0): [<ffffffff8103657c>] copy_process+0x3c1/0x10cc [18969.798029] softirqs last disabled at (0): [<(null)>] (null) [18969.798029] CPU 1 [18969.798029] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd pcspkr serial_core k10temp edac_core [18969.798029] [18969.798029] Pid: 5605, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0 #1 M3A78 PRO/System Product Name [18969.798029] RIP: 0010:[<ffffffff8118b7f4>] [<ffffffff8118b7f4>] delay_tsc+0x33/0xca [18969.798029] RSP: 0018:ffff8801aebdf7b8 EFLAGS: 00000206 [18969.798029] RAX: 00000000fc6fc9e8 RBX: ffff8801aebdf7e8 RCX: 0000000000001200 [18969.798029] RDX: 0000000000002806 RSI: ffff8801aebdf848 RDI: 0000000000000001 [18969.798029] RBP: ffffffff81002b4e R08: 0000000000000001 R09: 0000000000000000 [18969.798029] R10: ffff8801aebdf8a8 R11: 0000000000000001 R12: 0000000000000014 [18969.798029] R13: ffff88000a200000 R14: ffff8801aebde000 R15: ffff8801aebdffd8 [18969.798029] FS: 00007f2c86c656f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000 [18969.798029] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [18969.798029] CR2: 00007fd515101870 CR3: 000000022bd9a000 CR4: 00000000000006e0 [18969.798029] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [18969.798029] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [18969.798029] Process hib.sh (pid: 5605, threadinfo ffff8801aebde000, task ffff88022e194b80) [18969.798029] Stack: [18969.798029] 0000000000000001 ffff88022d2db720 ffff88022e194b80 00000000b3477260 [18969.798029] <0> ffff88022e194f28 000000002a5200c6 ffff8801aebdf7f8 ffffffff8118b7bf [18969.798029] <0> ffff8801aebdf848 ffffffff8119a296 ffff88022d2db738 0000000000000001 [18969.798029] Call Trace: [18969.798029] [<ffffffff8118b7bf>] ? __delay+0xf/0x11 [18969.798029] [<ffffffff8119a296>] ? do_raw_spin_lock+0xd2/0x13c [18969.798029] [<ffffffff813f843b>] ? _raw_spin_lock+0x60/0x73 [18969.798029] [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac [18969.798029] [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac [18969.798029] [<ffffffff810c0a80>] ? page_lock_anon_vma+0x0/0xac [18969.798029] [<ffffffff810c0cc9>] ? page_referenced+0x80/0x1dc [18969.798029] [<ffffffff810c60a0>] ? swapcache_free+0x37/0x3c [18969.798029] [<ffffffff810ab7e6>] ? shrink_page_list+0x14a/0x477 [18969.798029] [<ffffffff810abe6a>] ? shrink_inactive_list+0x357/0x5e5 [18969.798029] [<ffffffff810ab68a>] ? shrink_active_list+0x232/0x244 [18969.798029] [<ffffffff810ac404>] ? shrink_zone+0x30c/0x3d6 [18969.798029] [<ffffffff810acfdf>] ? do_try_to_free_pages+0x176/0x27f [18969.798029] [<ffffffff810ad17d>] ? shrink_all_memory+0x95/0xc4 [18969.798029] [<ffffffff810aa680>] ? isolate_pages_global+0x0/0x1f0 [18969.798029] [<ffffffff81076e80>] ? count_data_pages+0x65/0x79 [18969.798029] [<ffffffff810770e7>] ? hibernate_preallocate_memory+0x1aa/0x2cb [18969.798029] [<ffffffff813f5445>] ? printk+0x41/0x44 [18969.798029] [<ffffffff81075a87>] ? hibernation_snapshot+0x36/0x1e1 [18969.798029] [<ffffffff81075d00>] ? hibernate+0xce/0x172 [18969.798029] [<ffffffff81074a6d>] ? state_store+0x5c/0xd3 [18969.798029] [<ffffffff8118504b>] ? kobj_attr_store+0x17/0x19 [18969.798029] [<ffffffff81125e8b>] ? sysfs_write_file+0x108/0x144 [18969.798029] [<ffffffff810d5807>] ? vfs_write+0xb2/0x153 [18969.798029] [<ffffffff81063c0d>] ? trace_hardirqs_on_caller+0x1f/0x14b [18969.798029] [<ffffffff810d596b>] ? sys_write+0x4a/0x71 [18969.798029] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b [18969.798029] Code: 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 49 89 fc bf 01 00 00 00 e8 88 1d ea ff e8 db f4 00 00 41 89 c5 0f ae f0 66 66 90 0f 31 <89> c3 65 4c 8b 34 25 48 b5 00 00 0f ae f0 66 66 90 0f 31 41 89 [18969.798029] Call Trace: [18969.798029] [<ffffffff8118b7bf>] ? __delay+0xf/0x11 [18969.798029] [<ffffffff8119a296>] ? do_raw_spin_lock+0xd2/0x13c [18969.798029] [<ffffffff813f843b>] ? _raw_spin_lock+0x60/0x73 [18969.798029] [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac [18969.798029] [<ffffffff810c0ae3>] ? page_lock_anon_vma+0x63/0xac [18969.798029] [<ffffffff810c0a80>] ? page_lock_anon_vma+0x0/0xac [18969.798029] [<ffffffff810c0cc9>] ? page_referenced+0x80/0x1dc [18969.798029] [<ffffffff810c60a0>] ? swapcache_free+0x37/0x3c [18969.798029] [<ffffffff810ab7e6>] ? shrink_page_list+0x14a/0x477 [18969.798029] [<ffffffff810abe6a>] ? shrink_inactive_list+0x357/0x5e5 [18969.798029] [<ffffffff810ab68a>] ? shrink_active_list+0x232/0x244 [18969.798029] [<ffffffff810ac404>] ? shrink_zone+0x30c/0x3d6 [18969.798029] [<ffffffff810acfdf>] ? do_try_to_free_pages+0x176/0x27f [18969.798029] [<ffffffff810ad17d>] ? shrink_all_memory+0x95/0xc4 [18969.798029] [<ffffffff810aa680>] ? isolate_pages_global+0x0/0x1f0 [18969.798029] [<ffffffff81076e80>] ? count_data_pages+0x65/0x79 [18969.798029] [<ffffffff810770e7>] ? hibernate_preallocate_memory+0x1aa/0x2cb [18969.798029] [<ffffffff813f5445>] ? printk+0x41/0x44 [18969.798029] [<ffffffff81075a87>] ? hibernation_snapshot+0x36/0x1e1 [18969.798029] [<ffffffff81075d00>] ? hibernate+0xce/0x172 [18969.798029] [<ffffffff81074a6d>] ? state_store+0x5c/0xd3 [18969.798029] [<ffffffff8118504b>] ? kobj_attr_store+0x17/0x19 [18969.798029] [<ffffffff81125e8b>] ? sysfs_write_file+0x108/0x144 [18969.798029] [<ffffffff810d5807>] ? vfs_write+0xb2/0x153 [18969.798029] [<ffffffff81063c0d>] ? trace_hardirqs_on_caller+0x1f/0x14b [18969.798029] [<ffffffff810d596b>] ? sys_write+0x4a/0x71 [18969.798029] [<ffffffff810021db>] ? system_call_fastpath+0x16/0x1b [19005.426655] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) [19005.663484] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) [19007.018563] SysRq : Emergency Sync [19007.018969] Emergency Sync complete [19007.582218] SysRq : Emergency Remount R/O [19008.251934] SysRq : Power Off [19010.076146] SysRq : Resetting -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 11:26 ` Borislav Petkov @ 2010-04-10 14:45 ` Rik van Riel 2010-04-10 15:24 ` Linus Torvalds 1 sibling, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-10 14:45 UTC (permalink / raw) To: Borislav Petkov, Linus Torvalds, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/10/2010 07:26 AM, Borislav Petkov wrote: > This time we got stuck on the anon_vma->lock (yep, we've seen that > oopsie before). So, it might be that we _really_ are staring at the > wrong code... Back to square one. This is a different bug, though. If the null pointer dereference is gone, Linus's patch fixed that bug and we can move forward to fixing the anon_vma->lock bug. I'll start auditing the code to see if we forget to unlock the anon_vma in some unlikely error path... ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 11:26 ` Borislav Petkov 2010-04-10 14:45 ` Rik van Riel @ 2010-04-10 15:24 ` Linus Torvalds 2010-04-10 16:38 ` Borislav Petkov 2010-04-10 16:41 ` Linus Torvalds 1 sibling, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 15:24 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Borislav Petkov wrote: > > > > I will keep running that kernel in the next couple of days and keep you > > informed in case this is the fix we're gonna use. > > Yep, you jinxed it :) > > This time we got stuck on the anon_vma->lock (yep, we've seen that > oopsie before). So, it might be that we _really_ are staring at the > wrong code... Back to square one. No, I think we're good. I suspect this is a different issue. Do you have lockdep enabled, along with mutex and spinlock debugging etc? That might help pinpoint what triggers this. But I think the fact that you are apparently not able to get the list corruption is a good sign. Of course, it might just be harder to trigger, and these things could all be a sign of a different bug, but my gut feel is that we did fix something, and you are just damn good at stressing the new code. Kudos. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 15:24 ` Linus Torvalds @ 2010-04-10 16:38 ` Borislav Petkov 2010-04-10 17:05 ` Linus Torvalds 2010-04-10 17:07 ` Borislav Petkov 2010-04-10 16:41 ` Linus Torvalds 1 sibling, 2 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-10 16:38 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat, Apr 10, 2010 at 08:24:02AM -0700 > No, I think we're good. I suspect this is a different issue. Do you have > lockdep enabled, along with mutex and spinlock debugging etc? That might > help pinpoint what triggers this. I had pretty much all lock debugging options enabled except PROVE_RCU. > But I think the fact that you are apparently not able to get the list > corruption is a good sign. Of course, it might just be harder to trigger, > and these things could all be a sign of a different bug, but my gut feel > is that we did fix something, and you are just damn good at stressing the > new code. Kudos. Yep, even my mom says I'm good at breaking things :) But seriously, thanks - means a lot coming from you. And I got an oops again, this time the #GP from couple of days ago. <thinking out loud> I'm starting to think that maybe there could be something wrong with the machine I'm running it on. Especially since there are only two people who reported this issue, Steinar and me, so how probable is it that maybe those two machines have failing RAM module somewhere? Or some other data corrupting thing? Although I should be getting mchecks... Hmm... </thinking out loud> Im going to run the stress test on 2.6.33.2 to verify whether this is actually software-related. Just in case. Oh, yes, I almost forgot, the latest and greatest in the world of oopsies: [ 452.351588] general protection fault: 0000 [#1] PREEMPT SMP [ 452.352119] last sysfs file: /sys/power/state [ 452.352131] CPU 1 [ 452.352131] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core ohci_hcd pcspkr k10temp [ 452.352131] [ 452.352131] Pid: 2929, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0 #4 M3A78 PRO/System Product Name [ 452.352131] RIP: 0010:[<ffffffff810c5f00>] [<ffffffff810c5f00>] page_referenced+0xee/0x1dc [ 452.352131] RSP: 0018:ffff88022adb18b8 EFLAGS: 00010206 [ 452.352131] RAX: ffff88022ad5c468 RBX: ffffea0007598558 RCX: 0000000000000000 [ 452.352131] RDX: ffff88022adb1cf8 RSI: ffff88022ad5c440 RDI: ffff88022e7d38a0 [ 452.352131] RBP: ffff88022adb1938 R08: 0000000000000002 R09: 0000000000000000 [ 452.352131] R10: ffff88022be83868 R11: ffffffff00000012 R12: 0000000000000000 [ 452.352131] R13: 0032323200323212 R14: ffff88022ad5c428 R15: ffff88022adb1a00 [ 452.352131] FS: 00007f056a1e36f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000 [ 452.352131] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 452.352131] CR2: 000000000250e408 CR3: 000000022983f000 CR4: 00000000000006e0 [ 452.352131] DR0: 00000000000000a0 DR1: 0000000000000000 DR2: 0000000000000003 [ 452.352131] DR3: 00000000000000b0 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 452.352131] Process hib.sh (pid: 2929, threadinfo ffff88022adb0000, task ffff88022e7d38a0) [ 452.352131] Stack: [ 452.352131] ffff88022ad5c468 00000000810c5c1f ffff88022adb1918 ffffffff810c5d88 [ 452.352131] <0> ffff88022adb18f8 ffffffff00000001 ffffea00075c89c0 ffffea00075984e8 [ 452.352131] <0> ffffea00075984e8 000000022adb1cf8 ffffea00075984e8 ffffea0007598580 [ 452.352131] Call Trace: [ 452.352131] [<ffffffff810c5d88>] ? try_to_unmap_anon+0xa2/0xb4 [ 452.352131] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 452.352131] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 452.352131] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc [ 452.352131] [<ffffffff8140f856>] ? _raw_spin_unlock_irq+0x30/0x58 [ 452.352131] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 452.352131] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 452.352131] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 452.352131] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 452.352131] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 452.352131] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 452.352131] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 452.352131] [<ffffffff8140bbd4>] ? printk+0x41/0x45 [ 452.352131] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 452.352131] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 452.352131] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 452.352131] [<ffffffff8118f3cf>] kobj_attr_store+0x17/0x19 [ 452.352131] [<ffffffff8112e288>] sysfs_write_file+0x108/0x144 [ 452.352131] [<ffffffff810db4ff>] vfs_write+0xb2/0x153 [ 452.352131] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 452.352131] [<ffffffff810db663>] sys_write+0x4a/0x71 [ 452.352131] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 452.352131] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 [ 452.352131] RIP [<ffffffff810c5f00>] page_referenced+0xee/0x1dc [ 452.352131] RSP <ffff88022adb18b8> [ 452.368192] ---[ end trace a9c84cb81ab9fd41 ]--- [ 452.368372] note: hib.sh[2929] exited with preempt_count 2 [ 452.368564] BUG: scheduling while atomic: hib.sh/2929/0x10000003 [ 452.368742] INFO: lockdep is turned off. [ 452.368915] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core ohci_hcd pcspkr k10temp [ 452.370749] Pid: 2929, comm: hib.sh Tainted: G D 2.6.34-rc3-00501-gefb57c0 #4 [ 452.371051] Call Trace: [ 452.371239] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24 [ 452.371425] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77 [ 452.371608] [<ffffffff8140bfe8>] schedule+0xe3/0x7ff [ 452.371788] [<ffffffff810bd066>] ? unmap_vmas+0x88e/0x893 [ 452.371973] [<ffffffff81030ecb>] __cond_resched+0x18/0x24 [ 452.372168] [<ffffffff8140c7d1>] _cond_resched+0x2c/0x37 [ 452.372348] [<ffffffff810bcea6>] unmap_vmas+0x6ce/0x893 [ 452.372531] [<ffffffff8140f8b6>] ? _raw_spin_unlock_irqrestore+0x38/0x69 [ 452.372721] [<ffffffff810c1604>] exit_mmap+0xd7/0x182 [ 452.372903] [<ffffffff810368bc>] mmput+0x48/0xb9 [ 452.373088] [<ffffffff8103ad90>] exit_mm+0x110/0x11d [ 452.373284] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5 [ 452.373464] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155 [ 452.373645] [<ffffffff8100616b>] ? oops_end+0x47/0x93 [ 452.373826] [<ffffffff810061b2>] oops_end+0x8e/0x93 [ 452.374006] [<ffffffff810063a3>] die+0x5a/0x63 [ 452.374198] [<ffffffff81003eef>] do_general_protection+0x134/0x13c [ 452.374382] [<ffffffff8140fdb0>] ? irq_return+0x0/0x2 [ 452.374565] [<ffffffff8140ff8f>] general_protection+0x1f/0x30 [ 452.374754] [<ffffffff810c5f00>] ? page_referenced+0xee/0x1dc [ 452.374940] [<ffffffff810c5e92>] ? page_referenced+0x80/0x1dc [ 452.375147] [<ffffffff810c5d88>] ? try_to_unmap_anon+0xa2/0xb4 [ 452.375335] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 452.375519] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 452.375703] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc [ 452.375888] [<ffffffff8140f856>] ? _raw_spin_unlock_irq+0x30/0x58 [ 452.376080] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 452.376284] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 452.376476] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 452.376664] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 452.376852] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 452.377038] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 452.377238] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 452.377429] [<ffffffff8140bbd4>] ? printk+0x41/0x45 [ 452.377611] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 452.377794] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 452.377975] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 452.378170] [<ffffffff8118f3cf>] kobj_attr_store+0x17/0x19 [ 452.378351] [<ffffffff8112e288>] sysfs_write_file+0x108/0x144 [ 452.378533] [<ffffffff810db4ff>] vfs_write+0xb2/0x153 [ 452.378714] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 452.378898] [<ffffffff810db663>] sys_write+0x4a/0x71 [ 452.379084] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 16:38 ` Borislav Petkov @ 2010-04-10 17:05 ` Linus Torvalds 2010-04-10 18:21 ` Linus Torvalds 2010-04-10 17:07 ` Borislav Petkov 1 sibling, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 17:05 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Borislav Petkov wrote: > > And I got an oops again, this time the #GP from couple of days ago. Oh damn. So the list corruption really does happen still. And the pattern is similar, but not the same: now it's 0032323200323232, rather than 002e2e2e002e2e2e. Very intriguing. 0x32 instead of 0x2e, but the same pattern of duplicated bytes. And not very helpful in that it still doesn't actually make any sense. > <thinking out loud> > > I'm starting to think that maybe there could be something wrong with the > machine I'm running it on. Especially since there are only two people > who reported this issue, Steinar and me, so how probable is it that > maybe those two machines have failing RAM module somewhere? Or some > other data corrupting thing? Although I should be getting mchecks... > Hmm... No. Just the fact that there are two people who reported the same thing is already a pretty strong sign that it's real. Also, hardware problems don't tend to be as consistent in the details as yours have been. And in fact I have seen it personally (but couldn't reproduce it) on the kids mac mini after you reported it. So I'm convinced the problem is real, and just not so easily triggered, and you're being a great tester. Linus -- Here's the one I've seen, in case you care. I haven't posted it, because it doesn't really add anything new. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<c02850cf>] page_referenced+0xd6/0x199 *pde = 21d73067 *pte = 00000000 Oops: 0000 [#2] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/block/sda/uevent Modules linked in: [last unloaded: scsi_wait_scan] Pid: 14440, comm: firefox Tainted: G D 2.6.34-rc2-00391-gfc1203c #3 Mac-F4208EC8/Macmini1,1 EIP: 0060:[<c02850cf>] EFLAGS: 00210287 CPU: 1 EIP is at page_referenced+0xd6/0x199 EAX: f59e65d4 EBX: c10b5480 ECX: 00000000 EDX: fffffff0 ESI: f59e65d0 EDI: 00000000 EBP: d8f77cd8 ESP: d8f77ca0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process firefox (pid: 14440, ti=d8f76000 task=cb795440 task.ti=d8f76000) Stack: f59e65d4 00000000 fffffff0 c15ba000 d8f77cbc c02885b8 c07972c4 d8f77cdc c0276712 00000000 00000001 c10b5498 c10b5480 d8f77e94 d8f77d58 c0276b53 d8f77d48 00000000 00000000 00000000 0000001d d8f77de8 00000001 c07972c4 Call Trace: [<c02885b8>] ? swapcache_free+0x1b/0x24 [<c0276712>] ? __remove_mapping+0x90/0xb2 [<c0276b53>] ? shrink_page_list+0x109/0x3ba [<c0277099>] ? shrink_inactive_list+0x295/0x48e [<c0273d68>] ? determine_dirtyable_memory+0x34/0x4b [<c0273dd0>] ? get_dirty_limits+0x16/0x26d [<c027750c>] ? shrink_zone+0x27a/0x327 [<c03c55a5>] ? i915_gem_shrink+0x67/0x22c [<c0277e6d>] ? do_try_to_free_pages+0x17d/0x292 [<c0278078>] ? try_to_free_pages+0x6a/0x72 [<c0275cd7>] ? isolate_pages_global+0x0/0x1bd [<c0273210>] ? __alloc_pages_nodemask+0x2c2/0x447 [<c027f1c1>] ? handle_mm_fault+0x188/0x605 [<c02192c3>] ? do_page_fault+0x253/0x269 [<c0219070>] ? do_page_fault+0x0/0x269 [<c05b9e82>] ? error_code+0x66/0x6c [<c05b0000>] ? azx_probe+0x5e8/0x8ae [<c0219070>] ? do_page_fault+0x0/0x269 Code: f9 f2 74 18 ff 75 08 8d 45 f0 50 89 d8 e8 62 f6 ff ff 01 c7 59 83 7d f0 00 58 74 20 8b 55 d0 8b 42 10 83 e8 10 89 45 d0 8b 55 d0 <8b> 42 10 0f 18 00 90 89 d0 83 c0 10 39 45 c8 75 ab fe 06 e9 90 EIP: [<c02850cf>] page_referenced+0xd6/0x199 SS:ESP 0068:d8f77ca0 CR2: 0000000000000000 ---[ end trace 890710798f4c0070 ]--- ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 17:05 ` Linus Torvalds @ 2010-04-10 18:21 ` Linus Torvalds 2010-04-10 18:26 ` Linus Torvalds ` (3 more replies) 0 siblings, 4 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 18:21 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Linus Torvalds wrote: > On Sat, 10 Apr 2010, Borislav Petkov wrote: > > > > And I got an oops again, this time the #GP from couple of days ago. > > Oh damn. So the list corruption really does happen still. Ho humm. Maybe I'm crazy, but something started bothering me. And I started wondering: when is the 'page->mapping' of an anonymous page actually cleared? The thing is, the mapping of an anonymous page is actually cleared only when the page is _freed_, in "free_hot_cold_page()". Now, let's think about that. And in particular, let's think about how that relates to the freeing of the 'anon_vma' that the page->mapping points to. The way the anon_vma is freed is when the mapping is torn down, and we do roughly: tlb = tlb_gather_mmu(mm,..) .. unmap_vmas(&tlb, vma .. .. free_pgtables() .. tlb_finish_mmu(tlb, start, end); and we actually unmap all the pages in "unmap_vmas()", and then _after_ unmapping all the pages we do the "unlink_anon_vmas(vma);" in "free_pgtables()". Fine so far - the anon_vma stay around until after the page has been happily unmapped. But "unmapped all the pages" is _not_ actually the same as "free'd all the pages". The actual _freeing_ of the page happens generally in tlb_finish_mmu(), because we can free the page only after we've flushed any TLB entries. So what we have in that tlb_gather structure is a list of _pending_ pages to be freed, while we already actually free'd the anon_vmas earlier! Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because we use a per-cpu variable), but as far as I can tell it is _not_ an RCU-safe region. So I think we might actually get a real RCU freeing event while this all happens. So now the 'anon_vma' that 'page->mapping' points to has not just been released back to the SLUB caches, the page itself might have been released too. I dunno. Does the above sound at all sane? Or am I just raving? Something hacky like the above might fix it if I'm not just raving. I really might be missing something here. Linus --- include/asm-generic/tlb.h | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index e43f976..2678118 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -14,6 +14,7 @@ #define _ASM_GENERIC__TLB_H #include <linux/swap.h> +#include <linux/rcupdate.h> #include <asm/pgalloc.h> #include <asm/tlbflush.h> @@ -62,6 +63,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush) tlb->fullmm = full_mm_flush; + rcu_read_lock(); return tlb; } @@ -90,6 +92,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) /* keep the page table cache within bounds */ check_pgt_cache(); + rcu_read_unlock(); put_cpu_var(mmu_gathers); } ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 18:21 ` Linus Torvalds @ 2010-04-10 18:26 ` Linus Torvalds 2010-04-10 18:51 ` Borislav Petkov ` (2 subsequent siblings) 3 siblings, 0 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 18:26 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Linus Torvalds wrote: > > I dunno. Does the above sound at all sane? Or am I just raving? > > Something hacky like the above might fix it if I'm not just raving. I > really might be missing something here. Btw, if this turns out to be accurate, the real fix is to probably just have a separate phase at the very end to actually release all the vma's, rather than do it in "free_page_tables()". We don't want to make the tlb-gather any more atomic than it already is. In fact, Nick is trying to make it preemptible. So the patch included in that mail was meant very much as a "let's test my crazy theory" patch, rather than as the real solution. The patch is also untested. Maybe it doesn't work at all and introduces new bugs. Caveat emptor. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 18:21 ` Linus Torvalds 2010-04-10 18:26 ` Linus Torvalds @ 2010-04-10 18:51 ` Borislav Petkov 2010-04-10 18:58 ` Borislav Petkov 2010-04-10 19:36 ` Rik van Riel 2010-04-12 14:40 ` Peter Zijlstra 3 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-10 18:51 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat, Apr 10, 2010 at 11:21:39AM -0700 > On Sat, 10 Apr 2010, Linus Torvalds wrote: > > On Sat, 10 Apr 2010, Borislav Petkov wrote: > > > > > > And I got an oops again, this time the #GP from couple of days ago. > > > > Oh damn. So the list corruption really does happen still. > > Ho humm. > > Maybe I'm crazy, but something started bothering me. And I started > wondering: when is the 'page->mapping' of an anonymous page actually > cleared? > > The thing is, the mapping of an anonymous page is actually cleared only > when the page is _freed_, in "free_hot_cold_page()". > > Now, let's think about that. And in particular, let's think about how that > relates to the freeing of the 'anon_vma' that the page->mapping points to. > > The way the anon_vma is freed is when the mapping is torn down, and we do > roughly: > > tlb = tlb_gather_mmu(mm,..) > .. > unmap_vmas(&tlb, vma .. > .. > free_pgtables() > .. > tlb_finish_mmu(tlb, start, end); > > and we actually unmap all the pages in "unmap_vmas()", and then _after_ > unmapping all the pages we do the "unlink_anon_vmas(vma);" in > "free_pgtables()". Fine so far - the anon_vma stay around until after the > page has been happily unmapped. > > But "unmapped all the pages" is _not_ actually the same as "free'd all the > pages". The actual _freeing_ of the page happens generally in > tlb_finish_mmu(), because we can free the page only after we've flushed > any TLB entries. > > So what we have in that tlb_gather structure is a list of _pending_ pages > to be freed, while we already actually free'd the anon_vmas earlier! > > Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because > we use a per-cpu variable), but as far as I can tell it is _not_ an > RCU-safe region. > > So I think we might actually get a real RCU freeing event while this all > happens. So now the 'anon_vma' that 'page->mapping' points to has not just > been released back to the SLUB caches, the page itself might have been > released too. So, if I understand you correctly, the list_head anon_vma gets freed _before_ the page descriptor itself, therefore we still get a valid page->mapping in page_lock_anon_vma(). Maybe that explains the funny patterns in %r13. But how do they come to exist when the anon_vma is freed, shouldn't there be LIST_POISON or something recognizable? Anyways, testing... -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 18:51 ` Borislav Petkov @ 2010-04-10 18:58 ` Borislav Petkov 2010-04-10 20:05 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-10 18:58 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Borislav Petkov <bp@alien8.de> Date: Sat, Apr 10, 2010 at 08:51:45PM +0200 > Anyways, testing... Nope, still b0rked. And this time is not a funny pattern but ffffffffffffffe0 we had originally. [ 521.306972] BUG: unable to handle kernel NULL pointer dereference at (null) [ 521.307126] IP: [<ffffffff810c60b4>] page_referenced+0xee/0x1dc [ 521.307126] PGD 22d952067 PUD 2291db067 PMD 0 [ 521.307126] Oops: 0000 [#1] PREEMPT SMP [ 521.307126] last sysfs file: /sys/power/state [ 521.307126] CPU 1 [ 521.307126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core ohci_hcd edac_core k10temp [ 521.307126] [ 521.307126] Pid: 2896, comm: hib.sh Not tainted 2.6.34-rc3-00501-gefb57c0-dirty #5 M3A78 PRO/System Product Name [ 521.307126] RIP: 0010:[<ffffffff810c60b4>] [<ffffffff810c60b4>] page_referenced+0xee/0x1dc [ 521.307126] RSP: 0018:ffff88022bd9f8b8 EFLAGS: 00010283 [ 521.307126] RAX: ffff88022af8c338 RBX: ffffea00067e2998 RCX: 0000000000000000 [ 521.307126] RDX: ffff88022bd9fcf8 RSI: ffff88022af8c310 RDI: ffff88022c0c5e60 [ 521.307126] RBP: ffff88022bd9f938 R08: 0000000000000002 R09: 0000000000000000 [ 521.307126] R10: ffff88022b4454d8 R11: ffffffff00000012 R12: 0000000000000000 [ 521.307126] R13: ffffffffffffffe0 R14: ffff88022af8c2f8 R15: ffff88022bd9fa00 [ 521.307126] FS: 00007ff70fb586f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000 [ 521.307126] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 521.307126] CR2: 0000000000000000 CR3: 000000022e19c000 CR4: 00000000000006e0 [ 521.307126] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 521.307126] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 521.307126] Process hib.sh (pid: 2896, threadinfo ffff88022bd9e000, task ffff88022c0c5e60) [ 521.307126] Stack: [ 521.307126] ffff88022af8c338 00000000810c5dd3 ffff88022bd9f918 ffffffff810c5f3c [ 521.307126] <0> ffff880200000000 ffffffff00000001 ffff88022bd9ffd8 ffffea00067d2cf0 [ 521.307126] <0> ffffea00067d2cf0 000000022bd9fcf8 ffffea00067d2cf0 ffffea00067e29c0 [ 521.307126] Call Trace: [ 521.307126] [<ffffffff810c5f3c>] ? try_to_unmap_anon+0xa2/0xb4 [ 521.307126] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 521.307126] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 521.307126] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc [ 521.307126] [<ffffffff8140fa66>] ? _raw_spin_unlock_irq+0x30/0x58 [ 521.307126] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 521.307126] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 521.307126] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 521.307126] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 521.307126] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 521.307126] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 521.307126] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 521.307126] [<ffffffff8140bde4>] ? printk+0x41/0x45 [ 521.307126] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 521.307126] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 521.307126] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 521.307126] [<ffffffff8118f5eb>] kobj_attr_store+0x17/0x19 [ 521.307126] [<ffffffff8112e4a4>] sysfs_write_file+0x108/0x144 [ 521.307126] [<ffffffff810db6b3>] vfs_write+0xb2/0x153 [ 521.307126] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 521.307126] [<ffffffff810db817>] sys_write+0x4a/0x71 [ 521.307126] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 521.307126] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 [ 521.307126] RIP [<ffffffff810c60b4>] page_referenced+0xee/0x1dc [ 521.307126] RSP <ffff88022bd9f8b8> [ 521.307126] CR2: 0000000000000000 [ 521.320888] ---[ end trace 023d26183296e92e ]--- [ 521.321033] note: hib.sh[2896] exited with preempt_count 2 [ 521.321206] BUG: scheduling while atomic: hib.sh/2896/0x10000003 [ 521.321355] INFO: lockdep is turned off. [ 521.321500] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 pcspkr serial_core ohci_hcd edac_core k10temp [ 521.322884] Pid: 2896, comm: hib.sh Tainted: G D 2.6.34-rc3-00501-gefb57c0-dirty #5 [ 521.323139] Call Trace: [ 521.323288] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24 [ 521.323440] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77 [ 521.323587] [<ffffffff8140c1f8>] schedule+0xe3/0x7ff [ 521.323735] [<ffffffff81030ecb>] __cond_resched+0x18/0x24 [ 521.323882] [<ffffffff8140c9e1>] _cond_resched+0x2c/0x37 [ 521.324029] [<ffffffff810bcef1>] unmap_vmas+0x719/0x911 [ 521.324207] [<ffffffff810c1781>] exit_mmap+0x102/0x1e4 [ 521.324356] [<ffffffff810c16e8>] ? exit_mmap+0x69/0x1e4 [ 521.324503] [<ffffffff810368bc>] mmput+0x48/0xb9 [ 521.324651] [<ffffffff8103ad90>] exit_mm+0x110/0x11d [ 521.324798] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5 [ 521.324945] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155 [ 521.325093] [<ffffffff8100616b>] ? oops_end+0x47/0x93 [ 521.325244] [<ffffffff810061b2>] oops_end+0x8e/0x93 [ 521.325396] [<ffffffff8101f3e5>] no_context+0x1fc/0x20b [ 521.325544] [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af [ 521.325691] [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d [ 521.325839] [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15 [ 521.325987] [<ffffffff8101f886>] do_page_fault+0x173/0x32d [ 521.326138] [<ffffffff81082b84>] ? __call_rcu+0x11d/0x130 [ 521.326289] [<ffffffff814103e3>] ? error_sti+0x5/0x6 [ 521.326437] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 521.326586] [<ffffffff8140ed0b>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 521.326737] [<ffffffff814101ff>] page_fault+0x1f/0x30 [ 521.326885] [<ffffffff810c60b4>] ? page_referenced+0xee/0x1dc [ 521.327034] [<ffffffff810c6046>] ? page_referenced+0x80/0x1dc [ 521.327185] [<ffffffff810c5f3c>] ? try_to_unmap_anon+0xa2/0xb4 [ 521.327336] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 521.327483] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 521.327632] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc [ 521.327780] [<ffffffff8140fa66>] ? _raw_spin_unlock_irq+0x30/0x58 [ 521.327928] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 521.328079] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 521.328232] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 521.328387] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 521.328535] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 521.328683] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 521.328831] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 521.328979] [<ffffffff8140bde4>] ? printk+0x41/0x45 [ 521.329130] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 521.329283] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 521.329432] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 521.329580] [<ffffffff8118f5eb>] kobj_attr_store+0x17/0x19 [ 521.329727] [<ffffffff8112e4a4>] sysfs_write_file+0x108/0x144 [ 521.329875] [<ffffffff810db6b3>] vfs_write+0xb2/0x153 [ 521.330022] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 521.330174] [<ffffffff810db817>] sys_write+0x4a/0x71 [ 521.330326] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 18:58 ` Borislav Petkov @ 2010-04-10 20:05 ` Linus Torvalds 2010-04-10 20:12 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 20:05 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Borislav Petkov wrote: > From: Borislav Petkov <bp@alien8.de> > Date: Sat, Apr 10, 2010 at 08:51:45PM +0200 > > > Anyways, testing... > > Nope, still b0rked. And this time is not a funny pattern but > ffffffffffffffe0 we had originally. Ok, I think that just depends on who happens to re-use the allocation and how it does it. I'm pretty sure it's a use-after-free issue, where we have free'd an anon_vma too early, even though it has pages associated with it. If it wasn't the RCU case, it's just something else. I think it's worth looking at "vma_adjust()", because as I already mentioned to Rik earlier - the code is very hard to understand, and it's accrued crud over many many years. And vma_adjust is the one place that does that anon_vma_merge(), which is apart from the actual unmapping sequence the only other place that actually free's anon_vmas. So there are reasons to be very suspicious of that code. And I think that code can actually lose an anon_vma chain. It's totally screwing up the "import anonvma" case: when it does if (anon_vma_clone(importer, vma)) { return -ENOMEM; } importer->anon_vma = anon_vma; we can actually have "importer == vma", but "anon_vma = next->anon_vma". In which case we actually end up with an _empty_ chain (because importer didn't have a chain to begin with!) but "importer->anon_vma" points to an anon_vma. And then when we do that "remove_next", we actually get rid of the only chain we ever had, and have lost all our references to the anon_vma. That looks _horribly_ buggy. Also, the conditional nesting makes no sense (the whole anon_vma_clone() only makes sense if importer is set, and it is only ever set _inside_ the earlier if-statement, so the whole code should be moved inside there), nor does some of the comments. This patch is scary and untested, but the more I look at that code, the more convinced I am that vma_adjust was _really_ badly screwed up. The patch below may make things worse. I'll test it myself too, but I'm sending it out first, since I was writing the email as I was looking at the piece of cr*p. Linus --- mm/mmap.c | 24 ++++++++---------------- 1 files changed, 8 insertions(+), 16 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index acb023e..f90ea92 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start, struct address_space *mapping = NULL; struct prio_tree_root *root = NULL; struct file *file = vma->vm_file; - struct anon_vma *anon_vma = NULL; long adjust_next = 0; int remove_next = 0; if (next && !insert) { + struct vm_area_struct *exporter = NULL; + if (end >= next->vm_end) { /* * vma expands, overlapping all the next, and @@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start, */ again: remove_next = 1 + (end > next->vm_end); end = next->vm_end; - anon_vma = next->anon_vma; + exporter = next; importer = vma; } else if (end > next->vm_start) { /* @@ -527,7 +528,7 @@ again: remove_next = 1 + (end > next->vm_end); * mprotect case 5 shifting the boundary up. */ adjust_next = (end - next->vm_start) >> PAGE_SHIFT; - anon_vma = next->anon_vma; + exporter = next; importer = vma; } else if (end < vma->vm_end) { /* @@ -536,28 +537,19 @@ again: remove_next = 1 + (end > next->vm_end); * mprotect case 4 shifting the boundary down. */ adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT); - anon_vma = next->anon_vma; + exporter = vma; importer = next; } - } - /* - * When changing only vma->vm_end, we don't really need anon_vma lock. - */ - if (vma->anon_vma && (insert || importer || start != vma->vm_start)) - anon_vma = vma->anon_vma; - if (anon_vma) { /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the * shrinking vma had, to cover any anon pages imported. */ - if (importer && !importer->anon_vma) { - /* Block reverse map lookups until things are set up. */ - if (anon_vma_clone(importer, vma)) { + if (exporter && exporter->anon_vma && !importer->anon_vma) { + if (anon_vma_clone(importer, exporter)) return -ENOMEM; - } - importer->anon_vma = anon_vma; + importer->anon_vma = exporter->anon_vma; } } ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 20:05 ` Linus Torvalds @ 2010-04-10 20:12 ` Linus Torvalds 2010-04-10 20:36 ` Borislav Petkov 2010-04-10 20:24 ` Rik van Riel 2010-04-10 20:32 ` Rik van Riel 2 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 20:12 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Linus Torvalds wrote: > > This patch is scary and untested, but the more I look at that code, the > more convinced I am that vma_adjust was _really_ badly screwed up. The > patch below may make things worse. I'll test it myself too, but I'm > sending it out first, since I was writing the email as I was looking at > the piece of cr*p. Ok, it boots. Which means it must be bug-free and perfect. And I really am convinced that the old vma_adjust() use of anon_vma_clone() was _totally_ broken, so this really could explain everything. The RCU grace period thing for the TLB flush does look like a real bug too, but it's one that is probably impossible to hit in practice. A broken vma_adjust(), however, would seem to be trivial to hit once you just get the right memory freeing patterns going, because the anon_vma would easily be _loong_ gone because we didn't create a chain to it at all, so the anon_vma code decided that it's not used any more. So I'm actually pretty optimistic that this really is it. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 20:12 ` Linus Torvalds @ 2010-04-10 20:36 ` Borislav Petkov 2010-04-10 20:40 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-10 20:36 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat, Apr 10, 2010 at 01:12:46PM -0700 > So I'm actually pretty optimistic that this really is it. Ok, let me verify what/in which order should be tested before I test something wrongly. The RCU-safe fix for the TLB flush can stay for correctness reasons, this last patch, obviosly, what happens with the find_mergeable_anon_vma() changes to use only singleton lists for merging? Should I keep those too? -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 20:36 ` Borislav Petkov @ 2010-04-10 20:40 ` Linus Torvalds 2010-04-10 21:25 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 20:40 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Borislav Petkov wrote: > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sat, Apr 10, 2010 at 01:12:46PM -0700 > > > So I'm actually pretty optimistic that this really is it. > > Ok, let me verify what/in which order should be tested before I test > something wrongly. The RCU-safe fix for the TLB flush can stay for > correctness reasons, this last patch, obviosly, what happens with the > find_mergeable_anon_vma() changes to use only singleton lists for > merging? Should I keep those too? Yes. So the patches I actually think are important are: - the RCU fix is real, although admittedly the race window is probably too small to ever really hit. - the simplification rule to find_mergeable_anon_vma's is required, because otherwise our anon_vma_merge() will do the wrong thing (maybe Johannes' patch would be an alternative, but quite frankly, I think we want the simpler code, and I don't think we even _want_ to share anon_vma's that are complex due to forking) I like my "cleanup" version (the bigger one with lots of comments) more than the two-liner version, but they should be equivalent. - the vma_adjust() fix is the one that I think may actually end up fixing your problems for good. Knock wood. So I think they are all required, but I suspect that the vma_adjust() one is finally the most direct explanation of the problem you've seen. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 20:40 ` Linus Torvalds @ 2010-04-10 21:25 ` Borislav Petkov 2010-04-10 21:30 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-10 21:25 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat, Apr 10, 2010 at 01:40:39PM -0700 > Yes. So the patches I actually think are important are: > > - the RCU fix is real, although admittedly the race window is probably > too small to ever really hit. > > - the simplification rule to find_mergeable_anon_vma's is required, > because otherwise our anon_vma_merge() will do the wrong thing (maybe > Johannes' patch would be an alternative, but quite frankly, I think we > want the simpler code, and I don't think we even _want_ to share > anon_vma's that are complex due to forking) > > I like my "cleanup" version (the bigger one with lots of comments) more > than the two-liner version, but they should be equivalent. > > - the vma_adjust() fix is the one that I think may actually end up fixing > your problems for good. Knock wood. > > So I think they are all required, but I suspect that the vma_adjust() one > is finally the most direct explanation of the problem you've seen. Damn, nope, still no joy :(. It looked like it was fixed but one of the test was to hibernate right after the 3 kvm guests were shut down and I guess the mem freeing pattern kinda hits it where it most hurts. Anyways, I'm going to bed soon, will test whatever you come up with guys tomorrow morning when I can think again. By the way, do we want to create a new thread - the mailchain is off the screen limits of my netbook :) Thanks. p.s. Oopsie: [ 647.288638] PM: Syncing filesystems ... done. [ 647.307459] Freezing user space processes ... (elapsed 0.01 seconds) done. [ 647.320981] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. [ 647.334152] PM: Preallocating image memory... [ 647.492781] BUG: unable to handle kernel NULL pointer dereference at (null) [ 647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc [ 647.493001] PGD 22a1d1067 PUD 1cb6a9067 PMD 0 [ 647.493001] Oops: 0000 [#1] PREEMPT SMP [ 647.493001] last sysfs file: /sys/power/state [ 647.493001] CPU 0 [ 647.493001] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp ohci_hcd 8250 serial_core pcspkr k10temp edac_core [ 647.493001] [ 647.493001] Pid: 3231, comm: hib.sh Not tainted 2.6.34-rc3-00503-g8b3334b #6 M3A78 PRO/System Product Name [ 647.493001] RIP: 0010:[<ffffffff810c60a0>] [<ffffffff810c60a0>] page_referenced+0xee/0x1dc [ 647.493001] RSP: 0018:ffff880223b6f8b8 EFLAGS: 00010283 [ 647.493001] RAX: ffff88022aa316c8 RBX: ffffea0006882fc0 RCX: 0000000000000000 [ 647.493001] RDX: ffff880223b6fcf8 RSI: ffff88022aa316a0 RDI: ffff88022de6de60 [ 647.493001] RBP: ffff880223b6f938 R08: 0000000000000002 R09: 0000000000000000 [ 647.493001] R10: ffff880228cb03a8 R11: ffffffff00000012 R12: 0000000000000000 [ 647.493001] R13: ffffffffffffffe0 R14: ffff88022aa31688 R15: ffff880223b6fa00 [ 647.493001] FS: 00007f0eea2086f0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000 [ 647.493001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 647.493001] CR2: 0000000000000000 CR3: 0000000223df5000 CR4: 00000000000006f0 [ 647.493001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 647.493001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 647.493001] Process hib.sh (pid: 3231, threadinfo ffff880223b6e000, task ffff88022de6de60) [ 647.493001] Stack: [ 647.493001] ffff88022aa316c8 00000000810c5dbf ffff880223b6f918 ffffffff810c5f28 [ 647.493001] <0> ffff880223b6f8f8 ffffffff00000001 ffffea0006867570 ffffea0006889070 [ 647.493001] <0> ffffea0006889070 0000000223b6fcf8 ffffea0006889070 ffffea0006882fe8 [ 647.493001] Call Trace: [ 647.493001] [<ffffffff810c5f28>] ? try_to_unmap_anon+0xa2/0xb4 [ 647.493001] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 647.493001] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 647.493001] [<ffffffff810b1155>] ? shrink_zone+0x11a/0x3d6 [ 647.493001] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 647.493001] [<ffffffff8140f000>] ? _raw_spin_lock_irq+0x19/0x79 [ 647.493001] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 647.493001] [<ffffffff810b155b>] ? shrink_slab+0x14a/0x15c [ 647.493001] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 647.493001] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 647.493001] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 647.493001] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 647.493001] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 647.493001] [<ffffffff8140bdd4>] ? printk+0x41/0x45 [ 647.493001] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 647.493001] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 647.493001] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 647.493001] [<ffffffff8118f5d7>] kobj_attr_store+0x17/0x19 [ 647.493001] [<ffffffff8112e490>] sysfs_write_file+0x108/0x144 [ 647.493001] [<ffffffff810db69f>] vfs_write+0xb2/0x153 [ 647.493001] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 647.493001] [<ffffffff810db803>] sys_write+0x4a/0x71 [ 647.493001] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 647.493001] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 [ 647.493001] RIP [<ffffffff810c60a0>] page_referenced+0xee/0x1dc [ 647.493001] RSP <ffff880223b6f8b8> [ 647.493001] CR2: 0000000000000000 [ 647.508991] ---[ end trace 91f57fb5ef398fd2 ]--- [ 647.509150] note: hib.sh[3231] exited with preempt_count 2 [ 647.509311] BUG: scheduling while atomic: hib.sh/3231/0x10000003 [ 647.509462] INFO: lockdep is turned off. [ 647.509610] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp ohci_hcd 8250 serial_core pcspkr k10temp edac_core [ 647.511093] Pid: 3231, comm: hib.sh Tainted: G D 2.6.34-rc3-00503-g8b3334b #6 [ 647.511353] Call Trace: [ 647.511504] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24 [ 647.511658] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77 [ 647.511811] [<ffffffff8140c1e8>] schedule+0xe3/0x7ff [ 647.511962] [<ffffffff810bd0e4>] ? unmap_vmas+0x90c/0x911 [ 647.512191] [<ffffffff81030ecb>] __cond_resched+0x18/0x24 [ 647.512337] [<ffffffff8140c9d1>] _cond_resched+0x2c/0x37 [ 647.512550] [<ffffffff810bcef1>] unmap_vmas+0x719/0x911 [ 647.512697] [<ffffffff810c1781>] exit_mmap+0x102/0x1e4 [ 647.512911] [<ffffffff810c16e8>] ? exit_mmap+0x69/0x1e4 [ 647.513082] [<ffffffff810368bc>] mmput+0x48/0xb9 [ 647.513233] [<ffffffff8103ad90>] exit_mm+0x110/0x11d [ 647.513387] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5 [ 647.513538] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155 [ 647.513690] [<ffffffff8100616b>] ? oops_end+0x47/0x93 [ 647.513859] [<ffffffff810061b2>] oops_end+0x8e/0x93 [ 647.514009] [<ffffffff8101f3e5>] no_context+0x1fc/0x20b [ 647.514172] [<ffffffff8118b72b>] ? cfq_insert_request+0x7a/0x3b1 [ 647.514321] [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af [ 647.514473] [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d [ 647.514625] [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15 [ 647.514777] [<ffffffff8101f886>] do_page_fault+0x173/0x32d [ 647.514929] [<ffffffff814103a3>] ? error_sti+0x5/0x6 [ 647.515084] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 647.515242] [<ffffffff8140ecfb>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 647.515397] [<ffffffff814101bf>] page_fault+0x1f/0x30 [ 647.515549] [<ffffffff810c60a0>] ? page_referenced+0xee/0x1dc [ 647.515701] [<ffffffff810c6032>] ? page_referenced+0x80/0x1dc [ 647.515853] [<ffffffff810c5f28>] ? try_to_unmap_anon+0xa2/0xb4 [ 647.516010] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 647.516167] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 647.516323] [<ffffffff810b1155>] ? shrink_zone+0x11a/0x3d6 [ 647.516474] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 647.516627] [<ffffffff8140f000>] ? _raw_spin_lock_irq+0x19/0x79 [ 647.516780] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 647.516931] [<ffffffff810b155b>] ? shrink_slab+0x14a/0x15c [ 647.517086] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 647.517243] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 647.517398] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 647.517551] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 647.517703] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 647.517856] [<ffffffff8140bdd4>] ? printk+0x41/0x45 [ 647.518011] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 647.518168] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 647.518322] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 647.518473] [<ffffffff8118f5d7>] kobj_attr_store+0x17/0x19 [ 647.518625] [<ffffffff8112e490>] sysfs_write_file+0x108/0x144 [ 647.518777] [<ffffffff810db69f>] vfs_write+0xb2/0x153 [ 647.518928] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 647.519084] [<ffffffff810db803>] sys_write+0x4a/0x71 [ 647.519240] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 699.648857] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) [ 700.234923] SysRq : Emergency Sync [ 700.235341] Emergency Sync complete [ 700.982072] SysRq : Emergency Remount R/O [ 701.600802] SysRq : Resetting -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 21:25 ` Borislav Petkov @ 2010-04-10 21:30 ` Linus Torvalds 2010-04-10 21:51 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 21:30 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Borislav Petkov wrote: > > Damn, nope, still no joy :(. It looked like it was fixed but one of the > test was to hibernate right after the 3 kvm guests were shut down and I > guess the mem freeing pattern kinda hits it where it most hurts. Damn, I really hoped that was it. Three independent bugs found and fixed, and still no joy? Oh well. > By the way, do we want to create a new thread - the mailchain is off the > screen limits of my netbook :) I prefer to keep it in one thread so that they all show up together if I need to, but feel free to start a new one. Not a biggie. > [ 647.492781] BUG: unable to handle kernel NULL pointer dereference at (null) > [ 647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc Well, it sure is consistent. I'll start to think about what else could go wrong.. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 21:30 ` Linus Torvalds @ 2010-04-10 21:51 ` Borislav Petkov 2010-04-11 13:08 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-10 21:51 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat, Apr 10, 2010 at 02:30:49PM -0700 > On Sat, 10 Apr 2010, Borislav Petkov wrote: > > > > Damn, nope, still no joy :(. It looked like it was fixed but one of the > > test was to hibernate right after the 3 kvm guests were shut down and I > > guess the mem freeing pattern kinda hits it where it most hurts. > > Damn, I really hoped that was it. Three independent bugs found and fixed, > and still no joy? Oh well. Yep, I'll redo the testing tomorrow, so that we are sure that even with the _three_ bugs fixed we still hit the funky list element issue. > > By the way, do we want to create a new thread - the mailchain is off the > > screen limits of my netbook :) > > I prefer to keep it in one thread so that they all show up together if I > need to, but feel free to start a new one. Not a biggie. I'll keep the thread then - I didn't know it mattered. Mine was just a suggestion, nevermind. > > [ 647.492781] BUG: unable to handle kernel NULL pointer dereference at (null) > > [ 647.493001] IP: [<ffffffff810c60a0>] page_referenced+0xee/0x1dc > > Well, it sure is consistent. I'll start to think about what else could go > wrong.. Which could mean that even with those issues fixed, the real issue is yet something else. Because obviously the fixes you throw at it don't seem to change it - even the traces remain consistent across tests. And if it is use-after-free case, the funny patterns could be some shifted SLUB poison values which we happen to "see" through the dangling pointer... I dunno. Hmm. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 21:51 ` Borislav Petkov @ 2010-04-11 13:08 ` Borislav Petkov 2010-04-11 13:19 ` [PATCH 1/3] mm: make page freeing path RCU-safe Borislav Petkov ` (4 more replies) 0 siblings, 5 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-11 13:08 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Borislav Petkov <bp@alien8.de> Date: Sat, Apr 10, 2010 at 11:51:15PM +0200 > > Damn, I really hoped that was it. Three independent bugs found and fixed, > > and still no joy? Oh well. > > Yep, I'll redo the testing tomorrow, so that we are sure that even with > the _three_ bugs fixed we still hit the funky list element issue. Ok, I could verify that the three patches we were talking about still can't fix the issue. However, just to make sure I'm sending the versions of the patches I used for you guys to check. [ 529.667108] PM: Preallocating image memory... [ 529.930881] BUG: unable to handle kernel NULL pointer dereference at (null) [ 529.931275] IP: [<ffffffff810c603c>] page_referenced+0xee/0x1dc [ 529.931377] PGD 22e33d067 PUD 22ddc1067 PMD 0 [ 529.931377] Oops: 0000 [#1] PREEMPT SMP [ 529.931377] last sysfs file: /sys/power/state [ 529.931377] CPU 3 [ 529.931377] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp [ 529.931377] [ 529.931377] Pid: 3354, comm: hib.sh Tainted: G W 2.6.34-rc3-00503-g0fcc334 #1 M3A78 PRO/System Product Name [ 529.931377] RIP: 0010:[<ffffffff810c603c>] [<ffffffff810c603c>] page_referenced+0xee/0x1dc [ 529.931377] RSP: 0018:ffff880105a118b8 EFLAGS: 00010283 [ 529.931377] RAX: ffff88022dc896c8 RBX: ffffea0007a15e10 RCX: 0000000000000000 [ 529.931377] RDX: ffff880105a11cf8 RSI: ffff88022dc896a0 RDI: ffff88022b760000 [ 529.931377] RBP: ffff880105a11938 R08: 0000000000000002 R09: 0000000000000000 [ 529.931377] R10: 0000000000000000 R11: ffffffff00000012 R12: 0000000000000000 [ 529.931377] R13: ffffffffffffffe0 R14: ffff88022dc89688 R15: ffff880105a11a00 [ 529.931377] FS: 00007f21045876f0(0000) GS:ffff88000a600000(0000) knlGS:0000000000000000 [ 529.931377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 529.931377] CR2: 0000000000000000 CR3: 000000022b33f000 CR4: 00000000000006e0 [ 529.931377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 529.931377] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 529.931377] Process hib.sh (pid: 3354, threadinfo ffff880105a10000, task ffff88022b760000) [ 529.931377] Stack: [ 529.931377] ffff88022dc896c8 00000000810b0082 0000000000000000 0000000000000000 [ 529.931377] <0> 0000000000000000 0000000000000000 0000000000000000 0000000000000020 [ 529.931377] <0> 0000000000000000 0000000200000000 7fffffffffffffff ffffea0007a15e38 [ 529.931377] Call Trace: [ 529.931377] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 529.931377] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 529.931377] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc [ 529.931377] [<ffffffff8140f9f6>] ? _raw_spin_unlock_irq+0x30/0x58 [ 529.931377] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 529.931377] [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244 [ 529.931377] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 529.931377] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 529.931377] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 529.931377] [<ffffffff81078e1e>] ? memory_bm_test_bit+0x1/0x30 [ 529.931377] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 529.931377] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 529.931377] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 529.931377] [<ffffffff8140bd74>] ? printk+0x41/0x45 [ 529.931377] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 529.931377] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 529.931377] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 529.931377] [<ffffffff8118f573>] kobj_attr_store+0x17/0x19 [ 529.931377] [<ffffffff8112e42c>] sysfs_write_file+0x108/0x144 [ 529.931377] [<ffffffff810db63b>] vfs_write+0xb2/0x153 [ 529.931377] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 529.931377] [<ffffffff810db79f>] sys_write+0x4a/0x71 [ 529.931377] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 529.931377] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 11 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 [ 529.931377] RIP [<ffffffff810c603c>] page_referenced+0xee/0x1dc [ 529.931377] RSP <ffff880105a118b8> [ 529.931377] CR2: 0000000000000000 [ 529.945250] ---[ end trace caa5471c993e6461 ]--- [ 529.945558] note: hib.sh[3354] exited with preempt_count 2 [ 529.945710] BUG: scheduling while atomic: hib.sh/3354/0x10000003 [ 529.945858] INFO: lockdep is turned off. [ 529.946005] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 ohci_hcd edac_core serial_core pcspkr k10temp [ 529.947595] Pid: 3354, comm: hib.sh Tainted: G D W 2.6.34-rc3-00503-g0fcc334 #1 [ 529.947848] Call Trace: [ 529.947993] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24 [ 529.948147] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77 [ 529.948296] [<ffffffff8140c188>] schedule+0xe3/0x7ff [ 529.948449] [<ffffffff810bd0e4>] ? unmap_vmas+0x90c/0x911 [ 529.948599] [<ffffffff81030ecb>] __cond_resched+0x18/0x24 [ 529.948748] [<ffffffff8140c971>] _cond_resched+0x2c/0x37 [ 529.948896] [<ffffffff810bcef1>] unmap_vmas+0x719/0x911 [ 529.949049] [<ffffffff8140f01e>] ? _raw_spin_lock_irqsave+0x1e/0x85 [ 529.949199] [<ffffffff8105a878>] ? up+0x14/0x3e [ 529.949347] [<ffffffff810c171f>] exit_mmap+0x102/0x1e4 [ 529.949639] [<ffffffff810c1686>] ? exit_mmap+0x69/0x1e4 [ 529.949787] [<ffffffff810368bc>] mmput+0x48/0xb9 [ 529.949935] [<ffffffff8103ad90>] exit_mm+0x110/0x11d [ 529.950087] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5 [ 529.950236] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155 [ 529.950525] [<ffffffff8100616b>] ? oops_end+0x47/0x93 [ 529.950671] [<ffffffff810061b2>] oops_end+0x8e/0x93 [ 529.950819] [<ffffffff8101f3e5>] no_context+0x1fc/0x20b [ 529.950967] [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af [ 529.951120] [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d [ 529.951276] [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15 [ 529.951572] [<ffffffff8101f886>] do_page_fault+0x173/0x32d [ 529.951719] [<ffffffff81410363>] ? error_sti+0x5/0x6 [ 529.951867] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 529.952018] [<ffffffff8140ec9b>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 529.952170] [<ffffffff8141017f>] page_fault+0x1f/0x30 [ 529.952319] [<ffffffff810c603c>] ? page_referenced+0xee/0x1dc [ 529.952615] [<ffffffff810c5fce>] ? page_referenced+0x80/0x1dc [ 529.952762] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 529.952911] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 529.953065] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc [ 529.953214] [<ffffffff8140f9f6>] ? _raw_spin_unlock_irq+0x30/0x58 [ 529.953363] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 529.953627] [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244 [ 529.953775] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 529.953924] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 529.954077] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 529.954226] [<ffffffff81078e1e>] ? memory_bm_test_bit+0x1/0x30 [ 529.954486] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 529.954632] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 529.954782] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 529.954931] [<ffffffff8140bd74>] ? printk+0x41/0x45 [ 529.955083] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 529.955233] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 529.955457] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 529.955604] [<ffffffff8118f573>] kobj_attr_store+0x17/0x19 [ 529.955752] [<ffffffff8112e42c>] sysfs_write_file+0x108/0x144 [ 529.955900] [<ffffffff810db63b>] vfs_write+0xb2/0x153 [ 529.956053] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 529.956202] [<ffffffff810db79f>] sys_write+0x4a/0x71 [ 529.956351] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 537.634362] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) [ 538.129750] SysRq : Emergency Sync [ 538.130161] Emergency Sync complete [ 538.902386] SysRq : Emergency Remount R/O [ 539.328830] SysRq : Resetting -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* [PATCH 1/3] mm: make page freeing path RCU-safe 2010-04-11 13:08 ` Borislav Petkov @ 2010-04-11 13:19 ` Borislav Petkov 2010-04-11 13:19 ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov ` (3 subsequent siblings) 4 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-11 13:19 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> On Sat, 10 Apr 2010, Linus Torvalds wrote: > On Sat, 10 Apr 2010, Borislav Petkov wrote: > > > > And I got an oops again, this time the #GP from couple of days ago. > > Oh damn. So the list corruption really does happen still. Ho humm. Maybe I'm crazy, but something started bothering me. And I started wondering: when is the 'page->mapping' of an anonymous page actually cleared? The thing is, the mapping of an anonymous page is actually cleared only when the page is _freed_, in "free_hot_cold_page()". Now, let's think about that. And in particular, let's think about how that relates to the freeing of the 'anon_vma' that the page->mapping points to. The way the anon_vma is freed is when the mapping is torn down, and we do roughly: tlb = tlb_gather_mmu(mm,..) .. unmap_vmas(&tlb, vma .. .. free_pgtables() .. tlb_finish_mmu(tlb, start, end); and we actually unmap all the pages in "unmap_vmas()", and then _after_ unmapping all the pages we do the "unlink_anon_vmas(vma);" in "free_pgtables()". Fine so far - the anon_vma stay around until after the page has been happily unmapped. But "unmapped all the pages" is _not_ actually the same as "free'd all the pages". The actual _freeing_ of the page happens generally in tlb_finish_mmu(), because we can free the page only after we've flushed any TLB entries. So what we have in that tlb_gather structure is a list of _pending_ pages to be freed, while we already actually free'd the anon_vmas earlier! Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because we use a per-cpu variable), but as far as I can tell it is _not_ an RCU-safe region. So I think we might actually get a real RCU freeing event while this all happens. So now the 'anon_vma' that 'page->mapping' points to has not just been released back to the SLUB caches, the page itself might have been released too. I dunno. Does the above sound at all sane? Or am I just raving? Something hacky like the above might fix it if I'm not just raving. I really might be missing something here. Linus --- include/asm-generic/tlb.h | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index e43f976..2678118 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -14,6 +14,7 @@ #define _ASM_GENERIC__TLB_H #include <linux/swap.h> +#include <linux/rcupdate.h> #include <asm/pgalloc.h> #include <asm/tlbflush.h> @@ -62,6 +63,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush) tlb->fullmm = full_mm_flush; + rcu_read_lock(); return tlb; } @@ -90,6 +92,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end) /* keep the page table cache within bounds */ check_pgt_cache(); + rcu_read_unlock(); put_cpu_var(mmu_gathers); } -- 1.7.0.3 ^ permalink raw reply related [flat|nested] 231+ messages in thread
* [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity 2010-04-11 13:08 ` Borislav Petkov 2010-04-11 13:19 ` [PATCH 1/3] mm: make page freeing path RCU-safe Borislav Petkov @ 2010-04-11 13:19 ` Borislav Petkov 2010-04-11 13:19 ` [PATCH 3/3] mm: fixup vma_adjust Borislav Petkov ` (2 subsequent siblings) 4 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-11 13:19 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> On Sat, 10 Apr 2010, Linus Torvalds wrote: > > But I think the fact that you are apparently not able to get the list > corruption is a good sign. Of course, it might just be harder to trigger, > and these things could all be a sign of a different bug, but my gut feel > is that we did fix something, and you are just damn good at stressing the > new code. Kudos. Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated checks for prev/next compatibility that I just made even more complex. So I'm actually inclined to want to write my simple two-liner fix as a rather more complex cleanup patch, below. It adds way more lines than it deletes, but a lot of it is comments (and some of it is just because one routine got split up into three), and I think it makes the result a lot more readable. It also splits off the decision of whether we can reuse an non_vma from the decision of whether we can merge the vma's - the two are kind of related, but they are not really the same, and they have different issues. I think it's good to try to keep separate issues separate. This is UNTESTED! It's meant to be an "obvious cleanup" with no real semantic difference, but if I did something wrong it won't work. Also note the comment about the lack of locking between two adjacent anon_vma's taking a page fault at the same time: the ACCESS_ONCE() is unlikely to ever matter (anon_vma's are stable once they are set, so it's really just that you could first load a NULL, and then if you re-load the value you might get a non-NULL thing). Also note that when checking whether the anon_vma is a singleton, we don't hold any lock that protects the list we are checking. But "list_is_singular()" is safe and won't oops even if the pointers in the list are crap, because it only _compares_ the prev/next pointers, it doesn't dereference them. In short, what I'm saying is that there is a pretty subtle race in the very very unlikely case that two anon_vma's get prepared concurrently, but from a correctness standpoint it doesn't matter. We might sometimes - once in a blue moon - reject an anon_vma that could in theory have been merged, but that won't hurt. Comments? Rik, Johannes? Linus --- mm/mmap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------------- 1 files changed, 62 insertions(+), 24 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..acb023e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, } /* + * Rough compatbility check to quickly see if it's even worth looking + * at sharing an anon_vma. + * + * They need to have the same vm_file, and the flags can only differ + * in things that mprotect may change. + * + * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that + * we can merge the two vma's. For example, we refuse to merge a vma if + * there is a vm_ops->close() function, because that indicates that the + * driver is doing some kind of reference counting. But that doesn't + * really matter for the anon_vma sharing case. + */ +static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b) +{ + return a->vm_end == b->vm_start && + mpol_equal(vma_policy(a), vma_policy(b)) && + a->vm_file == b->vm_file && + !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) && + b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); +} + +/* + * Do some basic sanity checking to see if we can re-use the anon_vma + * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be + * the same as 'old', the other will be the new one that is trying + * to share the anon_vma. + * + * NOTE! This runs with mm_sem held for reading, so it is possible that + * the anon_vma of 'old' is concurrently in the process of being set up + * by another page fault trying to merge _that_. But that's ok: if it + * is being set up, that automatically means that it will be a singleton + * acceptable for merging, so we can do all of this optimistically. But + * we do that ACCESS_ONCE() to make sure that we never re-load the pointer. + * + * IOW: that the "list_is_singular()" test on the anon_vma_chain only + * matters for the 'stable anon_vma' case (ie the thing we want to avoid + * is to return an anon_vma that is "complex" due to having gone through + * a fork). + * + * We also make sure that the two vma's are compatible (adjacent, + * and with the same memory policies). That's all stable, even with just + * a read lock on the mm_sem. + */ +static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b) +{ + if (anon_vma_compatible(a, b)) { + struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma); + + if (anon_vma && list_is_singular(&old->anon_vma_chain)) + return anon_vma; + } + return NULL; +} + +/* * find_mergeable_anon_vma is used by anon_vma_prepare, to check * neighbouring vmas for a suitable anon_vma, before it goes off * to allocate a new anon_vma. It checks because a repetitive @@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, */ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) { + struct anon_vma *anon_vma; struct vm_area_struct *near; - unsigned long vm_flags; near = vma->vm_next; if (!near) goto try_prev; - /* - * Since only mprotect tries to remerge vmas, match flags - * which might be mprotected into each other later on. - * Neither mlock nor madvise tries to remerge at present, - * so leave their flags as obstructing a merge. - */ - vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); - vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - - if (near->anon_vma && vma->vm_end == near->vm_start && - mpol_equal(vma_policy(vma), vma_policy(near)) && - can_vma_merge_before(near, vm_flags, - NULL, vma->vm_file, vma->vm_pgoff + - ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT))) - return near->anon_vma; + anon_vma = reusable_anon_vma(near, vma, near); + if (anon_vma) + return anon_vma; try_prev: /* * It is potentially slow to have to call find_vma_prev here. @@ -868,14 +911,9 @@ try_prev: if (!near) goto none; - vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); - vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - - if (near->anon_vma && near->vm_end == vma->vm_start && - mpol_equal(vma_policy(near), vma_policy(vma)) && - can_vma_merge_after(near, vm_flags, - NULL, vma->vm_file, vma->vm_pgoff)) - return near->anon_vma; + anon_vma = reusable_anon_vma(near, near, vma); + if (anon_vma) + return anon_vma; none: /* * There's no absolute need to look only at touching neighbours: -- 1.7.0.3 ^ permalink raw reply related [flat|nested] 231+ messages in thread
* [PATCH 3/3] mm: fixup vma_adjust 2010-04-11 13:08 ` Borislav Petkov 2010-04-11 13:19 ` [PATCH 1/3] mm: make page freeing path RCU-safe Borislav Petkov 2010-04-11 13:19 ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov @ 2010-04-11 13:19 ` Borislav Petkov 2010-04-11 13:25 ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov 2010-04-11 17:07 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Linus Torvalds 4 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-11 13:19 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> On Sat, 10 Apr 2010, Borislav Petkov wrote: > From: Borislav Petkov <bp@alien8.de> > Date: Sat, Apr 10, 2010 at 08:51:45PM +0200 > > > Anyways, testing... > > Nope, still b0rked. And this time is not a funny pattern but > ffffffffffffffe0 we had originally. Ok, I think that just depends on who happens to re-use the allocation and how it does it. I'm pretty sure it's a use-after-free issue, where we have free'd an anon_vma too early, even though it has pages associated with it. If it wasn't the RCU case, it's just something else. I think it's worth looking at "vma_adjust()", because as I already mentioned to Rik earlier - the code is very hard to understand, and it's accrued crud over many many years. And vma_adjust is the one place that does that anon_vma_merge(), which is apart from the actual unmapping sequence the only other place that actually free's anon_vmas. So there are reasons to be very suspicious of that code. And I think that code can actually lose an anon_vma chain. It's totally screwing up the "import anonvma" case: when it does if (anon_vma_clone(importer, vma)) { return -ENOMEM; } importer->anon_vma = anon_vma; we can actually have "importer == vma", but "anon_vma = next->anon_vma". In which case we actually end up with an _empty_ chain (because importer didn't have a chain to begin with!) but "importer->anon_vma" points to an anon_vma. And then when we do that "remove_next", we actually get rid of the only chain we ever had, and have lost all our references to the anon_vma. That looks _horribly_ buggy. Also, the conditional nesting makes no sense (the whole anon_vma_clone() only makes sense if importer is set, and it is only ever set _inside_ the earlier if-statement, so the whole code should be moved inside there), nor does some of the comments. This patch is scary and untested, but the more I look at that code, the more convinced I am that vma_adjust was _really_ badly screwed up. The patch below may make things worse. I'll test it myself too, but I'm sending it out first, since I was writing the email as I was looking at the piece of cr*p. Linus --- mm/mmap.c | 24 ++++++++---------------- 1 files changed, 8 insertions(+), 16 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index acb023e..f90ea92 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start, struct address_space *mapping = NULL; struct prio_tree_root *root = NULL; struct file *file = vma->vm_file; - struct anon_vma *anon_vma = NULL; long adjust_next = 0; int remove_next = 0; if (next && !insert) { + struct vm_area_struct *exporter = NULL; + if (end >= next->vm_end) { /* * vma expands, overlapping all the next, and @@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start, */ again: remove_next = 1 + (end > next->vm_end); end = next->vm_end; - anon_vma = next->anon_vma; + exporter = next; importer = vma; } else if (end > next->vm_start) { /* @@ -527,7 +528,7 @@ again: remove_next = 1 + (end > next->vm_end); * mprotect case 5 shifting the boundary up. */ adjust_next = (end - next->vm_start) >> PAGE_SHIFT; - anon_vma = next->anon_vma; + exporter = next; importer = vma; } else if (end < vma->vm_end) { /* @@ -536,28 +537,19 @@ again: remove_next = 1 + (end > next->vm_end); * mprotect case 4 shifting the boundary down. */ adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT); - anon_vma = next->anon_vma; + exporter = vma; importer = next; } - } - /* - * When changing only vma->vm_end, we don't really need anon_vma lock. - */ - if (vma->anon_vma && (insert || importer || start != vma->vm_start)) - anon_vma = vma->anon_vma; - if (anon_vma) { /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the * shrinking vma had, to cover any anon pages imported. */ - if (importer && !importer->anon_vma) { - /* Block reverse map lookups until things are set up. */ - if (anon_vma_clone(importer, vma)) { + if (exporter && exporter->anon_vma && !importer->anon_vma) { + if (anon_vma_clone(importer, exporter)) return -ENOMEM; - } - importer->anon_vma = anon_vma; + importer->anon_vma = exporter->anon_vma; } } -- 1.7.0.3 ^ permalink raw reply related [flat|nested] 231+ messages in thread
* [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity 2010-04-11 13:08 ` Borislav Petkov ` (2 preceding siblings ...) 2010-04-11 13:19 ` [PATCH 3/3] mm: fixup vma_adjust Borislav Petkov @ 2010-04-11 13:25 ` Borislav Petkov 2010-04-11 17:07 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Linus Torvalds 4 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-11 13:25 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> On Sat, 10 Apr 2010, Linus Torvalds wrote: > > But I think the fact that you are apparently not able to get the list > corruption is a good sign. Of course, it might just be harder to trigger, > and these things could all be a sign of a different bug, but my gut feel > is that we did fix something, and you are just damn good at stressing the > new code. Kudos. Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated checks for prev/next compatibility that I just made even more complex. So I'm actually inclined to want to write my simple two-liner fix as a rather more complex cleanup patch, below. It adds way more lines than it deletes, but a lot of it is comments (and some of it is just because one routine got split up into three), and I think it makes the result a lot more readable. It also splits off the decision of whether we can reuse an non_vma from the decision of whether we can merge the vma's - the two are kind of related, but they are not really the same, and they have different issues. I think it's good to try to keep separate issues separate. This is UNTESTED! It's meant to be an "obvious cleanup" with no real semantic difference, but if I did something wrong it won't work. Also note the comment about the lack of locking between two adjacent anon_vma's taking a page fault at the same time: the ACCESS_ONCE() is unlikely to ever matter (anon_vma's are stable once they are set, so it's really just that you could first load a NULL, and then if you re-load the value you might get a non-NULL thing). Also note that when checking whether the anon_vma is a singleton, we don't hold any lock that protects the list we are checking. But "list_is_singular()" is safe and won't oops even if the pointers in the list are crap, because it only _compares_ the prev/next pointers, it doesn't dereference them. In short, what I'm saying is that there is a pretty subtle race in the very very unlikely case that two anon_vma's get prepared concurrently, but from a correctness standpoint it doesn't matter. We might sometimes - once in a blue moon - reject an anon_vma that could in theory have been merged, but that won't hurt. Comments? Rik, Johannes? Linus --- mm/mmap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------------- 1 files changed, 62 insertions(+), 24 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..acb023e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, } /* + * Rough compatbility check to quickly see if it's even worth looking + * at sharing an anon_vma. + * + * They need to have the same vm_file, and the flags can only differ + * in things that mprotect may change. + * + * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that + * we can merge the two vma's. For example, we refuse to merge a vma if + * there is a vm_ops->close() function, because that indicates that the + * driver is doing some kind of reference counting. But that doesn't + * really matter for the anon_vma sharing case. + */ +static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b) +{ + return a->vm_end == b->vm_start && + mpol_equal(vma_policy(a), vma_policy(b)) && + a->vm_file == b->vm_file && + !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) && + b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); +} + +/* + * Do some basic sanity checking to see if we can re-use the anon_vma + * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be + * the same as 'old', the other will be the new one that is trying + * to share the anon_vma. + * + * NOTE! This runs with mm_sem held for reading, so it is possible that + * the anon_vma of 'old' is concurrently in the process of being set up + * by another page fault trying to merge _that_. But that's ok: if it + * is being set up, that automatically means that it will be a singleton + * acceptable for merging, so we can do all of this optimistically. But + * we do that ACCESS_ONCE() to make sure that we never re-load the pointer. + * + * IOW: that the "list_is_singular()" test on the anon_vma_chain only + * matters for the 'stable anon_vma' case (ie the thing we want to avoid + * is to return an anon_vma that is "complex" due to having gone through + * a fork). + * + * We also make sure that the two vma's are compatible (adjacent, + * and with the same memory policies). That's all stable, even with just + * a read lock on the mm_sem. + */ +static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b) +{ + if (anon_vma_compatible(a, b)) { + struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma); + + if (anon_vma && list_is_singular(&old->anon_vma_chain)) + return anon_vma; + } + return NULL; +} + +/* * find_mergeable_anon_vma is used by anon_vma_prepare, to check * neighbouring vmas for a suitable anon_vma, before it goes off * to allocate a new anon_vma. It checks because a repetitive @@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, */ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) { + struct anon_vma *anon_vma; struct vm_area_struct *near; - unsigned long vm_flags; near = vma->vm_next; if (!near) goto try_prev; - /* - * Since only mprotect tries to remerge vmas, match flags - * which might be mprotected into each other later on. - * Neither mlock nor madvise tries to remerge at present, - * so leave their flags as obstructing a merge. - */ - vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); - vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - - if (near->anon_vma && vma->vm_end == near->vm_start && - mpol_equal(vma_policy(vma), vma_policy(near)) && - can_vma_merge_before(near, vm_flags, - NULL, vma->vm_file, vma->vm_pgoff + - ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT))) - return near->anon_vma; + anon_vma = reusable_anon_vma(near, vma, near); + if (anon_vma) + return anon_vma; try_prev: /* * It is potentially slow to have to call find_vma_prev here. @@ -868,14 +911,9 @@ try_prev: if (!near) goto none; - vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); - vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - - if (near->anon_vma && near->vm_end == vma->vm_start && - mpol_equal(vma_policy(near), vma_policy(vma)) && - can_vma_merge_after(near, vm_flags, - NULL, vma->vm_file, vma->vm_pgoff)) - return near->anon_vma; + anon_vma = reusable_anon_vma(near, near, vma); + if (anon_vma) + return anon_vma; none: /* * There's no absolute need to look only at touching neighbours: -- 1.7.0.3 ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-11 13:08 ` Borislav Petkov ` (3 preceding siblings ...) 2010-04-11 13:25 ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov @ 2010-04-11 17:07 ` Linus Torvalds 2010-04-11 17:16 ` Linus Torvalds 4 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-11 17:07 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sun, 11 Apr 2010, Borislav Petkov wrote: > > Ok, I could verify that the three patches we were talking about still > can't fix the issue. However, just to make sure I'm sending the versions > of the patches I used for you guys to check. Yup, the patches are the ones I wanted you to try. So either my fixes were buggy (possible, especially for the vma_adjust case), or there are other bugs still lurking. The scary part is that the _old_ anon_vma code didn't really care about the anon_vma all that deeply. It was just a placeholder, if you got some of it wrong the worst that would probably happen would be that a page could never find all the mappings it had. So it was a possible swap efficiency problem when we cannot get rid of all mapped pages, but if it only happens for some small and unusual special case, nobody would ever have noticed. With the new code, when you have a page that is associated with a stale anon_vma, you get the page_referenced() oops instead. And I can't find the bug. Everything I've looked at looks fine. So I'm going to ask you to start applying "validation patches" - code to check some internal consistency, and seeing if we break that internal consistency somewhere. It may be that Rik has some patches like this from his development work, but here's the first one. This patch should have caught the vma_adjust() problem, but all it caught for me was that "anon_vma_clone()" ended up cloning the avc entries in the wrong order so the lists didn't actually look exactly the same. The patch fixes that case, so if this triggers any warnings for you, I think it's a real bug. But I'm pretty sure that the problem is that we have a "page->mapping" that points to an anon_vma that no longer exists, and you can easily get that while still having valid vma chains - they just aren't necessarily the complete _set_ of chains they should be. [ In particular, I think that the _real_ problem is that we don't clear "page->mapping" when we unmap a page. See the comment at the end of page_remove_rmap(), and it also explains the test for "page_mapped()" in page_lock_anon_vma(). But I think the bug you see might be exactly the race between page_mapped() and actually getting the anon_vma spinlock. I'd have expected that window to be too small to ever hit, though, which is why I find it a bit unlikely. But it would explain why you _sometimes_ actually get a hung spinlock too - you never get the spinlock at all, and somebody replaced the data with something that the spinlock code thinks is a locked spinlock - but is no longer a spinlock at all ] Linus --- mm/mmap.c | 18 ++++++++++++++++++ mm/rmap.c | 2 +- 2 files changed, 19 insertions(+), 1 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index f90ea92..890c169 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1565,6 +1565,22 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, EXPORT_SYMBOL(get_unmapped_area); +static void verify_vma(struct vm_area_struct *vma) +{ + if (vma->anon_vma) { + struct anon_vma_chain *avc; + if (WARN_ONCE(list_empty(&vma->anon_vma_chain), "vma has anon_vma but empty chain")) + return; + /* The first entry of the avc chain should match! */ + avc = list_entry(vma->anon_vma_chain.next, struct anon_vma_chain, same_vma); + WARN_ONCE(avc->anon_vma != vma->anon_vma, "anon_vma entry doesn't match anon_vma_chain"); + WARN_ONCE(avc->vma != vma, "vma entry doesn't match anon_vma_chain"); + } else { + WARN_ONCE(!list_empty(&vma->anon_vma_chain), "vma has no anon_vma but has chain"); + } +} + + /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) { @@ -1598,6 +1614,8 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) mm->mmap_cache = vma; } } + if (vma) + verify_vma(vma); return vma; } diff --git a/mm/rmap.c b/mm/rmap.c index eaa7a09..ee97d38 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -182,7 +182,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; - list_for_each_entry(pavc, &src->anon_vma_chain, same_vma) { + list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { avc = anon_vma_chain_alloc(); if (!avc) goto enomem_failure; ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-11 17:07 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Linus Torvalds @ 2010-04-11 17:16 ` Linus Torvalds 2010-04-11 18:55 ` Borislav Petkov ` (2 more replies) 0 siblings, 3 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-11 17:16 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sun, 11 Apr 2010, Linus Torvalds wrote: > > But I think the bug you see might be exactly the race between > page_mapped() and actually getting the anon_vma spinlock. I'd have > expected that window to be too small to ever hit, though, which is why I > find it a bit unlikely. But it would explain why you _sometimes_ > actually get a hung spinlock too - you never get the spinlock at all, > and somebody replaced the data with something that the spinlock code > thinks is a locked spinlock - but is no longer a spinlock at all ] Actually, so if it's that race, then we might get rid of the oops with this total hack. NOTE! If this is the race, then the hack really is just a hack, because it doesn't really solve anything. We still take the spinlock, and if bad things has happened, _that_ can still very much fail, and you get the watchdog lockup message instead. So this doesn't really fix anything. But if this patch changes behavior, and you no longer see the oops, that tells us _something_. I'm not sure how useful that "something" is, but it at least means that there are no _mapped_ pages that have that stale anon_vma pointer in page->mapping. Conversely, if you still see the oops (rather than the watchdog), that means that we actually have pages that are still marked mapped, and that despite that mapped state have a stale page->mapping pointer. I actually find that the more likely case, because otherwise the window is _so_ small that I don't see how you can hit the oops so reliably. Anyway - probably worth testing, along with the verify_vma() patch. If nothing else, if there is no new behavior, even that tells us something. Even if that "something" is not a huge piece of information. Linus --- diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -302,7 +302,11 @@ struct anon_vma *page_lock_anon_vma(struct page *page) anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); - return anon_vma; + + if (page_mapped(page)) + return anon_vma; + + spin_unlock(&anon_vma->lock); out: rcu_read_unlock(); return NULL; ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-11 17:16 ` Linus Torvalds @ 2010-04-11 18:55 ` Borislav Petkov 2010-04-12 0:13 ` Linus Torvalds 2010-04-11 19:49 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel 2010-04-11 21:45 ` Rik van Riel 2 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-11 18:55 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sun, Apr 11, 2010 at 10:16:10AM -0700 > Conversely, if you still see the oops (rather than the watchdog), that > means that we actually have pages that are still marked mapped, and that > despite that mapped state have a stale page->mapping pointer. I actually > find that the more likely case, because otherwise the window is _so_ small > that I don't see how you can hit the oops so reliably. Ok, did test with the all 5 patches applied. It oopsed with the same trace, see below. Except one kernel/sched.c:3555 warning checking spinlock count overflowing, nothing else. :( I tried to see whether the page->mapping pointer is stale, I dunno, maybe there could be something in the register dump which could tell us what's happening. This is how I see it, I could very well be wrong and missing something though: So, yes, we oops at the same place, however, a bit early we do anon_vma = page_lock_anon_vma(page); if (!anon_vma) return referenced; which compiles here to .loc 1 496 0 movq %rbx, %rdi # page, call page_lock_anon_vma # .LVL288: .loc 1 497 0 testq %rax, %rax # anon_vma .LVL289: .loc 1 496 0 movq %rax, %r14 #, anon_vma and I checked that on the path before the instruction where we oops we don't touch %r14 so the value in the register dump below should be that anon_vma. Which looks like valid kernel pointer. We dereference it later to get anon_vma->head.next with .loc 1 501 0 movq 64(%r14), %r13 # <variable>.head.next, <variable>.head.next .LBE1287: leaq 64(%r14), %rax #, movq %rax, -128(%rbp) #, %sfp .LBB1288: subq $32, %r13 #, avc which ends up in %r13 as ffffffffffffffe0. So, it really looks like at least that list_head in anon_vma is bollocks, or even the whole anon_vma. So if this is correct, it is highly likely that the anon_vma is already freed material or not initialized at all. Hm... [ 616.317201] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. [ 616.329964] PM: Preallocating image memory... [ 616.586463] BUG: unable to handle kernel NULL pointer dereference at (null) [ 616.586851] IP: [<ffffffff810c614f>] page_referenced+0xee/0x1dc [ 616.587045] PGD 225dcf067 PUD 22627f067 PMD 0 [ 616.587126] Oops: 0000 [#1] PREEMPT SMP [ 616.587126] last sysfs file: /sys/power/state [ 616.587126] CPU 1 [ 616.587126] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd edac_core 8250_pnp 8250 serial_core pcspkr k10temp [ 616.587126] [ 616.587126] Pid: 3453, comm: hib.sh Tainted: G W 2.6.34-rc3-00505-g1d9bb34 #1 M3A78 PRO/System Product Name [ 616.587126] RIP: 0010:[<ffffffff810c614f>] [<ffffffff810c614f>] page_referenced+0xee/0x1dc [ 616.587126] RSP: 0018:ffff88022b3258b8 EFLAGS: 00010283 [ 616.587126] RAX: ffff880200ba4b88 RBX: ffffea00076b2b30 RCX: ffff88022eacaa58 [ 616.587126] RDX: ffffffff810c5e7a RSI: ffff880200ba4b60 RDI: ffff88022fa492e0 [ 616.587126] RBP: ffff88022b325938 R08: 0000000000000002 R09: 0000000000000000 [ 616.587126] R10: ffff88022eacaa30 R11: 0000000000000001 R12: 0000000000000000 [ 616.587126] R13: ffffffffffffffe0 R14: ffff880200ba4b48 R15: ffff88022b325a00 [ 616.587126] FS: 00007f0b140306f0(0000) GS:ffff88000a200000(0000) knlGS:0000000000000000 [ 616.587126] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 616.587126] CR2: 0000000000000000 CR3: 000000022c44f000 CR4: 00000000000006e0 [ 616.587126] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 616.587126] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 616.587126] Process hib.sh (pid: 3453, threadinfo ffff88022b324000, task ffff88022fa492e0) [ 616.587126] Stack: [ 616.587126] ffff880200ba4b88 00000000810c5e5f ffff88022b325918 ffffffff810c5fd7 [ 616.587126] <0> ffff880200000000 ffffffff00000001 ffff88022b325fd8 ffffea00076c1a80 [ 616.587126] <0> ffffea00076c1a80 000000022b325cf8 ffffea00076c1a80 ffffea00076b2b58 [ 616.587126] Call Trace: [ 616.587126] [<ffffffff810c5fd7>] ? try_to_unmap_anon+0xa2/0xb4 [ 616.587126] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 616.587126] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 616.587126] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc [ 616.587126] [<ffffffff8140fb06>] ? _raw_spin_unlock_irq+0x30/0x58 [ 616.587126] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 616.587126] [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244 [ 616.587126] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 616.587126] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 616.587126] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 616.587126] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 616.587126] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 616.587126] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 616.587126] [<ffffffff8140be84>] ? printk+0x41/0x45 [ 616.587126] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 616.587126] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 616.587126] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 616.587126] [<ffffffff8118f687>] kobj_attr_store+0x17/0x19 [ 616.587126] [<ffffffff8112e540>] sysfs_write_file+0x108/0x144 [ 616.587126] [<ffffffff810db74f>] vfs_write+0xb2/0x153 [ 616.587126] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 616.587126] [<ffffffff810db8b3>] sys_write+0x4a/0x71 [ 616.587126] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 616.587126] Code: 3b 56 10 73 1e 48 83 fa f2 74 18 48 8d 4d cc 4d 89 f8 48 89 df e8 02 f2 ff ff 41 01 c4 83 7d cc 00 74 19 4d 8b 6d 20 49 83 ed 20 <49> 8b 45 20 0f 18 08 49 8d 45 20 48 39 45 80 75 aa 4c 89 f7 e8 [ 616.587126] RIP [<ffffffff810c614f>] page_referenced+0xee/0x1dc [ 616.587126] RSP <ffff88022b3258b8> [ 616.587126] CR2: 0000000000000000 [ 616.600838] ---[ end trace 0ea0c6b4ead21c8f ]--- [ 616.600984] note: hib.sh[3453] exited with preempt_count 2 [ 616.601282] BUG: scheduling while atomic: hib.sh/3453/0x10000003 [ 616.601431] INFO: lockdep is turned off. [ 616.601584] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod ohci_hcd edac_core 8250_pnp 8250 serial_core pcspkr k10temp [ 616.603115] Pid: 3453, comm: hib.sh Tainted: G D W 2.6.34-rc3-00505-g1d9bb34 #1 [ 616.603460] Call Trace: [ 616.603605] [<ffffffff810658df>] ? __debug_show_held_locks+0x1b/0x24 [ 616.603755] [<ffffffff8102dfac>] __schedule_bug+0x72/0x77 [ 616.603903] [<ffffffff8140c298>] schedule+0xe3/0x7ff [ 616.604051] [<ffffffff810bd0e4>] ? unmap_vmas+0x90c/0x911 [ 616.604230] [<ffffffff81030ecb>] __cond_resched+0x18/0x24 [ 616.604381] [<ffffffff8140ca81>] _cond_resched+0x2c/0x37 [ 616.604529] [<ffffffff810bcef1>] unmap_vmas+0x719/0x911 [ 616.604678] [<ffffffff810c16c0>] exit_mmap+0x102/0x1e4 [ 616.604826] [<ffffffff810c1627>] ? exit_mmap+0x69/0x1e4 [ 616.604975] [<ffffffff810368bc>] mmput+0x48/0xb9 [ 616.605124] [<ffffffff8103ad90>] exit_mm+0x110/0x11d [ 616.605280] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5 [ 616.605430] [<ffffffff81039e2f>] ? kmsg_dump+0x13b/0x155 [ 616.605579] [<ffffffff8100616b>] ? oops_end+0x47/0x93 [ 616.605727] [<ffffffff810061b2>] oops_end+0x8e/0x93 [ 616.605875] [<ffffffff8101f3e5>] no_context+0x1fc/0x20b [ 616.606023] [<ffffffff8101f580>] __bad_area_nosemaphore+0x18c/0x1af [ 616.606176] [<ffffffff8101f7bb>] ? do_page_fault+0xa8/0x32d [ 616.606330] [<ffffffff8101f5b6>] bad_area_nosemaphore+0x13/0x15 [ 616.606479] [<ffffffff8101f886>] do_page_fault+0x173/0x32d [ 616.606628] [<ffffffff81410463>] ? error_sti+0x5/0x6 [ 616.606776] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 616.606926] [<ffffffff8140edab>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 616.607076] [<ffffffff8141027f>] page_fault+0x1f/0x30 [ 616.607227] [<ffffffff810c5e7a>] ? page_lock_anon_vma+0x0/0xbb [ 616.607381] [<ffffffff810c614f>] ? page_referenced+0xee/0x1dc [ 616.607530] [<ffffffff810c60e1>] ? page_referenced+0x80/0x1dc [ 616.607678] [<ffffffff810c5fd7>] ? try_to_unmap_anon+0xa2/0xb4 [ 616.607827] [<ffffffff810b06bc>] shrink_page_list+0x154/0x4c7 [ 616.607976] [<ffffffff81067149>] ? print_lock_contention_bug+0x1b/0xe1 [ 616.608131] [<ffffffff810af59c>] ? isolate_pages_global+0xd0/0x1fc [ 616.608284] [<ffffffff8140fb06>] ? _raw_spin_unlock_irq+0x30/0x58 [ 616.608435] [<ffffffff810b0d8a>] shrink_inactive_list+0x35b/0x60c [ 616.608585] [<ffffffff810b0556>] ? shrink_active_list+0x232/0x244 [ 616.608734] [<ffffffff810b1347>] shrink_zone+0x30c/0x3d6 [ 616.608883] [<ffffffff810b1f3d>] do_try_to_free_pages+0x191/0x29a [ 616.609031] [<ffffffff810b20db>] shrink_all_memory+0x95/0xc4 [ 616.609183] [<ffffffff810af4cc>] ? isolate_pages_global+0x0/0x1fc [ 616.609337] [<ffffffff81079c9c>] ? count_data_pages+0x65/0x79 [ 616.609486] [<ffffffff81079f03>] hibernate_preallocate_memory+0x1aa/0x2cb [ 616.609636] [<ffffffff8140be84>] ? printk+0x41/0x45 [ 616.609784] [<ffffffff8107878f>] hibernation_snapshot+0x36/0x1e1 [ 616.609933] [<ffffffff81078a08>] hibernate+0xce/0x172 [ 616.610080] [<ffffffff81077775>] state_store+0x5c/0xd3 [ 616.610233] [<ffffffff8118f687>] kobj_attr_store+0x17/0x19 [ 616.610383] [<ffffffff8112e540>] sysfs_write_file+0x108/0x144 [ 616.610532] [<ffffffff810db74f>] vfs_write+0xb2/0x153 [ 616.610680] [<ffffffff810663c9>] ? trace_hardirqs_on_caller+0x1f/0x14b [ 616.610830] [<ffffffff810db8b3>] sys_write+0x4a/0x71 [ 616.610978] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 682.501863] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) [ 683.552767] SysRq : Emergency Sync [ 683.553147] Emergency Sync complete [ 684.180708] SysRq : Emergency Remount R/O [ 684.927560] SysRq : Resetting -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-11 18:55 ` Borislav Petkov @ 2010-04-12 0:13 ` Linus Torvalds 2010-04-12 1:04 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 0:13 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sun, 11 Apr 2010, Borislav Petkov wrote: > > > Conversely, if you still see the oops (rather than the watchdog), that > > means that we actually have pages that are still marked mapped, and that > > despite that mapped state have a stale page->mapping pointer. I actually > > find that the more likely case, because otherwise the window is _so_ small > > that I don't see how you can hit the oops so reliably. > > Ok, did test with the all 5 patches applied. It oopsed with the same > trace, see below. Except one kernel/sched.c:3555 warning checking > spinlock count overflowing, nothing else. :( Ok, that preempt-count thing is a real problem, but should be unrelated to your issues. Anyway, so this all means that we definitely have lost sight of an 'anon_vma', even if page->mapping still points to it, and even though the page is still mapped. I'll see if I can come up with a patch to do the same kind of validation on page->mapping as on the anon-vma chains themselves. > I tried to see whether the page->mapping pointer is stale, I dunno, > maybe there could be something in the register dump which could tell us > what's happening. Sadly, you cannot tell by the pointer. A stale pointer still is a perfectly fine kernel pointer, it's just that we've long since released the anon_vma it used to point to, and now it points to some random other data structure. > So, it really looks like at least that list_head in anon_vma is > bollocks, or even the whole anon_vma. So if this is correct, it is > highly likely that the anon_vma is already freed material or not > initialized at all. Yes, it's pretty certain it is long free'd, and re-allocated to something else. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 0:13 ` Linus Torvalds @ 2010-04-12 1:04 ` Linus Torvalds 2010-04-12 7:20 ` Borislav Petkov 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 1:04 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sun, 11 Apr 2010, Linus Torvalds wrote: > > I'll see if I can come up with a patch to do the same kind of validation > on page->mapping as on the anon-vma chains themselves. Ok, this may or may not work. It hasn't triggered for me, which may be because it's broken, but maybe it's because I'm not doing whatever it is you are doing to break our VM. It checks each anonymous page at unmap time against the vma it gets unmapped from. It depends on the previous vma_verify debugging patch, and it would be interesting to hear whether this patch causes any new warnngs for you.. If the warnings do happen, they are not going to be printing out any hugely informative data apart from the fact that the bad case happened at all. But If they do trigger, I can try to improve on them - it's just not worth trying to make them any more interesting if they never trigger. Linus --- mm/memory.c | 21 +++++++++++++++++++++ mm/mmap.c | 2 +- 2 files changed, 22 insertions(+), 1 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 833952d..5d2df59 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -890,6 +890,25 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, return ret; } +extern void verify_vma(struct vm_area_struct *); + +static void verify_anon_page(struct vm_area_struct *vma, struct page *page) +{ + struct anon_vma *anon_vma = vma->anon_vma; + struct anon_vma *need_anon_vma = page_anon_vma(page); + struct anon_vma_chain *avc; + + verify_vma(vma); + if (WARN_ONCE(!anon_vma, "anonymous page in vma without anon_vma")) + return; + list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) { + WARN_ONCE(avc->vma != vma, "anon_vma_chain vma entry doesn't match"); + if (avc->anon_vma == need_anon_vma) + return; + } + WARN_ONCE(1, "page->mapping does not exist in vma chain"); +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, @@ -940,6 +959,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; + if (PageAnon(page)) + verify_anon_page(vma, page); if (unlikely(details) && details->nonlinear_vma && linear_page_index(details->nonlinear_vma, addr) != page->index) diff --git a/mm/mmap.c b/mm/mmap.c index 890c169..461f59c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1565,7 +1565,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, EXPORT_SYMBOL(get_unmapped_area); -static void verify_vma(struct vm_area_struct *vma) +void verify_vma(struct vm_area_struct *vma) { if (vma->anon_vma) { struct anon_vma_chain *avc; ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 1:04 ` Linus Torvalds @ 2010-04-12 7:20 ` Borislav Petkov 2010-04-12 16:02 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-12 7:20 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sun, Apr 11, 2010 at 06:04:39PM -0700 > It checks each anonymous page at unmap time against the vma it gets > unmapped from. It depends on the previous vma_verify debugging patch, and > it would be interesting to hear whether this patch causes any new warnngs > for you.. > > If the warnings do happen, they are not going to be printing out any > hugely informative data apart from the fact that the bad case happened at > all. But If they do trigger, I can try to improve on them - it's just not > worth trying to make them any more interesting if they never trigger. Haa, I think you're gonna want to improve them :) WARN_ONCE(1, "page->mapping does not exist in vma chain"); triggered on the first resume showing a rather messy 4 WARN_ONCEs. Had I more cores, there maybe would've been more of them :) Maybe need locking if clean output is of interest (see below). So, anyway, if I can read this correctly, there is a page->mapping anon_vma which is _not_ in the anon_vmas chain of the vma (avc->same_vma). And the spot we oops on is in page_referenced_anon(): list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { which is actually where we iterate over all vmas associated with this anon_vma. So if that previous anon_vma pointed to by the page_mapping has been falsely unlinked at some point, no wonder we boom on that later. By the way, I completely understand when you say that your head hurts from looking at this :). [ 486.580872] Restarting tasks ... done. [ 494.167242] [drm] Resetting GPU [ 495.422354] ------------[ cut here ]------------ [ 495.422407] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29() [ 495.422442] Hardware name: System Product Name [ 495.422474] page->mapping does not exist in vma chain [ 495.422504] Modules linked in: [ 495.422545] ------------[ cut here ]------------ [ 495.422555] ------------[ cut here ]------------ [ 495.422565] powernow_k8 [ 495.422583] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29() [ 495.422591] cpufreq_ondemand [ 495.422597] Hardware name: System Product Name [ 495.422602] page->mapping does not exist in vma chain cpufreq_powersave [ 495.422612] Modules linked in: cpufreq_userspace powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table freq_table cpufreq_conservative cpufreq_conservative binfmt_misc binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt kvm_amd dm_mod 8250_pnp kvm 8250 serial_core edac_core pcspkr k10temp ohci_hcd [ 495.422676] ipv6Pid: 2919, comm: udevd Tainted: G W 2.6.34-rc3-00506-g6c62fe4 #1 [ 495.422689] Call Trace: [ 495.422694] vfat [ 495.422700] ------------[ cut here ]------------ [ 495.422721] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29() [ 495.422729] fat [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94 [ 495.422746] dm_crypt [ 495.422751] Hardware name: System Product Name [ 495.422758] dm_modpage->mapping does not exist in vma chain [ 495.422767] Modules linked in: 8250_pnp [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43 [ 495.422784] powernow_k8 cpufreq_ondemand 8250 cpufreq_powersave [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29 [ 495.422807] serial_core cpufreq_userspace [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29 [ 495.422828] edac_core freq_table pcspkr cpufreq_conservative [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4 [ 495.422851] binfmt_misc [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4 [ 495.422863] k10temp [<ffffffff810368bc>] mmput+0x48/0xb9 [ 495.422876] kvm_amd [<ffffffff8103ad90>] exit_mm+0x110/0x11d [ 495.422889] ohci_hcd kvm [ 495.422903] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5 [ 495.422909] ipv6Pid: 2916, comm: udevd Tainted: G W 2.6.34-rc3-00506-g6c62fe4 #1 [ 495.422927] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 495.422934] Call Trace: [ 495.422940] vfat [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13 [ 495.422956] fat [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94 [ 495.422972] dm_crypt dm_mod 8250_pnp [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0 [ 495.422989] 8250 serial_core [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b [ 495.423013] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 495.423019] edac_core [ 495.423025] ---[ end trace d9664ac54d1edb0e ]--- [ 495.423031] pcspkr k10temp ohci_hcd [ 495.423043] Pid: 2914, comm: udevd Tainted: G W 2.6.34-rc3-00506-g6c62fe4 #1 [ 495.423055] [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43 [ 495.423063] Call Trace: [ 495.423073] [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29 [ 495.423087] [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94 [ 495.423100] [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29 [ 495.423111] [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43 [ 495.423123] [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29 [ 495.423134] [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29 [ 495.423147] [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4 [ 495.423159] [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4 [ 495.423172] [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4 [ 495.423184] [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4 [ 495.423194] [<ffffffff810368bc>] mmput+0x48/0xb9 [ 495.423204] [<ffffffff810368bc>] mmput+0x48/0xb9 [ 495.423214] [<ffffffff8103ad90>] exit_mm+0x110/0x11d [ 495.423225] [<ffffffff8103ad90>] exit_mm+0x110/0x11d [ 495.423236] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5 [ 495.423246] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5 [ 495.423266] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 495.423277] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 495.423292] [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13 [ 495.423303] [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13 [ 495.423315] [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0 [ 495.423325] [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b [ 495.423334] [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0 [ 495.423346] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 495.423357] [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b [ 495.423365] ---[ end trace d9664ac54d1edb0f ]--- [ 495.423386] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 495.423402] ---[ end trace d9664ac54d1edb10 ]--- [ 495.424191] WARNING: at mm/memory.c:909 unmap_vmas+0x548/0xa29() [ 495.424215] Hardware name: System Product Name [ 495.424238] page->mapping does not exist in vma chain [ 495.424259] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 serial_core edac_core pcspkr k10temp ohci_hcd [ 495.424693] Pid: 1923, comm: udevd Tainted: G W 2.6.34-rc3-00506-g6c62fe4 #1 [ 495.424723] Call Trace: [ 495.424758] [<ffffffff81038fe0>] warn_slowpath_common+0x7c/0x94 [ 495.424788] [<ffffffff8103904f>] warn_slowpath_fmt+0x41/0x43 [ 495.424816] [<ffffffff810bcd20>] unmap_vmas+0x548/0xa29 [ 495.424843] [<ffffffff810bd021>] ? unmap_vmas+0x849/0xa29 [ 495.424875] [<ffffffff810c17d8>] exit_mmap+0x102/0x1e4 [ 495.424901] [<ffffffff810c173f>] ? exit_mmap+0x69/0x1e4 [ 495.424926] [<ffffffff810368bc>] mmput+0x48/0xb9 [ 495.424954] [<ffffffff8103ad90>] exit_mm+0x110/0x11d [ 495.424981] [<ffffffff8103c9e6>] do_exit+0x1c5/0x6e5 [ 495.425008] [<ffffffff81065387>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 495.425038] [<ffffffff8141016d>] ? retint_swapgs+0xe/0x13 [ 495.425065] [<ffffffff8103cf8a>] do_group_exit+0x84/0xb0 [ 495.425091] [<ffffffff8103cfcd>] sys_exit_group+0x17/0x1b [ 495.425119] [<ffffffff8100221b>] system_call_fastpath+0x16/0x1b [ 495.425156] ---[ end trace d9664ac54d1edb11 ]--- -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 7:20 ` Borislav Petkov @ 2010-04-12 16:02 ` Linus Torvalds 2010-04-12 16:26 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 16:02 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 12 Apr 2010, Borislav Petkov wrote: > > > > If the warnings do happen, they are not going to be printing out any > > hugely informative data apart from the fact that the bad case happened at > > all. But If they do trigger, I can try to improve on them - it's just not > > worth trying to make them any more interesting if they never trigger. > > Haa, I think you're gonna want to improve them :) > > WARN_ONCE(1, "page->mapping does not exist in vma chain"); > > triggered on the first resume showing a rather messy 4 WARN_ONCEs. Had I > more cores, there maybe would've been more of them :) Maybe need locking > if clean output is of interest (see below). Goodie. I can't trigger this on my machine (not that I tried very hard - but I did do some swapping loads etc by limiting my memory to just 1GB etc). So I'm pretty sure my verification code is "correct", and verifies things that should be right. And the fact that it triggers under the exact load that you use to then trigger the bug is a damn good thing. That means that we are finally on the right track, and we have somethign that correlates well with the actual bug. > So, anyway, if I can read this correctly, there is a page->mapping > anon_vma which is _not_ in the anon_vmas chain of the vma > (avc->same_vma). Yes, and that is supposed to be a no-no. The page is clearly associated with the vma in question (since we are unmapping it through that vma), but the vma list of 'anon_vma's doesn't actually have the one that 'page->mapping' points to. And that, in turn, means that we've lost sight of the 'page->mapping' anon_vma, and THAT in turn means that it could well have been free'd as being no longer referenced. And if it was free'd, it could be re-allocated as something else (after the RCU grace period), and that directly explains your oops. > By the way, I completely understand when you say that your head hurts > from looking at this :). Well, I have to say that I'm happy I've spent the time on it, because this way I got to learn all the new rules. It's just that I really wish I wouldn't have _had_ to. Anyway, I'll have to think way more about this to see if I can come up with a debugging patch that shows more details about what actually caused this to happen in the first place. But we definitely have a smoking gun. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 16:02 ` Linus Torvalds @ 2010-04-12 16:26 ` Linus Torvalds 2010-04-12 18:40 ` Rik van Riel 2010-04-12 21:50 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Borislav Petkov 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 16:26 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 12 Apr 2010, Linus Torvalds wrote: > > Yes, and that is supposed to be a no-no. The page is clearly associated > with the vma in question (since we are unmapping it through that vma), but > the vma list of 'anon_vma's doesn't actually have the one that > 'page->mapping' points to. > > And that, in turn, means that we've lost sight of the 'page->mapping' > anon_vma, and THAT in turn means that it could well have been free'd as > being no longer referenced. > > And if it was free'd, it could be re-allocated as something else (after > the RCU grace period), and that directly explains your oops. I have a new theory. And this new theory is completely different from all the other things we've been looking at. The new theory is really simple: 'page->mapping' has been re-set to the wrong mapping. Now, there is one case where we reset page->mapping _intentionally_, namely in the COW-breaking case of having the last user ("page_move_anon_rmap"). And that looks fine, and happens under normal loads all the time. We _want_ to do it there. But there is a _much_ more subtle case that involved swapping. So guys, here's my fairly simple theory on what happens: - page gets allocated/mapped by process A. Let's call the anon_vma we associate the page with 'A' to keep it easy to track. - Process A forks, creating process B. The anon_vma in B is 'B', and has a chain that looks like 'B' -> 'A'. Everything is fine. - Swapping happens. The page (with mapping pointing to 'A') gets swapped out (perhaps not to disk - it's enough to assume that it's just not mapped any more, and lives entirely in the swap-cache) - Process B pages it in, which goes like this: do_swap_page -> page = lookup_swap_cache(entry); ... set_pte_at(mm, address, page_table, pte); page_add_anon_rmap(page, vma, address); And think about what happens here! In particular, what happens is that this will now be the "first" mapping of that page, so page_add_anon_rmap() will do if (first) __page_set_anon_rmap(page, vma, address); and notice what anon_vma it will use? It will use the anon_vma for process B! So now page->mapping actually points to anon_vma 'B', not 'A' like it used to. What happens then? Trivial: process 'A' also pages it in (nothing happens, it's not the first mapping), and then process 'B' execve's or exits or unmaps, making anon_vma B go away. End result: process A has a page that points to anon_vma B, but anon_vma B does not exist any more. This can go on forever. Forget about RCU grace periods, forget about locking, forget anything like that. The bug is simply that page->mapping points to an anon_vma that was correct at one point, but was _not_ the one that was shared by all users of that possible mapping. The patch below is my largely mindless try at fixing this. It's untested. I'm not entirely sure that it actually works. But it makes some amount of conceptual sense. No? Linus --- mm/rmap.c | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index ee97d38..4bad326 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page, static void __page_set_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { - struct anon_vma *anon_vma = vma->anon_vma; + struct anon_vma_chain *avc; + struct anon_vma *anon_vma; + + BUG_ON(!vma->anon_vma); + + /* + * We must use the _oldest_ possible anon_vma for the page mapping! + * + * So take the last AVC chain entry in the vma, which is the deepest + * ancestor, and use the anon_vma from that. + */ + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); + anon_vma = avc->anon_vma; - BUG_ON(!anon_vma); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; page->index = linear_page_index(vma, address); ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 16:26 ` Linus Torvalds @ 2010-04-12 18:40 ` Rik van Riel 2010-04-12 19:00 ` Borislav Petkov 2010-04-12 21:50 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Borislav Petkov 1 sibling, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-12 18:40 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/12/2010 12:26 PM, Linus Torvalds wrote: > But there is a _much_ more subtle case that involved swapping. > > So guys, here's my fairly simple theory on what happens: That bug looks entirely possible. Given that Borislav has heavy swapping going on, it is quite possible that this is the bug he has been triggering. > The patch below is my largely mindless try at fixing this. It's untested. > I'm not entirely sure that it actually works. But it makes some amount of > conceptual sense. No? The patch would help avoid the bug you described. It does have the drawback of moving all the pages of child processes back into the anon_vma of the parent process after swapin, even if they are privately owned pages by the child process. I am guessing it may need a check to see whether the page and swap slot are exclusively owned by the current process. Page or swap slot shared? => oldest anon_vma Page and swap slot exclusive? => newest anon_vma I suspect the easiest way to achieve this would be to pass a flag in from do_swap_page, where we already check this, a few lines above calling page_add_anon_rmap: if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); flags &= ~FAULT_FLAG_WRITE; } ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 18:40 ` Rik van Riel @ 2010-04-12 19:00 ` Borislav Petkov 2010-04-12 19:17 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-12 19:00 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Rik van Riel <riel@redhat.com> Date: Mon, Apr 12, 2010 at 02:40:22PM -0400 > On 04/12/2010 12:26 PM, Linus Torvalds wrote: > > >But there is a _much_ more subtle case that involved swapping. > > > >So guys, here's my fairly simple theory on what happens: > > That bug looks entirely possible. Given that Borislav > has heavy swapping going on, it is quite possible that > this is the bug he has been triggering. Yeah, about that. I dunno whether you guys saw that but the machine has 8Gb of RAM and shouldn't be swapping, AFAIK. The largest mem usage I saw was 5Gb used, most of which pagecache. So I was kinda doubtful when Linus came up with the swapping theory earlier. I'll pay attention to the SwapCached in /proc/meminfo more to see whether we do any swapping. It could be that there is a small amount which is swapped out for whatever reason... Maybe that's the bug... But I'll give the patch a run anyway in an hour or so anyway. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 19:00 ` Borislav Petkov @ 2010-04-12 19:17 ` Linus Torvalds 2010-04-12 20:22 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 19:17 UTC (permalink / raw) To: Borislav Petkov Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 12 Apr 2010, Borislav Petkov wrote: > > But I'll give the patch a run anyway in an hour or so anyway. Thanks. I suspect you will find that even if there is no actual disk IO swapping going on during any of the normal loads, the shrink_all_memory() thing in your hibernation event will cause swap to happen. Or at least swap-cache entries to be done. Oh, and I've decided that my rcu_read_lock() patch for the tlb_gather() thing for unmapping is bogus. Exactly because the critical issue isn't when the page is free'd (and page->mapping is cleared), but when the page is unmapped (and page_mapped() clears). And that is done correctly even with the delayed frees in tlb_gather. So addign the rcu_read_lock/rcu_read_unlock around it all doesn't actually matter or help. So the patches that I think fix real bugs are - the anon_vma_prepare() fix to only share anon_vma's if they are singletons. - the vma_adjust() fix to copy the right anon_vma chains - the anon_vma_clone() fix to traverse the avc's in reverse order, so that the resulting cloned chain is the same as the original chain You got this patch as part of the "verify_vma()" patch, but the only part of that patch that matters is the one-liner that changes a "for_each_list_entry" to use the "_reverse()" version.. - and that last patch to pick the right anon_vma when mapping a page (which could still be improved: the "insert new page" case does _not_ have to take the oldest anon_vma, and Rik is correct that if we have an exclusive swap cache entry we could also take the top one) I think I'll re-post all four patches with real commit messages, to get ack's for them. I'd like to finally get the much delayed -rc4 out the door. Oh, and if that "pick the right anon_vma" patch doesn't fix it, I suspect we'll have to revert the whole anon_vma changes for 2.6.34. It's getting pretty late in the -rc series to fix this bug. I'm _hoping_ that I really nailed it this time, and that we're ok, but if Borislav reports it still happening, and people not having any other ideas, I think I'll just have to do an -rc4 with it all reverted, and then we can try again for 35 if somebody figures out the bug. Hmm? I'd hate to revert it all now because of the hours I've put in looking at the code (to the point that I feel I understand it), but at the same time, if it was somebody else who was chasing this bug and not being able to fix it, I'd tell them "revert it, it's too late". Amount of effort spent doesn't matter if the bug still happens ;^( Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() 2010-04-12 19:17 ` Linus Torvalds @ 2010-04-12 20:22 ` Linus Torvalds 2010-04-12 20:23 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds ` (4 more replies) 0 siblings, 5 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 20:22 UTC (permalink / raw) To: Borislav Petkov Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat, 10 Apr 2010 10:36:19 -0700 Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() This changes the anon_vma reuse case to require that we only reuse simple anon_vma's - ie the case when the vma only has a single anon_vma associated with it. This means that a reuse of an anon_vma from an adjacent vma will always guarantee that both vma's are associated not onyl with the same anon_vma, they will also have the same anon_vma chain (of just a single entry in this case). And since anon_vma re-use was the only case where the same anon_vma might be associated with different chains of anon_vma's, we now have the case that every vma that shares the same vma will always also have the same chain. That makes it much easier to think about merging vma's that share the same anon_vma's: you can always just drop the other anon_vma chain in anon_vma_merge() since you know that they are always identical. This also splits up the function to validate the anon_vma re-use, and adds a lot of commentary about the possible races. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- Ok, so I'm sending out this series of four patches, in the (perhaps futile) hope that they will finally fix the problem that Borislav has been so great at reporting. I'd like to gather ack's, nak's and perhaps changelog improvement suggestions while doing this. mm/mmap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------------- 1 files changed, 62 insertions(+), 24 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..acb023e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, } /* + * Rough compatbility check to quickly see if it's even worth looking + * at sharing an anon_vma. + * + * They need to have the same vm_file, and the flags can only differ + * in things that mprotect may change. + * + * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that + * we can merge the two vma's. For example, we refuse to merge a vma if + * there is a vm_ops->close() function, because that indicates that the + * driver is doing some kind of reference counting. But that doesn't + * really matter for the anon_vma sharing case. + */ +static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b) +{ + return a->vm_end == b->vm_start && + mpol_equal(vma_policy(a), vma_policy(b)) && + a->vm_file == b->vm_file && + !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) && + b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); +} + +/* + * Do some basic sanity checking to see if we can re-use the anon_vma + * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be + * the same as 'old', the other will be the new one that is trying + * to share the anon_vma. + * + * NOTE! This runs with mm_sem held for reading, so it is possible that + * the anon_vma of 'old' is concurrently in the process of being set up + * by another page fault trying to merge _that_. But that's ok: if it + * is being set up, that automatically means that it will be a singleton + * acceptable for merging, so we can do all of this optimistically. But + * we do that ACCESS_ONCE() to make sure that we never re-load the pointer. + * + * IOW: that the "list_is_singular()" test on the anon_vma_chain only + * matters for the 'stable anon_vma' case (ie the thing we want to avoid + * is to return an anon_vma that is "complex" due to having gone through + * a fork). + * + * We also make sure that the two vma's are compatible (adjacent, + * and with the same memory policies). That's all stable, even with just + * a read lock on the mm_sem. + */ +static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b) +{ + if (anon_vma_compatible(a, b)) { + struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma); + + if (anon_vma && list_is_singular(&old->anon_vma_chain)) + return anon_vma; + } + return NULL; +} + +/* * find_mergeable_anon_vma is used by anon_vma_prepare, to check * neighbouring vmas for a suitable anon_vma, before it goes off * to allocate a new anon_vma. It checks because a repetitive @@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, */ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) { + struct anon_vma *anon_vma; struct vm_area_struct *near; - unsigned long vm_flags; near = vma->vm_next; if (!near) goto try_prev; - /* - * Since only mprotect tries to remerge vmas, match flags - * which might be mprotected into each other later on. - * Neither mlock nor madvise tries to remerge at present, - * so leave their flags as obstructing a merge. - */ - vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); - vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - - if (near->anon_vma && vma->vm_end == near->vm_start && - mpol_equal(vma_policy(vma), vma_policy(near)) && - can_vma_merge_before(near, vm_flags, - NULL, vma->vm_file, vma->vm_pgoff + - ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT))) - return near->anon_vma; + anon_vma = reusable_anon_vma(near, vma, near); + if (anon_vma) + return anon_vma; try_prev: /* * It is potentially slow to have to call find_vma_prev here. @@ -868,14 +911,9 @@ try_prev: if (!near) goto none; - vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); - vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - - if (near->anon_vma && near->vm_end == vma->vm_start && - mpol_equal(vma_policy(near), vma_policy(vma)) && - can_vma_merge_after(near, vm_flags, - NULL, vma->vm_file, vma->vm_pgoff)) - return near->anon_vma; + anon_vma = reusable_anon_vma(near, near, vma); + if (anon_vma) + return anon_vma; none: /* * There's no absolute need to look only at touching neighbours: -- 1.7.1.rc1.dirty ^ permalink raw reply related [flat|nested] 231+ messages in thread
* [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains 2010-04-12 20:22 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds @ 2010-04-12 20:23 ` Linus Torvalds 2010-04-12 20:23 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds ` (3 more replies) 2010-04-12 20:54 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Rik van Riel ` (3 subsequent siblings) 4 siblings, 4 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 20:23 UTC (permalink / raw) To: Borislav Petkov Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat, 10 Apr 2010 15:22:30 -0700 Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains When we move the boundaries between two vma's due to things like mprotect, we need to make sure that the anon_vma of the pages that got moved from one vma to another gets properly copied around. And that was not always the case, in this rather hard-to-follow code sequence. Clarify the code, and fix it so that it copies the anon_vma from the right source. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/mmap.c | 24 ++++++++---------------- 1 files changed, 8 insertions(+), 16 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index acb023e..f90ea92 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -507,11 +507,12 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start, struct address_space *mapping = NULL; struct prio_tree_root *root = NULL; struct file *file = vma->vm_file; - struct anon_vma *anon_vma = NULL; long adjust_next = 0; int remove_next = 0; if (next && !insert) { + struct vm_area_struct *exporter = NULL; + if (end >= next->vm_end) { /* * vma expands, overlapping all the next, and @@ -519,7 +520,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start, */ again: remove_next = 1 + (end > next->vm_end); end = next->vm_end; - anon_vma = next->anon_vma; + exporter = next; importer = vma; } else if (end > next->vm_start) { /* @@ -527,7 +528,7 @@ again: remove_next = 1 + (end > next->vm_end); * mprotect case 5 shifting the boundary up. */ adjust_next = (end - next->vm_start) >> PAGE_SHIFT; - anon_vma = next->anon_vma; + exporter = next; importer = vma; } else if (end < vma->vm_end) { /* @@ -536,28 +537,19 @@ again: remove_next = 1 + (end > next->vm_end); * mprotect case 4 shifting the boundary down. */ adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT); - anon_vma = next->anon_vma; + exporter = vma; importer = next; } - } - /* - * When changing only vma->vm_end, we don't really need anon_vma lock. - */ - if (vma->anon_vma && (insert || importer || start != vma->vm_start)) - anon_vma = vma->anon_vma; - if (anon_vma) { /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the * shrinking vma had, to cover any anon pages imported. */ - if (importer && !importer->anon_vma) { - /* Block reverse map lookups until things are set up. */ - if (anon_vma_clone(importer, vma)) { + if (exporter && exporter->anon_vma && !importer->anon_vma) { + if (anon_vma_clone(importer, exporter)) return -ENOMEM; - } - importer->anon_vma = anon_vma; + importer->anon_vma = exporter->anon_vma; } } -- 1.7.1.rc1.dirty ^ permalink raw reply related [flat|nested] 231+ messages in thread
* [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order 2010-04-12 20:23 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds @ 2010-04-12 20:23 ` Linus Torvalds 2010-04-12 20:23 ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds ` (3 more replies) 2010-04-12 20:54 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Rik van Riel ` (2 subsequent siblings) 3 siblings, 4 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 20:23 UTC (permalink / raw) To: Borislav Petkov Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Sun, 11 Apr 2010 17:15:03 -0700 Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order We want to walk the chain in reverse order when cloning it, so that the order of the result chain will be the same as the order in the source chain. When we add entries to the chain, they go at the head of the chain, so we want to add the source head last. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/rmap.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index eaa7a09..ee97d38 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -182,7 +182,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; - list_for_each_entry(pavc, &src->anon_vma_chain, same_vma) { + list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { avc = anon_vma_chain_alloc(); if (!avc) goto enomem_failure; -- 1.7.1.rc1.dirty ^ permalink raw reply related [flat|nested] 231+ messages in thread
* [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma 2010-04-12 20:23 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds @ 2010-04-12 20:23 ` Linus Torvalds 2010-04-12 21:03 ` Rik van Riel 2010-04-13 0:41 ` Johannes Weiner 2010-04-12 20:57 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Rik van Riel ` (2 subsequent siblings) 3 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 20:23 UTC (permalink / raw) To: Borislav Petkov Cc: Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 12 Apr 2010 12:44:29 -0700 Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Otherwise we might be mapping in a page in a new mapping, but that page (through the swapcache) would later be mapped into an old mapping too. The page->mapping must be the case that works for everybody, not just the mapping that happened to page it in first. This can be improved in certain cases: if we know the page is private to just this particular mapping (for example, it's a new page, or it is the only swapcache entry), we could pick the top (most specific) anon_vma. But that's a future optimization. Make it _work_ reliably first. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/rmap.c | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index ee97d38..4bad326 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page, static void __page_set_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { - struct anon_vma *anon_vma = vma->anon_vma; + struct anon_vma_chain *avc; + struct anon_vma *anon_vma; + + BUG_ON(!vma->anon_vma); + + /* + * We must use the _oldest_ possible anon_vma for the page mapping! + * + * So take the last AVC chain entry in the vma, which is the deepest + * ancestor, and use the anon_vma from that. + */ + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); + anon_vma = avc->anon_vma; - BUG_ON(!anon_vma); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; page->index = linear_page_index(vma, address); -- 1.7.1.rc1.dirty ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma 2010-04-12 20:23 ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds @ 2010-04-12 21:03 ` Rik van Riel 2010-04-13 0:41 ` Johannes Weiner 1 sibling, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-12 21:03 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/12/2010 04:23 PM, Linus Torvalds wrote: > > From: Linus Torvalds<torvalds@linux-foundation.org> > Date: Mon, 12 Apr 2010 12:44:29 -0700 > Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma > > Otherwise we might be mapping in a page in a new mapping, but that page > (through the swapcache) would later be mapped into an old mapping too. > The page->mapping must be the case that works for everybody, not just > the mapping that happened to page it in first. > > This can be improved in certain cases: if we know the page is private to > just this particular mapping (for example, it's a new page, or it is the > only swapcache entry), we could pick the top (most specific) anon_vma. > > But that's a future optimization. Make it _work_ reliably first. Agreed. I'll send an incremental for that later, you can judge whether or not it's something you'll want to merge before or after 2.6.34 > Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma 2010-04-12 20:23 ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds 2010-04-12 21:03 ` Rik van Riel @ 2010-04-13 0:41 ` Johannes Weiner 2010-04-13 1:08 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Johannes Weiner @ 2010-04-13 0:41 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, Apr 12, 2010 at 01:23:50PM -0700, Linus Torvalds wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Mon, 12 Apr 2010 12:44:29 -0700 > Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma > > Otherwise we might be mapping in a page in a new mapping, but that page > (through the swapcache) would later be mapped into an old mapping too. > The page->mapping must be the case that works for everybody, not just > the mapping that happened to page it in first. > > This can be improved in certain cases: if we know the page is private to > just this particular mapping (for example, it's a new page, or it is the > only swapcache entry), we could pick the top (most specific) anon_vma. > > But that's a future optimization. Make it _work_ reliably first. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Would you mind pasting that nice description of the error case from your other email into that changelog? I skimmed over the description but when I read this patch several hours later, I had to go back to that previous email to fully make sense of it. > --- > mm/rmap.c | 15 +++++++++++++-- > 1 files changed, 13 insertions(+), 2 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index ee97d38..4bad326 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page, > static void __page_set_anon_rmap(struct page *page, > struct vm_area_struct *vma, unsigned long address) > { > - struct anon_vma *anon_vma = vma->anon_vma; > + struct anon_vma_chain *avc; > + struct anon_vma *anon_vma; > + > + BUG_ON(!vma->anon_vma); > + > + /* > + * We must use the _oldest_ possible anon_vma for the page mapping! I think the key here is not that it's the oldest (past) but also the one with the longest extent (future), so that it's bound to stay until the last possible mapping for this page vanishes. Maybe it's just me, but I doubt the comment as it is would help me understand that code if I didn't already. > + * > + * So take the last AVC chain entry in the vma, which is the deepest > + * ancestor, and use the anon_vma from that. > + */ > + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); > + anon_vma = avc->anon_vma; > > - BUG_ON(!anon_vma); > anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; > page->mapping = (struct address_space *) anon_vma; > page->index = linear_page_index(vma, address); Hannes ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma 2010-04-13 0:41 ` Johannes Weiner @ 2010-04-13 1:08 ` Linus Torvalds 2010-04-13 4:23 ` Minchan Kim 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-13 1:08 UTC (permalink / raw) To: Johannes Weiner Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 13 Apr 2010, Johannes Weiner wrote: > > Would you mind pasting that nice description of the error case from your > other email into that changelog? I skimmed over the description but when > I read this patch several hours later, I had to go back to that previous > email to fully make sense of it. It now looks like this.. Linus --- From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 12 Apr 2010 12:44:29 -0700 Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Otherwise we might be mapping in a page in a new mapping, but that page (through the swapcache) would later be mapped into an old mapping too. The page->mapping must be the case that works for everybody, not just the mapping that happened to page it in first. Here's the scenario: - page gets allocated/mapped by process A. Let's call the anon_vma we associate the page with 'A' to keep it easy to track. - Process A forks, creating process B. The anon_vma in B is 'B', and has a chain that looks like 'B' -> 'A'. Everything is fine. - Swapping happens. The page (with mapping pointing to 'A') gets swapped out (perhaps not to disk - it's enough to assume that it's just not mapped any more, and lives entirely in the swap-cache) - Process B pages it in, which goes like this: do_swap_page -> page = lookup_swap_cache(entry); ... set_pte_at(mm, address, page_table, pte); page_add_anon_rmap(page, vma, address); And think about what happens here! In particular, what happens is that this will now be the "first" mapping of that page, so page_add_anon_rmap() used to do if (first) __page_set_anon_rmap(page, vma, address); and notice what anon_vma it will use? It will use the anon_vma for process B! What happens then? Trivial: process 'A' also pages it in (nothing happens, it's not the first mapping), and then process 'B' execve's or exits or unmaps, making anon_vma B go away. End result: process A has a page that points to anon_vma B, but anon_vma B does not exist any more. This can go on forever. Forget about RCU grace periods, forget about locking, forget anything like that. The bug is simply that page->mapping points to an anon_vma that was correct at one point, but was _not_ the one that was shared by all users of that possible mapping. Changing it to always use the deepest anon_vma in the anonvma chain gets us to the safest model. This can be improved in certain cases: if we know the page is private to just this particular mapping (for example, it's a new page, or it is the only swapcache entry), we could pick the top (most specific) anon_vma. But that's a future optimization. Make it _work_ reliably first. Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/rmap.c | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index ee97d38..4bad326 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page, static void __page_set_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { - struct anon_vma *anon_vma = vma->anon_vma; + struct anon_vma_chain *avc; + struct anon_vma *anon_vma; + + BUG_ON(!vma->anon_vma); + + /* + * We must use the _oldest_ possible anon_vma for the page mapping! + * + * So take the last AVC chain entry in the vma, which is the deepest + * ancestor, and use the anon_vma from that. + */ + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); + anon_vma = avc->anon_vma; - BUG_ON(!anon_vma); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; page->index = linear_page_index(vma, address); -- 1.7.1.rc1.dirty ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma 2010-04-13 1:08 ` Linus Torvalds @ 2010-04-13 4:23 ` Minchan Kim 2010-04-13 4:26 ` Minchan Kim 0 siblings, 1 reply; 231+ messages in thread From: Minchan Kim @ 2010-04-13 4:23 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, Apr 13, 2010 at 10:08 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Tue, 13 Apr 2010, Johannes Weiner wrote: >> >> Would you mind pasting that nice description of the error case from your >> other email into that changelog? I skimmed over the description but when >> I read this patch several hours later, I had to go back to that previous >> email to fully make sense of it. > > It now looks like this.. > > Linus > --- > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Mon, 12 Apr 2010 12:44:29 -0700 > Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma > > Otherwise we might be mapping in a page in a new mapping, but that page > (through the swapcache) would later be mapped into an old mapping too. > The page->mapping must be the case that works for everybody, not just > the mapping that happened to page it in first. > > Here's the scenario: > > - page gets allocated/mapped by process A. Let's call the anon_vma we > associate the page with 'A' to keep it easy to track. > > - Process A forks, creating process B. The anon_vma in B is 'B', and has > a chain that looks like 'B' -> 'A'. Everything is fine. > > - Swapping happens. The page (with mapping pointing to 'A') gets swapped > out (perhaps not to disk - it's enough to assume that it's just not > mapped any more, and lives entirely in the swap-cache) > > - Process B pages it in, which goes like this: > > do_swap_page -> > page = lookup_swap_cache(entry); > ... > set_pte_at(mm, address, page_table, pte); > page_add_anon_rmap(page, vma, address); > > And think about what happens here! > > In particular, what happens is that this will now be the "first" > mapping of that page, so page_add_anon_rmap() used to do > > if (first) > __page_set_anon_rmap(page, vma, address); > > and notice what anon_vma it will use? It will use the anon_vma for > process B! > > What happens then? Trivial: process 'A' also pages it in (nothing > happens, it's not the first mapping), and then process 'B' execve's > or exits or unmaps, making anon_vma B go away. > > End result: process A has a page that points to anon_vma B, but > anon_vma B does not exist any more. This can go on forever. Forget > about RCU grace periods, forget about locking, forget anything like > that. The bug is simply that page->mapping points to an anon_vma > that was correct at one point, but was _not_ the one that was shared > by all users of that possible mapping. > > Changing it to always use the deepest anon_vma in the anonvma chain gets > us to the safest model. > > This can be improved in certain cases: if we know the page is private to > just this particular mapping (for example, it's a new page, or it is the > only swapcache entry), we could pick the top (most specific) anon_vma. > > But that's a future optimization. Make it _work_ reliably first. > > Reviewed-by: Rik van Riel <riel@redhat.com> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ] > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim> It was great hunting and was a chance to learn many things from LKML smart guys. I feel again about OSS's power and great procedure of linux evolution Thanks for everybody. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma 2010-04-13 4:23 ` Minchan Kim @ 2010-04-13 4:26 ` Minchan Kim 0 siblings, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-13 4:26 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, Apr 13, 2010 at 1:23 PM, Minchan Kim <minchan.kim@gmail.com> wrote: > On Tue, Apr 13, 2010 at 10:08 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> >> On Tue, 13 Apr 2010, Johannes Weiner wrote: >>> >>> Would you mind pasting that nice description of the error case from your >>> other email into that changelog? I skimmed over the description but when >>> I read this patch several hours later, I had to go back to that previous >>> email to fully make sense of it. >> >> It now looks like this.. >> >> Linus >> --- >> From: Linus Torvalds <torvalds@linux-foundation.org> >> Date: Mon, 12 Apr 2010 12:44:29 -0700 >> Subject: [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma >> >> Otherwise we might be mapping in a page in a new mapping, but that page >> (through the swapcache) would later be mapped into an old mapping too. >> The page->mapping must be the case that works for everybody, not just >> the mapping that happened to page it in first. >> >> Here's the scenario: >> >> - page gets allocated/mapped by process A. Let's call the anon_vma we >> associate the page with 'A' to keep it easy to track. >> >> - Process A forks, creating process B. The anon_vma in B is 'B', and has >> a chain that looks like 'B' -> 'A'. Everything is fine. >> >> - Swapping happens. The page (with mapping pointing to 'A') gets swapped >> out (perhaps not to disk - it's enough to assume that it's just not >> mapped any more, and lives entirely in the swap-cache) >> >> - Process B pages it in, which goes like this: >> >> do_swap_page -> >> page = lookup_swap_cache(entry); >> ... >> set_pte_at(mm, address, page_table, pte); >> page_add_anon_rmap(page, vma, address); >> >> And think about what happens here! >> >> In particular, what happens is that this will now be the "first" >> mapping of that page, so page_add_anon_rmap() used to do >> >> if (first) >> __page_set_anon_rmap(page, vma, address); >> >> and notice what anon_vma it will use? It will use the anon_vma for >> process B! >> >> What happens then? Trivial: process 'A' also pages it in (nothing >> happens, it's not the first mapping), and then process 'B' execve's >> or exits or unmaps, making anon_vma B go away. >> >> End result: process A has a page that points to anon_vma B, but >> anon_vma B does not exist any more. This can go on forever. Forget >> about RCU grace periods, forget about locking, forget anything like >> that. The bug is simply that page->mapping points to an anon_vma >> that was correct at one point, but was _not_ the one that was shared >> by all users of that possible mapping. >> >> Changing it to always use the deepest anon_vma in the anonvma chain gets >> us to the safest model. >> >> This can be improved in certain cases: if we know the page is private to >> just this particular mapping (for example, it's a new page, or it is the >> only swapcache entry), we could pick the top (most specific) anon_vma. >> >> But that's a future optimization. Make it _work_ reliably first. >> >> Reviewed-by: Rik van Riel <riel@redhat.com> >> Acked-by: Johannes Weiner <hannes@cmpxchg.org> >> Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ] >> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Sorry for mistake. I was extremely excited. :) -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order 2010-04-12 20:23 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds 2010-04-12 20:23 ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds @ 2010-04-12 20:57 ` Rik van Riel 2010-04-13 0:18 ` Johannes Weiner 2010-04-13 4:16 ` Minchan Kim 3 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-12 20:57 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/12/2010 04:23 PM, Linus Torvalds wrote: > > From: Linus Torvalds<torvalds@linux-foundation.org> > Date: Sun, 11 Apr 2010 17:15:03 -0700 > Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order > > We want to walk the chain in reverse order when cloning it, so that the > order of the result chain will be the same as the order in the source > chain. When we add entries to the chain, they go at the head of the > chain, so we want to add the source head last. > > Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order 2010-04-12 20:23 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds 2010-04-12 20:23 ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds 2010-04-12 20:57 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Rik van Riel @ 2010-04-13 0:18 ` Johannes Weiner 2010-04-13 4:16 ` Minchan Kim 3 siblings, 0 replies; 231+ messages in thread From: Johannes Weiner @ 2010-04-13 0:18 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, Apr 12, 2010 at 01:23:24PM -0700, Linus Torvalds wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sun, 11 Apr 2010 17:15:03 -0700 > Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order > > We want to walk the chain in reverse order when cloning it, so that the > order of the result chain will be the same as the order in the source > chain. When we add entries to the chain, they go at the head of the > chain, so we want to add the source head last. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order 2010-04-12 20:23 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds ` (2 preceding siblings ...) 2010-04-13 0:18 ` Johannes Weiner @ 2010-04-13 4:16 ` Minchan Kim 3 siblings, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-13 4:16 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, Apr 13, 2010 at 5:23 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sun, 11 Apr 2010 17:15:03 -0700 > Subject: [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order > > We want to walk the chain in reverse order when cloning it, so that the > order of the result chain will be the same as the order in the source > chain. When we add entries to the chain, they go at the head of the > chain, so we want to add the source head last. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains 2010-04-12 20:23 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds 2010-04-12 20:23 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds @ 2010-04-12 20:54 ` Rik van Riel 2010-04-12 23:59 ` Johannes Weiner 2010-04-13 4:15 ` Minchan Kim 3 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-12 20:54 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/12/2010 04:23 PM, Linus Torvalds wrote: > > From: Linus Torvalds<torvalds@linux-foundation.org> > Date: Sat, 10 Apr 2010 15:22:30 -0700 > Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains > > When we move the boundaries between two vma's due to things like > mprotect, we need to make sure that the anon_vma of the pages that got > moved from one vma to another gets properly copied around. And that was > not always the case, in this rather hard-to-follow code sequence. > > Clarify the code, and fix it so that it copies the anon_vma from the > right source. > > Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains 2010-04-12 20:23 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds 2010-04-12 20:23 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds 2010-04-12 20:54 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Rik van Riel @ 2010-04-12 23:59 ` Johannes Weiner 2010-04-13 4:15 ` Minchan Kim 3 siblings, 0 replies; 231+ messages in thread From: Johannes Weiner @ 2010-04-12 23:59 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, Apr 12, 2010 at 01:23:04PM -0700, Linus Torvalds wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sat, 10 Apr 2010 15:22:30 -0700 > Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains > > When we move the boundaries between two vma's due to things like > mprotect, we need to make sure that the anon_vma of the pages that got > moved from one vma to another gets properly copied around. And that was > not always the case, in this rather hard-to-follow code sequence. > > Clarify the code, and fix it so that it copies the anon_vma from the > right source. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains 2010-04-12 20:23 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds ` (2 preceding siblings ...) 2010-04-12 23:59 ` Johannes Weiner @ 2010-04-13 4:15 ` Minchan Kim 3 siblings, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-13 4:15 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, Apr 13, 2010 at 5:23 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sat, 10 Apr 2010 15:22:30 -0700 > Subject: [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains > > When we move the boundaries between two vma's due to things like > mprotect, we need to make sure that the anon_vma of the pages that got > moved from one vma to another gets properly copied around. And that was > not always the case, in this rather hard-to-follow code sequence. > > Clarify the code, and fix it so that it copies the anon_vma from the > right source. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() 2010-04-12 20:22 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds 2010-04-12 20:23 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds @ 2010-04-12 20:54 ` Rik van Riel 2010-04-12 23:54 ` Johannes Weiner ` (2 subsequent siblings) 4 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-12 20:54 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/12/2010 04:22 PM, Linus Torvalds wrote: > > From: Linus Torvalds<torvalds@linux-foundation.org> > Date: Sat, 10 Apr 2010 10:36:19 -0700 > Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() > Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() 2010-04-12 20:22 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds 2010-04-12 20:23 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds 2010-04-12 20:54 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Rik van Riel @ 2010-04-12 23:54 ` Johannes Weiner 2010-04-13 4:04 ` Minchan Kim 2010-04-13 9:51 ` Peter Zijlstra 4 siblings, 0 replies; 231+ messages in thread From: Johannes Weiner @ 2010-04-12 23:54 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Rik van Riel, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson Hi Linus, On Mon, Apr 12, 2010 at 01:22:33PM -0700, Linus Torvalds wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sat, 10 Apr 2010 10:36:19 -0700 > Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() > > This changes the anon_vma reuse case to require that we only reuse > simple anon_vma's - ie the case when the vma only has a single anon_vma > associated with it. > > This means that a reuse of an anon_vma from an adjacent vma will always > guarantee that both vma's are associated not onyl with the same > anon_vma, they will also have the same anon_vma chain (of just a single > entry in this case). > > And since anon_vma re-use was the only case where the same anon_vma > might be associated with different chains of anon_vma's, we now have the > case that every vma that shares the same vma will always also have the ^^^ That should be anon_vma? > same chain. That makes it much easier to think about merging vma's that > share the same anon_vma's: you can always just drop the other anon_vma > chain in anon_vma_merge() since you know that they are always identical. I like to think of 'incomplete' and 'complete' versions of the same chain and that this new rule of yours simplifies things by limiting reuse to the cases where the incomplete and the complete version end up identical. I can live with your wording, though :) > This also splits up the function to validate the anon_vma re-use, and > adds a lot of commentary about the possible races. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> That said, I still don't like that the vma comparisons differ depending on whether we reuse an anon_vma or merge vmas. In my happy-place, the same vma comparison function is predicate for both cases, so I actually liked that aspect of the old code, but I also see that code reuse is a PITA in that file... Ah well, that can still be cleaned up later. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() 2010-04-12 20:22 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds ` (2 preceding siblings ...) 2010-04-12 23:54 ` Johannes Weiner @ 2010-04-13 4:04 ` Minchan Kim 2010-04-13 9:51 ` Peter Zijlstra 4 siblings, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-13 4:04 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, Apr 13, 2010 at 5:22 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sat, 10 Apr 2010 10:36:19 -0700 > Subject: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() > > This changes the anon_vma reuse case to require that we only reuse > simple anon_vma's - ie the case when the vma only has a single anon_vma > associated with it. > > This means that a reuse of an anon_vma from an adjacent vma will always > guarantee that both vma's are associated not onyl with the same > anon_vma, they will also have the same anon_vma chain (of just a single > entry in this case). > > And since anon_vma re-use was the only case where the same anon_vma > might be associated with different chains of anon_vma's, we now have the > case that every vma that shares the same vma will always also have the same vma => same anon_vma. > same chain. That makes it much easier to think about merging vma's that > share the same anon_vma's: you can always just drop the other anon_vma > chain in anon_vma_merge() since you know that they are always identical. > > This also splits up the function to validate the anon_vma re-use, and > adds a lot of commentary about the possible races. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() 2010-04-12 20:22 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds ` (3 preceding siblings ...) 2010-04-13 4:04 ` Minchan Kim @ 2010-04-13 9:51 ` Peter Zijlstra 4 siblings, 0 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-13 9:51 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Rik van Riel, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 2010-04-12 at 13:22 -0700, Linus Torvalds wrote: > +static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b) > +{ > + return a->vm_end == b->vm_start && > + mpol_equal(vma_policy(a), vma_policy(b)) && > + a->vm_file == b->vm_file && > + !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) && > + b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); > +} Maybe write that as: static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b) { if (a->vm_end != b->vm_start) return 0; if (!mpol_equal(vma_policy(a), vma_policy(b)) return 0; if (a->vm_file != b->vm_file) return 0; if ((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) return 0; if (a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT) != b->vm_pgoff) return 0; return 1; } ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 16:26 ` Linus Torvalds 2010-04-12 18:40 ` Rik van Riel @ 2010-04-12 21:50 ` Borislav Petkov 2010-04-12 22:11 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-12 21:50 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, Apr 12, 2010 at 09:26:57AM -0700 > I have a new theory. And this new theory is completely different from all > the other things we've been looking at. Yeah, because all starts with "I have a new theory..." :o) > The patch below is my largely mindless try at fixing this. It's untested. > I'm not entirely sure that it actually works. But it makes some amount of > conceptual sense. No? Linus, are you trying to give me a heart-attack? This sh*t just survived 20(!) hibernation runs without a problem (well, there is this nagging /sysfs lockdep warning) but apart from that, it survived! I even did my all time best when hitting on it. Normally, it used to crap up on the 6th cycle as latest. Now we're rock solid. And yes, there were something like ~64Mb in the swap cache. Also, I have your verification stuff in addition to the 4 patches you sent before. Not a single WARN_ONCE got triggered. So I have a gut feeling that it is fixed but you never know with these beasts. As before, I'll rebuild and reapply everything in the morning and retest just in case. And I guess I'll have to test all following -rc's so that we can be absolutely sure. So cheers! -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 21:50 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Borislav Petkov @ 2010-04-12 22:11 ` Linus Torvalds 2010-04-12 22:18 ` Linus Torvalds 2010-04-13 9:38 ` Borislav Petkov 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 22:11 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 12 Apr 2010, Borislav Petkov wrote: > > > I have a new theory. And this new theory is completely different from all > > the other things we've been looking at. > > Yeah, because all starts with "I have a new theory..." :o) Hey, all my other theories made sense too.. They just didn't work. But as Edison said: I didn't fail, I just found three other ways to not fix your bug. > > The patch below is my largely mindless try at fixing this. It's untested. > > I'm not entirely sure that it actually works. But it makes some amount of > > conceptual sense. No? > > Linus, are you trying to give me a heart-attack? This sh*t just survived > 20(!) hibernation runs without a problem (well, there is this nagging > /sysfs lockdep warning) but apart from that, it survived! I even did my > all time best when hitting on it. Normally, it used to crap up on the > 6th cycle as latest. Now we're rock solid. And yes, there were something > like ~64Mb in the swap cache. > > Also, I have your verification stuff in addition to the 4 patches you > sent before. Not a single WARN_ONCE got triggered. So I have a gut > feeling that it is fixed but you never know with these beasts. Ok. That does sound very positive. Of course, last time you sounded positive, I had an email from you half an hour later that said "oh no, it oopsed again". So I'll take it with a bit of salt, but on the whole I'll be optimistic about it. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 22:11 ` Linus Torvalds @ 2010-04-12 22:18 ` Linus Torvalds 2010-04-12 22:29 ` Borislav Petkov 2010-04-13 9:38 ` Borislav Petkov 1 sibling, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 22:18 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson Oh, btw, I like your email gateway. Only noticed now: mail.skyhub.de (SuperMail on ZX Spectrum 128k) that's a tough little machine. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 22:18 ` Linus Torvalds @ 2010-04-12 22:29 ` Borislav Petkov 0 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-12 22:29 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, Apr 12, 2010 at 03:18:20PM -0700 > Oh, btw, I like your email gateway. Only noticed now: > > mail.skyhub.de (SuperMail on ZX Spectrum 128k) > > that's a tough little machine. Yeah, and it can handle all that mail traffic just fine :) -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 22:11 ` Linus Torvalds 2010-04-12 22:18 ` Linus Torvalds @ 2010-04-13 9:38 ` Borislav Petkov 2010-04-14 21:59 ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel 1 sibling, 1 reply; 231+ messages in thread From: Borislav Petkov @ 2010-04-13 9:38 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, Apr 12, 2010 at 03:11:53PM -0700 > Ok. That does sound very positive. Of course, last time you sounded > positive, I had an email from you half an hour later that said "oh no, it > oopsed again". So I'll take it with a bit of salt, but on the whole I'll > be optimistic about it. Ok, just finished testing -rc4 - no problems so far. Let's just go out on a limb here and say with a greater certainty that this really got fixed but be smart about it and keep an eye open if it happens again - you never know. Where is the champagne? -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* [PATCH] rmap: add exclusively owned pages to the newest anon_vma 2010-04-13 9:38 ` Borislav Petkov @ 2010-04-14 21:59 ` Rik van Riel 2010-04-14 23:20 ` Johannes Weiner ` (3 more replies) 0 siblings, 4 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-14 21:59 UTC (permalink / raw) To: Borislav Petkov Cc: Linus Torvalds, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson The recent anon_vma fixes cause many anonymous pages to end up in the parent process anon_vma, even when the page is exclusively owned by the current process. Adding exclusively owned anonymous pages to the top anon_vma reduces rmap scanning overhead, especially in workloads with forking servers. This patch adds a parameter to __page_set_anon_rmap that can be used to indicate whether or not the added page is exclusively owned by the current process. Pages added through page_add_new_anon_rmap are exclusively owned by the current process, and can be added to the top anon_vma. Pages added through page_add_anon_rmap can be either shared or exclusively owned, so we do the conservative thing and add it to the oldest anon_vma. A next step would be to add the exclusive parameter to page_add_anon_rmap, to be used from functions where we do know for sure whether a page is exclusively owned. Signed-off-by: Rik van Riel <riel@redhat.com> --- Borislav, I audited the code before making this change, but would still appreciate your testing of this patch :) Linus, once this patch survives Borislav's testing, I'll start looking at the next step. I'd like to do things one step at a time so I won't cause another regression... mm/rmap.c | 30 +++++++++++++++++++----------- 1 files changed, 19 insertions(+), 11 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 4bad326..12ac0f1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -730,23 +730,31 @@ void page_move_anon_rmap(struct page *page, * @page: the page to add the mapping to * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped + * @exclusive: the page is exclusively owned by the current process */ static void __page_set_anon_rmap(struct page *page, - struct vm_area_struct *vma, unsigned long address) + struct vm_area_struct *vma, unsigned long address, int exclusive) { struct anon_vma_chain *avc; struct anon_vma *anon_vma; BUG_ON(!vma->anon_vma); - /* - * We must use the _oldest_ possible anon_vma for the page mapping! - * - * So take the last AVC chain entry in the vma, which is the deepest - * ancestor, and use the anon_vma from that. - */ - avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); - anon_vma = avc->anon_vma; + if (exclusive) + anon_vma = vma->anon_vma; + else { + /* + * The page may be shared between multiple processes. + * We must use the _oldest_ possible anon_vma for the + * page mapping! That anon_vma is guaranteed to be + * present in all processes that could share this page. + * + * So take the last AVC chain entry in the vma, which is the + * deepest ancestor, and use the anon_vma from that. + */ + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); + anon_vma = avc->anon_vma; + } anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; @@ -802,7 +810,7 @@ void page_add_anon_rmap(struct page *page, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); if (first) - __page_set_anon_rmap(page, vma, address); + __page_set_anon_rmap(page, vma, address, 0); else __page_check_anon_rmap(page, vma, address); } @@ -824,7 +832,7 @@ void page_add_new_anon_rmap(struct page *page, SetPageSwapBacked(page); atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */ __inc_zone_page_state(page, NR_ANON_PAGES); - __page_set_anon_rmap(page, vma, address); + __page_set_anon_rmap(page, vma, address, 1); if (page_evictable(page, vma)) lru_cache_add_lru(page, LRU_ACTIVE_ANON); else ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma 2010-04-14 21:59 ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel @ 2010-04-14 23:20 ` Johannes Weiner 2010-04-15 8:34 ` Borislav Petkov ` (2 subsequent siblings) 3 siblings, 0 replies; 231+ messages in thread From: Johannes Weiner @ 2010-04-14 23:20 UTC (permalink / raw) To: Rik van Riel Cc: Borislav Petkov, Linus Torvalds, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Wed, Apr 14, 2010 at 05:59:28PM -0400, Rik van Riel wrote: > The recent anon_vma fixes cause many anonymous pages to end up > in the parent process anon_vma, even when the page is exclusively > owned by the current process. > > Adding exclusively owned anonymous pages to the top anon_vma > reduces rmap scanning overhead, especially in workloads with > forking servers. > > This patch adds a parameter to __page_set_anon_rmap that can > be used to indicate whether or not the added page is exclusively > owned by the current process. > > Pages added through page_add_new_anon_rmap are exclusively > owned by the current process, and can be added to the top > anon_vma. > > Pages added through page_add_anon_rmap can be either shared > or exclusively owned, so we do the conservative thing and > add it to the oldest anon_vma. > > A next step would be to add the exclusive parameter to > page_add_anon_rmap, to be used from functions where we do > know for sure whether a page is exclusively owned. > > Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma 2010-04-14 21:59 ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel 2010-04-14 23:20 ` Johannes Weiner @ 2010-04-15 8:34 ` Borislav Petkov 2010-04-15 16:02 ` Minchan Kim 2010-04-15 20:01 ` Linus Torvalds 3 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-15 8:34 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Rik van Riel <riel@redhat.com> Date: Wed, Apr 14, 2010 at 05:59:28PM -0400 > The recent anon_vma fixes cause many anonymous pages to end up > in the parent process anon_vma, even when the page is exclusively > owned by the current process. > > Adding exclusively owned anonymous pages to the top anon_vma > reduces rmap scanning overhead, especially in workloads with > forking servers. > > This patch adds a parameter to __page_set_anon_rmap that can > be used to indicate whether or not the added page is exclusively > owned by the current process. > > Pages added through page_add_new_anon_rmap are exclusively > owned by the current process, and can be added to the top > anon_vma. > > Pages added through page_add_anon_rmap can be either shared > or exclusively owned, so we do the conservative thing and > add it to the oldest anon_vma. > > A next step would be to add the exclusive parameter to > page_add_anon_rmap, to be used from functions where we do > know for sure whether a page is exclusively owned. > > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > Borislav, I audited the code before making this change, but would > still appreciate your testing of this patch :) Just did some light hammering and it looks ok so far. I'll keep watching out for oopsies/issues. Lightly-tested-by: Borislav Petkov <bp@alien8.de> -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma 2010-04-14 21:59 ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel 2010-04-14 23:20 ` Johannes Weiner 2010-04-15 8:34 ` Borislav Petkov @ 2010-04-15 16:02 ` Minchan Kim 2010-04-15 20:01 ` Linus Torvalds 3 siblings, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-15 16:02 UTC (permalink / raw) To: Rik van Riel Cc: Borislav Petkov, Linus Torvalds, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Thu, Apr 15, 2010 at 6:59 AM, Rik van Riel <riel@redhat.com> wrote: > The recent anon_vma fixes cause many anonymous pages to end up > in the parent process anon_vma, even when the page is exclusively > owned by the current process. > > Adding exclusively owned anonymous pages to the top anon_vma > reduces rmap scanning overhead, especially in workloads with > forking servers. > > This patch adds a parameter to __page_set_anon_rmap that can > be used to indicate whether or not the added page is exclusively > owned by the current process. > > Pages added through page_add_new_anon_rmap are exclusively > owned by the current process, and can be added to the top > anon_vma. > > Pages added through page_add_anon_rmap can be either shared > or exclusively owned, so we do the conservative thing and > add it to the oldest anon_vma. > > A next step would be to add the exclusive parameter to > page_add_anon_rmap, to be used from functions where we do > know for sure whether a page is exclusively owned. > > Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma 2010-04-14 21:59 ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel ` (2 preceding siblings ...) 2010-04-15 16:02 ` Minchan Kim @ 2010-04-15 20:01 ` Linus Torvalds 2010-04-16 6:09 ` Felipe Balbi 3 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-15 20:01 UTC (permalink / raw) To: Rik van Riel Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Wed, 14 Apr 2010, Rik van Riel wrote: > - /* > - * We must use the _oldest_ possible anon_vma for the page mapping! > - * > - * So take the last AVC chain entry in the vma, which is the deepest > - * ancestor, and use the anon_vma from that. > - */ > - avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); > - anon_vma = avc->anon_vma; > + if (exclusive) > + anon_vma = vma->anon_vma; > + else { > + /* > + * The page may be shared between multiple processes. > + * We must use the _oldest_ possible anon_vma for the > + * page mapping! That anon_vma is guaranteed to be > + * present in all processes that could share this page. > + * > + * So take the last AVC chain entry in the vma, which is the > + * deepest ancestor, and use the anon_vma from that. > + */ > + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); > + anon_vma = avc->anon_vma; > + } I really dislike your coding style. If we do this conditionally, we're _much_ better off declaring the variables we only use inside that conditional block inside the block itself. And since we access "vma->anon_vma" in either case, just move that case outside the conditional statement, and avoid a pointless if/then/else. IOW, something like this. Totally untested. Linus --- mm/rmap.c | 26 +++++++++++++++----------- 1 files changed, 15 insertions(+), 11 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 4bad326..78d4730 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -732,21 +732,25 @@ void page_move_anon_rmap(struct page *page, * @address: the user virtual address mapped */ static void __page_set_anon_rmap(struct page *page, - struct vm_area_struct *vma, unsigned long address) + struct vm_area_struct *vma, unsigned long address, int exclusive) { - struct anon_vma_chain *avc; - struct anon_vma *anon_vma; + struct anon_vma *anon_vma = vma->anon_vma; - BUG_ON(!vma->anon_vma); + BUG_ON(!anon_vma); /* - * We must use the _oldest_ possible anon_vma for the page mapping! + * If the page isn't exclusively mapped into this vma, + * we must use the _oldest_ possible anon_vma for the + * page mapping! * - * So take the last AVC chain entry in the vma, which is the deepest - * ancestor, and use the anon_vma from that. + * So take the last AVC chain entry in the vma, which is + * the deepest ancestor, and use the anon_vma from that. */ - avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); - anon_vma = avc->anon_vma; + if (!exclusive) { + struct anon_vma_chain *avc; + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); + anon_vma = avc->anon_vma; + } anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; @@ -802,7 +806,7 @@ void page_add_anon_rmap(struct page *page, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); if (first) - __page_set_anon_rmap(page, vma, address); + __page_set_anon_rmap(page, vma, address, 0); else __page_check_anon_rmap(page, vma, address); } @@ -824,7 +828,7 @@ void page_add_new_anon_rmap(struct page *page, SetPageSwapBacked(page); atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */ __inc_zone_page_state(page, NR_ANON_PAGES); - __page_set_anon_rmap(page, vma, address); + __page_set_anon_rmap(page, vma, address, 1); if (page_evictable(page, vma)) lru_cache_add_lru(page, LRU_ACTIVE_ANON); else ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma 2010-04-15 20:01 ` Linus Torvalds @ 2010-04-16 6:09 ` Felipe Balbi 2010-04-16 14:48 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Felipe Balbi @ 2010-04-16 6:09 UTC (permalink / raw) To: ext Linus Torvalds Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson Hi, On Thu, Apr 15, 2010 at 10:01:11PM +0200, ext Linus Torvalds wrote: >+ avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); while at that, would it make sense to first provide list_last_entry() since we already have list_first_entry() ?? totally unrelated to this patch, sorry -- balbi ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH] rmap: add exclusively owned pages to the newest anon_vma 2010-04-16 6:09 ` Felipe Balbi @ 2010-04-16 14:48 ` Linus Torvalds 0 siblings, 0 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-16 14:48 UTC (permalink / raw) To: Felipe Balbi Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Fri, 16 Apr 2010, Felipe Balbi wrote: > > while at that, would it make sense to first provide list_last_entry() since we > already have list_first_entry() ?? Yeah, it probably would make sense. Especially as doing a simple grep for 'list_entry.*prev' does seem to imply that there might be quite a few places that would be able to use it. Although some of them do seem to be about finding the previous entry rather than the last in a list. That said, doing the same grep for 'next' shows that a lot of places don't use the list_first_entry() that we _do_ have, so.. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-11 17:16 ` Linus Torvalds 2010-04-11 18:55 ` Borislav Petkov @ 2010-04-11 19:49 ` Rik van Riel 2010-04-12 15:44 ` Linus Torvalds 2010-04-11 21:45 ` Rik van Riel 2 siblings, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-11 19:49 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/11/2010 01:16 PM, Linus Torvalds wrote: > NOTE! If this is the race, then the hack really is just a hack, because it > doesn't really solve anything. We still take the spinlock, and if bad > things has happened, _that_ can still very much fail, and you get the > watchdog lockup message instead. So this doesn't really fix anything. Looking around the code some more, zap_pte_range() calls page_remove_rmap(), which leaves the page->mapping in place and has this comment: /* * It would be tidy to reset the PageAnon mapping here, * but that might overwrite a racing page_add_anon_rmap * which increments mapcount after us but sets mapping * before us: so leave the reset to free_hot_cold_page, * and remember that it's only reliable while mapped. * Leaving it set also helps swapoff to reinstate ptes * faster for those pages still in swapcache. */ I wonder if we can clear page->mapping here, if list_is_singular(anon_vma->head). That way we will not leave stale pointers behind. Adding another VMA to the anon_vma can happen at fork time - which will not happen simultaneously with exit or munmap, because the mmap_sem is taken for write during either code path. Am I overlooking something here? ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-11 19:49 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel @ 2010-04-12 15:44 ` Linus Torvalds 2010-04-12 15:51 ` Rik van Riel 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 15:44 UTC (permalink / raw) To: Rik van Riel Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sun, 11 Apr 2010, Rik van Riel wrote: > > Looking around the code some more, zap_pte_range() > calls page_remove_rmap(), which leaves the > page->mapping in place and has this comment: See my earlier email about this exact issue. It's well-known that there are stale page->mapping pointers. The "page_mapped()" check _should_ have meant that in that case we never follow them, though. > I wonder if we can clear page->mapping here, if > list_is_singular(anon_vma->head). That way we > will not leave stale pointers behind. What does that help? What if list _isn't_ singular? Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 15:44 ` Linus Torvalds @ 2010-04-12 15:51 ` Rik van Riel 0 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-12 15:51 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/12/2010 11:44 AM, Linus Torvalds wrote: > On Sun, 11 Apr 2010, Rik van Riel wrote: >> >> Looking around the code some more, zap_pte_range() >> calls page_remove_rmap(), which leaves the >> page->mapping in place and has this comment: > > See my earlier email about this exact issue. It's well-known that there > are stale page->mapping pointers. The "page_mapped()" check _should_ have > meant that in that case we never follow them, though. Good point. I wonder if we have some SMP reordering issue then? >> I wonder if we can clear page->mapping here, if >> list_is_singular(anon_vma->head). That way we >> will not leave stale pointers behind. > > What does that help? What if list _isn't_ singular? Yeah, that was a bad idea. Looking at the same code for 11 days straight seems to have put some knots in my brain :) ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-11 17:16 ` Linus Torvalds 2010-04-11 18:55 ` Borislav Petkov 2010-04-11 19:49 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel @ 2010-04-11 21:45 ` Rik van Riel 2010-04-12 15:51 ` Linus Torvalds 2 siblings, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-11 21:45 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/11/2010 01:16 PM, Linus Torvalds wrote: > Actually, so if it's that race, then we might get rid of the oops with > this total hack. Another thing I just thought of. The anon_vma struct will not be reused for something completely different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep is created with. The anon_vma_chain structs are allocated from a slab without that flag, so they can be reused for something else in the middle of an RCU section. Is that something worth fixing, or is this so subtle that we'd rather not have the code rely on this kind of behaviour at all? ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-11 21:45 ` Rik van Riel @ 2010-04-12 15:51 ` Linus Torvalds 2010-04-13 10:36 ` KOSAKI Motohiro 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 15:51 UTC (permalink / raw) To: Rik van Riel Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sun, 11 Apr 2010, Rik van Riel wrote: > > Another thing I just thought of. > > The anon_vma struct will not be reused for something completely > different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep > is created with. Rik, we _know_ it got re-used by something totally different. That's clearly the problem. The page->mapping pointer does _not_ point to an anon_vma any more. That's the problem here. What we need to figure out is how we have a page on the LRU list that is still marked as 'mapped' that has that stale mapping pointer. I can easily see how the stale mapping pointer happens for a non-mapped page. That part is trivial. Here's a simple case: - vmscan does that whole "isolate LRU pages", and one of them is a (at that time mapped) anonymous page. It's now not on any LRU lists at all. - vmscan ends up waiting for pageout and/or writeback while holding that list of pages. - in the meantime, the process that had the page exists or unmaps, unmapping the page and freeing the vma and the anon_vma. - vmscan eventually gets to the page, and does that page_referenced() dance. page->mapping points to something that is long long gone (as in "IO access lifetimes", so we're talking something that has been freed literally milliseconds ago, rather than any RCU delays) So I can see the stale page->mapping pointer happening. That part is even trivial. What I don't see is how the page would be still marked 'mapped'. Everything that actually free's the vma/anon_vmas should also have unmapped the page before that - even if it didn't _free_ the page. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 15:51 ` Linus Torvalds @ 2010-04-13 10:36 ` KOSAKI Motohiro 0 siblings, 0 replies; 231+ messages in thread From: KOSAKI Motohiro @ 2010-04-13 10:36 UTC (permalink / raw) To: Linus Torvalds Cc: kosaki.motohiro, Rik van Riel, Borislav Petkov, Johannes Weiner, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson Hi Linus, > On Sun, 11 Apr 2010, Rik van Riel wrote: > > > > Another thing I just thought of. > > > > The anon_vma struct will not be reused for something completely > > different due to the SLAB_DESTROY_BY_RCU flag that the anon_vma_cachep > > is created with. > > Rik, we _know_ it got re-used by something totally different. That's > clearly the problem. The page->mapping pointer does _not_ point to an > anon_vma any more. That's the problem here. > > What we need to figure out is how we have a page on the LRU list that is > still marked as 'mapped' that has that stale mapping pointer. > > I can easily see how the stale mapping pointer happens for a non-mapped > page. That part is trivial. Here's a simple case: > > - vmscan does that whole "isolate LRU pages", and one of them is a (at > that time mapped) anonymous page. It's now not on any LRU lists at all. > > - vmscan ends up waiting for pageout and/or writeback while holding that > list of pages. > > - in the meantime, the process that had the page exists or unmaps, > unmapping the page and freeing the vma and the anon_vma. > > - vmscan eventually gets to the page, and does that page_referenced() > dance. page->mapping points to something that is long long gone (as in > "IO access lifetimes", so we're talking something that has been freed > literally milliseconds ago, rather than any RCU delays) > > So I can see the stale page->mapping pointer happening. That part is even > trivial. What I don't see is how the page would be still marked 'mapped'. > Everything that actually free's the vma/anon_vmas should also have > unmapped the page before that - even if it didn't _free_ the page. Sorry, Now I'm lost what discuss in this crazy long thread. IIUC, If the page->mapping was freed millisecns ago, following (1) check returen false and we never touch page->mapping literally. Am I missing something? =================================================================== struct anon_vma *page_lock_anon_vma(struct page *page) { struct anon_vma *anon_vma; unsigned long anon_mapping; rcu_read_lock(); anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; if (!page_mapped(page)) /* (1) here */ goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); return anon_vma; out: rcu_read_unlock(); return NULL; } ================================================= And, I think your following patch seems incorrect. The added page_mapped() is called after spinlock(anon_vma->lock), it mean check-after-dereference. such check doesn't prevent invalid pointer dereference, I think. perhaps, I'm missing anything. I have to reread this thread at all from first. --- diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -302,7 +302,11 @@ struct anon_vma *page_lock_anon_vma(struct page *page) anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); - return anon_vma; + + if (page_mapped(page)) + return anon_vma; + + spin_unlock(&anon_vma->lock); out: rcu_read_unlock(); return NULL; ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 20:05 ` Linus Torvalds 2010-04-10 20:12 ` Linus Torvalds @ 2010-04-10 20:24 ` Rik van Riel 2010-04-10 20:34 ` Linus Torvalds 2010-04-10 20:32 ` Rik van Riel 2 siblings, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-10 20:24 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/10/2010 04:05 PM, Linus Torvalds wrote: > And vma_adjust is the one place that does that anon_vma_merge(), which is > apart from the actual unmapping sequence the only other place that > actually free's anon_vmas. So there are reasons to be very suspicious of > that code. It frees anon_vma_chain structures, but not actual anon_vmas. Walking the anon_vma (from rmap) requires the anon_vma->lock, which is taken in anon_vma_merge whenever a chain is unlinked. > And I think that code can actually lose an anon_vma chain. It's totally > screwing up the "import anonvma" case: when it does > > if (anon_vma_clone(importer, vma)) { > return -ENOMEM; > } > importer->anon_vma = anon_vma; > > we can actually have "importer == vma", but "anon_vma = next->anon_vma". A few lines up from that code, we have: if (vma->anon_vma && (insert || importer || start != vma->vm_start)) anon_vma = vma->anon_vma; So anon_vma should always be vma->anon_vma. If we have already imported an anon_vma, we will not do so twice, because of the !importer->anon_vma check. What am I overlooking? > In which case we actually end up with an _empty_ chain (because importer > didn't have a chain to begin with!) but "importer->anon_vma" points to an > anon_vma. If we import a chain, from vma to importer, importer->anon_vma will be equal to vma->anon_vma. I do not see how 'importer' could get a state different from 'vma'. > Also, the conditional nesting makes no sense (the whole anon_vma_clone() > only makes sense if importer is set, and it is only ever set _inside_ the > earlier if-statement, so the whole code should be moved inside there), nor > does some of the comments. No argument there, vma_adjust is very hard to read and it took me a few days to convince myself that my changes kept things equivalent to how they were before. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 20:24 ` Rik van Riel @ 2010-04-10 20:34 ` Linus Torvalds 2010-04-10 20:43 ` Rik van Riel 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 20:34 UTC (permalink / raw) To: Rik van Riel Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Rik van Riel wrote: > On 04/10/2010 04:05 PM, Linus Torvalds wrote: > > > And vma_adjust is the one place that does that anon_vma_merge(), which is > > apart from the actual unmapping sequence the only other place that > > actually free's anon_vmas. So there are reasons to be very suspicious of > > that code. > > It frees anon_vma_chain structures, but not actual anon_vmas. Rik, I think you're ignoring the fact that the anon_vma_chain is also the implicit refcount. So when you don't create the chains, you implicitly end up freeing the anon_vma too early. In fact, it might well happen at that 'anon_vma_merge()': when it does the unlink_anon_vmas(), it may be unlinking the last remaining anon_vma ref, and then anon_vma_unlink _will_ in fact free the anon_vma. Even though we have a 'vma->anon_vma' pointer that points to it - because the chains weren't set up correctly. > Walking the anon_vma (from rmap) requires the anon_vma->lock, > which is taken in anon_vma_merge whenever a chain is unlinked. None of that matters. If the dang thing got free'd, the lock isn't reliable any more. > A few lines up from that code, we have: > > if (vma->anon_vma && (insert || importer || start != vma->vm_start)) > anon_vma = vma->anon_vma; > > So anon_vma should always be vma->anon_vma. No. vma->anon_vma is NULL, so the above lines are total no-ops. We're trying to _fill_ it. But we're doing it wrong. So we end up with: anon_vma = next->anon-vma importer = vma and we do: if (anon_vma_clone(importer, vma)) { return -ENOMEM; } importer->anon_vma = anon_vma; do you see? The "anon_vma_clone(importer, vma)" does NOTHING, because it is cloning from the wrong source (from 'vma', rather than from 'next', so it leaves the vma chains empty. And then, despite having empty chains, we do that importer->anon_vma = anon_vma; which sets the anon_vma to the (non-NULL) next->anon_vma. And then, a bit later, we'll do anon_vma_merge(vma, next); which will happily notice that the anon_vma's of both vma and next match (because we just _set_ them to match), and then frees the ONLY REMAINING CHAIN - the one in next. The one we DID NOT CORRECTLY COPY, because we got our sources completely screwed up. > What am I overlooking? Can you see it now? > If we import a chain, from vma to importer, importer->anon_vma > will be equal to vma->anon_vma. The thing you seem to miss is that we aren't supposed to import the chain from 'vma' AT ALL. The anon_vma came from _next_, not from 'vma'! > I do not see how 'importer' could get a state different from 'vma'. Stop worrying about 'vma'. Start worrying about 'next'. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 20:34 ` Linus Torvalds @ 2010-04-10 20:43 ` Rik van Riel 0 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-10 20:43 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/10/2010 04:34 PM, Linus Torvalds wrote: >> What am I overlooking? > > Can you see it now? Yeah, after reading through your patch it became obvious. It's the code above this code that sets up the problem. It's a small miracle it worked before... ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 20:05 ` Linus Torvalds 2010-04-10 20:12 ` Linus Torvalds 2010-04-10 20:24 ` Rik van Riel @ 2010-04-10 20:32 ` Rik van Riel 2 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-10 20:32 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/10/2010 04:05 PM, Linus Torvalds wrote: > This patch is scary and untested, but the more I look at that code, the > more convinced I am that vma_adjust was _really_ badly screwed up. The > patch below may make things worse. I'll test it myself too, but I'm > sending it out first, since I was writing the email as I was looking at > the piece of cr*p. Your patch looks correct. Gotta love how before, "vma" could be either exporter or importer! I'm guessing that it did not break before my changes, because of plain old luck... Acked-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 18:21 ` Linus Torvalds 2010-04-10 18:26 ` Linus Torvalds 2010-04-10 18:51 ` Borislav Petkov @ 2010-04-10 19:36 ` Rik van Riel 2010-04-12 14:40 ` Peter Zijlstra 3 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-10 19:36 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/10/2010 02:21 PM, Linus Torvalds wrote: > Maybe I'm crazy, but something started bothering me. And I started > wondering: when is the 'page->mapping' of an anonymous page actually > cleared? > > The thing is, the mapping of an anonymous page is actually cleared only > when the page is _freed_, in "free_hot_cold_page()". Which is also where they are removed from the LRU. The plot thickens... > Now, let's think about that. And in particular, let's think about how that > relates to the freeing of the 'anon_vma' that the page->mapping points to. > > The way the anon_vma is freed is when the mapping is torn down, and we do > roughly: > > tlb = tlb_gather_mmu(mm,..) > .. > unmap_vmas(&tlb, vma .. > .. > free_pgtables() > .. > tlb_finish_mmu(tlb, start, end); Looks like we should move the anon_vma freeing from free_pgtables over to remove_vma? This code is just below the tlb_finish_mmu in exit_mmap: /* * Walk the list again, actually closing and freeing it, * with preemption enabled, without holding any MM locks. */ while (vma) vma = remove_vma(vma); This comment in free_pgtables is a little suspect: /* * Hide vma from rmap and truncate_pagecache before freeing * pgtables */ unlink_anon_vmas(vma); unlink_file_vma(vma); After all, the rmap code will quickly notice that there either are no page tables, or the page tables no longer have anything in them. It looks like we may have had this use-after-free bug in the VM for quite a while... I am not entirely sure what exposed the bug, but I can see how it works. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 18:21 ` Linus Torvalds ` (2 preceding siblings ...) 2010-04-10 19:36 ` Rik van Riel @ 2010-04-12 14:40 ` Peter Zijlstra 2010-04-12 15:17 ` Minchan Kim 2010-04-12 15:19 ` Rik van Riel 3 siblings, 2 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-12 14:40 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 2010-04-10 at 11:21 -0700, Linus Torvalds wrote: > > Ho humm. > > Maybe I'm crazy, but something started bothering me. And I started > wondering: when is the 'page->mapping' of an anonymous page actually > cleared? > > The thing is, the mapping of an anonymous page is actually cleared only > when the page is _freed_, in "free_hot_cold_page()". > > Now, let's think about that. And in particular, let's think about how that > relates to the freeing of the 'anon_vma' that the page->mapping points to. > > The way the anon_vma is freed is when the mapping is torn down, and we do > roughly: > > tlb = tlb_gather_mmu(mm,..) > .. > unmap_vmas(&tlb, vma .. > .. > free_pgtables() > .. > tlb_finish_mmu(tlb, start, end); > > and we actually unmap all the pages in "unmap_vmas()", and then _after_ > unmapping all the pages we do the "unlink_anon_vmas(vma);" in > "free_pgtables()". Fine so far - the anon_vma stay around until after the > page has been happily unmapped. > > But "unmapped all the pages" is _not_ actually the same as "free'd all the > pages". The actual _freeing_ of the page happens generally in > tlb_finish_mmu(), because we can free the page only after we've flushed > any TLB entries. > > So what we have in that tlb_gather structure is a list of _pending_ pages > to be freed, while we already actually free'd the anon_vmas earlier! > > Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because > we use a per-cpu variable), but as far as I can tell it is _not_ an > RCU-safe region. > > So I think we might actually get a real RCU freeing event while this all > happens. So now the 'anon_vma' that 'page->mapping' points to has not just > been released back to the SLUB caches, the page itself might have been > released too. > > I dunno. Does the above sound at all sane? Or am I just raving? > > Something hacky like the above might fix it if I'm not just raving. I > really might be missing something here. Right, so unless you have CONFIG_TREE_PREEMPT_RCU=y, the preempt-disable == RCU read lock assumption does hold. But even with your patch it doesn't close all holes because while zap_pte_range() can remove the last mapcount of the page, the page_remove_tlb() et al. don't need to be the last use count of the page. Concurrent reclaim/gup/whatever could still have a count out on the page delaying the actual free beyond the tlb gather RCU section. So the reason page->mapping isn't cleared in page_remove_rmap() isn't detailed beyond a (possible) race with page_add_anon_rmap() (which I guess would be reclaim trying to unmap the page and a fault re-instating it). This also complicates the whole page_lock_anon_vma() thing, so it would be nice to be able to remove this race and clear page->mapping in page_remove_rmap(). ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 14:40 ` Peter Zijlstra @ 2010-04-12 15:17 ` Minchan Kim 2010-04-12 15:33 ` Peter Zijlstra 2010-04-12 15:19 ` Rik van Riel 1 sibling, 1 reply; 231+ messages in thread From: Minchan Kim @ 2010-04-12 15:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, Apr 12, 2010 at 11:40 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, 2010-04-10 at 11:21 -0700, Linus Torvalds wrote: >> > >> Ho humm. >> >> Maybe I'm crazy, but something started bothering me. And I started >> wondering: when is the 'page->mapping' of an anonymous page actually >> cleared? >> >> The thing is, the mapping of an anonymous page is actually cleared only >> when the page is _freed_, in "free_hot_cold_page()". >> >> Now, let's think about that. And in particular, let's think about how that >> relates to the freeing of the 'anon_vma' that the page->mapping points to. >> >> The way the anon_vma is freed is when the mapping is torn down, and we do >> roughly: >> >> tlb = tlb_gather_mmu(mm,..) >> .. >> unmap_vmas(&tlb, vma .. >> .. >> free_pgtables() >> .. >> tlb_finish_mmu(tlb, start, end); >> >> and we actually unmap all the pages in "unmap_vmas()", and then _after_ >> unmapping all the pages we do the "unlink_anon_vmas(vma);" in >> "free_pgtables()". Fine so far - the anon_vma stay around until after the >> page has been happily unmapped. >> >> But "unmapped all the pages" is _not_ actually the same as "free'd all the >> pages". The actual _freeing_ of the page happens generally in >> tlb_finish_mmu(), because we can free the page only after we've flushed >> any TLB entries. >> >> So what we have in that tlb_gather structure is a list of _pending_ pages >> to be freed, while we already actually free'd the anon_vmas earlier! >> >> Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because >> we use a per-cpu variable), but as far as I can tell it is _not_ an >> RCU-safe region. >> >> So I think we might actually get a real RCU freeing event while this all >> happens. So now the 'anon_vma' that 'page->mapping' points to has not just >> been released back to the SLUB caches, the page itself might have been >> released too. >> >> I dunno. Does the above sound at all sane? Or am I just raving? >> >> Something hacky like the above might fix it if I'm not just raving. I >> really might be missing something here. > > Right, so unless you have CONFIG_TREE_PREEMPT_RCU=y, the preempt-disable > == RCU read lock assumption does hold. Indeed. > > But even with your patch it doesn't close all holes because while > zap_pte_range() can remove the last mapcount of the page, the > page_remove_tlb() et al. don't need to be the last use count of the > page. > > Concurrent reclaim/gup/whatever could still have a count out on the page > delaying the actual free beyond the tlb gather RCU section. anon_vma lock is just valid in case of page_mapped. if reclaim/gup/whatever want to use anon_vma, it should check with page_mapped. And last put_page doesn't touch anon_vma for freeing the page so I think it's not a problem. Do I miss something? > > This also complicates the whole page_lock_anon_vma() thing, so it would > be nice to be able to remove this race and clear page->mapping in > page_remove_rmap(). > BTW, I totally agree with you. Now anon_vma is very complicated. SLAB_DESTROY_BY_RCU, vma merge, when page->mapping is cleared, anon_vma_chain and so on.. :( -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 15:17 ` Minchan Kim @ 2010-04-12 15:33 ` Peter Zijlstra 0 siblings, 0 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-12 15:33 UTC (permalink / raw) To: Minchan Kim Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 2010-04-13 at 00:17 +0900, Minchan Kim wrote: > > Concurrent reclaim/gup/whatever could still have a count out on the page > > delaying the actual free beyond the tlb gather RCU section. > > anon_vma lock is just valid in case of page_mapped. > if reclaim/gup/whatever want to use anon_vma, it should check with page_mapped. > And last put_page doesn't touch anon_vma for freeing the page so I > think it's not a problem. Do I miss something? Hmm, I think you're right. The race I was thinking of makes the page_lock_anon_vma() RCU section overlap with that of the mmu_gather, which ensures the thing is long enough, or hits the !_mapcount case. I'm not sure there are other page->mapping users that are interesting. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 14:40 ` Peter Zijlstra 2010-04-12 15:17 ` Minchan Kim @ 2010-04-12 15:19 ` Rik van Riel 2010-04-12 16:01 ` Peter Zijlstra 1 sibling, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-12 15:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/12/2010 10:40 AM, Peter Zijlstra wrote: > So the reason page->mapping isn't cleared in page_remove_rmap() isn't > detailed beyond a (possible) race with page_add_anon_rmap() (which I > guess would be reclaim trying to unmap the page and a fault re-instating > it). > > This also complicates the whole page_lock_anon_vma() thing, so it would > be nice to be able to remove this race and clear page->mapping in > page_remove_rmap(). For anonymous pages, I don't see where the race comes from. Both do_swap_page and the reclaim code hold the page lock across the entire operation, so they are already excluding each other. Hugh, do you remember what the race between page_remove_rmap and page_add_anon_rmap is/was all about? I don't see a race in the current code... ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 15:19 ` Rik van Riel @ 2010-04-12 16:01 ` Peter Zijlstra 2010-04-12 16:06 ` Rik van Riel 2010-04-13 10:53 ` KOSAKI Motohiro 0 siblings, 2 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-12 16:01 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 2010-04-12 at 11:19 -0400, Rik van Riel wrote: > On 04/12/2010 10:40 AM, Peter Zijlstra wrote: > > > So the reason page->mapping isn't cleared in page_remove_rmap() isn't > > detailed beyond a (possible) race with page_add_anon_rmap() (which I > > guess would be reclaim trying to unmap the page and a fault re-instating > > it). > > > > This also complicates the whole page_lock_anon_vma() thing, so it would > > be nice to be able to remove this race and clear page->mapping in > > page_remove_rmap(). > > For anonymous pages, I don't see where the race comes from. > > Both do_swap_page and the reclaim code hold the page lock > across the entire operation, so they are already excluding > each other. > > Hugh, do you remember what the race between page_remove_rmap > and page_add_anon_rmap is/was all about? > > I don't see a race in the current code... Something like the below would be nice if possible. --- mm/rmap.c | 44 +++++++++++++++++++++++++++++++------------- 1 files changed, 31 insertions(+), 13 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index eaa7a09..241f75d 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -286,7 +286,22 @@ void __init anon_vma_init(void) /* * Getting a lock on a stable anon_vma from a page off the LRU is - * tricky: page_lock_anon_vma rely on RCU to guard against the races. + * tricky: + * + * page_add_anon_vma() + * atomic_add_negative(page->_mapcount); + * page->mapping = anon_vma; + * + * + * page_remove_rmap() + * atomic_add_negative(); + * page->mapping = anon_vma; + * + * So we have to first read page->mapping(), and then verify + * _mapcount, and make sure we order them correctly. + * + * We take anon_vma->lock in between so that if we see the anon_vma + * with a mapcount we know it won't go away on us. */ struct anon_vma *page_lock_anon_vma(struct page *page) { @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page) unsigned long anon_mapping; rcu_read_lock(); - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); + anon_mapping = (unsigned long)rcu_dereference(page->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; - if (!page_mapped(page)) - goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); + + /* + * Order the reading of page->mapping and page->_mapcount against the + * mb() implied by the atomic_add_negative() in page_remove_rmap(). + */ + smp_rmb(); + if (!page_mapped(page)) { + spin_unlock(&anon_vma->lock); + anon_vma = NULL; + goto out; + } + return anon_vma; out: rcu_read_unlock(); @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page) __dec_zone_page_state(page, NR_FILE_MAPPED); mem_cgroup_update_file_mapped(page, -1); } - /* - * It would be tidy to reset the PageAnon mapping here, - * but that might overwrite a racing page_add_anon_rmap - * which increments mapcount after us but sets mapping - * before us: so leave the reset to free_hot_cold_page, - * and remember that it's only reliable while mapped. - * Leaving it set also helps swapoff to reinstate ptes - * faster for those pages still in swapcache. - */ + + page->mapping = NULL; } /* ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 16:01 ` Peter Zijlstra @ 2010-04-12 16:06 ` Rik van Riel 2010-04-12 16:46 ` Linus Torvalds 2010-04-13 10:53 ` KOSAKI Motohiro 1 sibling, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-12 16:06 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/12/2010 12:01 PM, Peter Zijlstra wrote: > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page) > __dec_zone_page_state(page, NR_FILE_MAPPED); > mem_cgroup_update_file_mapped(page, -1); > } > - /* > - * It would be tidy to reset the PageAnon mapping here, > - * but that might overwrite a racing page_add_anon_rmap > - * which increments mapcount after us but sets mapping > - * before us: so leave the reset to free_hot_cold_page, > - * and remember that it's only reliable while mapped. > - * Leaving it set also helps swapoff to reinstate ptes > - * faster for those pages still in swapcache. > - */ > + > + page->mapping = NULL; > } That would be a bug for file pages :) I could see how it could work for anonymous memory, though. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 16:06 ` Rik van Riel @ 2010-04-12 16:46 ` Linus Torvalds 2010-04-12 18:40 ` Peter Zijlstra 0 siblings, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-12 16:46 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 12 Apr 2010, Rik van Riel wrote: > On 04/12/2010 12:01 PM, Peter Zijlstra wrote: > > > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page) > > __dec_zone_page_state(page, NR_FILE_MAPPED); > > mem_cgroup_update_file_mapped(page, -1); > > } > > - /* > > - * It would be tidy to reset the PageAnon mapping here, > > - * but that might overwrite a racing page_add_anon_rmap > > - * which increments mapcount after us but sets mapping > > - * before us: so leave the reset to free_hot_cold_page, > > - * and remember that it's only reliable while mapped. > > - * Leaving it set also helps swapoff to reinstate ptes > > - * faster for those pages still in swapcache. > > - */ > > + > > + page->mapping = NULL; > > } > > That would be a bug for file pages :) > > I could see how it could work for anonymous memory, though. I think it's scary for anonymous pages too. The _common_ case of page_remove_rmap() is from unmap/exit, which holds no locks on the page what-so-ever. So assuming the page could be reachable some other way (swap cache etc), I think the above is pretty scary. Also do note that the bug we've been chasing has _always_ had that test for "page_mapped(page)". See my other email about why the unmapped case isn't even interesting, because it's so easy to see how page->mapping can be stale for unmapped pages. It's the _mapped_ case that is interesting, not the unmapped one. So setting page->mapping to NULL when unmapping is perhaps a nice consistency issue ("never have stale pointers"), but it's missing the fact that it's not really the case we care about. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 16:46 ` Linus Torvalds @ 2010-04-12 18:40 ` Peter Zijlstra 2010-04-12 19:30 ` Peter Zijlstra 0 siblings, 1 reply; 231+ messages in thread From: Peter Zijlstra @ 2010-04-12 18:40 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 2010-04-12 at 09:46 -0700, Linus Torvalds wrote: > > On Mon, 12 Apr 2010, Rik van Riel wrote: > > > On 04/12/2010 12:01 PM, Peter Zijlstra wrote: > > > > > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page) > > > __dec_zone_page_state(page, NR_FILE_MAPPED); > > > mem_cgroup_update_file_mapped(page, -1); > > > } > > > - /* > > > - * It would be tidy to reset the PageAnon mapping here, > > > - * but that might overwrite a racing page_add_anon_rmap > > > - * which increments mapcount after us but sets mapping > > > - * before us: so leave the reset to free_hot_cold_page, > > > - * and remember that it's only reliable while mapped. > > > - * Leaving it set also helps swapoff to reinstate ptes > > > - * faster for those pages still in swapcache. > > > - */ > > > + > > > + page->mapping = NULL; > > > } > > > > That would be a bug for file pages :) > > > > I could see how it could work for anonymous memory, though. > > I think it's scary for anonymous pages too. The _common_ case of > page_remove_rmap() is from unmap/exit, which holds no locks on the page > what-so-ever. So assuming the page could be reachable some other way (swap > cache etc), I think the above is pretty scary. Fully agreed. > Also do note that the bug we've been chasing has _always_ had that test > for "page_mapped(page)". See my other email about why the unmapped case > isn't even interesting, because it's so easy to see how page->mapping can > be stale for unmapped pages. > > It's the _mapped_ case that is interesting, not the unmapped one. So > setting page->mapping to NULL when unmapping is perhaps a nice consistency > issue ("never have stale pointers"), but it's missing the fact that it's > not really the case we care about. Yes, I don't think this is the problem that has been plaguing us for over a week now. But while staring at that code it did get me worried that the current code (page_lock_anon_vma): - is missing the smp_read_barrier_depends() after the ACCESS_ONCE - isn't properly ordered wrt page->mapping and page->_mapcount. - doesn't appear to guarantee much at all when returning an anon_vma since it locks after checking page->_mapcount so: * it can return !NULL for an unmapped page (your patch cures that) * it can return !NULL but for a different anon_vma (my earlier patch checking page_rmapping() after the spin_lock cures that, but doesn't cure the above): [ highly unlikely but not impossible race ] page_referenced(page_A) try_to_unmap(page_A) unrelated fault fault page_A CPU0 CPU1 CPU2 CPU3 rcu_read_lock() anon_vma = page->mapping; if (!anon_vma & ANON_BIT) goto out if (!page_mapped(page)) goto out page_remove_rmap() ... anon_vma_free()-----\ v anon_vma_alloc() anon_vma_alloc() page_add_anon_rmap() ^ spin_lock(anon_vma->lock)----------/ Now I don't think the above can happen due to how our slab allocators work, they won't share a slab page between cpus like that, but once we make the whole thing preemptible this race becomes a lot more likely. So a page_lock_anon_vma(), that looks a little like the below should (I think) cure all our problems with it. struct anon_vma *page_lock_anon_vma(struct page *page) { struct anon_vma *anon_vma; unsigned long anon_mapping; rcu_read_lock(); again: anon_mapping = (unsigned long)rcu_dereference(page->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON); /* * The RCU read lock ensures we can safely dereference anon_vma * since it ensures the backing slab won't go away. It will however * not guarantee it's the right object. * * First take the anon_vma->lock, this will, per anon_vma_unlink() * avoid this anon_vma from being freed if it is a valid object. */ spin_lock(&anon_vma->lock); /* * Secondly, we have to re-read page->mapping, so ensure it * has not changed, rely on spin_lock() being at least a * compiler barrier to force the re-read. */ if (unlikely(page_rmapping(page) != anon_vma)) { spin_unlock(&anon_vma->lock); goto again; } /* * Ensure we read page->mapping before page->_mapcount, * orders against atomic_add_negative() in page_remove_rmap(). */ smp_rmb(); /* * Finally check that the page is still mapped, * if not, this can't possibly be the right anon_vma. */ if (!page_mapped(page)) goto unlock; return anon_vma; unlock: spin_unlock(&anon_vma->lock); out: rcu_read_unlock(); return NULL; } With this, I think we can actually drop the RCU read lock when returning since if this is indeed a valid anon_vma for this page, then the page is still mapped, and hence the anon_vma was not deleted, and a possible future delete will be held back by us holding the anon_vma->lock. Now I could be totally wrong and have confused myself throroughly, but how does this look? ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 18:40 ` Peter Zijlstra @ 2010-04-12 19:30 ` Peter Zijlstra 2010-04-12 19:44 ` Peter Zijlstra 0 siblings, 1 reply; 231+ messages in thread From: Peter Zijlstra @ 2010-04-12 19:30 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 2010-04-12 at 20:40 +0200, Peter Zijlstra wrote: Hmm, if interleaved like so > struct anon_vma *page_lock_anon_vma(struct page *page) > { > struct anon_vma *anon_vma; > unsigned long anon_mapping; page_remove_rmap() anon_vma_unlink() anon_vma_free() So that the below will all observe the old page->mapping: > rcu_read_lock(); > again: > anon_mapping = (unsigned long)rcu_dereference(page->mapping); > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) > goto out; > anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON); > > /* > * The RCU read lock ensures we can safely dereference anon_vma > * since it ensures the backing slab won't go away. It will however > * not guarantee it's the right object. > * > * First take the anon_vma->lock, this will, per anon_vma_unlink() > * avoid this anon_vma from being freed if it is a valid object. > */ > spin_lock(&anon_vma->lock); > > /* > * Secondly, we have to re-read page->mapping, so ensure it > * has not changed, rely on spin_lock() being at least a > * compiler barrier to force the re-read. > */ > if (unlikely(page_rmapping(page) != anon_vma)) { > spin_unlock(&anon_vma->lock); > goto again; > } page_add_anon_rmap(), so that the page_mapped() test below would be positive, > /* > * Ensure we read page->mapping before page->_mapcount, > * orders against atomic_add_negative() in page_remove_rmap(). > */ > smp_rmb(); > > /* > * Finally check that the page is still mapped, > * if not, this can't possibly be the right anon_vma. > */ > if (!page_mapped(page)) > goto unlock; We could here return a non-valid and already freed anon_vma. > return anon_vma; > > unlock: > spin_unlock(&anon_vma->lock); > out: > rcu_read_unlock(); > return NULL; > } > > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 19:30 ` Peter Zijlstra @ 2010-04-12 19:44 ` Peter Zijlstra 0 siblings, 0 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-12 19:44 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Borislav Petkov, Johannes Weiner, KOSAKI Motohiro, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Mon, 2010-04-12 at 21:30 +0200, Peter Zijlstra wrote: > > We could here return a non-valid and already freed anon_vma. > OK, so non of the users of page_lock_anon_vma() with exception of the memory-failure.c one could really care. And all of them seem to be safe enough wrt dealing with a dead one. So unless people care, I'm going to not spend more time on trying to make page_lock_anon_vma() behave. Instead I'll try and see wth it is that migrate.c and rmap_walk_anon are doing. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-12 16:01 ` Peter Zijlstra 2010-04-12 16:06 ` Rik van Riel @ 2010-04-13 10:53 ` KOSAKI Motohiro 2010-04-13 11:30 ` Peter Zijlstra 1 sibling, 1 reply; 231+ messages in thread From: KOSAKI Motohiro @ 2010-04-13 10:53 UTC (permalink / raw) To: Peter Zijlstra Cc: kosaki.motohiro, Rik van Riel, Linus Torvalds, Borislav Petkov, Johannes Weiner, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson > struct anon_vma *page_lock_anon_vma(struct page *page) > { > @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page) > unsigned long anon_mapping; > > rcu_read_lock(); > - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); > + anon_mapping = (unsigned long)rcu_dereference(page->mapping); > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) > goto out; > - if (!page_mapped(page)) > - goto out; > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > spin_lock(&anon_vma->lock); Does anon->lock dereference is guranteed if page->_mapcount==-1? It can be freed miliseconds ago, rcu_read_lock() doesn't provide such gurantee. perhaps, I'm missing your point. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-13 10:53 ` KOSAKI Motohiro @ 2010-04-13 11:30 ` Peter Zijlstra 2010-04-13 12:00 ` KOSAKI Motohiro 0 siblings, 1 reply; 231+ messages in thread From: Peter Zijlstra @ 2010-04-13 11:30 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Linus Torvalds, Borislav Petkov, Johannes Weiner, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 2010-04-13 at 19:53 +0900, KOSAKI Motohiro wrote: > > struct anon_vma *page_lock_anon_vma(struct page *page) > > { > > @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page) > > unsigned long anon_mapping; > > > > rcu_read_lock(); > > - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); > > + anon_mapping = (unsigned long)rcu_dereference(page->mapping); > > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) > > goto out; > > - if (!page_mapped(page)) > > - goto out; > > > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > > spin_lock(&anon_vma->lock); > > Does anon->lock dereference is guranteed if page->_mapcount==-1? > It can be freed miliseconds ago, rcu_read_lock() doesn't provide such > gurantee. > > perhaps, I'm missing your point. No you're right, I got my head hopelessly twisted up trying to make page_lock_anon_vma() do something reliable, but there really isn't much that can be done. Luckily most users (with exception of the memory-failure.c one) don't really care and all take steps to verify the page is indeed in any of the vmas it might find. So I've given up on this and will only submit a patch like the below, which hopefully does still make sense... I do think there's a missing barrier in there as well, but I've made enough of a fool of myself. [ with the preemptible mmu_gather patches I introduce a refcount to the anon_vma, and then with atomic_inc_not_zero() we can add a guarantee that the returned anon_vma is alive ] --- mm/rmap.c | 18 ++++++++++++++++-- 1 files changed, 16 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index eaa7a09..49a2533 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -285,8 +285,22 @@ void __init anon_vma_init(void) } /* - * Getting a lock on a stable anon_vma from a page off the LRU is - * tricky: page_lock_anon_vma rely on RCU to guard against the races. + * Getting a lock on a stable anon_vma from a page off the LRU is tricky! + * + * Since there is no serialization what so ever against page_remove_rmap() + * the best this function can do is return a locked anon_vma that might + * have been relevant to this page. + * + * The page might have been remapped to a different anon_vma or the anon_vma + * returned may already be freed (and even reused). + * + * All users of this function must be very careful when walking the anon_vma + * chain and verify that the page in question is indeed mapped in it + * [ something equivalent to page_mapped_in_vma() ]. + * + * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap() + * that the anon_vma pointer from page->mapping is valid if there is a + * mapcount, we can dereference the anon_vma after observing those. */ struct anon_vma *page_lock_anon_vma(struct page *page) { ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-13 11:30 ` Peter Zijlstra @ 2010-04-13 12:00 ` KOSAKI Motohiro 2010-04-14 14:27 ` Peter Zijlstra 0 siblings, 1 reply; 231+ messages in thread From: KOSAKI Motohiro @ 2010-04-13 12:00 UTC (permalink / raw) To: Peter Zijlstra Cc: kosaki.motohiro, Rik van Riel, Linus Torvalds, Borislav Petkov, Johannes Weiner, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson > On Tue, 2010-04-13 at 19:53 +0900, KOSAKI Motohiro wrote: > > > struct anon_vma *page_lock_anon_vma(struct page *page) > > > { > > > @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page) > > > unsigned long anon_mapping; > > > > > > rcu_read_lock(); > > > - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); > > > + anon_mapping = (unsigned long)rcu_dereference(page->mapping); > > > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) > > > goto out; > > > - if (!page_mapped(page)) > > > - goto out; > > > > > > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); > > > spin_lock(&anon_vma->lock); > > > > Does anon->lock dereference is guranteed if page->_mapcount==-1? > > It can be freed miliseconds ago, rcu_read_lock() doesn't provide such > > gurantee. > > > > perhaps, I'm missing your point. > > No you're right, I got my head hopelessly twisted up trying to make > page_lock_anon_vma() do something reliable, but there really isn't much > that can be done. > > Luckily most users (with exception of the memory-failure.c one) don't > really care and all take steps to verify the page is indeed in any of > the vmas it might find. > > So I've given up on this and will only submit a patch like the below, > which hopefully does still make sense... > > I do think there's a missing barrier in there as well, but I've made > enough of a fool of myself. > > [ with the preemptible mmu_gather patches I introduce a refcount to > the anon_vma, and then with atomic_inc_not_zero() we can add a > guarantee that the returned anon_vma is alive ] Indeed. refcount is best way. anon_vma DESTROY_BY_RCU stuff seems overengineering, I think. this is fastest, but anon_vma allocation is not (and was not) fork/exit bottleneck point. So, I guess most simply way is best. Also following patch looks good to me. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Thanks for that. I've thought this is really necessary. but my (very) poor english skill make hesitate it to me. sorry my laziness ;) > > --- > mm/rmap.c | 18 ++++++++++++++++-- > 1 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index eaa7a09..49a2533 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -285,8 +285,22 @@ void __init anon_vma_init(void) > } > > /* > - * Getting a lock on a stable anon_vma from a page off the LRU is > - * tricky: page_lock_anon_vma rely on RCU to guard against the races. > + * Getting a lock on a stable anon_vma from a page off the LRU is tricky! > + * > + * Since there is no serialization what so ever against page_remove_rmap() > + * the best this function can do is return a locked anon_vma that might > + * have been relevant to this page. > + * > + * The page might have been remapped to a different anon_vma or the anon_vma > + * returned may already be freed (and even reused). > + * > + * All users of this function must be very careful when walking the anon_vma > + * chain and verify that the page in question is indeed mapped in it > + * [ something equivalent to page_mapped_in_vma() ]. > + * > + * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap() > + * that the anon_vma pointer from page->mapping is valid if there is a > + * mapcount, we can dereference the anon_vma after observing those. > */ > struct anon_vma *page_lock_anon_vma(struct page *page) > { > ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-13 12:00 ` KOSAKI Motohiro @ 2010-04-14 14:27 ` Peter Zijlstra 0 siblings, 0 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-14 14:27 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Rik van Riel, Linus Torvalds, Borislav Petkov, Johannes Weiner, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 2010-04-13 at 21:00 +0900, KOSAKI Motohiro wrote: > > [ with the preemptible mmu_gather patches I introduce a refcount to > > the anon_vma, and then with atomic_inc_not_zero() we can add a > > guarantee that the returned anon_vma is alive ] > > Indeed. refcount is best way. anon_vma DESTROY_BY_RCU stuff seems > overengineering, I think. this is fastest, but anon_vma allocation is not > (and was not) fork/exit bottleneck point. So, I guess most simply way is > best. Well, that refcount stuff still relies on DESTROY_BY_RCU :-) Anyway, it also looks like a lot of races are avoided by ordering the rmap_add/remove calls wrt to adding/removing the page to/from the LRU. Rmap calls come from LRU pages, and it looks like rmap state is only changed for pages that are not on the LRU. I still have to go through all that code again to make sure, but I couldn't find a race between page_add_anon_rmap() and page_lock_anon_vma() due to that. If there is, we need to look at page_mapped() before page->mapping because page_add_anon_rmap() first increments the mapcount and only then adjusts the mapping, so the existing order in page_anon_lock_vma() can end up dereferencing a long dead anon_vma. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 16:38 ` Borislav Petkov 2010-04-10 17:05 ` Linus Torvalds @ 2010-04-10 17:07 ` Borislav Petkov 1 sibling, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-10 17:07 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Borislav Petkov <bp@alien8.de> Date: Sat, Apr 10, 2010 at 06:38:28PM +0200 > Im going to run the stress test on 2.6.33.2 to verify whether this is > actually software-related. Just in case. Just did a bunch of hibernation runs - 2.6.33.2 feels rock solid - no issues whatsoever. So in the face of such results a hw failure is kinda unprobable... Hmm... -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 15:24 ` Linus Torvalds 2010-04-10 16:38 ` Borislav Petkov @ 2010-04-10 16:41 ` Linus Torvalds 2010-04-10 22:49 ` Johannes Weiner 1 sibling, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 16:41 UTC (permalink / raw) To: Borislav Petkov Cc: Johannes Weiner, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, 10 Apr 2010, Linus Torvalds wrote: > > But I think the fact that you are apparently not able to get the list > corruption is a good sign. Of course, it might just be harder to trigger, > and these things could all be a sign of a different bug, but my gut feel > is that we did fix something, and you are just damn good at stressing the > new code. Kudos. Btw, I do hate the current 'find_mergeable_anon_vma()' with its duplicated checks for prev/next compatibility that I just made even more complex. So I'm actually inclined to want to write my simple two-liner fix as a rather more complex cleanup patch, below. It adds way more lines than it deletes, but a lot of it is comments (and some of it is just because one routine got split up into three), and I think it makes the result a lot more readable. It also splits off the decision of whether we can reuse an non_vma from the decision of whether we can merge the vma's - the two are kind of related, but they are not really the same, and they have different issues. I think it's good to try to keep separate issues separate. This is UNTESTED! It's meant to be an "obvious cleanup" with no real semantic difference, but if I did something wrong it won't work. Also note the comment about the lack of locking between two adjacent anon_vma's taking a page fault at the same time: the ACCESS_ONCE() is unlikely to ever matter (anon_vma's are stable once they are set, so it's really just that you could first load a NULL, and then if you re-load the value you might get a non-NULL thing). Also note that when checking whether the anon_vma is a singleton, we don't hold any lock that protects the list we are checking. But "list_is_singular()" is safe and won't oops even if the pointers in the list are crap, because it only _compares_ the prev/next pointers, it doesn't dereference them. In short, what I'm saying is that there is a pretty subtle race in the very very unlikely case that two anon_vma's get prepared concurrently, but from a correctness standpoint it doesn't matter. We might sometimes - once in a blue moon - reject an anon_vma that could in theory have been merged, but that won't hurt. Comments? Rik, Johannes? Linus --- mm/mmap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------------- 1 files changed, 62 insertions(+), 24 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..acb023e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -825,6 +825,61 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, } /* + * Rough compatbility check to quickly see if it's even worth looking + * at sharing an anon_vma. + * + * They need to have the same vm_file, and the flags can only differ + * in things that mprotect may change. + * + * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that + * we can merge the two vma's. For example, we refuse to merge a vma if + * there is a vm_ops->close() function, because that indicates that the + * driver is doing some kind of reference counting. But that doesn't + * really matter for the anon_vma sharing case. + */ +static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b) +{ + return a->vm_end == b->vm_start && + mpol_equal(vma_policy(a), vma_policy(b)) && + a->vm_file == b->vm_file && + !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC)) && + b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); +} + +/* + * Do some basic sanity checking to see if we can re-use the anon_vma + * from 'old'. The 'a'/'b' vma's are in VM order - one of them will be + * the same as 'old', the other will be the new one that is trying + * to share the anon_vma. + * + * NOTE! This runs with mm_sem held for reading, so it is possible that + * the anon_vma of 'old' is concurrently in the process of being set up + * by another page fault trying to merge _that_. But that's ok: if it + * is being set up, that automatically means that it will be a singleton + * acceptable for merging, so we can do all of this optimistically. But + * we do that ACCESS_ONCE() to make sure that we never re-load the pointer. + * + * IOW: that the "list_is_singular()" test on the anon_vma_chain only + * matters for the 'stable anon_vma' case (ie the thing we want to avoid + * is to return an anon_vma that is "complex" due to having gone through + * a fork). + * + * We also make sure that the two vma's are compatible (adjacent, + * and with the same memory policies). That's all stable, even with just + * a read lock on the mm_sem. + */ +static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b) +{ + if (anon_vma_compatible(a, b)) { + struct anon_vma *anon_vma = ACCESS_ONCE(old->anon_vma); + + if (anon_vma && list_is_singular(&old->anon_vma_chain)) + return anon_vma; + } + return NULL; +} + +/* * find_mergeable_anon_vma is used by anon_vma_prepare, to check * neighbouring vmas for a suitable anon_vma, before it goes off * to allocate a new anon_vma. It checks because a repetitive @@ -834,28 +889,16 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, */ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma) { + struct anon_vma *anon_vma; struct vm_area_struct *near; - unsigned long vm_flags; near = vma->vm_next; if (!near) goto try_prev; - /* - * Since only mprotect tries to remerge vmas, match flags - * which might be mprotected into each other later on. - * Neither mlock nor madvise tries to remerge at present, - * so leave their flags as obstructing a merge. - */ - vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); - vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - - if (near->anon_vma && vma->vm_end == near->vm_start && - mpol_equal(vma_policy(vma), vma_policy(near)) && - can_vma_merge_before(near, vm_flags, - NULL, vma->vm_file, vma->vm_pgoff + - ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT))) - return near->anon_vma; + anon_vma = reusable_anon_vma(near, vma, near); + if (anon_vma) + return anon_vma; try_prev: /* * It is potentially slow to have to call find_vma_prev here. @@ -868,14 +911,9 @@ try_prev: if (!near) goto none; - vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); - vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC); - - if (near->anon_vma && near->vm_end == vma->vm_start && - mpol_equal(vma_policy(near), vma_policy(vma)) && - can_vma_merge_after(near, vm_flags, - NULL, vma->vm_file, vma->vm_pgoff)) - return near->anon_vma; + anon_vma = reusable_anon_vma(near, near, vma); + if (anon_vma) + return anon_vma; none: /* * There's no absolute need to look only at touching neighbours: ^ permalink raw reply related [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 16:41 ` Linus Torvalds @ 2010-04-10 22:49 ` Johannes Weiner 2010-04-10 23:31 ` Linus Torvalds 0 siblings, 1 reply; 231+ messages in thread From: Johannes Weiner @ 2010-04-10 22:49 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sat, Apr 10, 2010 at 09:41:52AM -0700, Linus Torvalds wrote: > [...] > > It also splits off the decision of whether we can reuse an non_vma from > the decision of whether we can merge the vma's - the two are kind of > related, but they are not really the same, and they have different issues. > I think it's good to try to keep separate issues separate. > > [...] > > + * NOTE! The fact that we share an anon_vma doesn't _have_ to mean that > + * we can merge the two vma's. For example, we refuse to merge a vma if > + * there is a vm_ops->close() function, because that indicates that the > + * driver is doing some kind of reference counting. But that doesn't > + * really matter for the anon_vma sharing case. I am all in favor of only doing singletons, so that we don't have to inflict my psycho-active merging routine on civilians. I am not convinced it's a good idea to share an anon_vma, however, when we know beforehand the vmas will never merge, because it will increase rmap overhead of walking unrelated vmas for every page in every vma that is part of the reused anon_vma. So we usually take that as a trade-off when there is a chance the vmas could still reunite and we don't want to spoil that through differing anon_vmas. But if it's already clear that they won't, it appears to me it would be more efficient in the long run to just allocate our own anon_vma. Did you have something in mind that I missed? Hannes ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-10 22:49 ` Johannes Weiner @ 2010-04-10 23:31 ` Linus Torvalds 0 siblings, 0 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-10 23:31 UTC (permalink / raw) To: Johannes Weiner Cc: Borislav Petkov, KOSAKI Motohiro, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Sun, 11 Apr 2010, Johannes Weiner wrote: > > Did you have something in mind that I missed? Mostly that the corner cases will never matter, and I'd prefer to keep the code simpler than to care deeply. For example, the only case you'd see vm_ops->close() is for special device mappings. It's true that they cannot have their vma's merged, but it's also true that they (a) will seldom have anon_vma's anyway and (b) would never get mapped very many times so that anon_vma merging would be an issue. In other words, it's a "don't care" situation, where to keep the code simpler we just document that we don't care. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-09 0:50 ` Linus Torvalds 2010-04-09 1:30 ` Borislav Petkov @ 2010-04-09 1:45 ` KOSAKI Motohiro 1 sibling, 0 replies; 231+ messages in thread From: KOSAKI Motohiro @ 2010-04-09 1:45 UTC (permalink / raw) To: Linus Torvalds Cc: kosaki.motohiro, Borislav Petkov, Rik van Riel, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes > > > On Fri, 9 Apr 2010, Borislav Petkov wrote: > > > > Yep, looks good: its mmap_region()... > > Can you double-check your current diffs - maybe something got corrupted. > > mmap_region installs the vma with vma_link(), and the last thing > vma_link() does with my patch is that "anon_vma_prepare()". I agree. and at least your patch works fine on my box. I'll continue digg. > > Maybe with all the patches flying around, you had a reject or something, > and you lost that one anon_vma_prepare()? > > Or maybe I screwed up somewhere and sent you the wrong patch. Here it is > again, just in case. > > [ I have a horrible cold, and can hardly think straight. So who knows, > maybe I'm missing something. But if you have lost one of the > 'anon_vma_prepare()' call sites, that would certainly explain why you > get NULL anon_vma's ] > > Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA 2010-04-07 14:54 ` [PATCH -v2] " Rik van Riel 2010-04-07 15:30 ` Linus Torvalds @ 2010-04-07 15:55 ` Minchan Kim 1 sibling, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-07 15:55 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Linus Torvalds, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, hannes Hi, Rik. On Wed, Apr 7, 2010 at 11:54 PM, Rik van Riel <riel@redhat.com> wrote: > When a new VMA has a mergeable anon_vma with a neighboring VMA, > make sure all of the neighbor's old anon_vma structs are also > linked in. > > This is necessary because at some point the VMAs could get merged, > and we want to ensure no anon_vma structs get freed prematurely, > while the system still has anonymous pages that belong to those > structs. > > Reported-by: Borislav Petkov <bp@alien8.de> > Signed-off-by: Rik van Riel <riel@redhat.com> At last, you might find culprit. AFAIU your descriptoin, don't we have to care vma_merge case, too? Sorry if it is dumb question. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 23:27 ` Linus Torvalds 2010-04-06 23:54 ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel @ 2010-04-07 7:29 ` Borislav Petkov 2010-04-07 14:05 ` Paulo Marques 2 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-07 7:29 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, Apr 06, 2010 at 04:27:42PM -0700 > No, you're mis-reading the asm. It's again the first iteration, and the > code above it is again the end of the loop. And %rax is once more a kernel > pointer, not the return value of 'page_referenced_one()'. > > So it once more is 'anon_vma->head.next' that is crap, but now it's not > NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20 > subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e). No, maybe I expressed myself wrong (it was late an' all) - I was basically trying to confirm your assessment that anon_vma->head.next is crap but the code had changed since I had added the debugging 'if (!anon_vma->head.next)' and that was the value that was already in %r13 before iterating over the list chain. Yeah, just a minor nitpick and not that it matters. Nevermind though, we're on the same page. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 23:27 ` Linus Torvalds 2010-04-06 23:54 ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel 2010-04-07 7:29 ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) Borislav Petkov @ 2010-04-07 14:05 ` Paulo Marques 2010-04-07 14:13 ` Borislav Petkov 2 siblings, 1 reply; 231+ messages in thread From: Paulo Marques @ 2010-04-07 14:05 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson Linus Torvalds wrote: > [...] > So it once more is 'anon_vma->head.next' that is crap, but now it's not > NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20 > subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e). > > What does '0x2e' mean? It's ASCII '.', but that doesn't really mean > anything either. Just a wild shot in the dark: it can be a couple of gray pixels with intensity 0x2e at some 32 bits per pixel mode. I say this because of the zero bytes there and someone mentioning seeing the problem when starting X. -- Paulo Marques - www.grupopie.com "Don't worry, you'll be fine; I saw it work in a cartoon once..." ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-07 14:05 ` Paulo Marques @ 2010-04-07 14:13 ` Borislav Petkov 0 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-07 14:13 UTC (permalink / raw) To: Paulo Marques Cc: Linux Kernel Mailing List, Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Paulo Marques <pmarques@grupopie.com> Date: Wed, Apr 07, 2010 at 03:05:50PM +0100 > Linus Torvalds wrote: > > [...] > > So it once more is 'anon_vma->head.next' that is crap, but now it's not > > NULL, it's that very odd 0x002e2e2e002e2e2e pattern (the %r13 has had 0x20 > > subtracted from it, so that LSB of "0x0e" is actually _also_ a 0x2e). > > > > What does '0x2e' mean? It's ASCII '.', but that doesn't really mean > > anything either. > > Just a wild shot in the dark: it can be a couple of gray pixels with > intensity 0x2e at some 32 bits per pixel mode. I say this because of the > zero bytes there and someone mentioning seeing the problem when starting X. I don't think those are related: the problem when X was starting happens with Rik' newest patch and the funny %r13 value happened after enabling SLUB debugging last night. Thanks. -- Regards/Gruss, Boris. -- Advanced Micro Devices, Inc. Operating Systems Research Center ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 22:59 ` Borislav Petkov 2010-04-06 23:27 ` Linus Torvalds @ 2010-04-06 23:37 ` Linus Torvalds 1 sibling, 0 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 23:37 UTC (permalink / raw) To: Borislav Petkov Cc: Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Wed, 7 Apr 2010, Borislav Petkov wrote: > + > + if (!anon_vma->head.next) { > + printk(KERN_ERR "NULL anon_vma->head.next, page %lu\n", > + page_to_pfn(page)); > + > + object_err(anon_vma_cachep, page, (u8 *)anon_vma, "NULL next"); Oh, and since the debugging code never triggered ('head.next' wasn't actually NULL), you never got here, but the 'page' you passed in to object_error() should be the page of the slab allocation, not the page associated with the anon_vma. So it should be something like "virt_to_head_page(anon_vma)" that you pass in to object_err(). Not that it matters. I assume it is the fact that SLAB debugging is on that actually turns the NULL into a non-NULL thing. Poisoning is not active for SLUb's with constructors or RCU-freeing, but things like redzoning still are. So enabling SLUB debugging will change the offsets within the pages of all the SLUB allocations. I wonder if that's just what caused it to now have that 0x002e2e2e002e2e2e instead of NULL. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 21:27 ` Linus Torvalds 2010-04-06 22:59 ` Borislav Petkov @ 2010-04-06 23:22 ` Rik van Riel 2010-04-07 0:10 ` Linus Torvalds 1 sibling, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-06 23:22 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Andrew Morton, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/06/2010 05:27 PM, Linus Torvalds wrote: > I still don't see _how_ it happens, though. That 'struct anon_vma' is very > simple, and contains literally just the lock and that list_head. It gets more fun. It looks like the anon_vma is only allocated through anon_vma_alloc() and only handled by the functions in rmap.c By themselves, all of those functions look alright. However, I think I may have found a possible bug in the interplay between anon_vma_prepare() and vma_adjust(), across several mprotect invocations. Let me explain what I think may be going on in small steps, since it is quite subtle (assuming I am right). 1) a process forks, creating a second "layer" of anon_vma objects for the VMAs that have anon pages 2) a new VMA is created adjacant to an existing one, with different permissions 3) anon_vma_prepare is called on the new VMA, this only links the "top" anon_vma to the new VMA, since that is the anon_vma where all new pages get instantiated anyway (this would be part of the bug) 4) mprotect changes the permission of one of the VMAs, causing the old and the new VMAs to get merged 5) vma_adjust calls anon_vma_merge, causing the anon_vma chain of one of the VMAs to get nuked - with bad luck, this is the original one, leaving just the new anon_vma attached to the VMA 6) if the parent process quits, the old anon_vma structs get freed 7) meanwhile, we may still have some anonymous pages stick around in memory that have their page->mapping point to a freed anon_vma struct Does this look like it could happen? If so, I'll cook up a patch to change anon_vma_prepare and find_mergeable_anon_vma to attach the whole chain of anon_vmas to the new VMA, using anon_vma_clone(). ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 23:22 ` Rik van Riel @ 2010-04-07 0:10 ` Linus Torvalds 2010-04-07 1:18 ` Rik van Riel 2010-04-07 10:09 ` Pekka Enberg 0 siblings, 2 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-07 0:10 UTC (permalink / raw) To: Rik van Riel Cc: Borislav Petkov, Andrew Morton, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 6 Apr 2010, Rik van Riel wrote: > > It gets more fun. It looks like the anon_vma is only > allocated through anon_vma_alloc() and only handled > by the functions in rmap.c > > By themselves, all of those functions look alright. Yes. Very trivially so, in fact. > However, I think I may have found a possible bug in > the interplay between anon_vma_prepare() and vma_adjust(), > across several mprotect invocations. > > Let me explain what I think may be going on in small > steps, since it is quite subtle (assuming I am right). Sounds at least possible. Way more likely than any of the "trivially obvious" code being buggy, or the SLUB layer suddenly having a serious bug that only the new user could trigger. That said, the code that _really_ confuses me is the stuff that uses "anon_vma_clone()". Could you please also explain the code flow of vma_adjust() to mere mortals, please? I suspect Borislav is sleeping. But at least we have a patch for him to test when he wakes up ;) Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-07 0:10 ` Linus Torvalds @ 2010-04-07 1:18 ` Rik van Riel 2010-04-07 7:22 ` Borislav Petkov 2010-04-07 10:09 ` Pekka Enberg 1 sibling, 1 reply; 231+ messages in thread From: Rik van Riel @ 2010-04-07 1:18 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Andrew Morton, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On 04/06/2010 08:10 PM, Linus Torvalds wrote: > That said, the code that _really_ confuses me is the stuff that uses > "anon_vma_clone()". Could you please also explain the code flow of > vma_adjust() to mere mortals, please? That's easier said than done. I spent 3 days with pen and paper, going over that code before I made the anon_vma changes, first verifying that the code is indeed correct and then figuring out how I could make the anon_vma changes safely. I am not happy with the complexity of the code around vma_adjust, but could not find a way to simplify it and still keep merging VMAs the way we do. My largest change to vma_adjust was moving some code closer to the beginning of the function, so I could bail out if the allocation failed, without making change to the vma... > I suspect Borislav is sleeping. But at least we have a patch for him to > test when he wakes up ;) I am looking forward to the test results. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-07 1:18 ` Rik van Riel @ 2010-04-07 7:22 ` Borislav Petkov 0 siblings, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-07 7:22 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Andrew Morton, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson From: Rik van Riel <riel@redhat.com> Date: Tue, Apr 06, 2010 at 09:18:28PM -0400 Hi Rik, I think your patch needs a bit more baking, see below :) > >I suspect Borislav is sleeping. But at least we have a patch for him to > >test when he wakes up ;) > > I am looking forward to the test results. This happens when starting X, I haven't even started hibernating. [By the way, further testing will have to wait till tonight since I have a job, you know :) ] Also, mm/rmap.c:745 is BUG_ON(!anon_vma); in __page_set_anon_rmap(). --- [ 43.142371] ------------[ cut here ]------------ [ 43.142411] kernel BUG at mm/rmap.c:745! [ 43.142436] invalid opcode: 0000 [#1] PREEMPT SMP [ 43.142514] last sysfs file: /sys/devices/virtual/vtconsole/vtcon0/uevent [ 43.142537] CPU 0 [ 43.142559] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core k10temp ohci_hcd pcspkr [ 43.142997] [ 43.143012] Pid: 1940, comm: console-kit-dae Not tainted 2.6.34-rc3-00289-gae1ed76 #5 M3A78 PRO/System Product Name [ 43.143012] RIP: 0010:[<ffffffff810c08e7>] [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89 [ 43.143012] RSP: 0000:ffff88022c019da8 EFLAGS: 00010246 [ 43.143012] RAX: 0000000000000000 RBX: ffffea000774ff78 RCX: 000000002ce900f4 [ 43.143012] RDX: ffff88000a1d5dc8 RSI: 0000000000000007 RDI: ffffffff816e8740 [ 43.143012] RBP: ffff88022c019dc8 R08: 00007f29e3cfd928 R09: 000000000062c318 [ 43.143012] R10: 0000000000000000 R11: 0000000000000002 R12: ffff88022bbad960 [ 43.143012] R13: 00007f29e3cfd928 R14: 00007f29e3cfd928 R15: 80000002216d9067 [ 43.143012] FS: 00007f29e3d0f790(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000 [ 43.143012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 43.143012] CR2: 00007f29e3cfd928 CR3: 000000022dfd3000 CR4: 00000000000006f0 [ 43.143012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 43.143012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 43.143012] Process console-kit-dae (pid: 1940, threadinfo ffff88022c018000, task ffff88022ce90000) [ 43.143012] Stack: [ 43.143012] ffffffff810b8802 ffff88022bbad960 ffff88022ea3c600 ffff88022bb6d7e8 [ 43.143012] <0> ffff88022c019e48 ffffffff810b8823 ffff88022ea3c6b8 0000000000000246 [ 43.143012] <0> ffffea000774ff78 0000000000000001 00000001e3cfd928 ffff88022fdb58f0 [ 43.143012] Call Trace: [ 43.143012] [<ffffffff810b8802>] ? handle_mm_fault+0x2af/0x64e [ 43.143012] [<ffffffff810b8823>] handle_mm_fault+0x2d0/0x64e [ 43.143012] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d [ 43.143012] [<ffffffff810615ce>] ? put_lock_stats+0xe/0x27 [ 43.143012] [<ffffffff81062a55>] ? lock_release_holdtime+0x104/0x109 [ 43.143012] [<ffffffff813f93e3>] ? error_sti+0x5/0x6 [ 43.143012] [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 43.143012] [<ffffffff813f91ff>] page_fault+0x1f/0x30 [ 43.143012] Code: 00 00 48 89 fb 49 89 f4 49 89 d5 f0 80 4f 02 10 be 07 00 00 00 c7 47 0c 00 00 00 00 e8 c5 30 ff ff 49 8b 44 24 78 48 85 c0 75 04 <0f> 0b eb fe 48 ff c0 4c 89 e6 48 89 df 48 89 43 18 4d 2b 6c 24 [ 43.143012] RIP [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89 [ 43.143012] RSP <ffff88022c019da8> [ 43.145276] ---[ end trace d6305f6e826dbd53 ]--- [ 43.145314] note: console-kit-dae[1940] exited with preempt_count 1 [ 73.644201] ------------[ cut here ]------------ [ 73.644218] kernel BUG at mm/rmap.c:745! [ 73.644226] invalid opcode: 0000 [#2] PREEMPT SMP [ 73.644266] last sysfs file: /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq [ 73.644278] CPU 0 [ 73.644287] Modules linked in: powernow_k8 cpufreq_ondemand cpufreq_powersave cpufreq_userspace freq_table cpufreq_conservative binfmt_misc kvm_amd kvm ipv6 vfat fat dm_crypt dm_mod 8250_pnp 8250 edac_core serial_core k10temp ohci_hcd pcspkr [ 73.644509] [ 73.644520] Pid: 2018, comm: iceowl-bin Tainted: G D 2.6.34-rc3-00289-gae1ed76 #5 M3A78 PRO/System Product Name [ 73.644534] RIP: 0010:[<ffffffff810c08e7>] [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89 [ 73.644553] RSP: 0000:ffff88022cd37da8 EFLAGS: 00010246 [ 73.644562] RAX: 0000000000000000 RBX: ffffea000764dfa8 RCX: 0000000000000002 [ 73.644572] RDX: ffff88000a1d5dc8 RSI: 0000000000000007 RDI: ffffffff816e8740 [ 73.644589] RBP: ffff88022cd37dc8 R08: 00007f2ce0aab928 R09: 0000000000000000 [ 73.644603] R10: 0000000000000000 R11: 000000000011da32 R12: ffff88022d5894b0 [ 73.644615] R13: 00007f2ce0aab928 R14: 00007f2ce0aab928 R15: 800000021cd23067 [ 73.644628] FS: 00007f2cee88b7b0(0000) GS:ffff88000a000000(0000) knlGS:0000000000000000 [ 73.644639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 73.644652] CR2: 00007f2ce0aab928 CR3: 000000022b1b5000 CR4: 00000000000006f0 [ 73.644664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 73.644675] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 73.644690] Process iceowl-bin (pid: 2018, threadinfo ffff88022cd36000, task ffff88022a74a5c0) [ 73.644701] Stack: [ 73.644708] ffffffff810b8802 ffff88022d5894b0 ffff88022ce41e00 ffff88022d4b0558 [ 73.644745] <0> ffff88022cd37e48 ffffffff810b8823 ffff88022ce41eb8 0000000000000246 [ 73.644801] <0> ffffea000764dfa8 0000000000000001 00000001e0aab928 ffff88022c0a4828 [ 73.644862] Call Trace: [ 73.644874] [<ffffffff810b8802>] ? handle_mm_fault+0x2af/0x64e [ 73.644885] [<ffffffff810b8823>] handle_mm_fault+0x2d0/0x64e [ 73.644895] [<ffffffff8101f392>] do_page_fault+0x30b/0x32d [ 73.644909] [<ffffffff810be3c2>] ? do_mmap_pgoff+0x290/0x2f3 [ 73.644921] [<ffffffff813f93e3>] ? error_sti+0x5/0x6 [ 73.644932] [<ffffffff81062b97>] ? trace_hardirqs_off_caller+0x1f/0xa9 [ 73.644943] [<ffffffff813f7de2>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 73.644952] [<ffffffff813f91ff>] page_fault+0x1f/0x30 [ 73.644963] Code: 00 00 48 89 fb 49 89 f4 49 89 d5 f0 80 4f 02 10 be 07 00 00 00 c7 47 0c 00 00 00 00 e8 c5 30 ff ff 49 8b 44 24 78 48 85 c0 75 04 <0f> 0b eb fe 48 ff c0 4c 89 e6 48 89 df 48 89 43 18 4d 2b 6c 24 [ 73.645001] RIP [<ffffffff810c08e7>] page_add_new_anon_rmap+0x3b/0x89 [ 73.645001] RSP <ffff88022cd37da8> [ 73.645610] ---[ end trace d6305f6e826dbd54 ]--- [ 73.645621] note: iceowl-bin[2018] exited with preempt_count 1 [ 77.562222] SysRq : HELP : loglevel(0-9) reBoot Crash show-all-locks(D) terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) [ 78.014120] SysRq : Emergency Sync [ 78.016864] Emergency Sync complete [ 78.585045] SysRq : Emergency Remount R/O [ 78.663367] Emergency Remount complete [ 79.098126] SysRq : Resetting -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-07 0:10 ` Linus Torvalds 2010-04-07 1:18 ` Rik van Riel @ 2010-04-07 10:09 ` Pekka Enberg 2010-04-07 10:12 ` KOSAKI Motohiro 1 sibling, 1 reply; 231+ messages in thread From: Pekka Enberg @ 2010-04-07 10:09 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Borislav Petkov, Andrew Morton, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, Christoph Lameter, Tejun Heo Hi Linus, On Wed, Apr 7, 2010 at 3:10 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Sounds at least possible. Way more likely than any of the "trivially > obvious" code being buggy, or the SLUB layer suddenly having a serious bug > that only the new user could trigger. I haven't followed the discussion at all but if someone wants to investigate that angle more, the most likely suspect are the recent per-cpu changes. That said, I'd expect the problem to be more widespread if SLUB is to blame here. Pekka ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-07 10:09 ` Pekka Enberg @ 2010-04-07 10:12 ` KOSAKI Motohiro 0 siblings, 0 replies; 231+ messages in thread From: KOSAKI Motohiro @ 2010-04-07 10:12 UTC (permalink / raw) To: Pekka Enberg Cc: kosaki.motohiro, Linus Torvalds, Rik van Riel, Borislav Petkov, Andrew Morton, Minchan Kim, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson, Christoph Lameter, Tejun Heo > Hi Linus, > > On Wed, Apr 7, 2010 at 3:10 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > Sounds at least possible. Way more likely than any of the "trivially > > obvious" code being buggy, or the SLUB layer suddenly having a serious bug > > that only the new user could trigger. > > I haven't followed the discussion at all but if someone wants to > investigate that angle more, the most likely suspect are the recent > per-cpu changes. That said, I'd expect the problem to be more > widespread if SLUB is to blame here. Nope. We don't doubt SLUB nor per-cpu anymore. Rik found the bug in his patch. thanks. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 20:02 ` Linus Torvalds 2010-04-06 20:46 ` Steinar H. Gunderson 2010-04-06 20:51 ` Borislav Petkov @ 2010-04-07 8:41 ` Peter Zijlstra 2 siblings, 0 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-07 8:41 UTC (permalink / raw) To: Linus Torvalds Cc: Borislav Petkov, Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins, sgunderson On Tue, 2010-04-06 at 13:02 -0700, Linus Torvalds wrote: > - Related to the above: perhaps the RCU freeing isn't working, or > slub/slab/slob ends up reusing the allocations for something else than > anonvma's, so together with the race _and_ an unlucky re-use, you get > some odd crud. > > I haven't looked at the kernel config files: do they perhaps share the > same (odd?) SLUB/SLAB/SLOB config? Right, so anon_vma uses SLAB_DESTROY_BY_RCU and as the huge comment in rmap.c explains, that doesn't mean the objects themself get RCU grace period delays in freeing, only the SLAB that backs these objects does. So the moment you do kmem_cache_free() on the anon_vma it can be re-used for another allocation. The only guarantee given by RCU is that the backing storage doesn't go away and hence you can 'safely' deref pointers, you still very much have to revalidate you got the object you were looking for. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 18:28 ` Linus Torvalds 2010-04-06 19:03 ` Andrew Morton @ 2010-04-07 8:36 ` Peter Zijlstra 2010-04-07 9:16 ` Johannes Weiner ` (2 more replies) 1 sibling, 3 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-07 8:36 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote: > Just as an example of the kind of code that makes me worry: > > void unlink_anon_vmas(struct vm_area_struct *vma) > { > struct anon_vma_chain *avc, *next; > > /* Unlink each anon_vma chained to the VMA. */ > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { > anon_vma_unlink(avc); > list_del(&avc->same_vma); > anon_vma_chain_free(avc); > } > } > > Now, think about what happens for the *last* entry in that avc chain. It > will call that "anon_vma_unlink()" thing, which will delete perhaps the > last entry in the "same_anon_vma" one, and then it does > > if (empty) > anon_vma_free(anon_vma); > > *before* unlink_anon_vma's has actually does that > > list_del(&avc->same_vma); > > and what we essentially have is a stale anon_vma_chain entry that still > exists on that same_vma list, and points to an anon_vma that already got > deleted. > > Does it matter? I really can't see that it does. I think it does, the anon_vma thing has an RCU destroyed slab, but that doesn't mean the anon_vma object itself is rcu delayed. The moment we free it it can be re-used. So the above use after free is a bug. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-07 8:36 ` Peter Zijlstra @ 2010-04-07 9:16 ` Johannes Weiner 2010-04-07 9:37 ` Peter Zijlstra 2010-04-07 14:12 ` Rik van Riel 2010-04-07 15:46 ` Linus Torvalds 2 siblings, 1 reply; 231+ messages in thread From: Johannes Weiner @ 2010-04-07 9:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Wed, Apr 07, 2010 at 10:36:43AM +0200, Peter Zijlstra wrote: > On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote: > > Just as an example of the kind of code that makes me worry: > > > > void unlink_anon_vmas(struct vm_area_struct *vma) > > { > > struct anon_vma_chain *avc, *next; > > > > /* Unlink each anon_vma chained to the VMA. */ > > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { > > anon_vma_unlink(avc); > > list_del(&avc->same_vma); > > anon_vma_chain_free(avc); > > } > > } > > > > Now, think about what happens for the *last* entry in that avc chain. It > > will call that "anon_vma_unlink()" thing, which will delete perhaps the > > last entry in the "same_anon_vma" one, and then it does > > > > if (empty) > > anon_vma_free(anon_vma); > > > > *before* unlink_anon_vma's has actually does that > > > > list_del(&avc->same_vma); > > > > and what we essentially have is a stale anon_vma_chain entry that still > > exists on that same_vma list, and points to an anon_vma that already got > > deleted. > > > > Does it matter? I really can't see that it does. > > I think it does, the anon_vma thing has an RCU destroyed slab, but that > doesn't mean the anon_vma object itself is rcu delayed. The moment we > free it it can be re-used. So the above use after free is a bug. It frees avc->anon_vma, not avc. So the sequence is free(avc->anon_vma) in anon_vma_unlink() list_del(&avc->same_vma) in unlink_anon_vmas() It's not a use-after free. A problem would be if somebody should find the avc through this list (it is the vma->anon_vma_chain list) when its anon_vma pointer is invalid. I don't think this can happen, however. Both the unlinking and the looking at the list happen under vma->vm_mm's mmap_sem held for writing. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-07 9:16 ` Johannes Weiner @ 2010-04-07 9:37 ` Peter Zijlstra 0 siblings, 0 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-07 9:37 UTC (permalink / raw) To: Johannes Weiner Cc: Linus Torvalds, Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Wed, 2010-04-07 at 11:16 +0200, Johannes Weiner wrote: > On Wed, Apr 07, 2010 at 10:36:43AM +0200, Peter Zijlstra wrote: > > On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote: > > > Just as an example of the kind of code that makes me worry: > > > > > > void unlink_anon_vmas(struct vm_area_struct *vma) > > > { > > > struct anon_vma_chain *avc, *next; > > > > > > /* Unlink each anon_vma chained to the VMA. */ > > > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { > > > anon_vma_unlink(avc); > > > list_del(&avc->same_vma); > > > anon_vma_chain_free(avc); > > > } > > > } > > > > > > Now, think about what happens for the *last* entry in that avc chain. It > > > will call that "anon_vma_unlink()" thing, which will delete perhaps the > > > last entry in the "same_anon_vma" one, and then it does > > > > > > if (empty) > > > anon_vma_free(anon_vma); > > > > > > *before* unlink_anon_vma's has actually does that > > > > > > list_del(&avc->same_vma); > > > > > > and what we essentially have is a stale anon_vma_chain entry that still > > > exists on that same_vma list, and points to an anon_vma that already got > > > deleted. > > > > > > Does it matter? I really can't see that it does. > > > > I think it does, the anon_vma thing has an RCU destroyed slab, but that > > doesn't mean the anon_vma object itself is rcu delayed. The moment we > > free it it can be re-used. So the above use after free is a bug. > > It frees avc->anon_vma, not avc. Sure, freeing avc does not involve RCU in any way. > So the sequence is > > free(avc->anon_vma) in anon_vma_unlink() > list_del(&avc->same_vma) in unlink_anon_vmas() > > It's not a use-after free. A problem would be if somebody should find the > avc through this list (it is the vma->anon_vma_chain list) when its anon_vma > pointer is invalid. > > I don't think this can happen, however. Both the unlinking and the looking > at the list happen under vma->vm_mm's mmap_sem held for writing. What I was worried about was it freeing anon_vma and then still having the avc on list. But I guess that cannot happen because it only frees if its actually empty. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-07 8:36 ` Peter Zijlstra 2010-04-07 9:16 ` Johannes Weiner @ 2010-04-07 14:12 ` Rik van Riel 2010-04-07 15:46 ` Linus Torvalds 2 siblings, 0 replies; 231+ messages in thread From: Rik van Riel @ 2010-04-07 14:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On 04/07/2010 04:36 AM, Peter Zijlstra wrote: > On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote: >> if (empty) >> anon_vma_free(anon_vma); >> >> *before* unlink_anon_vma's has actually does that >> >> list_del(&avc->same_vma); >> >> and what we essentially have is a stale anon_vma_chain entry that still >> exists on that same_vma list, and points to an anon_vma that already got >> deleted. >> >> Does it matter? I really can't see that it does. > > I think it does, the anon_vma thing has an RCU destroyed slab, but that > doesn't mean the anon_vma object itself is rcu delayed. The moment we > free it it can be re-used. So the above use after free is a bug. Peter, the avc is an anon_vma_chain, which is a different object than the anon_vma itself. There is no use after free of an anon_vma object in unlink_anon_vmas + anon_vma_unlink. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-07 8:36 ` Peter Zijlstra 2010-04-07 9:16 ` Johannes Weiner 2010-04-07 14:12 ` Rik van Riel @ 2010-04-07 15:46 ` Linus Torvalds 2 siblings, 0 replies; 231+ messages in thread From: Linus Torvalds @ 2010-04-07 15:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Rik van Riel, Minchan Kim, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Wed, 7 Apr 2010, Peter Zijlstra wrote: > On Tue, 2010-04-06 at 11:28 -0700, Linus Torvalds wrote: > > Just as an example of the kind of code that makes me worry: > > > > void unlink_anon_vmas(struct vm_area_struct *vma) > > { > > struct anon_vma_chain *avc, *next; > > > > /* Unlink each anon_vma chained to the VMA. */ > > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { > > anon_vma_unlink(avc); > > list_del(&avc->same_vma); > > anon_vma_chain_free(avc); > > } > > } > > > > Now, think about what happens for the *last* entry in that avc chain. It > > will call that "anon_vma_unlink()" thing, which will delete perhaps the > > last entry in the "same_anon_vma" one, and then it does > > > > if (empty) > > anon_vma_free(anon_vma); > > > > *before* unlink_anon_vma's has actually does that > > > > list_del(&avc->same_vma); > > > > and what we essentially have is a stale anon_vma_chain entry that still > > exists on that same_vma list, and points to an anon_vma that already got > > deleted. > > > > Does it matter? I really can't see that it does. > > I think it does, the anon_vma thing has an RCU destroyed slab, but that > doesn't mean the anon_vma object itself is rcu delayed. The moment we > free it it can be re-used. So the above use after free is a bug. Well, it's not really a "use after free" - it's just that a stale pointer still exists in a live data structure that is linked into the list. I don't think there is a real bug there, simply because I don't think anybody will be accessing that list (we should hopefully have all the sufficient mutual exclusion in place). So I just think it is bad form to potentially free something before we get rid of all pointers to it. So to me it's a cleanliness issue: good code shouldn't do things like that, and it would be much cleaner to remove the AVC entry that has a pointer to the anon_vma _before_ we might be freeing the anon_vma. Maybe I'm just anal. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 16:23 ` Minchan Kim 2010-04-06 16:28 ` Linus Torvalds @ 2010-04-06 16:32 ` Linus Torvalds 2010-04-06 16:54 ` Minchan Kim 1 sibling, 1 reply; 231+ messages in thread From: Linus Torvalds @ 2010-04-06 16:32 UTC (permalink / raw) To: Minchan Kim Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Wed, 7 Apr 2010, Minchan Kim wrote: > > > > I don't think so. That isn't the racy case. We're working with a > > anon_vma_chain, so the anonvma is all there. > > But the anon_vma is using for another anon_vma. No, that can only happen if somebody has done "anon_vma_free()" on it. And nobody does that if the anonvma still has a non-empty'&anon_vma->head'. So as long as the anon_vma has a anon_vma_chain entry associated with it (or a ksm refcount, but that's a separate issue), it's not going to be re-allocated for any other use, because it's not going to be free'd. Linus ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 16:32 ` Linus Torvalds @ 2010-04-06 16:54 ` Minchan Kim 0 siblings, 0 replies; 231+ messages in thread From: Minchan Kim @ 2010-04-06 16:54 UTC (permalink / raw) To: Linus Torvalds Cc: Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, 2010-04-06 at 09:32 -0700, Linus Torvalds wrote: > > On Wed, 7 Apr 2010, Minchan Kim wrote: > > > > > > I don't think so. That isn't the racy case. We're working with a > > > anon_vma_chain, so the anonvma is all there. > > > > But the anon_vma is using for another anon_vma. > > No, that can only happen if somebody has done "anon_vma_free()" on it. And > nobody does that if the anonvma still has a non-empty'&anon_vma->head'. > > So as long as the anon_vma has a anon_vma_chain entry associated with it > (or a ksm refcount, but that's a separate issue), it's not going to be > re-allocated for any other use, because it's not going to be free'd. > > Linus That's what I am missing. Thanks, Linus. I will think over the problem. :) -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 15:55 ` Linus Torvalds 2010-04-06 16:23 ` Minchan Kim @ 2010-04-07 8:37 ` Peter Zijlstra 1 sibling, 0 replies; 231+ messages in thread From: Peter Zijlstra @ 2010-04-07 8:37 UTC (permalink / raw) To: Linus Torvalds Cc: Minchan Kim, Rik van Riel, KOSAKI Motohiro, Borislav Petkov, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Nick Piggin, Andrea Arcangeli, Hugh Dickins On Tue, 2010-04-06 at 08:55 -0700, Linus Torvalds wrote: > I do wonder if "page_lock_anon_vma()" should check the whole > "page_mapped()" case _after_ taking the anon_vma lock. Because if the race > happens, we're following a anon_vma list that has nothing to do with that > page (it's stilla _valid_ list, since we locked the anon_vma, but will it > be ok?) > > IOW, what is it that really keeps the anon_vma list reliable _and_ > relevant wrt the page? We know we may get a stale anon_vma, are we ok if > that anon_vma list doesn't actually have anything to do with the page any > more? When doing the whole make i_mmap_lock/anon_vma->lock a mutex thing last week I ran into the same issue and its on my todo list to find out wth is happening there. So yes I think we should move that validation check inside page_lock_anon_vma(). I'll cook up a patch once I'm done staring at the various funny arch mmu_gather implementations. ^ permalink raw reply [flat|nested] 231+ messages in thread
* Re: Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) 2010-04-06 14:38 ` Rik van Riel 2010-04-06 15:34 ` Minchan Kim @ 2010-04-06 17:05 ` Borislav Petkov 1 sibling, 0 replies; 231+ messages in thread From: Borislav Petkov @ 2010-04-06 17:05 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Linus Torvalds, Andrew Morton, Linux Kernel Mailing List, Lee Schermerhorn, Minchan Kim, Nick Piggin, Andrea Arcangeli, Hugh Dickins [-- Attachment #1: Type: text/plain, Size: 975 bytes --] From: Rik van Riel <riel@redhat.com> Date: Tue, Apr 06, 2010 at 10:38:18AM -0400 > This makes me wonder if perhaps the bug is a side effect > of something Borislav (and the other reproducers) have > in their kernel configuration, which we do not have. > > Another (unlikely) thing is that the fix for the leak > makes the bug go away. Yes, very unlikely. > > Borislav, could you please send us your .config ? attached. > Also, if you have the time, could you try out the > patch (-v2) I mailed in a little up this thread > that fixes the memory leak in anon_vma_fork? Sure, building ontop of v2.6.34-rc3-288-gab195c5. Will try to trigger it but let me remind you that it will take a while since it doesn't happen everytime I suspend. Any other printks or debug output which might be helpful to slap at the site, page_referenced_anon() I mean? > I suspect it should not change anything, but it > could be useful to rule out anyway. > -- Regards/Gruss, Boris. [-- Attachment #2: config-2.6.34-rc3 --] [-- Type: text/plain, Size: 62617 bytes --] # # Automatically generated make config: don't edit # Linux kernel version: 2.6.34-rc3 # Tue Mar 30 22:52:17 2010 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y # CONFIG_RWSEM_GENERIC_SPINLOCK is not set CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_DEFAULT_IDLE=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_EARLY_RES=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_USE_GENERIC_SMP_HELPERS=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_X86_TRAMPOLINE=y # CONFIG_KTIME_SCALAR is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_CONSTRUCTORS=y # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_LZO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_AUDIT is not set # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_TREE_PREEMPT_RCU is not set # CONFIG_TINY_RCU is not set # CONFIG_RCU_TRACE is not set CONFIG_RCU_FANOUT=64 # CONFIG_RCU_FANOUT_EXACT is not set # CONFIG_RCU_FAST_NO_HZ is not set # CONFIG_TREE_RCU_TRACE is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=21 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y # CONFIG_CGROUPS is not set # CONFIG_SYSFS_DEPRECATED_V2 is not set # CONFIG_RELAY is not set CONFIG_NAMESPACES=y # CONFIG_UTS_NS is not set # CONFIG_IPC_NS is not set # CONFIG_USER_NS is not set # CONFIG_PID_NS is not set # CONFIG_NET_NS is not set # CONFIG_BLK_DEV_INITRD is not set CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y CONFIG_ANON_INODES=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_PCSPKR_PLATFORM=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_AIO=y CONFIG_HAVE_PERF_EVENTS=y # # Kernel Performance Events And Counters # CONFIG_PERF_EVENTS=y # CONFIG_PERF_COUNTERS is not set # CONFIG_DEBUG_PERF_USE_VMALLOC is not set CONFIG_VM_EVENT_COUNTERS=y CONFIG_PCI_QUIRKS=y CONFIG_SLUB_DEBUG=y # CONFIG_COMPAT_BRK is not set # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not set # CONFIG_PROFILING is not set CONFIG_TRACEPOINTS=y CONFIG_HAVE_OPROFILE=y # CONFIG_KPROBES is not set CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y CONFIG_USER_RETURN_NOTIFIER=y CONFIG_HAVE_IOREMAP_PROT=y CONFIG_HAVE_KPROBES=y CONFIG_HAVE_KRETPROBES=y CONFIG_HAVE_OPTPROBES=y CONFIG_HAVE_ARCH_TRACEHOOK=y CONFIG_HAVE_DMA_ATTRS=y CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y CONFIG_HAVE_DMA_API_DEBUG=y CONFIG_HAVE_HW_BREAKPOINT=y CONFIG_HAVE_USER_RETURN_NOTIFIER=y # # GCOV-based kernel profiling # # CONFIG_GCOV_KERNEL is not set CONFIG_SLOW_WORK=y # CONFIG_SLOW_WORK_DEBUG is not set # CONFIG_HAVE_GENERIC_DMA_COHERENT is not set CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_FORCE_LOAD=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_MODVERSIONS=y CONFIG_MODULE_SRCVERSION_ALL=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y CONFIG_BLK_DEV_BSG=y # CONFIG_BLK_DEV_INTEGRITY is not set CONFIG_BLOCK_COMPAT=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" CONFIG_PREEMPT_NOTIFIERS=y # CONFIG_INLINE_SPIN_TRYLOCK is not set # CONFIG_INLINE_SPIN_TRYLOCK_BH is not set # CONFIG_INLINE_SPIN_LOCK is not set # CONFIG_INLINE_SPIN_LOCK_BH is not set # CONFIG_INLINE_SPIN_LOCK_IRQ is not set # CONFIG_INLINE_SPIN_LOCK_IRQSAVE is not set # CONFIG_INLINE_SPIN_UNLOCK is not set # CONFIG_INLINE_SPIN_UNLOCK_BH is not set # CONFIG_INLINE_SPIN_UNLOCK_IRQ is not set # CONFIG_INLINE_SPIN_UNLOCK_IRQRESTORE is not set # CONFIG_INLINE_READ_TRYLOCK is not set # CONFIG_INLINE_READ_LOCK is not set # CONFIG_INLINE_READ_LOCK_BH is not set # CONFIG_INLINE_READ_LOCK_IRQ is not set # CONFIG_INLINE_READ_LOCK_IRQSAVE is not set # CONFIG_INLINE_READ_UNLOCK is not set # CONFIG_INLINE_READ_UNLOCK_BH is not set # CONFIG_INLINE_READ_UNLOCK_IRQ is not set # CONFIG_INLINE_READ_UNLOCK_IRQRESTORE is not set # CONFIG_INLINE_WRITE_TRYLOCK is not set # CONFIG_INLINE_WRITE_LOCK is not set # CONFIG_INLINE_WRITE_LOCK_BH is not set # CONFIG_INLINE_WRITE_LOCK_IRQ is not set # CONFIG_INLINE_WRITE_LOCK_IRQSAVE is not set # CONFIG_INLINE_WRITE_UNLOCK is not set # CONFIG_INLINE_WRITE_UNLOCK_BH is not set # CONFIG_INLINE_WRITE_UNLOCK_IRQ is not set # CONFIG_INLINE_WRITE_UNLOCK_IRQRESTORE is not set # CONFIG_MUTEX_SPIN_ON_OWNER is not set CONFIG_FREEZER=y # # Processor type and features # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_SMP=y # CONFIG_SPARSE_IRQ is not set # CONFIG_X86_MPPARSE is not set # CONFIG_X86_EXTENDED_PLATFORM is not set CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y CONFIG_SCHED_OMIT_FRAME_POINTER=y # CONFIG_PARAVIRT_GUEST is not set CONFIG_NO_BOOTMEM=y CONFIG_MEMTEST=y # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMII is not set # CONFIG_MPENTIUMIII is not set # CONFIG_MPENTIUMM is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set CONFIG_MK8=y # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP3D is not set # CONFIG_MGEODEGX1 is not set # CONFIG_MGEODE_LX is not set # CONFIG_MCYRIXIII is not set # CONFIG_MVIAC3_2 is not set # CONFIG_MVIAC7 is not set # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set # CONFIG_MATOM is not set # CONFIG_GENERIC_CPU is not set CONFIG_X86_CPU=y CONFIG_X86_INTERNODE_CACHE_SHIFT=6 CONFIG_X86_CMPXCHG=y CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_XADD=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INTEL_USERCOPY=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_TSC=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_CMOV=y CONFIG_X86_MINIMUM_CPU_FAMILY=64 CONFIG_X86_DEBUGCTLMSR=y CONFIG_CPU_SUP_INTEL=y CONFIG_CPU_SUP_AMD=y CONFIG_CPU_SUP_CENTAUR=y # CONFIG_X86_DS is not set CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_DMI=y CONFIG_GART_IOMMU=y # CONFIG_CALGARY_IOMMU is not set # CONFIG_AMD_IOMMU is not set CONFIG_SWIOTLB=y CONFIG_IOMMU_HELPER=y # CONFIG_IOMMU_API is not set # CONFIG_MAXSMP is not set CONFIG_NR_CPUS=8 # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y # CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set CONFIG_X86_MCE=y # CONFIG_X86_MCE_INTEL is not set CONFIG_X86_MCE_AMD=y CONFIG_X86_MCE_THRESHOLD=y # CONFIG_X86_MCE_INJECT is not set # CONFIG_I8K is not set CONFIG_MICROCODE=m # CONFIG_MICROCODE_INTEL is not set CONFIG_MICROCODE_AMD=y CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=m CONFIG_X86_CPUID=m CONFIG_ARCH_PHYS_ADDR_T_64BIT=y CONFIG_DIRECT_GBPAGES=y # CONFIG_NUMA is not set CONFIG_ARCH_PROC_KCORE_TEXT=y CONFIG_ARCH_SPARSEMEM_DEFAULT=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_SELECT_MEMORY_MODEL=y CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000 CONFIG_SELECT_MEMORY_MODEL=y # CONFIG_FLATMEM_MANUAL is not set # CONFIG_DISCONTIGMEM_MANUAL is not set CONFIG_SPARSEMEM_MANUAL=y CONFIG_SPARSEMEM=y CONFIG_HAVE_MEMORY_PRESENT=y CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y CONFIG_SPARSEMEM_VMEMMAP=y # CONFIG_MEMORY_HOTPLUG is not set CONFIG_PAGEFLAGS_EXTENDED=y CONFIG_SPLIT_PTLOCK_CPUS=999999 CONFIG_PHYS_ADDR_T_64BIT=y CONFIG_ZONE_DMA_FLAG=1 CONFIG_BOUNCE=y CONFIG_VIRT_TO_BUS=y CONFIG_MMU_NOTIFIER=y # CONFIG_KSM is not set CONFIG_DEFAULT_MMAP_MIN_ADDR=4096 CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y # CONFIG_MEMORY_FAILURE is not set # CONFIG_X86_CHECK_BIOS_CORRUPTION is not set CONFIG_X86_RESERVE_LOW_64K=y CONFIG_MTRR=y CONFIG_MTRR_SANITIZER=y CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0 CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1 CONFIG_X86_PAT=y CONFIG_ARCH_USES_PG_UNCACHED=y # CONFIG_EFI is not set # CONFIG_SECCOMP is not set # CONFIG_CC_STACKPROTECTOR is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 CONFIG_SCHED_HRTICK=y # CONFIG_KEXEC is not set # CONFIG_CRASH_DUMP is not set CONFIG_PHYSICAL_START=0x1000000 # CONFIG_RELOCATABLE is not set CONFIG_PHYSICAL_ALIGN=0x1000000 CONFIG_HOTPLUG_CPU=y # CONFIG_COMPAT_VDSO is not set # CONFIG_CMDLINE_BOOL is not set CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y # # Power management and ACPI options # CONFIG_ARCH_HIBERNATION_HEADER=y CONFIG_PM=y # CONFIG_PM_DEBUG is not set CONFIG_PM_SLEEP_SMP=y CONFIG_PM_SLEEP=y CONFIG_SUSPEND=y CONFIG_SUSPEND_FREEZER=y CONFIG_HIBERNATION_NVS=y CONFIG_HIBERNATION=y CONFIG_PM_STD_PARTITION="/dev/sda2" # CONFIG_PM_RUNTIME is not set CONFIG_PM_OPS=y CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y # CONFIG_ACPI_PROCFS is not set # CONFIG_ACPI_PROCFS_POWER is not set # CONFIG_ACPI_POWER_METER is not set CONFIG_ACPI_SYSFS_POWER=y # CONFIG_ACPI_PROC_EVENT is not set # CONFIG_ACPI_AC is not set # CONFIG_ACPI_BATTERY is not set # CONFIG_ACPI_BUTTON is not set # CONFIG_ACPI_FAN is not set CONFIG_ACPI_DOCK=y CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_HOTPLUG_CPU=y # CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set CONFIG_ACPI_THERMAL=y # CONFIG_ACPI_CUSTOM_DSDT is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set # CONFIG_ACPI_PCI_SLOT is not set CONFIG_X86_PM_TIMER=y CONFIG_ACPI_CONTAINER=y # CONFIG_ACPI_SBS is not set # CONFIG_SFI is not set # # CPU Frequency scaling # CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=m # CONFIG_CPU_FREQ_DEBUG is not set # CONFIG_CPU_FREQ_STAT is not set CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=m CONFIG_CPU_FREQ_GOV_USERSPACE=m CONFIG_CPU_FREQ_GOV_ONDEMAND=m CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m # # CPUFreq processor drivers # # CONFIG_X86_PCC_CPUFREQ is not set CONFIG_X86_ACPI_CPUFREQ=m CONFIG_X86_POWERNOW_K8=m # CONFIG_X86_SPEEDSTEP_CENTRINO is not set # CONFIG_X86_P4_CLOCKMOD is not set # # shared options # # CONFIG_X86_SPEEDSTEP_LIB is not set CONFIG_CPU_IDLE=y CONFIG_CPU_IDLE_GOV_LADDER=y CONFIG_CPU_IDLE_GOV_MENU=y # # Memory power savings # # CONFIG_I7300_IDLE is not set # # Bus options (PCI etc.) # CONFIG_PCI=y CONFIG_PCI_DIRECT=y CONFIG_PCI_MMCONFIG=y CONFIG_PCI_DOMAINS=y CONFIG_PCIEPORTBUS=y CONFIG_PCIEAER=y # CONFIG_PCIE_ECRC is not set # CONFIG_PCIEAER_INJECT is not set # CONFIG_PCIEASPM is not set CONFIG_ARCH_SUPPORTS_MSI=y # CONFIG_PCI_MSI is not set # CONFIG_PCI_DEBUG is not set # CONFIG_PCI_STUB is not set CONFIG_HT_IRQ=y # CONFIG_PCI_IOV is not set CONFIG_PCI_IOAPIC=y CONFIG_ISA_DMA_API=y CONFIG_K8_NB=y # CONFIG_PCCARD is not set # CONFIG_HOTPLUG_PCI is not set # # Executable file formats / Emulations # CONFIG_BINFMT_ELF=y CONFIG_COMPAT_BINFMT_ELF=y # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set # CONFIG_HAVE_AOUT is not set CONFIG_BINFMT_MISC=m CONFIG_IA32_EMULATION=y CONFIG_IA32_AOUT=y CONFIG_COMPAT=y CONFIG_COMPAT_FOR_U64_ALIGNMENT=y CONFIG_SYSVIPC_COMPAT=y CONFIG_NET=y # # Networking options # CONFIG_PACKET=m CONFIG_UNIX=y CONFIG_XFRM=y CONFIG_XFRM_USER=m # CONFIG_XFRM_SUB_POLICY is not set # CONFIG_XFRM_MIGRATE is not set # CONFIG_XFRM_STATISTICS is not set CONFIG_XFRM_IPCOMP=m # CONFIG_NET_KEY is not set CONFIG_INET=y CONFIG_IP_MULTICAST=y # CONFIG_IP_ADVANCED_ROUTER is not set CONFIG_IP_FIB_HASH=y # CONFIG_IP_PNP is not set CONFIG_NET_IPIP=m CONFIG_NET_IPGRE=m # CONFIG_NET_IPGRE_BROADCAST is not set # CONFIG_IP_MROUTE is not set # CONFIG_ARPD is not set CONFIG_SYN_COOKIES=y CONFIG_INET_AH=m CONFIG_INET_ESP=m CONFIG_INET_IPCOMP=m CONFIG_INET_XFRM_TUNNEL=m CONFIG_INET_TUNNEL=m CONFIG_INET_XFRM_MODE_TRANSPORT=m CONFIG_INET_XFRM_MODE_TUNNEL=m CONFIG_INET_XFRM_MODE_BEET=y # CONFIG_INET_LRO is not set CONFIG_INET_DIAG=m CONFIG_INET_TCP_DIAG=m # CONFIG_TCP_CONG_ADVANCED is not set CONFIG_TCP_CONG_CUBIC=y CONFIG_DEFAULT_TCP_CONG="cubic" # CONFIG_TCP_MD5SIG is not set CONFIG_IPV6=m CONFIG_IPV6_PRIVACY=y # CONFIG_IPV6_ROUTER_PREF is not set # CONFIG_IPV6_OPTIMISTIC_DAD is not set CONFIG_INET6_AH=m CONFIG_INET6_ESP=m CONFIG_INET6_IPCOMP=m # CONFIG_IPV6_MIP6 is not set CONFIG_INET6_XFRM_TUNNEL=m CONFIG_INET6_TUNNEL=m CONFIG_INET6_XFRM_MODE_TRANSPORT=m CONFIG_INET6_XFRM_MODE_TUNNEL=m CONFIG_INET6_XFRM_MODE_BEET=m # CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set CONFIG_IPV6_SIT=m # CONFIG_IPV6_SIT_6RD is not set CONFIG_IPV6_NDISC_NODETYPE=y CONFIG_IPV6_TUNNEL=m # CONFIG_IPV6_MULTIPLE_TABLES is not set # CONFIG_IPV6_MROUTE is not set CONFIG_NETWORK_SECMARK=y CONFIG_NETFILTER=y # CONFIG_NETFILTER_DEBUG is not set CONFIG_NETFILTER_ADVANCED=y CONFIG_BRIDGE_NETFILTER=y # # Core Netfilter Configuration # CONFIG_NETFILTER_NETLINK=m CONFIG_NETFILTER_NETLINK_QUEUE=m CONFIG_NETFILTER_NETLINK_LOG=m CONFIG_NF_CONNTRACK=m # CONFIG_NF_CT_ACCT is not set CONFIG_NF_CONNTRACK_MARK=y # CONFIG_NF_CONNTRACK_SECMARK is not set # CONFIG_NF_CONNTRACK_EVENTS is not set CONFIG_NF_CT_PROTO_DCCP=m CONFIG_NF_CT_PROTO_SCTP=m # CONFIG_NF_CT_PROTO_UDPLITE is not set # CONFIG_NF_CONNTRACK_AMANDA is not set # CONFIG_NF_CONNTRACK_FTP is not set # CONFIG_NF_CONNTRACK_H323 is not set # CONFIG_NF_CONNTRACK_IRC is not set # CONFIG_NF_CONNTRACK_NETBIOS_NS is not set # CONFIG_NF_CONNTRACK_PPTP is not set # CONFIG_NF_CONNTRACK_SANE is not set # CONFIG_NF_CONNTRACK_SIP is not set # CONFIG_NF_CONNTRACK_TFTP is not set # CONFIG_NF_CT_NETLINK is not set # CONFIG_NETFILTER_TPROXY is not set CONFIG_NETFILTER_XTABLES=m CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m # CONFIG_NETFILTER_XT_TARGET_CONNMARK is not set # CONFIG_NETFILTER_XT_TARGET_CT is not set CONFIG_NETFILTER_XT_TARGET_DSCP=m CONFIG_NETFILTER_XT_TARGET_HL=m CONFIG_NETFILTER_XT_TARGET_MARK=m CONFIG_NETFILTER_XT_TARGET_NFLOG=m CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m # CONFIG_NETFILTER_XT_TARGET_NOTRACK is not set CONFIG_NETFILTER_XT_TARGET_RATEEST=m CONFIG_NETFILTER_XT_TARGET_TRACE=m CONFIG_NETFILTER_XT_TARGET_SECMARK=m CONFIG_NETFILTER_XT_TARGET_TCPMSS=m CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=m CONFIG_NETFILTER_XT_MATCH_CLUSTER=m CONFIG_NETFILTER_XT_MATCH_COMMENT=m # CONFIG_NETFILTER_XT_MATCH_CONNBYTES is not set # CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set # CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set # CONFIG_NETFILTER_XT_MATCH_CONNTRACK is not set CONFIG_NETFILTER_XT_MATCH_DCCP=m CONFIG_NETFILTER_XT_MATCH_DSCP=m CONFIG_NETFILTER_XT_MATCH_ESP=m CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m # CONFIG_NETFILTER_XT_MATCH_HELPER is not set CONFIG_NETFILTER_XT_MATCH_HL=m CONFIG_NETFILTER_XT_MATCH_IPRANGE=m CONFIG_NETFILTER_XT_MATCH_LENGTH=m CONFIG_NETFILTER_XT_MATCH_LIMIT=m CONFIG_NETFILTER_XT_MATCH_MAC=m CONFIG_NETFILTER_XT_MATCH_MARK=m CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m CONFIG_NETFILTER_XT_MATCH_OWNER=m CONFIG_NETFILTER_XT_MATCH_POLICY=m # CONFIG_NETFILTER_XT_MATCH_PHYSDEV is not set CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m CONFIG_NETFILTER_XT_MATCH_QUOTA=m CONFIG_NETFILTER_XT_MATCH_RATEEST=m CONFIG_NETFILTER_XT_MATCH_REALM=m CONFIG_NETFILTER_XT_MATCH_RECENT=m # CONFIG_NETFILTER_XT_MATCH_RECENT_PROC_COMPAT is not set CONFIG_NETFILTER_XT_MATCH_SCTP=m # CONFIG_NETFILTER_XT_MATCH_STATE is not set CONFIG_NETFILTER_XT_MATCH_STATISTIC=m CONFIG_NETFILTER_XT_MATCH_STRING=m CONFIG_NETFILTER_XT_MATCH_TCPMSS=m CONFIG_NETFILTER_XT_MATCH_TIME=m CONFIG_NETFILTER_XT_MATCH_U32=m CONFIG_NETFILTER_XT_MATCH_OSF=m # CONFIG_IP_VS is not set # # IP: Netfilter Configuration # CONFIG_NF_DEFRAG_IPV4=m CONFIG_NF_CONNTRACK_IPV4=m CONFIG_NF_CONNTRACK_PROC_COMPAT=y CONFIG_IP_NF_QUEUE=m CONFIG_IP_NF_IPTABLES=m CONFIG_IP_NF_MATCH_ADDRTYPE=m CONFIG_IP_NF_MATCH_AH=m CONFIG_IP_NF_MATCH_ECN=m CONFIG_IP_NF_MATCH_TTL=m CONFIG_IP_NF_FILTER=m CONFIG_IP_NF_TARGET_REJECT=m CONFIG_IP_NF_TARGET_LOG=m CONFIG_IP_NF_TARGET_ULOG=m CONFIG_NF_NAT=m CONFIG_NF_NAT_NEEDED=y CONFIG_IP_NF_TARGET_MASQUERADE=m CONFIG_IP_NF_TARGET_NETMAP=m CONFIG_IP_NF_TARGET_REDIRECT=m CONFIG_NF_NAT_SNMP_BASIC=m CONFIG_NF_NAT_PROTO_DCCP=m CONFIG_NF_NAT_PROTO_SCTP=m # CONFIG_NF_NAT_FTP is not set # CONFIG_NF_NAT_IRC is not set # CONFIG_NF_NAT_TFTP is not set # CONFIG_NF_NAT_AMANDA is not set # CONFIG_NF_NAT_PPTP is not set # CONFIG_NF_NAT_H323 is not set # CONFIG_NF_NAT_SIP is not set CONFIG_IP_NF_MANGLE=m # CONFIG_IP_NF_TARGET_CLUSTERIP is not set CONFIG_IP_NF_TARGET_ECN=m CONFIG_IP_NF_TARGET_TTL=m CONFIG_IP_NF_RAW=m CONFIG_IP_NF_ARPTABLES=m CONFIG_IP_NF_ARPFILTER=m CONFIG_IP_NF_ARP_MANGLE=m # # IPv6: Netfilter Configuration # CONFIG_NF_CONNTRACK_IPV6=m CONFIG_IP6_NF_QUEUE=m CONFIG_IP6_NF_IPTABLES=m CONFIG_IP6_NF_MATCH_AH=m CONFIG_IP6_NF_MATCH_EUI64=m CONFIG_IP6_NF_MATCH_FRAG=m CONFIG_IP6_NF_MATCH_OPTS=m CONFIG_IP6_NF_MATCH_HL=m CONFIG_IP6_NF_MATCH_IPV6HEADER=m CONFIG_IP6_NF_MATCH_MH=m CONFIG_IP6_NF_MATCH_RT=m CONFIG_IP6_NF_TARGET_HL=m CONFIG_IP6_NF_TARGET_LOG=m CONFIG_IP6_NF_FILTER=m CONFIG_IP6_NF_TARGET_REJECT=m CONFIG_IP6_NF_MANGLE=m CONFIG_IP6_NF_RAW=m # CONFIG_BRIDGE_NF_EBTABLES is not set CONFIG_IP_DCCP=m CONFIG_INET_DCCP_DIAG=m # # DCCP CCIDs Configuration (EXPERIMENTAL) # # CONFIG_IP_DCCP_CCID2_DEBUG is not set CONFIG_IP_DCCP_CCID3=y # CONFIG_IP_DCCP_CCID3_DEBUG is not set CONFIG_IP_DCCP_CCID3_RTO=100 CONFIG_IP_DCCP_TFRC_LIB=y # # DCCP Kernel Hacking # # CONFIG_IP_DCCP_DEBUG is not set CONFIG_IP_SCTP=m # CONFIG_SCTP_DBG_MSG is not set # CONFIG_SCTP_DBG_OBJCNT is not set # CONFIG_SCTP_HMAC_NONE is not set # CONFIG_SCTP_HMAC_SHA1 is not set CONFIG_SCTP_HMAC_MD5=y # CONFIG_RDS is not set # CONFIG_TIPC is not set # CONFIG_ATM is not set CONFIG_STP=m CONFIG_BRIDGE=m CONFIG_BRIDGE_IGMP_SNOOPING=y # CONFIG_NET_DSA is not set # CONFIG_VLAN_8021Q is not set # CONFIG_DECNET is not set CONFIG_LLC=m # CONFIG_LLC2 is not set CONFIG_IPX=m # CONFIG_IPX_INTERN is not set # CONFIG_ATALK is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # CONFIG_PHONET is not set # CONFIG_IEEE802154 is not set CONFIG_NET_SCHED=y # # Queueing/Scheduling # CONFIG_NET_SCH_CBQ=m CONFIG_NET_SCH_HTB=m CONFIG_NET_SCH_HFSC=m CONFIG_NET_SCH_PRIO=m # CONFIG_NET_SCH_MULTIQ is not set CONFIG_NET_SCH_RED=m CONFIG_NET_SCH_SFQ=m CONFIG_NET_SCH_TEQL=m CONFIG_NET_SCH_TBF=m CONFIG_NET_SCH_GRED=m CONFIG_NET_SCH_DSMARK=m CONFIG_NET_SCH_NETEM=m # CONFIG_NET_SCH_DRR is not set CONFIG_NET_SCH_INGRESS=m # # Classification # CONFIG_NET_CLS=y CONFIG_NET_CLS_BASIC=m CONFIG_NET_CLS_TCINDEX=m CONFIG_NET_CLS_ROUTE4=m CONFIG_NET_CLS_ROUTE=y CONFIG_NET_CLS_FW=m CONFIG_NET_CLS_U32=m CONFIG_CLS_U32_PERF=y CONFIG_CLS_U32_MARK=y CONFIG_NET_CLS_RSVP=m CONFIG_NET_CLS_RSVP6=m # CONFIG_NET_CLS_FLOW is not set CONFIG_NET_EMATCH=y CONFIG_NET_EMATCH_STACK=32 CONFIG_NET_EMATCH_CMP=m CONFIG_NET_EMATCH_NBYTE=m CONFIG_NET_EMATCH_U32=m CONFIG_NET_EMATCH_META=m CONFIG_NET_EMATCH_TEXT=m CONFIG_NET_CLS_ACT=y CONFIG_NET_ACT_POLICE=m CONFIG_NET_ACT_GACT=m CONFIG_GACT_PROB=y CONFIG_NET_ACT_MIRRED=m CONFIG_NET_ACT_IPT=m # CONFIG_NET_ACT_NAT is not set CONFIG_NET_ACT_PEDIT=m CONFIG_NET_ACT_SIMP=m # CONFIG_NET_ACT_SKBEDIT is not set CONFIG_NET_CLS_IND=y CONFIG_NET_SCH_FIFO=y # CONFIG_DCB is not set # # Network testing # # CONFIG_NET_PKTGEN is not set # CONFIG_NET_DROP_MONITOR is not set # CONFIG_HAMRADIO is not set # CONFIG_CAN is not set # CONFIG_IRDA is not set # CONFIG_BT is not set CONFIG_AF_RXRPC=m # CONFIG_AF_RXRPC_DEBUG is not set # CONFIG_RXKAD is not set # CONFIG_WIRELESS is not set # CONFIG_WIMAX is not set CONFIG_RFKILL=m CONFIG_RFKILL_INPUT=y # CONFIG_NET_9P is not set # # Device Drivers # # # Generic Driver Options # CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug" # CONFIG_DEVTMPFS is not set CONFIG_STANDALONE=y CONFIG_PREVENT_FIRMWARE_BUILD=y CONFIG_FW_LOADER=y CONFIG_FIRMWARE_IN_KERNEL=y CONFIG_EXTRA_FIRMWARE="" # CONFIG_DEBUG_DRIVER is not set # CONFIG_DEBUG_DEVRES is not set # CONFIG_SYS_HYPERVISOR is not set # CONFIG_CONNECTOR is not set # CONFIG_MTD is not set # CONFIG_PARPORT is not set CONFIG_PNP=y # CONFIG_PNP_DEBUG_MESSAGES is not set # # Protocols # CONFIG_PNPACPI=y CONFIG_BLK_DEV=y # CONFIG_BLK_DEV_FD is not set # CONFIG_BLK_CPQ_DA is not set # CONFIG_BLK_CPQ_CISS_DA is not set # CONFIG_BLK_DEV_DAC960 is not set # CONFIG_BLK_DEV_UMEM is not set # CONFIG_BLK_DEV_COW_COMMON is not set CONFIG_BLK_DEV_LOOP=y CONFIG_BLK_DEV_CRYPTOLOOP=y # # DRBD disabled because PROC_FS, INET or CONNECTOR not selected # # CONFIG_BLK_DEV_NBD is not set # CONFIG_BLK_DEV_SX8 is not set # CONFIG_BLK_DEV_UB is not set # CONFIG_BLK_DEV_RAM is not set CONFIG_CDROM_PKTCDVD=m CONFIG_CDROM_PKTCDVD_BUFFERS=8 # CONFIG_CDROM_PKTCDVD_WCACHE is not set # CONFIG_ATA_OVER_ETH is not set # CONFIG_BLK_DEV_HD is not set CONFIG_MISC_DEVICES=y # CONFIG_AD525X_DPOT is not set # CONFIG_IBM_ASM is not set # CONFIG_PHANTOM is not set # CONFIG_SGI_IOC4 is not set # CONFIG_TIFM_CORE is not set # CONFIG_ICS932S401 is not set # CONFIG_ENCLOSURE_SERVICES is not set # CONFIG_CS5535_MFGPT is not set # CONFIG_HP_ILO is not set # CONFIG_ISL29003 is not set # CONFIG_SENSORS_TSL2550 is not set # CONFIG_DS1682 is not set # CONFIG_C2PORT is not set # # EEPROM support # # CONFIG_EEPROM_AT24 is not set # CONFIG_EEPROM_LEGACY is not set # CONFIG_EEPROM_MAX6875 is not set # CONFIG_EEPROM_93CX6 is not set # CONFIG_CB710_CORE is not set CONFIG_HAVE_IDE=y CONFIG_IDE=y # # Please see Documentation/ide/ide.txt for help/info on IDE drives # CONFIG_IDE_XFER_MODE=y CONFIG_IDE_ATAPI=y # CONFIG_BLK_DEV_IDE_SATA is not set CONFIG_IDE_GD=y CONFIG_IDE_GD_ATA=y CONFIG_IDE_GD_ATAPI=y CONFIG_BLK_DEV_IDECD=y CONFIG_BLK_DEV_IDECD_VERBOSE_ERRORS=y CONFIG_BLK_DEV_IDETAPE=y CONFIG_BLK_DEV_IDEACPI=y CONFIG_IDE_TASK_IOCTL=y # CONFIG_IDE_PROC_FS is not set # # IDE chipset support/bugfixes # CONFIG_IDE_GENERIC=m # CONFIG_BLK_DEV_PLATFORM is not set # CONFIG_BLK_DEV_CMD640 is not set # CONFIG_BLK_DEV_IDEPNP is not set CONFIG_BLK_DEV_IDEDMA_SFF=y # # PCI IDE chipsets support # CONFIG_BLK_DEV_IDEPCI=y CONFIG_IDEPCI_PCIBUS_ORDER=y # CONFIG_BLK_DEV_OFFBOARD is not set CONFIG_BLK_DEV_GENERIC=m # CONFIG_BLK_DEV_OPTI621 is not set # CONFIG_BLK_DEV_RZ1000 is not set CONFIG_BLK_DEV_IDEDMA_PCI=y # CONFIG_BLK_DEV_AEC62XX is not set # CONFIG_BLK_DEV_ALI15X3 is not set # CONFIG_BLK_DEV_AMD74XX is not set CONFIG_BLK_DEV_ATIIXP=y # CONFIG_BLK_DEV_CMD64X is not set # CONFIG_BLK_DEV_TRIFLEX is not set # CONFIG_BLK_DEV_CS5520 is not set # CONFIG_BLK_DEV_CS5530 is not set # CONFIG_BLK_DEV_HPT366 is not set # CONFIG_BLK_DEV_JMICRON is not set # CONFIG_BLK_DEV_SC1200 is not set # CONFIG_BLK_DEV_PIIX is not set # CONFIG_BLK_DEV_IT8172 is not set # CONFIG_BLK_DEV_IT8213 is not set # CONFIG_BLK_DEV_IT821X is not set # CONFIG_BLK_DEV_NS87415 is not set # CONFIG_BLK_DEV_PDC202XX_OLD is not set # CONFIG_BLK_DEV_PDC202XX_NEW is not set # CONFIG_BLK_DEV_SVWKS is not set # CONFIG_BLK_DEV_SIIMAGE is not set # CONFIG_BLK_DEV_SIS5513 is not set # CONFIG_BLK_DEV_SLC90E66 is not set # CONFIG_BLK_DEV_TRM290 is not set # CONFIG_BLK_DEV_VIA82CXXX is not set # CONFIG_BLK_DEV_TC86C001 is not set CONFIG_BLK_DEV_IDEDMA=y # # SCSI device support # CONFIG_SCSI_MOD=y # CONFIG_RAID_ATTRS is not set CONFIG_SCSI=y CONFIG_SCSI_DMA=y # CONFIG_SCSI_TGT is not set # CONFIG_SCSI_NETLINK is not set # CONFIG_SCSI_PROC_FS is not set # # SCSI support type (disk, tape, CD-ROM) # CONFIG_BLK_DEV_SD=y # CONFIG_CHR_DEV_ST is not set # CONFIG_CHR_DEV_OSST is not set # CONFIG_BLK_DEV_SR is not set # CONFIG_CHR_DEV_SG is not set # CONFIG_CHR_DEV_SCH is not set # CONFIG_SCSI_MULTI_LUN is not set # CONFIG_SCSI_CONSTANTS is not set # CONFIG_SCSI_LOGGING is not set # CONFIG_SCSI_SCAN_ASYNC is not set CONFIG_SCSI_WAIT_SCAN=m # # SCSI Transports # # CONFIG_SCSI_SPI_ATTRS is not set # CONFIG_SCSI_FC_ATTRS is not set # CONFIG_SCSI_ISCSI_ATTRS is not set # CONFIG_SCSI_SAS_ATTRS is not set # CONFIG_SCSI_SAS_LIBSAS is not set # CONFIG_SCSI_SRP_ATTRS is not set # CONFIG_SCSI_LOWLEVEL is not set # CONFIG_SCSI_DH is not set # CONFIG_SCSI_OSD_INITIATOR is not set CONFIG_ATA=y # CONFIG_ATA_NONSTANDARD is not set CONFIG_ATA_VERBOSE_ERROR=y CONFIG_ATA_ACPI=y CONFIG_SATA_PMP=y CONFIG_SATA_AHCI=y # CONFIG_SATA_SIL24 is not set # CONFIG_ATA_SFF is not set CONFIG_MD=y # CONFIG_BLK_DEV_MD is not set CONFIG_BLK_DEV_DM=m # CONFIG_DM_DEBUG is not set CONFIG_DM_CRYPT=m # CONFIG_DM_SNAPSHOT is not set # CONFIG_DM_MIRROR is not set # CONFIG_DM_ZERO is not set # CONFIG_DM_MULTIPATH is not set # CONFIG_DM_DELAY is not set # CONFIG_DM_UEVENT is not set # CONFIG_FUSION is not set # # IEEE 1394 (FireWire) support # # # You can enable one or both FireWire driver stacks. # # # The newer stack is recommended. # # CONFIG_FIREWIRE is not set CONFIG_IEEE1394=m CONFIG_IEEE1394_OHCI1394=m CONFIG_IEEE1394_PCILYNX=m CONFIG_IEEE1394_SBP2=m # CONFIG_IEEE1394_SBP2_PHYS_DMA is not set CONFIG_IEEE1394_ETH1394_ROM_ENTRY=y CONFIG_IEEE1394_ETH1394=m CONFIG_IEEE1394_RAWIO=m CONFIG_IEEE1394_VIDEO1394=m CONFIG_IEEE1394_DV1394=m # CONFIG_IEEE1394_VERBOSEDEBUG is not set # CONFIG_I2O is not set # CONFIG_MACINTOSH_DRIVERS is not set CONFIG_NETDEVICES=y # CONFIG_IFB is not set # CONFIG_DUMMY is not set # CONFIG_BONDING is not set # CONFIG_MACVLAN is not set # CONFIG_EQUALIZER is not set CONFIG_TUN=m # CONFIG_VETH is not set # CONFIG_NET_SB1000 is not set # CONFIG_ARCNET is not set CONFIG_PHYLIB=m # # MII PHY device drivers # CONFIG_MARVELL_PHY=m CONFIG_DAVICOM_PHY=m CONFIG_QSEMI_PHY=m CONFIG_LXT_PHY=m CONFIG_CICADA_PHY=m CONFIG_VITESSE_PHY=m CONFIG_SMSC_PHY=m # CONFIG_BROADCOM_PHY is not set # CONFIG_ICPLUS_PHY is not set # CONFIG_REALTEK_PHY is not set # CONFIG_NATIONAL_PHY is not set # CONFIG_STE10XP is not set # CONFIG_LSI_ET1011C_PHY is not set # CONFIG_MDIO_BITBANG is not set CONFIG_NET_ETHERNET=y CONFIG_MII=y # CONFIG_HAPPYMEAL is not set # CONFIG_SUNGEM is not set # CONFIG_CASSINI is not set # CONFIG_NET_VENDOR_3COM is not set # CONFIG_ETHOC is not set # CONFIG_DNET is not set # CONFIG_NET_TULIP is not set # CONFIG_HP100 is not set # CONFIG_IBM_NEW_EMAC_ZMII is not set # CONFIG_IBM_NEW_EMAC_RGMII is not set # CONFIG_IBM_NEW_EMAC_TAH is not set # CONFIG_IBM_NEW_EMAC_EMAC4 is not set # CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set # CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set # CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set CONFIG_NET_PCI=y # CONFIG_PCNET32 is not set # CONFIG_AMD8111_ETH is not set # CONFIG_ADAPTEC_STARFIRE is not set # CONFIG_KSZ884X_PCI is not set # CONFIG_B44 is not set # CONFIG_FORCEDETH is not set # CONFIG_E100 is not set # CONFIG_FEALNX is not set # CONFIG_NATSEMI is not set # CONFIG_NE2K_PCI is not set # CONFIG_8139CP is not set CONFIG_8139TOO=y # CONFIG_8139TOO_PIO is not set # CONFIG_8139TOO_TUNE_TWISTER is not set # CONFIG_8139TOO_8129 is not set # CONFIG_8139_OLD_RX_RESET is not set # CONFIG_R6040 is not set # CONFIG_SIS900 is not set # CONFIG_EPIC100 is not set # CONFIG_SMSC9420 is not set # CONFIG_SUNDANCE is not set # CONFIG_TLAN is not set # CONFIG_KS8842 is not set # CONFIG_KS8851_MLL is not set # CONFIG_VIA_RHINE is not set # CONFIG_SC92031 is not set # CONFIG_ATL2 is not set # CONFIG_NETDEV_1000 is not set # CONFIG_NETDEV_10000 is not set # CONFIG_TR is not set # CONFIG_WLAN is not set # # Enable WiMAX (Networking options) to see the WiMAX drivers # # # USB Network Adapters # # CONFIG_USB_CATC is not set # CONFIG_USB_KAWETH is not set # CONFIG_USB_PEGASUS is not set # CONFIG_USB_RTL8150 is not set # CONFIG_USB_USBNET is not set # CONFIG_USB_HSO is not set # CONFIG_WAN is not set # CONFIG_FDDI is not set # CONFIG_HIPPI is not set CONFIG_PPP=m CONFIG_PPP_MULTILINK=y CONFIG_PPP_FILTER=y CONFIG_PPP_ASYNC=m CONFIG_PPP_SYNC_TTY=m CONFIG_PPP_DEFLATE=m CONFIG_PPP_BSDCOMP=m CONFIG_PPP_MPPE=m CONFIG_PPPOE=m # CONFIG_PPPOL2TP is not set CONFIG_SLIP=m # CONFIG_SLIP_COMPRESSED is not set CONFIG_SLHC=m # CONFIG_SLIP_SMART is not set # CONFIG_SLIP_MODE_SLIP6 is not set # CONFIG_NET_FC is not set CONFIG_NETCONSOLE=y CONFIG_NETCONSOLE_DYNAMIC=y CONFIG_NETPOLL=y CONFIG_NETPOLL_TRAP=y CONFIG_NET_POLL_CONTROLLER=y # CONFIG_VMXNET3 is not set # CONFIG_ISDN is not set # CONFIG_PHONE is not set # # Input device support # CONFIG_INPUT=y # CONFIG_INPUT_FF_MEMLESS is not set CONFIG_INPUT_POLLDEV=m # CONFIG_INPUT_SPARSEKMAP is not set # # Userland interfaces # CONFIG_INPUT_MOUSEDEV=y CONFIG_INPUT_MOUSEDEV_PSAUX=y CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024 CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768 # CONFIG_INPUT_JOYDEV is not set CONFIG_INPUT_EVDEV=y # CONFIG_INPUT_EVBUG is not set # # Input Device Drivers # CONFIG_INPUT_KEYBOARD=y # CONFIG_KEYBOARD_ADP5588 is not set CONFIG_KEYBOARD_ATKBD=y # CONFIG_QT2160 is not set # CONFIG_KEYBOARD_LKKBD is not set # CONFIG_KEYBOARD_MAX7359 is not set # CONFIG_KEYBOARD_NEWTON is not set # CONFIG_KEYBOARD_OPENCORES is not set # CONFIG_KEYBOARD_STOWAWAY is not set # CONFIG_KEYBOARD_SUNKBD is not set # CONFIG_KEYBOARD_XTKBD is not set CONFIG_INPUT_MOUSE=y CONFIG_MOUSE_PS2=m CONFIG_MOUSE_PS2_ALPS=y CONFIG_MOUSE_PS2_LOGIPS2PP=y CONFIG_MOUSE_PS2_SYNAPTICS=y CONFIG_MOUSE_PS2_LIFEBOOK=y CONFIG_MOUSE_PS2_TRACKPOINT=y # CONFIG_MOUSE_PS2_ELANTECH is not set # CONFIG_MOUSE_PS2_SENTELIC is not set # CONFIG_MOUSE_PS2_TOUCHKIT is not set # CONFIG_MOUSE_SERIAL is not set # CONFIG_MOUSE_APPLETOUCH is not set # CONFIG_MOUSE_BCM5974 is not set # CONFIG_MOUSE_VSXXXAA is not set # CONFIG_MOUSE_SYNAPTICS_I2C is not set # CONFIG_INPUT_JOYSTICK is not set # CONFIG_INPUT_TABLET is not set # CONFIG_INPUT_TOUCHSCREEN is not set CONFIG_INPUT_MISC=y CONFIG_INPUT_PCSPKR=m # CONFIG_INPUT_ATLAS_BTNS is not set # CONFIG_INPUT_ATI_REMOTE is not set # CONFIG_INPUT_ATI_REMOTE2 is not set # CONFIG_INPUT_KEYSPAN_REMOTE is not set # CONFIG_INPUT_POWERMATE is not set # CONFIG_INPUT_YEALINK is not set # CONFIG_INPUT_CM109 is not set # CONFIG_INPUT_UINPUT is not set # CONFIG_INPUT_WINBOND_CIR is not set # # Hardware I/O ports # CONFIG_SERIO=y CONFIG_SERIO_I8042=y # CONFIG_SERIO_SERPORT is not set # CONFIG_SERIO_CT82C710 is not set # CONFIG_SERIO_PCIPS2 is not set CONFIG_SERIO_LIBPS2=y # CONFIG_SERIO_RAW is not set # CONFIG_SERIO_ALTERA_PS2 is not set # CONFIG_GAMEPORT is not set # # Character devices # CONFIG_VT=y CONFIG_CONSOLE_TRANSLATIONS=y CONFIG_VT_CONSOLE=y CONFIG_HW_CONSOLE=y # CONFIG_VT_HW_CONSOLE_BINDING is not set CONFIG_DEVKMEM=y # CONFIG_SERIAL_NONSTANDARD is not set # CONFIG_NOZOMI is not set # # Serial drivers # CONFIG_SERIAL_8250=m CONFIG_FIX_EARLYCON_MEM=y CONFIG_SERIAL_8250_PCI=m CONFIG_SERIAL_8250_PNP=m CONFIG_SERIAL_8250_NR_UARTS=16 CONFIG_SERIAL_8250_RUNTIME_UARTS=4 CONFIG_SERIAL_8250_EXTENDED=y CONFIG_SERIAL_8250_MANY_PORTS=y CONFIG_SERIAL_8250_SHARE_IRQ=y # CONFIG_SERIAL_8250_DETECT_IRQ is not set # CONFIG_SERIAL_8250_RSA is not set # # Non-8250 serial port support # CONFIG_SERIAL_CORE=m # CONFIG_SERIAL_JSM is not set # CONFIG_SERIAL_TIMBERDALE is not set CONFIG_UNIX98_PTYS=y # CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set # CONFIG_LEGACY_PTYS is not set # CONFIG_IPMI_HANDLER is not set # CONFIG_HW_RANDOM is not set # CONFIG_NVRAM is not set # CONFIG_R3964 is not set # CONFIG_APPLICOM is not set # CONFIG_MWAVE is not set # CONFIG_PC8736x_GPIO is not set # CONFIG_RAW_DRIVER is not set CONFIG_HPET=y CONFIG_HPET_MMAP=y # CONFIG_HANGCHECK_TIMER is not set # CONFIG_TCG_TPM is not set # CONFIG_TELCLOCK is not set CONFIG_DEVPORT=y CONFIG_I2C=y CONFIG_I2C_BOARDINFO=y CONFIG_I2C_COMPAT=y CONFIG_I2C_CHARDEV=m CONFIG_I2C_HELPER_AUTO=y CONFIG_I2C_ALGOBIT=y # # I2C Hardware Bus support # # # PC SMBus host controller drivers # # CONFIG_I2C_ALI1535 is not set # CONFIG_I2C_ALI1563 is not set # CONFIG_I2C_ALI15X3 is not set CONFIG_I2C_AMD756=m # CONFIG_I2C_AMD756_S4882 is not set CONFIG_I2C_AMD8111=m # CONFIG_I2C_I801 is not set # CONFIG_I2C_ISCH is not set # CONFIG_I2C_PIIX4 is not set # CONFIG_I2C_NFORCE2 is not set # CONFIG_I2C_SIS5595 is not set # CONFIG_I2C_SIS630 is not set # CONFIG_I2C_SIS96X is not set # CONFIG_I2C_VIA is not set # CONFIG_I2C_VIAPRO is not set # # ACPI drivers # # CONFIG_I2C_SCMI is not set # # I2C system bus drivers (mostly embedded / system-on-chip) # # CONFIG_I2C_OCORES is not set # CONFIG_I2C_SIMTEC is not set # CONFIG_I2C_XILINX is not set # # External I2C/SMBus adapter drivers # # CONFIG_I2C_PARPORT_LIGHT is not set # CONFIG_I2C_TAOS_EVM is not set # CONFIG_I2C_TINY_USB is not set # # Other I2C/SMBus bus drivers # # CONFIG_I2C_PCA_PLATFORM is not set # CONFIG_I2C_STUB is not set # CONFIG_I2C_DEBUG_CORE is not set # CONFIG_I2C_DEBUG_ALGO is not set # CONFIG_I2C_DEBUG_BUS is not set # CONFIG_SPI is not set # # PPS support # # CONFIG_PPS is not set CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y # CONFIG_GPIOLIB is not set # CONFIG_W1 is not set CONFIG_POWER_SUPPLY=y # CONFIG_POWER_SUPPLY_DEBUG is not set # CONFIG_PDA_POWER is not set # CONFIG_BATTERY_DS2760 is not set # CONFIG_BATTERY_DS2782 is not set # CONFIG_BATTERY_BQ27x00 is not set # CONFIG_BATTERY_MAX17040 is not set CONFIG_HWMON=y CONFIG_HWMON_VID=m # CONFIG_HWMON_DEBUG_CHIP is not set # # Native drivers # # CONFIG_SENSORS_ABITUGURU is not set # CONFIG_SENSORS_ABITUGURU3 is not set # CONFIG_SENSORS_AD7414 is not set # CONFIG_SENSORS_AD7418 is not set # CONFIG_SENSORS_ADM1021 is not set # CONFIG_SENSORS_ADM1025 is not set # CONFIG_SENSORS_ADM1026 is not set # CONFIG_SENSORS_ADM1029 is not set # CONFIG_SENSORS_ADM1031 is not set # CONFIG_SENSORS_ADM9240 is not set # CONFIG_SENSORS_ADT7411 is not set # CONFIG_SENSORS_ADT7462 is not set # CONFIG_SENSORS_ADT7470 is not set # CONFIG_SENSORS_ADT7475 is not set # CONFIG_SENSORS_ASC7621 is not set CONFIG_SENSORS_K8TEMP=m CONFIG_SENSORS_K10TEMP=m CONFIG_SENSORS_ASB100=m # CONFIG_SENSORS_ATXP1 is not set # CONFIG_SENSORS_DS1621 is not set # CONFIG_SENSORS_I5K_AMB is not set # CONFIG_SENSORS_F71805F is not set # CONFIG_SENSORS_F71882FG is not set # CONFIG_SENSORS_F75375S is not set # CONFIG_SENSORS_FSCHMD is not set # CONFIG_SENSORS_G760A is not set # CONFIG_SENSORS_GL518SM is not set # CONFIG_SENSORS_GL520SM is not set # CONFIG_SENSORS_CORETEMP is not set # CONFIG_SENSORS_IT87 is not set # CONFIG_SENSORS_LM63 is not set # CONFIG_SENSORS_LM73 is not set # CONFIG_SENSORS_LM75 is not set # CONFIG_SENSORS_LM77 is not set # CONFIG_SENSORS_LM78 is not set # CONFIG_SENSORS_LM80 is not set # CONFIG_SENSORS_LM83 is not set # CONFIG_SENSORS_LM85 is not set # CONFIG_SENSORS_LM87 is not set # CONFIG_SENSORS_LM90 is not set # CONFIG_SENSORS_LM92 is not set # CONFIG_SENSORS_LM93 is not set # CONFIG_SENSORS_LTC4215 is not set # CONFIG_SENSORS_LTC4245 is not set # CONFIG_SENSORS_LM95241 is not set # CONFIG_SENSORS_MAX1619 is not set # CONFIG_SENSORS_MAX6650 is not set # CONFIG_SENSORS_PC87360 is not set # CONFIG_SENSORS_PC87427 is not set # CONFIG_SENSORS_PCF8591 is not set # CONFIG_SENSORS_SIS5595 is not set # CONFIG_SENSORS_DME1737 is not set # CONFIG_SENSORS_SMSC47M1 is not set # CONFIG_SENSORS_SMSC47M192 is not set # CONFIG_SENSORS_SMSC47B397 is not set # CONFIG_SENSORS_ADS7828 is not set # CONFIG_SENSORS_AMC6821 is not set # CONFIG_SENSORS_THMC50 is not set # CONFIG_SENSORS_TMP401 is not set # CONFIG_SENSORS_TMP421 is not set # CONFIG_SENSORS_VIA_CPUTEMP is not set # CONFIG_SENSORS_VIA686A is not set # CONFIG_SENSORS_VT1211 is not set # CONFIG_SENSORS_VT8231 is not set # CONFIG_SENSORS_W83781D is not set # CONFIG_SENSORS_W83791D is not set # CONFIG_SENSORS_W83792D is not set # CONFIG_SENSORS_W83793 is not set # CONFIG_SENSORS_W83L785TS is not set # CONFIG_SENSORS_W83L786NG is not set # CONFIG_SENSORS_W83627HF is not set # CONFIG_SENSORS_W83627EHF is not set # CONFIG_SENSORS_HDAPS is not set # CONFIG_SENSORS_LIS3_I2C is not set # CONFIG_SENSORS_APPLESMC is not set # # ACPI drivers # # CONFIG_SENSORS_ATK0110 is not set # CONFIG_SENSORS_LIS3LV02D is not set CONFIG_THERMAL=y # CONFIG_THERMAL_HWMON is not set # CONFIG_WATCHDOG is not set CONFIG_SSB_POSSIBLE=y # # Sonics Silicon Backplane # # CONFIG_SSB is not set # # Multifunction device drivers # # CONFIG_MFD_CORE is not set # CONFIG_MFD_88PM860X is not set # CONFIG_MFD_SM501 is not set # CONFIG_HTC_PASIC3 is not set # CONFIG_TWL4030_CORE is not set # CONFIG_MFD_TMIO is not set # CONFIG_PMIC_DA903X is not set # CONFIG_PMIC_ADP5520 is not set # CONFIG_MFD_MAX8925 is not set # CONFIG_MFD_WM8400 is not set # CONFIG_MFD_WM831X is not set # CONFIG_MFD_WM8350_I2C is not set # CONFIG_MFD_WM8994 is not set # CONFIG_MFD_PCF50633 is not set # CONFIG_AB3100_CORE is not set # CONFIG_LPC_SCH is not set # CONFIG_REGULATOR is not set # CONFIG_MEDIA_SUPPORT is not set # # Graphics support # CONFIG_AGP=y CONFIG_AGP_AMD64=y # CONFIG_AGP_INTEL is not set # CONFIG_AGP_SIS is not set # CONFIG_AGP_VIA is not set CONFIG_VGA_ARB=y CONFIG_VGA_ARB_MAX_GPUS=16 # CONFIG_VGA_SWITCHEROO is not set CONFIG_DRM=y CONFIG_DRM_KMS_HELPER=y CONFIG_DRM_TTM=y # CONFIG_DRM_TDFX is not set # CONFIG_DRM_R128 is not set CONFIG_DRM_RADEON=y # CONFIG_DRM_RADEON_KMS is not set # CONFIG_DRM_MGA is not set # CONFIG_DRM_SIS is not set # CONFIG_DRM_VIA is not set # CONFIG_DRM_SAVAGE is not set # CONFIG_VGASTATE is not set # CONFIG_VIDEO_OUTPUT_CONTROL is not set CONFIG_FB=y # CONFIG_FIRMWARE_EDID is not set # CONFIG_FB_DDC is not set # CONFIG_FB_BOOT_VESA_SUPPORT is not set CONFIG_FB_CFB_FILLRECT=y CONFIG_FB_CFB_COPYAREA=y CONFIG_FB_CFB_IMAGEBLIT=y # CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set # CONFIG_FB_SYS_FILLRECT is not set # CONFIG_FB_SYS_COPYAREA is not set # CONFIG_FB_SYS_IMAGEBLIT is not set # CONFIG_FB_FOREIGN_ENDIAN is not set # CONFIG_FB_SYS_FOPS is not set # CONFIG_FB_SVGALIB is not set # CONFIG_FB_MACMODES is not set # CONFIG_FB_BACKLIGHT is not set CONFIG_FB_MODE_HELPERS=y # CONFIG_FB_TILEBLITTING is not set # # Frame buffer hardware drivers # # CONFIG_FB_CIRRUS is not set # CONFIG_FB_PM2 is not set # CONFIG_FB_CYBER2000 is not set # CONFIG_FB_ARC is not set # CONFIG_FB_ASILIANT is not set # CONFIG_FB_IMSTT is not set # CONFIG_FB_VGA16 is not set # CONFIG_FB_VESA is not set # CONFIG_FB_N411 is not set # CONFIG_FB_HGA is not set # CONFIG_FB_S1D13XXX is not set # CONFIG_FB_NVIDIA is not set # CONFIG_FB_RIVA is not set # CONFIG_FB_LE80578 is not set # CONFIG_FB_MATROX is not set # CONFIG_FB_RADEON is not set # CONFIG_FB_ATY128 is not set # CONFIG_FB_ATY is not set # CONFIG_FB_S3 is not set # CONFIG_FB_SAVAGE is not set # CONFIG_FB_SIS is not set # CONFIG_FB_VIA is not set # CONFIG_FB_NEOMAGIC is not set # CONFIG_FB_KYRO is not set # CONFIG_FB_3DFX is not set # CONFIG_FB_VOODOO1 is not set # CONFIG_FB_VT8623 is not set # CONFIG_FB_TRIDENT is not set # CONFIG_FB_ARK is not set # CONFIG_FB_PM3 is not set # CONFIG_FB_CARMINE is not set # CONFIG_FB_GEODE is not set # CONFIG_FB_VIRTUAL is not set # CONFIG_FB_METRONOME is not set # CONFIG_FB_MB862XX is not set # CONFIG_FB_BROADSHEET is not set CONFIG_BACKLIGHT_LCD_SUPPORT=y # CONFIG_LCD_CLASS_DEVICE is not set CONFIG_BACKLIGHT_CLASS_DEVICE=y # CONFIG_BACKLIGHT_GENERIC is not set # CONFIG_BACKLIGHT_PROGEAR is not set # CONFIG_BACKLIGHT_MBP_NVIDIA is not set # CONFIG_BACKLIGHT_SAHARA is not set # # Display device support # # CONFIG_DISPLAY_SUPPORT is not set # # Console display driver support # CONFIG_VGA_CONSOLE=y CONFIG_VGACON_SOFT_SCROLLBACK=y CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64 CONFIG_DUMMY_CONSOLE=y CONFIG_FRAMEBUFFER_CONSOLE=y # CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set # CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set # CONFIG_FONTS is not set CONFIG_FONT_8x8=y CONFIG_FONT_8x16=y CONFIG_LOGO=y CONFIG_LOGO_LINUX_MONO=y CONFIG_LOGO_LINUX_VGA16=y CONFIG_LOGO_LINUX_CLUT224=y CONFIG_SOUND=y CONFIG_SOUND_OSS_CORE=y # CONFIG_SOUND_OSS_CORE_PRECLAIM is not set CONFIG_SND=y CONFIG_SND_TIMER=y CONFIG_SND_PCM=y CONFIG_SND_SEQUENCER=m CONFIG_SND_SEQ_DUMMY=m CONFIG_SND_OSSEMUL=y CONFIG_SND_MIXER_OSS=m CONFIG_SND_PCM_OSS=m CONFIG_SND_PCM_OSS_PLUGINS=y CONFIG_SND_SEQUENCER_OSS=y CONFIG_SND_HRTIMER=m CONFIG_SND_SEQ_HRTIMER_DEFAULT=y # CONFIG_SND_DYNAMIC_MINORS is not set # CONFIG_SND_SUPPORT_OLD_API is not set # CONFIG_SND_VERBOSE_PROCFS is not set # CONFIG_SND_VERBOSE_PRINTK is not set # CONFIG_SND_DEBUG is not set CONFIG_SND_VMASTER=y CONFIG_SND_DMA_SGBUF=y # CONFIG_SND_RAWMIDI_SEQ is not set # CONFIG_SND_OPL3_LIB_SEQ is not set # CONFIG_SND_OPL4_LIB_SEQ is not set # CONFIG_SND_SBAWE_SEQ is not set # CONFIG_SND_EMU10K1_SEQ is not set CONFIG_SND_AC97_CODEC=m # CONFIG_SND_DRIVERS is not set CONFIG_SND_PCI=y # CONFIG_SND_AD1889 is not set # CONFIG_SND_ALS300 is not set # CONFIG_SND_ALS4000 is not set # CONFIG_SND_ALI5451 is not set CONFIG_SND_ATIIXP=m # CONFIG_SND_ATIIXP_MODEM is not set # CONFIG_SND_AU8810 is not set # CONFIG_SND_AU8820 is not set # CONFIG_SND_AU8830 is not set # CONFIG_SND_AW2 is not set # CONFIG_SND_AZT3328 is not set # CONFIG_SND_BT87X is not set # CONFIG_SND_CA0106 is not set # CONFIG_SND_CMIPCI is not set # CONFIG_SND_OXYGEN is not set # CONFIG_SND_CS4281 is not set # CONFIG_SND_CS46XX is not set # CONFIG_SND_CS5530 is not set # CONFIG_SND_CS5535AUDIO is not set # CONFIG_SND_CTXFI is not set # CONFIG_SND_DARLA20 is not set # CONFIG_SND_GINA20 is not set # CONFIG_SND_LAYLA20 is not set # CONFIG_SND_DARLA24 is not set # CONFIG_SND_GINA24 is not set # CONFIG_SND_LAYLA24 is not set # CONFIG_SND_MONA is not set # CONFIG_SND_MIA is not set # CONFIG_SND_ECHO3G is not set # CONFIG_SND_INDIGO is not set # CONFIG_SND_INDIGOIO is not set # CONFIG_SND_INDIGODJ is not set # CONFIG_SND_INDIGOIOX is not set # CONFIG_SND_INDIGODJX is not set # CONFIG_SND_EMU10K1 is not set # CONFIG_SND_EMU10K1X is not set # CONFIG_SND_ENS1370 is not set # CONFIG_SND_ENS1371 is not set # CONFIG_SND_ES1938 is not set # CONFIG_SND_ES1968 is not set # CONFIG_SND_FM801 is not set CONFIG_SND_HDA_INTEL=y # CONFIG_SND_HDA_HWDEP is not set # CONFIG_SND_HDA_INPUT_BEEP is not set # CONFIG_SND_HDA_INPUT_JACK is not set # CONFIG_SND_HDA_PATCH_LOADER is not set CONFIG_SND_HDA_CODEC_REALTEK=y CONFIG_SND_HDA_CODEC_ANALOG=y CONFIG_SND_HDA_CODEC_SIGMATEL=y CONFIG_SND_HDA_CODEC_VIA=y CONFIG_SND_HDA_CODEC_ATIHDMI=y CONFIG_SND_HDA_CODEC_NVHDMI=y CONFIG_SND_HDA_CODEC_INTELHDMI=y CONFIG_SND_HDA_ELD=y CONFIG_SND_HDA_CODEC_CIRRUS=y CONFIG_SND_HDA_CODEC_CONEXANT=y CONFIG_SND_HDA_CODEC_CA0110=y CONFIG_SND_HDA_CODEC_CMEDIA=y CONFIG_SND_HDA_CODEC_SI3054=y CONFIG_SND_HDA_GENERIC=y CONFIG_SND_HDA_POWER_SAVE=y CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0 # CONFIG_SND_HDSP is not set # CONFIG_SND_HDSPM is not set # CONFIG_SND_HIFIER is not set # CONFIG_SND_ICE1712 is not set # CONFIG_SND_ICE1724 is not set CONFIG_SND_INTEL8X0=m CONFIG_SND_INTEL8X0M=m # CONFIG_SND_KORG1212 is not set # CONFIG_SND_LX6464ES is not set # CONFIG_SND_MAESTRO3 is not set # CONFIG_SND_MIXART is not set # CONFIG_SND_NM256 is not set # CONFIG_SND_PCXHR is not set # CONFIG_SND_RIPTIDE is not set # CONFIG_SND_RME32 is not set # CONFIG_SND_RME96 is not set # CONFIG_SND_RME9652 is not set # CONFIG_SND_SONICVIBES is not set # CONFIG_SND_TRIDENT is not set # CONFIG_SND_VIA82XX is not set # CONFIG_SND_VIA82XX_MODEM is not set # CONFIG_SND_VIRTUOSO is not set # CONFIG_SND_VX222 is not set # CONFIG_SND_YMFPCI is not set # CONFIG_SND_USB is not set # CONFIG_SND_SOC is not set # CONFIG_SOUND_PRIME is not set CONFIG_AC97_BUS=m CONFIG_HID_SUPPORT=y CONFIG_HID=y # CONFIG_HIDRAW is not set # # USB Input Devices # CONFIG_USB_HID=y # CONFIG_HID_PID is not set # CONFIG_USB_HIDDEV is not set # # Special HID drivers # # CONFIG_HID_3M_PCT is not set CONFIG_HID_A4TECH=y CONFIG_HID_APPLE=y CONFIG_HID_BELKIN=y CONFIG_HID_CHERRY=y CONFIG_HID_CHICONY=y CONFIG_HID_CYPRESS=y CONFIG_HID_DRAGONRISE=y # CONFIG_DRAGONRISE_FF is not set CONFIG_HID_EZKEY=y CONFIG_HID_KYE=y CONFIG_HID_GYRATION=y CONFIG_HID_TWINHAN=y CONFIG_HID_KENSINGTON=y CONFIG_HID_LOGITECH=y # CONFIG_LOGITECH_FF is not set # CONFIG_LOGIRUMBLEPAD2_FF is not set # CONFIG_LOGIG940_FF is not set CONFIG_HID_MICROSOFT=y # CONFIG_HID_MOSART is not set CONFIG_HID_MONTEREY=y CONFIG_HID_NTRIG=y CONFIG_HID_ORTEK=y CONFIG_HID_PANTHERLORD=y # CONFIG_PANTHERLORD_FF is not set CONFIG_HID_PETALYNX=y # CONFIG_HID_QUANTA is not set CONFIG_HID_SAMSUNG=y CONFIG_HID_SONY=y # CONFIG_HID_STANTUM is not set CONFIG_HID_SUNPLUS=y CONFIG_HID_GREENASIA=y # CONFIG_GREENASIA_FF is not set CONFIG_HID_SMARTJOYPLUS=y # CONFIG_SMARTJOYPLUS_FF is not set CONFIG_HID_TOPSEED=y CONFIG_HID_THRUSTMASTER=y # CONFIG_THRUSTMASTER_FF is not set CONFIG_HID_ZEROPLUS=y # CONFIG_ZEROPLUS_FF is not set CONFIG_USB_SUPPORT=y CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y CONFIG_USB_ARCH_HAS_EHCI=y CONFIG_USB=y # CONFIG_USB_DEBUG is not set CONFIG_USB_ANNOUNCE_NEW_DEVICES=y # # Miscellaneous USB options # CONFIG_USB_DEVICEFS=y # CONFIG_USB_DEVICE_CLASS is not set # CONFIG_USB_DYNAMIC_MINORS is not set # CONFIG_USB_OTG is not set # CONFIG_USB_MON is not set # CONFIG_USB_WUSB is not set # CONFIG_USB_WUSB_CBAF is not set # # USB Host Controller Drivers # # CONFIG_USB_C67X00_HCD is not set # CONFIG_USB_XHCI_HCD is not set CONFIG_USB_EHCI_HCD=y CONFIG_USB_EHCI_ROOT_HUB_TT=y # CONFIG_USB_EHCI_TT_NEWSCHED is not set # CONFIG_USB_OXU210HP_HCD is not set # CONFIG_USB_ISP116X_HCD is not set # CONFIG_USB_ISP1760_HCD is not set # CONFIG_USB_ISP1362_HCD is not set CONFIG_USB_OHCI_HCD=m # CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set # CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set CONFIG_USB_OHCI_LITTLE_ENDIAN=y # CONFIG_USB_UHCI_HCD is not set # CONFIG_USB_SL811_HCD is not set # CONFIG_USB_R8A66597_HCD is not set # CONFIG_USB_WHCI_HCD is not set # CONFIG_USB_HWA_HCD is not set # # USB Device Class drivers # # CONFIG_USB_ACM is not set CONFIG_USB_PRINTER=m # CONFIG_USB_WDM is not set # CONFIG_USB_TMC is not set # # NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may # # # also be needed; see USB_STORAGE Help for more info # CONFIG_USB_STORAGE=y # CONFIG_USB_STORAGE_DEBUG is not set # CONFIG_USB_STORAGE_DATAFAB is not set # CONFIG_USB_STORAGE_FREECOM is not set # CONFIG_USB_STORAGE_ISD200 is not set # CONFIG_USB_STORAGE_USBAT is not set # CONFIG_USB_STORAGE_SDDR09 is not set # CONFIG_USB_STORAGE_SDDR55 is not set # CONFIG_USB_STORAGE_JUMPSHOT is not set # CONFIG_USB_STORAGE_ALAUDA is not set # CONFIG_USB_STORAGE_ONETOUCH is not set # CONFIG_USB_STORAGE_KARMA is not set # CONFIG_USB_STORAGE_CYPRESS_ATACB is not set # CONFIG_USB_LIBUSUAL is not set # # USB Imaging devices # # CONFIG_USB_MDC800 is not set # CONFIG_USB_MICROTEK is not set # # USB port drivers # # CONFIG_USB_SERIAL is not set # # USB Miscellaneous drivers # # CONFIG_USB_EMI62 is not set # CONFIG_USB_EMI26 is not set # CONFIG_USB_ADUTUX is not set # CONFIG_USB_SEVSEG is not set # CONFIG_USB_RIO500 is not set # CONFIG_USB_LEGOTOWER is not set # CONFIG_USB_LCD is not set # CONFIG_USB_LED is not set # CONFIG_USB_CYPRESS_CY7C63 is not set # CONFIG_USB_CYTHERM is not set # CONFIG_USB_IDMOUSE is not set # CONFIG_USB_FTDI_ELAN is not set # CONFIG_USB_APPLEDISPLAY is not set # CONFIG_USB_SISUSBVGA is not set # CONFIG_USB_LD is not set # CONFIG_USB_TRANCEVIBRATOR is not set # CONFIG_USB_IOWARRIOR is not set # CONFIG_USB_TEST is not set # CONFIG_USB_ISIGHTFW is not set # CONFIG_USB_GADGET is not set # # OTG and related infrastructure # # CONFIG_NOP_USB_XCEIV is not set # CONFIG_UWB is not set # CONFIG_MMC is not set # CONFIG_MEMSTICK is not set # CONFIG_NEW_LEDS is not set # CONFIG_ACCESSIBILITY is not set # CONFIG_INFINIBAND is not set CONFIG_EDAC=y # # Reporting subsystems # # CONFIG_EDAC_DEBUG is not set CONFIG_EDAC_DECODE_MCE=y CONFIG_EDAC_MM_EDAC=m CONFIG_EDAC_AMD64=m CONFIG_EDAC_AMD64_ERROR_INJECTION=y # CONFIG_EDAC_E752X is not set # CONFIG_EDAC_I82975X is not set # CONFIG_EDAC_I3000 is not set # CONFIG_EDAC_I3200 is not set # CONFIG_EDAC_X38 is not set # CONFIG_EDAC_I5400 is not set # CONFIG_EDAC_I5000 is not set # CONFIG_EDAC_I5100 is not set CONFIG_RTC_LIB=y CONFIG_RTC_CLASS=y CONFIG_RTC_HCTOSYS=y CONFIG_RTC_HCTOSYS_DEVICE="rtc0" # CONFIG_RTC_DEBUG is not set # # RTC interfaces # CONFIG_RTC_INTF_SYSFS=y CONFIG_RTC_INTF_PROC=y CONFIG_RTC_INTF_DEV=y # CONFIG_RTC_INTF_DEV_UIE_EMUL is not set # CONFIG_RTC_DRV_TEST is not set # # I2C RTC drivers # # CONFIG_RTC_DRV_DS1307 is not set # CONFIG_RTC_DRV_DS1374 is not set # CONFIG_RTC_DRV_DS1672 is not set # CONFIG_RTC_DRV_MAX6900 is not set # CONFIG_RTC_DRV_RS5C372 is not set # CONFIG_RTC_DRV_ISL1208 is not set # CONFIG_RTC_DRV_X1205 is not set # CONFIG_RTC_DRV_PCF8563 is not set # CONFIG_RTC_DRV_PCF8583 is not set # CONFIG_RTC_DRV_M41T80 is not set # CONFIG_RTC_DRV_BQ32K is not set # CONFIG_RTC_DRV_S35390A is not set # CONFIG_RTC_DRV_FM3130 is not set # CONFIG_RTC_DRV_RX8581 is not set # CONFIG_RTC_DRV_RX8025 is not set # # SPI RTC drivers # # # Platform RTC drivers # CONFIG_RTC_DRV_CMOS=y # CONFIG_RTC_DRV_DS1286 is not set # CONFIG_RTC_DRV_DS1511 is not set # CONFIG_RTC_DRV_DS1553 is not set # CONFIG_RTC_DRV_DS1742 is not set # CONFIG_RTC_DRV_STK17TA8 is not set # CONFIG_RTC_DRV_M48T86 is not set # CONFIG_RTC_DRV_M48T35 is not set # CONFIG_RTC_DRV_M48T59 is not set # CONFIG_RTC_DRV_MSM6242 is not set # CONFIG_RTC_DRV_BQ4802 is not set # CONFIG_RTC_DRV_RP5C01 is not set # CONFIG_RTC_DRV_V3020 is not set # # on-CPU RTC drivers # # CONFIG_DMADEVICES is not set # CONFIG_AUXDISPLAY is not set # CONFIG_UIO is not set # # TI VLYNQ # # CONFIG_STAGING is not set # CONFIG_X86_PLATFORM_DEVICES is not set # # Firmware Drivers # # CONFIG_EDD is not set CONFIG_FIRMWARE_MEMMAP=y # CONFIG_DELL_RBU is not set # CONFIG_DCDBAS is not set # CONFIG_DMIID is not set # CONFIG_ISCSI_IBFT_FIND is not set # # File systems # CONFIG_EXT2_FS=y CONFIG_EXT2_FS_XATTR=y CONFIG_EXT2_FS_POSIX_ACL=y CONFIG_EXT2_FS_SECURITY=y # CONFIG_EXT2_FS_XIP is not set CONFIG_EXT3_FS=y # CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set CONFIG_EXT3_FS_XATTR=y CONFIG_EXT3_FS_POSIX_ACL=y CONFIG_EXT3_FS_SECURITY=y # CONFIG_EXT4_FS is not set CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_FS_MBCACHE=y # CONFIG_REISERFS_FS is not set # CONFIG_JFS_FS is not set CONFIG_FS_POSIX_ACL=y # CONFIG_XFS_FS is not set # CONFIG_GFS2_FS is not set # CONFIG_OCFS2_FS is not set # CONFIG_BTRFS_FS is not set # CONFIG_NILFS2_FS is not set CONFIG_FILE_LOCKING=y CONFIG_FSNOTIFY=y # CONFIG_DNOTIFY is not set CONFIG_INOTIFY=y CONFIG_INOTIFY_USER=y # CONFIG_QUOTA is not set # CONFIG_AUTOFS_FS is not set # CONFIG_AUTOFS4_FS is not set CONFIG_FUSE_FS=m # CONFIG_CUSE is not set # # Caches # # CONFIG_FSCACHE is not set # # CD-ROM/DVD Filesystems # CONFIG_ISO9660_FS=m CONFIG_JOLIET=y CONFIG_ZISOFS=y CONFIG_UDF_FS=m CONFIG_UDF_NLS=y # # DOS/FAT/NT Filesystems # CONFIG_FAT_FS=m CONFIG_MSDOS_FS=m CONFIG_VFAT_FS=m CONFIG_FAT_DEFAULT_CODEPAGE=437 CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-15" CONFIG_NTFS_FS=m # CONFIG_NTFS_DEBUG is not set CONFIG_NTFS_RW=y # # Pseudo filesystems # CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_SYSCTL=y CONFIG_PROC_PAGE_MONITOR=y CONFIG_SYSFS=y CONFIG_TMPFS=y # CONFIG_TMPFS_POSIX_ACL is not set CONFIG_HUGETLBFS=y CONFIG_HUGETLB_PAGE=y CONFIG_CONFIGFS_FS=y # CONFIG_MISC_FILESYSTEMS is not set CONFIG_NETWORK_FILESYSTEMS=y CONFIG_NFS_FS=m CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y CONFIG_NFS_V4=y # CONFIG_NFS_V4_1 is not set CONFIG_NFSD=m CONFIG_NFSD_V2_ACL=y CONFIG_NFSD_V3=y CONFIG_NFSD_V3_ACL=y CONFIG_NFSD_V4=y CONFIG_LOCKD=m CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=m CONFIG_NFS_ACL_SUPPORT=m CONFIG_NFS_COMMON=y CONFIG_SUNRPC=m CONFIG_SUNRPC_GSS=m CONFIG_RPCSEC_GSS_KRB5=m # CONFIG_RPCSEC_GSS_SPKM3 is not set # CONFIG_SMB_FS is not set # CONFIG_CEPH_FS is not set CONFIG_CIFS=m # CONFIG_CIFS_STATS is not set # CONFIG_CIFS_WEAK_PW_HASH is not set # CONFIG_CIFS_UPCALL is not set CONFIG_CIFS_XATTR=y CONFIG_CIFS_POSIX=y # CONFIG_CIFS_DEBUG2 is not set # CONFIG_CIFS_DFS_UPCALL is not set # CONFIG_CIFS_EXPERIMENTAL is not set # CONFIG_NCP_FS is not set # CONFIG_CODA_FS is not set # CONFIG_AFS_FS is not set # # Partition Types # CONFIG_PARTITION_ADVANCED=y # CONFIG_ACORN_PARTITION is not set # CONFIG_OSF_PARTITION is not set # CONFIG_AMIGA_PARTITION is not set # CONFIG_ATARI_PARTITION is not set CONFIG_MAC_PARTITION=y CONFIG_MSDOS_PARTITION=y CONFIG_BSD_DISKLABEL=y # CONFIG_MINIX_SUBPARTITION is not set CONFIG_SOLARIS_X86_PARTITION=y # CONFIG_UNIXWARE_DISKLABEL is not set CONFIG_LDM_PARTITION=y # CONFIG_LDM_DEBUG is not set # CONFIG_SGI_PARTITION is not set # CONFIG_ULTRIX_PARTITION is not set CONFIG_SUN_PARTITION=y # CONFIG_KARMA_PARTITION is not set # CONFIG_EFI_PARTITION is not set # CONFIG_SYSV68_PARTITION is not set CONFIG_NLS=y CONFIG_NLS_DEFAULT="iso8859-15" CONFIG_NLS_CODEPAGE_437=m CONFIG_NLS_CODEPAGE_737=m CONFIG_NLS_CODEPAGE_775=m CONFIG_NLS_CODEPAGE_850=m CONFIG_NLS_CODEPAGE_852=m CONFIG_NLS_CODEPAGE_855=m CONFIG_NLS_CODEPAGE_857=m CONFIG_NLS_CODEPAGE_860=m CONFIG_NLS_CODEPAGE_861=m CONFIG_NLS_CODEPAGE_862=m CONFIG_NLS_CODEPAGE_863=m CONFIG_NLS_CODEPAGE_864=m CONFIG_NLS_CODEPAGE_865=m CONFIG_NLS_CODEPAGE_866=m CONFIG_NLS_CODEPAGE_869=m CONFIG_NLS_CODEPAGE_936=m CONFIG_NLS_CODEPAGE_950=m CONFIG_NLS_CODEPAGE_932=m CONFIG_NLS_CODEPAGE_949=m CONFIG_NLS_CODEPAGE_874=m CONFIG_NLS_ISO8859_8=m CONFIG_NLS_CODEPAGE_1250=m CONFIG_NLS_CODEPAGE_1251=m CONFIG_NLS_ASCII=m CONFIG_NLS_ISO8859_1=m CONFIG_NLS_ISO8859_2=m CONFIG_NLS_ISO8859_3=m CONFIG_NLS_ISO8859_4=m CONFIG_NLS_ISO8859_5=m CONFIG_NLS_ISO8859_6=m CONFIG_NLS_ISO8859_7=m CONFIG_NLS_ISO8859_9=m CONFIG_NLS_ISO8859_13=m CONFIG_NLS_ISO8859_14=m CONFIG_NLS_ISO8859_15=m CONFIG_NLS_KOI8_R=m CONFIG_NLS_KOI8_U=m CONFIG_NLS_UTF8=m # CONFIG_DLM is not set # # Kernel hacking # CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_PRINTK_TIME=y CONFIG_ENABLE_WARN_DEPRECATED=y CONFIG_ENABLE_MUST_CHECK=y CONFIG_FRAME_WARN=2048 CONFIG_MAGIC_SYSRQ=y # CONFIG_STRIP_ASM_SYMS is not set CONFIG_UNUSED_SYMBOLS=y CONFIG_DEBUG_FS=y # CONFIG_HEADERS_CHECK is not set CONFIG_DEBUG_KERNEL=y # CONFIG_DEBUG_SHIRQ is not set CONFIG_DETECT_SOFTLOCKUP=y # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0 CONFIG_DETECT_HUNG_TASK=y # CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0 CONFIG_SCHED_DEBUG=y # CONFIG_SCHEDSTATS is not set CONFIG_TIMER_STATS=y # CONFIG_DEBUG_OBJECTS is not set # CONFIG_SLUB_DEBUG_ON is not set # CONFIG_SLUB_STATS is not set # CONFIG_DEBUG_KMEMLEAK is not set CONFIG_DEBUG_PREEMPT=y # CONFIG_DEBUG_RT_MUTEXES is not set # CONFIG_RT_MUTEX_TESTER is not set CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y # CONFIG_PROVE_RCU is not set CONFIG_LOCKDEP=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_LOCKDEP=y CONFIG_TRACE_IRQFLAGS=y CONFIG_DEBUG_SPINLOCK_SLEEP=y CONFIG_DEBUG_LOCKING_API_SELFTESTS=y CONFIG_STACKTRACE=y # CONFIG_DEBUG_KOBJECT is not set CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_INFO=y # CONFIG_DEBUG_VM is not set # CONFIG_DEBUG_VIRTUAL is not set # CONFIG_DEBUG_WRITECOUNT is not set CONFIG_DEBUG_MEMORY_INIT=y # CONFIG_DEBUG_LIST is not set # CONFIG_DEBUG_SG is not set # CONFIG_DEBUG_NOTIFIERS is not set # CONFIG_DEBUG_CREDENTIALS is not set CONFIG_ARCH_WANT_FRAME_POINTERS=y CONFIG_FRAME_POINTER=y # CONFIG_BOOT_PRINTK_DELAY is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_RCU_CPU_STALL_DETECTOR is not set # CONFIG_BACKTRACE_SELF_TEST is not set # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set # CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set # CONFIG_LKDTM is not set # CONFIG_FAULT_INJECTION is not set # CONFIG_LATENCYTOP is not set # CONFIG_SYSCTL_SYSCALL_CHECK is not set # CONFIG_DEBUG_PAGEALLOC is not set CONFIG_USER_STACKTRACE_SUPPORT=y CONFIG_NOP_TRACER=y CONFIG_HAVE_FTRACE_NMI_ENTER=y CONFIG_HAVE_FUNCTION_TRACER=y CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y CONFIG_HAVE_DYNAMIC_FTRACE=y CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y CONFIG_HAVE_SYSCALL_TRACEPOINTS=y CONFIG_TRACER_MAX_TRACE=y CONFIG_RING_BUFFER=y CONFIG_FTRACE_NMI_ENTER=y CONFIG_EVENT_TRACING=y CONFIG_CONTEXT_SWITCH_TRACER=y CONFIG_RING_BUFFER_ALLOW_SWAP=y CONFIG_TRACING=y CONFIG_GENERIC_TRACER=y CONFIG_TRACING_SUPPORT=y CONFIG_FTRACE=y CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y CONFIG_IRQSOFF_TRACER=y CONFIG_PREEMPT_TRACER=y CONFIG_SYSPROF_TRACER=y CONFIG_SCHED_TRACER=y # CONFIG_FTRACE_SYSCALLS is not set # CONFIG_BOOT_TRACER is not set CONFIG_BRANCH_PROFILE_NONE=y # CONFIG_PROFILE_ANNOTATED_BRANCHES is not set # CONFIG_PROFILE_ALL_BRANCHES is not set # CONFIG_KSYM_TRACER is not set # CONFIG_STACK_TRACER is not set # CONFIG_KMEMTRACE is not set # CONFIG_WORKQUEUE_TRACER is not set # CONFIG_BLK_DEV_IO_TRACE is not set CONFIG_DYNAMIC_FTRACE=y CONFIG_FUNCTION_PROFILER=y CONFIG_FTRACE_MCOUNT_RECORD=y # CONFIG_FTRACE_STARTUP_TEST is not set # CONFIG_MMIOTRACE is not set # CONFIG_RING_BUFFER_BENCHMARK is not set # CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set # CONFIG_DYNAMIC_DEBUG is not set # CONFIG_DMA_API_DEBUG is not set # CONFIG_SAMPLES is not set CONFIG_HAVE_ARCH_KGDB=y # CONFIG_KGDB is not set CONFIG_HAVE_ARCH_KMEMCHECK=y # CONFIG_STRICT_DEVMEM is not set CONFIG_X86_VERBOSE_BOOTUP=y CONFIG_EARLY_PRINTK=y # CONFIG_EARLY_PRINTK_DBGP is not set # CONFIG_DEBUG_STACKOVERFLOW is not set # CONFIG_DEBUG_STACK_USAGE is not set # CONFIG_DEBUG_PER_CPU_MAPS is not set # CONFIG_X86_PTDUMP is not set # CONFIG_DEBUG_RODATA is not set # CONFIG_DEBUG_NX_TEST is not set # CONFIG_IOMMU_DEBUG is not set # CONFIG_IOMMU_STRESS is not set CONFIG_HAVE_MMIOTRACE_SUPPORT=y CONFIG_IO_DELAY_TYPE_0X80=0 CONFIG_IO_DELAY_TYPE_0XED=1 CONFIG_IO_DELAY_TYPE_UDELAY=2 CONFIG_IO_DELAY_TYPE_NONE=3 CONFIG_IO_DELAY_0X80=y # CONFIG_IO_DELAY_0XED is not set # CONFIG_IO_DELAY_UDELAY is not set # CONFIG_IO_DELAY_NONE is not set CONFIG_DEFAULT_IO_DELAY_TYPE=0 # CONFIG_DEBUG_BOOT_PARAMS is not set # CONFIG_CPA_DEBUG is not set # CONFIG_OPTIMIZE_INLINING is not set # CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is not set # # Security options # CONFIG_KEYS=y # CONFIG_KEYS_DEBUG_PROC_KEYS is not set # CONFIG_SECURITY is not set # CONFIG_SECURITYFS is not set # CONFIG_DEFAULT_SECURITY_SELINUX is not set # CONFIG_DEFAULT_SECURITY_SMACK is not set # CONFIG_DEFAULT_SECURITY_TOMOYO is not set CONFIG_DEFAULT_SECURITY_DAC=y CONFIG_DEFAULT_SECURITY="" CONFIG_CRYPTO=y # # Crypto core or helper # CONFIG_CRYPTO_ALGAPI=y CONFIG_CRYPTO_ALGAPI2=y CONFIG_CRYPTO_AEAD=m CONFIG_CRYPTO_AEAD2=y CONFIG_CRYPTO_BLKCIPHER=y CONFIG_CRYPTO_BLKCIPHER2=y CONFIG_CRYPTO_HASH=y CONFIG_CRYPTO_HASH2=y CONFIG_CRYPTO_RNG2=y CONFIG_CRYPTO_PCOMP=y CONFIG_CRYPTO_MANAGER=y CONFIG_CRYPTO_MANAGER2=y # CONFIG_CRYPTO_GF128MUL is not set CONFIG_CRYPTO_NULL=m # CONFIG_CRYPTO_PCRYPT is not set CONFIG_CRYPTO_WORKQUEUE=y CONFIG_CRYPTO_CRYPTD=m CONFIG_CRYPTO_AUTHENC=m CONFIG_CRYPTO_TEST=m # # Authenticated Encryption with Associated Data # # CONFIG_CRYPTO_CCM is not set # CONFIG_CRYPTO_GCM is not set # CONFIG_CRYPTO_SEQIV is not set # # Block modes # CONFIG_CRYPTO_CBC=y # CONFIG_CRYPTO_CTR is not set # CONFIG_CRYPTO_CTS is not set CONFIG_CRYPTO_ECB=m # CONFIG_CRYPTO_LRW is not set # CONFIG_CRYPTO_PCBC is not set # CONFIG_CRYPTO_XTS is not set CONFIG_CRYPTO_FPU=m # # Hash modes # CONFIG_CRYPTO_HMAC=y # CONFIG_CRYPTO_XCBC is not set # CONFIG_CRYPTO_VMAC is not set # # Digest # CONFIG_CRYPTO_CRC32C=m # CONFIG_CRYPTO_CRC32C_INTEL is not set # CONFIG_CRYPTO_GHASH is not set CONFIG_CRYPTO_MD4=m CONFIG_CRYPTO_MD5=y CONFIG_CRYPTO_MICHAEL_MIC=m CONFIG_CRYPTO_RMD128=m CONFIG_CRYPTO_RMD160=m CONFIG_CRYPTO_RMD256=m CONFIG_CRYPTO_RMD320=m CONFIG_CRYPTO_SHA1=m CONFIG_CRYPTO_SHA256=m CONFIG_CRYPTO_SHA512=m CONFIG_CRYPTO_TGR192=m CONFIG_CRYPTO_WP512=m # CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set # # Ciphers # CONFIG_CRYPTO_AES=m CONFIG_CRYPTO_AES_X86_64=m CONFIG_CRYPTO_AES_NI_INTEL=m # CONFIG_CRYPTO_ANUBIS is not set CONFIG_CRYPTO_ARC4=m CONFIG_CRYPTO_BLOWFISH=m # CONFIG_CRYPTO_CAMELLIA is not set CONFIG_CRYPTO_CAST5=m CONFIG_CRYPTO_CAST6=m CONFIG_CRYPTO_DES=m # CONFIG_CRYPTO_FCRYPT is not set CONFIG_CRYPTO_KHAZAD=m # CONFIG_CRYPTO_SALSA20 is not set # CONFIG_CRYPTO_SALSA20_X86_64 is not set # CONFIG_CRYPTO_SEED is not set CONFIG_CRYPTO_SERPENT=m CONFIG_CRYPTO_TEA=m CONFIG_CRYPTO_TWOFISH=m CONFIG_CRYPTO_TWOFISH_COMMON=m CONFIG_CRYPTO_TWOFISH_X86_64=m # # Compression # CONFIG_CRYPTO_DEFLATE=m CONFIG_CRYPTO_ZLIB=m # CONFIG_CRYPTO_LZO is not set # # Random Number Generation # # CONFIG_CRYPTO_ANSI_CPRNG is not set # CONFIG_CRYPTO_HW is not set CONFIG_HAVE_KVM=y CONFIG_HAVE_KVM_IRQCHIP=y CONFIG_HAVE_KVM_EVENTFD=y CONFIG_KVM_APIC_ARCHITECTURE=y CONFIG_KVM_MMIO=y CONFIG_VIRTUALIZATION=y CONFIG_KVM=m # CONFIG_KVM_INTEL is not set CONFIG_KVM_AMD=m # CONFIG_VHOST_NET is not set # CONFIG_VIRTIO_PCI is not set # CONFIG_VIRTIO_BALLOON is not set CONFIG_BINARY_PRINTF=y # # Library routines # CONFIG_BITREVERSE=y CONFIG_GENERIC_FIND_FIRST_BIT=y CONFIG_GENERIC_FIND_NEXT_BIT=y CONFIG_GENERIC_FIND_LAST_BIT=y CONFIG_CRC_CCITT=m CONFIG_CRC16=m CONFIG_CRC_T10DIF=m CONFIG_CRC_ITU_T=m CONFIG_CRC32=y CONFIG_CRC7=m CONFIG_LIBCRC32C=m CONFIG_ZLIB_INFLATE=m CONFIG_ZLIB_DEFLATE=m CONFIG_TEXTSEARCH=y CONFIG_TEXTSEARCH_KMP=m CONFIG_TEXTSEARCH_BM=m CONFIG_TEXTSEARCH_FSM=m CONFIG_HAS_IOMEM=y CONFIG_HAS_IOPORT=y CONFIG_HAS_DMA=y CONFIG_NLATTR=y ^ permalink raw reply [flat|nested] 231+ messages in thread
end of thread, other threads:[~2010-04-16 14:53 UTC | newest] Thread overview: 231+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-03-30 17:50 Linux 2.6.34-rc3 Linus Torvalds 2010-03-30 21:16 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Rafael J. Wysocki 2010-03-31 20:34 ` [stable] " Greg KH 2010-04-01 1:13 ` Rafael J. Wysocki 2010-04-01 2:19 ` Alex Deucher 2010-04-01 6:36 ` Clemens Ladisch 2010-04-01 15:01 ` Alex Deucher 2010-04-01 20:28 ` Rafael J. Wysocki 2010-04-01 20:39 ` Alex Deucher 2010-04-01 20:48 ` Rafael J. Wysocki 2010-04-01 21:00 ` Alex Deucher 2010-04-01 21:01 ` Alex Deucher 2010-04-01 21:08 ` Rafael J. Wysocki 2010-04-01 21:13 ` Alex Deucher 2010-04-01 21:46 ` Rafael J. Wysocki 2010-04-01 22:07 ` Alex Deucher 2010-04-01 23:20 ` Rafael J. Wysocki 2010-04-02 0:23 ` Linus Torvalds 2010-04-02 16:46 ` Rafael J. Wysocki 2010-04-03 18:08 ` Clemens Ladisch 2010-04-03 19:33 ` Rafael J. Wysocki 2010-04-01 16:29 ` Linus Torvalds 2010-04-01 17:07 ` Alex Deucher 2010-04-01 17:24 ` Linus Torvalds 2010-04-01 17:50 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 Clemens Ladisch 2010-04-01 17:53 ` [Regression, post-rc2] Commit a5ee4eb7541 breaks OpenGL on RS780 (was: Re: Linux 2.6.34-rc3) Alex Deucher 2010-04-01 20:17 ` Linus Torvalds 2010-04-01 20:23 ` Alex Deucher 2010-04-01 19:46 ` Rafael J. Wysocki 2010-04-01 22:48 ` Jesse Barnes 2010-04-01 23:23 ` Rafael J. Wysocki 2010-04-02 17:59 ` Ugly rmap NULL ptr deref oopsie on hibernate (was " Borislav Petkov 2010-04-02 18:09 ` Linus Torvalds 2010-04-02 15:24 ` Andrew Morton 2010-04-02 18:37 ` Linus Torvalds 2010-04-02 22:01 ` Rik van Riel 2010-04-03 0:19 ` Linus Torvalds 2010-04-04 16:12 ` Minchan Kim 2010-04-04 17:24 ` Rik van Riel 2010-04-04 23:09 ` [PATCH] rmap: fix anon_vma_fork() memory leak Rik van Riel 2010-04-04 23:56 ` Minchan Kim 2010-04-05 15:37 ` Linus Torvalds 2010-04-05 15:48 ` Minchan Kim 2010-04-05 16:04 ` Rik van Riel 2010-04-05 16:13 ` [PATCH -v2] " Rik van Riel 2010-04-06 8:53 ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) KOSAKI Motohiro 2010-04-06 10:09 ` KOSAKI Motohiro 2010-04-06 14:34 ` Rik van Riel 2010-04-06 14:38 ` Rik van Riel 2010-04-06 15:34 ` Minchan Kim 2010-04-06 15:40 ` Rik van Riel 2010-04-06 15:58 ` Minchan Kim 2010-04-06 15:55 ` Linus Torvalds 2010-04-06 16:23 ` Minchan Kim 2010-04-06 16:28 ` Linus Torvalds 2010-04-06 16:45 ` Minchan Kim 2010-04-06 16:53 ` Linus Torvalds 2010-04-06 17:04 ` Rik van Riel 2010-04-06 18:28 ` Linus Torvalds 2010-04-06 19:03 ` Andrew Morton 2010-04-06 19:10 ` Steinar H. Gunderson 2010-04-06 19:10 ` Linus Torvalds 2010-04-06 19:35 ` Linus Torvalds 2010-04-06 19:42 ` Borislav Petkov 2010-04-06 20:02 ` Linus Torvalds 2010-04-06 20:46 ` Steinar H. Gunderson 2010-04-06 20:56 ` Linus Torvalds 2010-04-06 21:05 ` Steinar H. Gunderson 2010-04-06 20:51 ` Borislav Petkov 2010-04-06 21:27 ` Linus Torvalds 2010-04-06 22:59 ` Borislav Petkov 2010-04-06 23:27 ` Linus Torvalds 2010-04-06 23:54 ` [PATCH] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel 2010-04-07 7:00 ` KOSAKI Motohiro 2010-04-07 14:48 ` Rik van Riel 2010-04-07 14:54 ` [PATCH -v2] " Rik van Riel 2010-04-07 15:30 ` Linus Torvalds 2010-04-07 15:52 ` Rik van Riel 2010-04-07 16:56 ` Linus Torvalds 2010-04-07 21:19 ` Linus Torvalds 2010-04-07 21:52 ` Rik van Riel 2010-04-07 22:09 ` Linus Torvalds 2010-04-07 22:15 ` Linus Torvalds 2010-04-08 0:38 ` Rik van Riel 2010-04-07 23:37 ` Linus Torvalds 2010-04-08 2:03 ` KOSAKI Motohiro 2010-04-08 2:33 ` Linus Torvalds 2010-04-08 5:47 ` Borislav Petkov 2010-04-08 14:11 ` Linus Torvalds 2010-04-08 18:25 ` Rik van Riel 2010-04-08 18:32 ` Linus Torvalds 2010-04-08 20:31 ` Borislav Petkov 2010-04-08 21:00 ` Borislav Petkov 2010-04-08 23:16 ` Linus Torvalds 2010-04-08 23:47 ` Borislav Petkov 2010-04-09 0:50 ` Linus Torvalds 2010-04-09 1:30 ` Borislav Petkov 2010-04-09 9:21 ` Borislav Petkov 2010-04-09 16:35 ` Linus Torvalds 2010-04-09 17:40 ` Borislav Petkov 2010-04-09 17:50 ` Linus Torvalds 2010-04-09 19:14 ` Borislav Petkov 2010-04-09 19:32 ` Linus Torvalds 2010-04-09 20:03 ` Rik van Riel 2010-04-09 20:43 ` Johannes Weiner 2010-04-09 20:57 ` Rik van Riel 2010-04-09 21:33 ` Borislav Petkov 2010-04-09 23:22 ` Linus Torvalds 2010-04-09 23:45 ` Rik van Riel 2010-04-10 0:03 ` Linus Torvalds 2010-04-10 0:11 ` Rik van Riel 2010-04-09 23:54 ` Johannes Weiner 2010-04-09 23:56 ` Linus Torvalds 2010-04-10 0:19 ` Rik van Riel 2010-04-10 0:31 ` Johannes Weiner 2010-04-10 0:32 ` Linus Torvalds 2010-04-10 7:27 ` Borislav Petkov 2010-04-10 11:26 ` Borislav Petkov 2010-04-10 14:45 ` Rik van Riel 2010-04-10 15:24 ` Linus Torvalds 2010-04-10 16:38 ` Borislav Petkov 2010-04-10 17:05 ` Linus Torvalds 2010-04-10 18:21 ` Linus Torvalds 2010-04-10 18:26 ` Linus Torvalds 2010-04-10 18:51 ` Borislav Petkov 2010-04-10 18:58 ` Borislav Petkov 2010-04-10 20:05 ` Linus Torvalds 2010-04-10 20:12 ` Linus Torvalds 2010-04-10 20:36 ` Borislav Petkov 2010-04-10 20:40 ` Linus Torvalds 2010-04-10 21:25 ` Borislav Petkov 2010-04-10 21:30 ` Linus Torvalds 2010-04-10 21:51 ` Borislav Petkov 2010-04-11 13:08 ` Borislav Petkov 2010-04-11 13:19 ` [PATCH 1/3] mm: make page freeing path RCU-safe Borislav Petkov 2010-04-11 13:19 ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov 2010-04-11 13:19 ` [PATCH 3/3] mm: fixup vma_adjust Borislav Petkov 2010-04-11 13:25 ` [PATCH 2/3] mm: cleanup find_mergeable_anon_vma complexity Borislav Petkov 2010-04-11 17:07 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Linus Torvalds 2010-04-11 17:16 ` Linus Torvalds 2010-04-11 18:55 ` Borislav Petkov 2010-04-12 0:13 ` Linus Torvalds 2010-04-12 1:04 ` Linus Torvalds 2010-04-12 7:20 ` Borislav Petkov 2010-04-12 16:02 ` Linus Torvalds 2010-04-12 16:26 ` Linus Torvalds 2010-04-12 18:40 ` Rik van Riel 2010-04-12 19:00 ` Borislav Petkov 2010-04-12 19:17 ` Linus Torvalds 2010-04-12 20:22 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Linus Torvalds 2010-04-12 20:23 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Linus Torvalds 2010-04-12 20:23 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Linus Torvalds 2010-04-12 20:23 ` [PATCH 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma Linus Torvalds 2010-04-12 21:03 ` Rik van Riel 2010-04-13 0:41 ` Johannes Weiner 2010-04-13 1:08 ` Linus Torvalds 2010-04-13 4:23 ` Minchan Kim 2010-04-13 4:26 ` Minchan Kim 2010-04-12 20:57 ` [PATCH 3/4] anon_vma: clone the anon_vma chain in the right order Rik van Riel 2010-04-13 0:18 ` Johannes Weiner 2010-04-13 4:16 ` Minchan Kim 2010-04-12 20:54 ` [PATCH 2/4] vma_adjust: fix the copying of anon_vma chains Rik van Riel 2010-04-12 23:59 ` Johannes Weiner 2010-04-13 4:15 ` Minchan Kim 2010-04-12 20:54 ` [PATCH 1/4] Simplify and comment on anon_vma re-use for anon_vma_prepare() Rik van Riel 2010-04-12 23:54 ` Johannes Weiner 2010-04-13 4:04 ` Minchan Kim 2010-04-13 9:51 ` Peter Zijlstra 2010-04-12 21:50 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Borislav Petkov 2010-04-12 22:11 ` Linus Torvalds 2010-04-12 22:18 ` Linus Torvalds 2010-04-12 22:29 ` Borislav Petkov 2010-04-13 9:38 ` Borislav Petkov 2010-04-14 21:59 ` [PATCH] rmap: add exclusively owned pages to the newest anon_vma Rik van Riel 2010-04-14 23:20 ` Johannes Weiner 2010-04-15 8:34 ` Borislav Petkov 2010-04-15 16:02 ` Minchan Kim 2010-04-15 20:01 ` Linus Torvalds 2010-04-16 6:09 ` Felipe Balbi 2010-04-16 14:48 ` Linus Torvalds 2010-04-11 19:49 ` [PATCH -v2] rmap: make anon_vma_prepare link in all the anon_vmas of a mergeable VMA Rik van Riel 2010-04-12 15:44 ` Linus Torvalds 2010-04-12 15:51 ` Rik van Riel 2010-04-11 21:45 ` Rik van Riel 2010-04-12 15:51 ` Linus Torvalds 2010-04-13 10:36 ` KOSAKI Motohiro 2010-04-10 20:24 ` Rik van Riel 2010-04-10 20:34 ` Linus Torvalds 2010-04-10 20:43 ` Rik van Riel 2010-04-10 20:32 ` Rik van Riel 2010-04-10 19:36 ` Rik van Riel 2010-04-12 14:40 ` Peter Zijlstra 2010-04-12 15:17 ` Minchan Kim 2010-04-12 15:33 ` Peter Zijlstra 2010-04-12 15:19 ` Rik van Riel 2010-04-12 16:01 ` Peter Zijlstra 2010-04-12 16:06 ` Rik van Riel 2010-04-12 16:46 ` Linus Torvalds 2010-04-12 18:40 ` Peter Zijlstra 2010-04-12 19:30 ` Peter Zijlstra 2010-04-12 19:44 ` Peter Zijlstra 2010-04-13 10:53 ` KOSAKI Motohiro 2010-04-13 11:30 ` Peter Zijlstra 2010-04-13 12:00 ` KOSAKI Motohiro 2010-04-14 14:27 ` Peter Zijlstra 2010-04-10 17:07 ` Borislav Petkov 2010-04-10 16:41 ` Linus Torvalds 2010-04-10 22:49 ` Johannes Weiner 2010-04-10 23:31 ` Linus Torvalds 2010-04-09 1:45 ` KOSAKI Motohiro 2010-04-07 15:55 ` Minchan Kim 2010-04-07 7:29 ` Ugly rmap NULL ptr deref oopsie on hibernate (was Linux 2.6.34-rc3) Borislav Petkov 2010-04-07 14:05 ` Paulo Marques 2010-04-07 14:13 ` Borislav Petkov 2010-04-06 23:37 ` Linus Torvalds 2010-04-06 23:22 ` Rik van Riel 2010-04-07 0:10 ` Linus Torvalds 2010-04-07 1:18 ` Rik van Riel 2010-04-07 7:22 ` Borislav Petkov 2010-04-07 10:09 ` Pekka Enberg 2010-04-07 10:12 ` KOSAKI Motohiro 2010-04-07 8:41 ` Peter Zijlstra 2010-04-07 8:36 ` Peter Zijlstra 2010-04-07 9:16 ` Johannes Weiner 2010-04-07 9:37 ` Peter Zijlstra 2010-04-07 14:12 ` Rik van Riel 2010-04-07 15:46 ` Linus Torvalds 2010-04-06 16:32 ` Linus Torvalds 2010-04-06 16:54 ` Minchan Kim 2010-04-07 8:37 ` Peter Zijlstra 2010-04-06 17:05 ` Borislav Petkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).